SCyDia – OCR For Serbian Cyrillic with Diacritics

Ilić, Velibor; Bajčetić, Lenka; Petrović, Snežana; Španović, Ana

2022

Scydia_EURALEX.pdf (1.102Mb)

Аутори

Ilić, Velibor
Bajčetić, Lenka
Petrović, Snežana
Španović, Ana

Чланак у часопису (Објављена верзија)

Метаподаци

Приказ свих података о документу

Апстракт

In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with d...

Кључне речи:

OCR / Cyrillic / convolutional neural networks / retro-digitization / Serbian language

Извор:
Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany, 2022, 387-400

Издавач:

Mannheim : IDS-Verlag

TY - JOUR
AU - Ilić, Velibor
AU - Bajčetić, Lenka
AU - Petrović, Snežana
AU - Španović, Ana
PY - 2022
UR - https://dais.sanu.ac.rs/123456789/14197
AB - In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the
biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are
not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web-based software solution that relies on the open-source software “Tesseract” in the background. “SCyDia” also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%.
PB - Mannheim : IDS-Verlag
T2 - Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany
T1 - SCyDia – OCR For Serbian Cyrillic with Diacritics
SP - 387
EP - 400
UR - https://hdl.handle.net/21.15107/rcub_dais_14197
ER -

@article{
author = "Ilić, Velibor and Bajčetić, Lenka and Petrović, Snežana and Španović, Ana",
year = "2022",
abstract = "In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the
biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are
not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web-based software solution that relies on the open-source software “Tesseract” in the background. “SCyDia” also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%.",
publisher = "Mannheim : IDS-Verlag",
journal = "Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany",
title = "SCyDia – OCR For Serbian Cyrillic with Diacritics",
pages = "387-400",
url = "https://hdl.handle.net/21.15107/rcub_dais_14197"
}

Ilić, V., Bajčetić, L., Petrović, S.,& Španović, A.. (2022). SCyDia – OCR For Serbian Cyrillic with Diacritics. in Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany
Mannheim : IDS-Verlag., 387-400.
https://hdl.handle.net/21.15107/rcub_dais_14197

Ilić V, Bajčetić L, Petrović S, Španović A. SCyDia – OCR For Serbian Cyrillic with Diacritics. in Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany. 2022;:387-400.
https://hdl.handle.net/21.15107/rcub_dais_14197 .

Ilić, Velibor, Bajčetić, Lenka, Petrović, Snežana, Španović, Ana, "SCyDia – OCR For Serbian Cyrillic with Diacritics" in Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany (2022):387-400,
https://hdl.handle.net/21.15107/rcub_dais_14197 .