SCyDia – OCR For Serbian Cyrillic with Diacritics
Чланак у часопису (Објављена верзија)
Метаподаци
Приказ свих података о документуАпстракт
In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the
biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with d...iacritics, such software solutions are
not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web-based software solution that relies on the open-source software “Tesseract” in the background. “SCyDia” also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%.
Кључне речи:
OCR / Cyrillic / convolutional neural networks / retro-digitization / Serbian languageИзвор:
Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany, 2022, 387-400Издавач:
- Mannheim : IDS-Verlag
TY - JOUR AU - Ilić, Velibor AU - Bajčetić, Lenka AU - Petrović, Snežana AU - Španović, Ana PY - 2022 UR - https://dais.sanu.ac.rs/123456789/14197 AB - In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web-based software solution that relies on the open-source software “Tesseract” in the background. “SCyDia” also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%. PB - Mannheim : IDS-Verlag T2 - Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany T1 - SCyDia – OCR For Serbian Cyrillic with Diacritics SP - 387 EP - 400 UR - https://hdl.handle.net/21.15107/rcub_dais_14197 ER -
@article{ author = "Ilić, Velibor and Bajčetić, Lenka and Petrović, Snežana and Španović, Ana", year = "2022", abstract = "In the currently ongoing process of retro-digitization of Serbian dialectal dictionaries, the biggest obstacle is the lack of machine-readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary-making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many opensource and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web-based software solution that relies on the open-source software “Tesseract” in the background. “SCyDia” also contains a module for semi-automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%.", publisher = "Mannheim : IDS-Verlag", journal = "Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany", title = "SCyDia – OCR For Serbian Cyrillic with Diacritics", pages = "387-400", url = "https://hdl.handle.net/21.15107/rcub_dais_14197" }
Ilić, V., Bajčetić, L., Petrović, S.,& Španović, A.. (2022). SCyDia – OCR For Serbian Cyrillic with Diacritics. in Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany Mannheim : IDS-Verlag., 387-400. https://hdl.handle.net/21.15107/rcub_dais_14197
Ilić V, Bajčetić L, Petrović S, Španović A. SCyDia – OCR For Serbian Cyrillic with Diacritics. in Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany. 2022;:387-400. https://hdl.handle.net/21.15107/rcub_dais_14197 .
Ilić, Velibor, Bajčetić, Lenka, Petrović, Snežana, Španović, Ana, "SCyDia – OCR For Serbian Cyrillic with Diacritics" in Dictionaries and Society. Proceedings of the XX EURALEX International Congress,12-16 July 2022, Mannheim, Germany (2022):387-400, https://hdl.handle.net/21.15107/rcub_dais_14197 .