====== Transliterating non-ASCII characters ====== The Scopus bibliographic data are Unicode encoded and can contain non-ASCII characters. The data from WoS have author names transformed into ASCII. How can we make both data compatible? I first prepared a test file ''examples.txt'' (UTF-8 with BOM) AU Garcia-Calvo, T RI García-Calvo, Tomas/AAN-6825-2021; OLIVA, DAVID SANCHEZ/L-1698-2014; AU Bazina, AM Pericic, TP Mihanovic, F RI Peričić, Tina Poklepović/G-8402-2017; Mihanović, Frane/E-3337-2017 RI Chmura, Paweł/U-6645-2019; Struzik, Artur/AAC-2669-2021; Popowczak, AU - Marinović, M. AU - Kjær, M. AU - Renström, P.A.F.H. AU - Ibáñez, S.J. AU - Kristjánsdóttir, H. Cvetković, D., Doob, M., Sachs, H., (1995) Spectra of Graphs: Theory and Application, pp. 18-20. , Johann Ambrosius Barth, Heidelberg, 3rd edn; Заболотский Александр Викторович الدبلوم التنفيذي | مهارات الذكاء الاصطناعي وعلم البيانات Cui, Chunfang; Tong, Zhongliang 干燥新技术及应用 /Gan zao xin ji shu ji ying yong [Di 1 ban. ed.] After some searching on Google I found the solution in [[https://www.geeksforgeeks.org/transliterating-non-ascii-characters-with-python/|Transliterating non-ASCII characters with Python]] wdir = "C:/Users/vlado/work2/mark/ascii" import sys; sys.path.append(wdir) import os; os.chdir(wdir) import io from unidecode import unidecode infile = open("examples.txt","r",encoding="utf-8-sig") data = infile.read() infile.close() a = unidecode(data) print(data) print(a) We get the following trasliteration: AU Garcia-Calvo, T RI Garcia-Calvo, Tomas/AAN-6825-2021; OLIVA, DAVID SANCHEZ/L-1698-2014; AU Bazina, AM Pericic, TP Mihanovic, F RI Pericic, Tina Poklepovic/G-8402-2017; Mihanovic, Frane/E-3337-2017 RI Chmura, Pawel/U-6645-2019; Struzik, Artur/AAC-2669-2021; Popowczak, AU - Marinovic, M. AU - Kjaer, M. AU - Renstrom, P.A.F.H. AU - Ibanez, S.J. AU - Kristjansdottir, H. Cvetkovic, D., Doob, M., Sachs, H., (1995) Spectra of Graphs: Theory and Application, pp. 18-20. , Johann Ambrosius Barth, Heidelberg, 3rd edn; Zabolotskii Aleksandr Viktorovich ldblwm ltnfydhy | mhrt ldhk lSTn`y w`lm lbynt Cui, Chunfang; Tong, Zhongliang Gan Zao Xin Ji Zhu Ji Ying Yong /Gan zao xin ji shu ji ying yong [Di 1 ban. ed.] ===== URLs ===== - [[https://www.geeksforgeeks.org/transliterating-non-ascii-characters-with-python/|Transliterating non-ASCII characters with Python]] - https://packaging.python.org/en/latest/tutorials/installing-packages/ - Emacs: [[http://zvon.org/other/elisp/Output/SEC525.html|34. Non-ASCII Characters]] - https://www.researchgate.net/publication/260030714_Information_management_and_improvement_of_citation_indices - NIH UMLS [[https://www.nlm.nih.gov/research/umls/new_users/online_learning/LEX_003.html|The Lexical Tools]] - https://cdr.lib.unc.edu/downloads/kw52jc52z?locale=en - Seth Bernstein: [[http://programminghistorian.org/en/lessons/transliterating|Transliterating non-ASCII characters with Python]] - Martijn Visser, Nees Jan van Eck, Ludo Waltman: [[https://direct.mit.edu/qss/article/2/1/20/97574/Large-scale-comparison-of-bibliographic-data|Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic]] - Chris J. Lu, Allen C. Browne, Divita Guy: [[https://www.academia.edu/8090744/Using_Lexical_Tools_to_Convert_Unicode_Characters_to_ASCII|Using Lexical Tools to Convert Unicode Characters to ASCII]] - Bijay Kumar: [[https://pythonguides.com/remove-non-ascii-characters-python/|Remove non-ASCII characters Python]] - [[https://www.codegrepper.com/code-examples/python/python+detect+non+ascii|Python detect non-ASCII]] - [[https://www.programcreek.com/python/?CodeExample=remove+non+ascii|Python Code Examples for remove non-ASCII]] - [[https://www.delftstack.com/howto/python/python-unicode-to-string/|Convert Unicode Characters to ASCII String in Python]]