====== Transliterating non-ASCII characters ======
The Scopus bibliographic data are Unicode encoded and can contain non-ASCII characters. The data from WoS have author names transformed into ASCII. How can we make both data compatible?
I first prepared a test file ''examples.txt'' (UTF-8 with BOM)
AU Garcia-Calvo, T
RI García-Calvo, Tomas/AAN-6825-2021; OLIVA, DAVID SANCHEZ/L-1698-2014;
AU Bazina, AM
Pericic, TP
Mihanovic, F
RI Peričić, Tina Poklepović/G-8402-2017; Mihanović, Frane/E-3337-2017
RI Chmura, Paweł/U-6645-2019; Struzik, Artur/AAC-2669-2021; Popowczak,
AU - Marinović, M.
AU - Kjær, M.
AU - Renström, P.A.F.H.
AU - Ibáñez, S.J.
AU - Kristjánsdóttir, H.
Cvetković, D., Doob, M., Sachs, H., (1995) Spectra of Graphs: Theory and Application, pp. 18-20. , Johann Ambrosius Barth, Heidelberg, 3rd edn;
Заболотский Александр Викторович
الدبلوم التنفيذي | مهارات الذكاء الاصطناعي وعلم البيانات
Cui, Chunfang; Tong, Zhongliang 干燥新技术及应用 /Gan zao xin ji shu ji ying yong [Di 1 ban. ed.]
After some searching on Google I found the solution in [[https://www.geeksforgeeks.org/transliterating-non-ascii-characters-with-python/|Transliterating non-ASCII characters with Python]]
wdir = "C:/Users/vlado/work2/mark/ascii"
import sys; sys.path.append(wdir)
import os; os.chdir(wdir)
import io
from unidecode import unidecode
infile = open("examples.txt","r",encoding="utf-8-sig")
data = infile.read()
infile.close()
a = unidecode(data)
print(data)
print(a)
We get the following trasliteration:
AU Garcia-Calvo, T
RI Garcia-Calvo, Tomas/AAN-6825-2021; OLIVA, DAVID SANCHEZ/L-1698-2014;
AU Bazina, AM
Pericic, TP
Mihanovic, F
RI Pericic, Tina Poklepovic/G-8402-2017; Mihanovic, Frane/E-3337-2017
RI Chmura, Pawel/U-6645-2019; Struzik, Artur/AAC-2669-2021; Popowczak,
AU - Marinovic, M.
AU - Kjaer, M.
AU - Renstrom, P.A.F.H.
AU - Ibanez, S.J.
AU - Kristjansdottir, H.
Cvetkovic, D., Doob, M., Sachs, H., (1995) Spectra of Graphs: Theory and Application, pp. 18-20. , Johann Ambrosius Barth, Heidelberg, 3rd edn;
Zabolotskii Aleksandr Viktorovich
ldblwm ltnfydhy | mhrt ldhk lSTn`y w`lm lbynt
Cui, Chunfang; Tong, Zhongliang Gan Zao Xin Ji Zhu Ji Ying Yong /Gan zao xin ji shu ji ying yong [Di 1 ban. ed.]
===== URLs =====
- [[https://www.geeksforgeeks.org/transliterating-non-ascii-characters-with-python/|Transliterating non-ASCII characters with Python]]
- https://packaging.python.org/en/latest/tutorials/installing-packages/
- Emacs: [[http://zvon.org/other/elisp/Output/SEC525.html|34. Non-ASCII Characters]]
- https://www.researchgate.net/publication/260030714_Information_management_and_improvement_of_citation_indices
- NIH UMLS [[https://www.nlm.nih.gov/research/umls/new_users/online_learning/LEX_003.html|The Lexical Tools]]
- https://cdr.lib.unc.edu/downloads/kw52jc52z?locale=en
- Seth Bernstein: [[http://programminghistorian.org/en/lessons/transliterating|Transliterating non-ASCII characters with Python]]
- Martijn Visser, Nees Jan van Eck, Ludo Waltman: [[https://direct.mit.edu/qss/article/2/1/20/97574/Large-scale-comparison-of-bibliographic-data|Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic]]
- Chris J. Lu, Allen C. Browne, Divita Guy: [[https://www.academia.edu/8090744/Using_Lexical_Tools_to_Convert_Unicode_Characters_to_ASCII|Using Lexical Tools to Convert Unicode Characters to ASCII]]
- Bijay Kumar: [[https://pythonguides.com/remove-non-ascii-characters-python/|Remove non-ASCII characters Python]]
- [[https://www.codegrepper.com/code-examples/python/python+detect+non+ascii|Python detect non-ASCII]]
- [[https://www.programcreek.com/python/?CodeExample=remove+non+ascii|Python Code Examples for remove non-ASCII]]
- [[https://www.delftstack.com/howto/python/python-unicode-to-string/|Convert Unicode Characters to ASCII String in Python]]