Entity recognition

Entity recognition

Daria Maltseva, Vladimir Batagelj: Social network analysis as a field of invasions: bibliographic approach to study SNA development. Scientometrics (2019) 121:1085–1128 . https://doi.org/10.1007/s11192-019-03193-x
Daria Maltseva, Vladimir Batagelj: Journals Publishing Social Network Analysis. manuscript
Daria Maltseva, Vladimir Batagelj: Collaboration Between Authors in the Field of Social Network Analysis. manuscript
Alvaro La Parra-Perez, Félix-Fernando Muñoz, Nadia Fernández de Pinedo: EconHist: A Relational Database for Analyzing the Evolution of Economic History (1980-2019). Academia
Vladimir Batagelj, Daria Maltseva: Analysis of bibliographic networks. 10th International Summer School “Analysis of Scientific Networks”, Moscow, July, 15–21, 2019. GitHub

Auth_1, Auth_2, Auth_3, Auth_4, Auth_5, Title, Year, Source_title, Volume; Issue, Art._No., Page_start, Page_end, Cited_by, DOI, Link, Abstract, author_keyword_1, author_keyword_2, author_keyword_3, author_keyword_4, author_keyword_5, Index Keywords, References

Table Author

(33, 32 variables)

author_name, eha, clio, ehes, ehs, Margo, researcherid, orcid, gender, organization, department, rank, yearbirth, yearPh.D., universityPh.D. , countryPh.D., Ph.D.program, babsyear, babsprogram, researchinterest_1, researchinterest_2, researchinterest_3, researchinterest_4, researchinterest_5, updated, noteissue, decade_Ph.D., core1, core2, core3, id_scopus_1, id_scopus_2, id_scopus_3

Table affiliation

(5, 3 variables)

affiliation, organization, department, country, region_work

Table Source

(12, 10 variables)

journal_name, classification_1, classification_2, publisher, ISSN, journal_URL, country, SJR, JCR, journal_field_1, journal_field_2, journal_field_3, journal_field_4 (12, 10)

Works

Works that appear in descriptions can be of two types: those which have full descriptions (hits), and those which were only cited (terminal works listed in the CR fields). This information was stored in a partition DC, where DC[w] = 1 if a work w has the WoS description, and DC[w] = 0 otherwise. The partition year contains the work’s publication year from the fields PY or CR. This information is essential for the construction of temporal networks. WoS2Pajek also builds a CSV file titles with main data about hits (short name, WoS data file line, first author, title, journal, year), which can be used to list the results, and the vector NP, where NP[w] = the number of pages in a work w. The usual ISI name of a work (its description in the field CR) has the following structure:

AU + ’, ’ + PY + ’, ’ + SO[:20] + ’, V’ + VL+ ’, P’ + BP

(first author’s surname, initials, year of publication, title of the journal, volume and the number of the starting page; + denotes concatenation), which results in such descriptions as

GRANOVETTER M, 1985, AM J SOCIOL, V91, P481

(all the elements are in the upper case). As in the WoS the same work can have different ISI names, WoS2Pajek supports also short names (similar to the names used in HistCite software output), which has the following format:

LastNm[:8] + ’_’ + FirstNm[0] + ’(’ + PY+ ’)’ + VL + ’:’ + BP.

For example, for the mentioned work its short name is GRANOVET_M(1985)91:481. For last names with prefixes VAN, DE,…the spaces are deleted, and unusual names start with characters * or $. However, some problems with data can still exist even with this approach as the information in CR field can include typos in publication year, volume and page numbers, etc. That is why some additional cleaning on the highly cited nodes was implemented.

Synonymic referencing

Some problems associated with name recognition can occur in the database, when the same work is referred to by different short names. For example, the short names BOYD_D(2007)13 and BOYD_D(2008) 13:210 reference the same work of Danah Boyd, originally published in 2007, but in many cases it is referenced as being published in 2008. There were also cases when the short names were different due to discrepancies in the descriptions—such as GRANOVET_M(1973)78:1360 and GRANOVET_M(1973) 78:6, or COLEMAN_J(1988)94:95 and COLEMAN_J(1988)94:S95. The names of some authors were presented in a different way—for example, GRANOVET_M and GRANOVET_. We identified these cases for all works with large indegree frequencies in the Cite network.

To resolve these problems, we have to correct the data. There are two possibilities: (1) to make corrections in the local copy of original data (WoS file) or (2) to make an equivalence partition of nodes and shrink the set of works accordingly in all obtained networks. We used the second option (Batagelj et al. 2014, p.395–399). For the works with large frequencies we prepared lists of possible equivalents and manually determined equivalence classes. With a function in R we produced a Pajek’s partition of equivalent work names representing the same work. We used this partition to shrink the networks Cite, WA, WJ, and WK. The partitions year, DC and the vector NP were also shrunk.

Some problems associated with name recognition can occur in the dataset. The original network WJ had 70,425 journals. Due to inconsistencies in journal titles in different descriptions, it contained sets of nodes denoting the same journal. To get the list of these nodes, we constructed for each journal title its short code, which was formed out of the first two letters of each word in the journal’s title, – such as SONEANANMI for SOCIAL NETWORK ANALYSIS AND MINING, – and then sorted them so that the journals with the same code were grouped together. We manually inspected all the journals with at least one of their names cited at least 200 times. To get these numbers we computed in Pajek the 2-mode network Cite*WJc and determined the vector wIndegJ.vec with weighted indegrees for journals. We obtained a list of candidates for inspection with 5,482 titles. To additionally reduce the number of titles to inspect we considered only titles that appeared in at least 3 citations. This gave a list journalK100.csv with 3,714 titles, that were manually inspected. After checking, this list was reduced to 1,688 titles. Some examples of the journal titles grouped according to their codes are presented in the figure.

63656 1312696 10849 SONEAN   | SOCIAL NETWORK ANAL
63657 1330776 3     SONEAN   | SOCIAL NETWORKS ANAL
63658 1311789 645   SONEANMI | SOC NETW ANAL MIN
63659 1313366 7     SONEANMI | SOCIAL NETW ANAL MIN
63660 1315722 7     SONEANMI | SOC NETW ANAL MINING
...
25340 1297450 195   HUREMA   | HUM RESOURCE MANAGE
25341 1298839 189   HUREMA   | HUMAN RESOURCE MANAG
25343 1304542 3     HUREMA   | HUMAN RESOURCES MANA
25344 1305503 67    HUREMA   | HUM RESOUR MANAGE
25345 1312370 222   HUREMA   | HUM RESOUR MANAGE-US
25352 1301632 189   HUREMAR  | HUM RESOUR MANAGE R
25353 1303129 5     HUREMAR  | HUM RESOUR MANAG R
...
4188 1299141 391    AMJGEPS  | AM J GERIAT PSYCHIAT
4189 1299905 23     AMJGEPS  | AM J GERIATRIC PSYCH
4190 1302259 12     AMJGEPS  | AMER J GERIATR PSYCHIATR
4191 1304932 14     AMJGEPS  | AM J GERIATR PSYCHIA
4192 1314551 7      AMJGEPS  | AM J GERIATR PSYCHIATRY

However, some journal titles can also appear in an abbreviated form based on their initials – for example, the Journal of the American Statistical Association could be coded as JAMSTAS according to its short title J AM STAT ASS and as JA according to its abbreviation JASA. That is why we also produced a list of frequent journal names of at most 5 letters, and chose all the cases that could be considered as abbreviations, such as CACM, JACM, JASA, LNCS, NIPS, JASSS, IJCAI, BMJ, JOSS and performed a manual search for the abbreviations of these journals in the original list of 70,425 journals. We grouped all the journal titles which included the same abbreviations – some examples are presented in Figure 14 (there were different codes generated for different titles). The results of the search were added to the first list, and finally a list and the corresponding partition for network shrinking were produced. This resulted in a reduced list of 69,146 journals.

10524 1297183 50    COAC   | COMMUN ACM
10525 1311274 14141 COAC   | COMMUNICATIONS ACM
10062 1309889 12756 CA     | CACM
...
55366 1351847 54714 PSPOSC | PS POLITICAL SCIENCE
55768 1320199 23066 POSC   | POLITICAL SCI
55769 1320573 23440 POSC   | POLIT SCI
56082 1297982 849   PSSCPO | PS-POLIT SCI POLIT
56083 1298064 931   PSSCPO | PS-POLITICAL SCI POL
...
33087 1299216 2083  JAC    | J ACM
33550 1355703 58570 JACJA  | J ACM JACM
32955 1302464 5331  JA     | JACM

Authors

We encountered a similar situation with respect to authors’ names. In addition to the unsystematic use of the initials, there were issues with inconsistent spellings such as tildes, which are commonly used in Spanish and French language; the “ö” and “ü” German language; and the numerous Slavic, Scandinavian, or Turkish characters. We used UFT-8 encoding and normalized, whenever possible, certain spellings by removing tildes and other special characters. In addition, people with compound surnames (common in Spanish), were identified multiple times as different authors depending on how the surname has been listed. For a database it is not the same “Fernández de Pinedo,” which is the right form in Spanish, and not “Fernandez-de-Pinedo,” which is incorrect in Spanish. However, the use of hyphen was extended as a way of preventing the surname from being truncated in an international journal—which may make a difference in the case of a surname as common in Spanish as Fernández. Finally, we systematized—again by removing tildes and other spellings—several hundred of authors’ names and surnames. Indexing author names deserves a separate discussion. The use of a single identification code (e.g. ORCID and ResearcherID) for each author would solve or greatly mitigate the previous problems with author names. However, sometimes, identification codes render the identification of authors more difficult when one author appears with several identification codes. For example Leandro Prados de la Escosura has been recorded in three different ways: “Prados de la Escosura, Leandro” (ID: 55982545400, 9 records), “De la Escosura, Leandro Prados” (6505827385, 26 records), “Escosura, Leandro Prados de la” (ID: 36652659400, only 2 records), and “Prados de la Escosura, Leandro” (ID: 8141705600, 2 records). Another paradigmatic example is that of Cormac Ó Gráda. There are many more cases in which the author appears with two codes— corresponding to two different ways to be indexed. This has required having to identify and unify in a single register all these different ways of naming the same author.

Another problem is author disambiguation, when different authors have the same names, well–known in the literature as the problem of “multiple personalities” (Harzing, 2015). It is especially relevant for authors with Chinese and Korean names due to the ”three Zhang, four Li” effect, but can occur with authors with common surnames (e.g., Smith, Rodriguez, Johnson). For such authors, the solution ofWoS2Pajek does not perform well: different authors, having the same surname and first initial of their first name, merge during the creation of the network WA. This problem can be overcome if we use a special ID (such as ORCID) for each scientist. Unfortunately, this information is not provided in all WoS descriptions. We have to accept this as a limitation of the study.

In the list of authors ranked in decreasing order of their WAr indegrees the top entries have Chinese or Korean surnames, e.g. (number of articles in parentheses): WANG Y (410), WANG X (339), ZHANG Y (332), LIU Y (321), CHEN Y (317), ZHANG J (310), LI J (305), LI Y (304), LI X (287). The issue of the super-productivity of these (groups of) authors was discussed by Harzing (2015). This is also a partial reason that the tail of the complementary cumulative distribution in Figure 3 is not following a straight line (fitting the power law). For example, Y. WANG had almost 80 works published each year in the years 2015 and 2016.

CVs

In particular, having access to updated and homogenous CVs would greatly facilitate the collection of biographical information. To the extent that CV formats depend on several factors—such as regulations from universities or national science systems—there is no systematic, consistent, or structured information about authors (if there is any). ORCID and Publons, two international initiatives, could help in systematizing this type of information.

ORCID (https://orcid.org) is a “nonprofit organization helping create a world in which all who participate in research, scholarship and innovation are uniquely identified and connected to their contributions and affiliations, across disciplines, borders, and times.” Publons (https://publons.com) is part of the WoS Group and is powered by integrations with the WoS, ORCID, and thousands of scholarly journals; this platform serves researchers and publishers. Other platforms such as editorialmanager.com or manuscriptcentral.com require the creation of a user account or being registered with ORCID. All links were accessed on November 19, 2020.

Gender

The majority of CV or other biographical documents do not contain explicit information about gender. We used Gender API algorithm (https://gender-api.com; accessed November 11, 2020) to infer gender. This algorithm gives a probabilistic estimate for the person’s gender based on their first name. One important limitation of this approach is that it imposes a binary structure for gender (male or female); therefore, our imputed gender does not necessarily reflect the actual gender identity. Furthermore, the association of a given name with a given gender may vary across countries. This will typically result in a low first-name-based gender probability (e.g. the algorithm gives “Andrea” a 54%-chance of being female probably because it is commonly used for females in English-speaking countries, but in Italy, it is commonly used for males). We searched additional online information for individuals with a first-name-based gender probability below 60%. When public online documents for those individuals used pronouns that did not correspond to the first-name-based gender prediction, the imputation was based on those pronouns. For example, Andrea Papadia’s profile at the European University Institute’s website states: “He completed his PhD at the London School of Economics” (emphasis added); therefore, the gender for this person was changed to “Male” (see https://www.eui.eu/ProgrammesAndFellowships/MaxWeberProgramme/People/ MaxWeberFellows/Fellows-2017-2018/Papadia; accessed November 11, 2020).

Journals

The preparation of the data also involved to take decisions about many aspects of the names of the journals. For example, The American Economic Review may appear as “The American Economic Review,” “American Economic Review,” “Amer. Econ. Rev.,” “Amer Econ Rev,” or simply “AER.” For analysis which is free of any ambiguity, we had to “fix” this publication name and opted for “American Economic Review.” In general, we used full names by eliminating the article “the” when it appears at the beginning.26 In cases where the publication name of a journal appears in English as well as other language, unless the journal is perfectly identified in the other language (e.g. Spanish, German, and French), we used the English name.27 Setting the name of the sources (not only for economic history journals, but in general) in an unambiguous way was also very important to maintain a consistent identification of the sources cited in the articles recorded in our database.

Another important task to prepare the data set for its analysis involved the elimination of false references and inconsistencies in the way the titles of the journals were cited in the bibliographical part of our data set. Out of the 10,773 bibliometric records in our database, 367 names for the outlet where they were published could not be matched to any journal listed at the Scopus source list.1 We individually checked each of them and three different types of issues were detected. First, using Scimago Journal & Country Rank (SJCR),2 we could determine that 62 of those observations (16.89%) corresponded to books, book chapters, conference proceedings or other non-peer reviewed publications and were eliminated from our dataset. Second, 167 mismatches (45.5%) were due to misspellings of the journal name (e.g. “Economics & Politics” instead of “Economics and Politics”) or changes in the name of the journal (e.g. the “Journal of Behavioral Economics” was renamed as the “The Journal of Socio-Economics” and currently it is titled the “Journal of Behavioral and Experimental Economics”). Corrections were made to ensure that the names of the journals were consistent with the spelling used in the list of sources for Scopus in 2018 and the names of journals that changed their names before 2018 were updated to the name they are currently using. Finally, the remaining 138 mismatches (37.6%) corresponded to sources that SJCR confirmed that disappeared before Scopus published its list of sources for 2018 or for which SCJR does not provide any information at all.

The list for 2020 is available at https://www.elsevier.com/?a=91122 (link accessed on October 5, 2020). 2 https://www.scimagojr.com/ (link accessed on October 5, 2020).

The main challenge in this approach is the entity resolution (synonymy/homonyms: works, authors, journals, keywords). This problem would be simplified by the standardization of information stored in bibliographic databases (ORCID, DOI, ISSN, ISBN, etc.).