====== English dictionary ======
{{pajek:data:zip:engdict.zip|EngDict}}
Two-mode network ''EngDict.net'' with 85973 + 94441 = 180414 nodes and 660758 arcs and the edge list ''Dict.csv''.
===== Description =====
English dictionary network JSON from [[https://www.kaggle.com/datasets/bfbarry/dictionary-graph|Kaggle]].
{
"DIPLOBLASTIC":["CHARACTERIZING","THE","OVUM","WHEN","IT","HAS","TWO","PRIMARY","GERMINALLAYERS"],
"DEFIGURE":["TO","DELINEATE","[OBS","]THESE","TWO","STONES","AS","THEY","ARE","HERE","DEFIGURED","WEEVER"],
"LOMBARD":["OF","OR","PERTAINING","TO","LOMBARDY","OR","THE","INHABITANTS","OF","LOMBARDY"],
"BAHAISM":["THE","RELIGIOUS","TENETS","OR","PRACTICES","OF","THE","BAHAIS"],
"FUMERELL":["SEE","FEMERELL"],
"ROYALET":["A","PETTY","OR","POWERLESS","KING",[R","]THERE","WERE","AT","THIS","TIME","TWO","OTHER",
"ROYALETS","AS","ONLY","KINGS","BY","HISLEAVE","FULLER"],
"TROPHIED":["ADORNED","WITH","TROPHIES","THE","TROPHIED","ARCHES","STORIED","HALLS","INVADE","POPE"],
"ZEQUIN":["SEE","SEQUIN"],
"MILLWRIGHT":["A","MECHANIC","WHOSE","OCCUPATION","IS","TO","BUILD","MILLS","OR","TO","SET","UPTHEIR","MACHINERY"],
...
It has some deficiencies:
* words such as: "[OBS", "]"; "[R","]"; "[WRITTEN","ALSOBETTEE","]THE"; "[BRAIDED]"; "[SLANG]", etc.
* phrases such as "TRUNK STEAMER", "BLACK FRIAR", "FLINT GLASS", "CHAIN TIE", etc. appear only as source nodes
* target nodes include also stopwords
* some target nodes consist of "duble" words: germinallayers, hisleave, platesemployed, theatlantic, fewspecies, etc.
* lines containing "see WORD"
===== Converting into an edge list =====
I decided to lemmatize the target words. I first re-joined the words from the entry list into the "original" description string and removed all []-enclosed substrings. Afterward, I lemmatized the words from the description and removed the stopwords.
version = "Dict to Pajek 0.1"
# by Vladimir Batagelj, May 29, 2022
wdir = "C:/Users/vlado/DL/data/kaggle"
import sys, os, re, datetime, csv, json, shutil, time
os.chdir(wdir)
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
print(version)
ts = datetime.datetime.now()
print('{0}: {1}\n'.format("START",ts))
lemmatizer = WordNetLemmatizer()
stopWords = set(stopwords.words('english'))
fDict = open('dict_graph.json')
D = json.load(fDict)
net = open(wdir+'/dict.csv','w',encoding="utf-8")
for k in D:
S = D[k]
if len(S)==0: print(k); continue
if S[0]=='SEE': S = S[1:]
# join words into text and remove [ ... ] substrings
s = re.sub(r"\[[\w|\s]*\]", "", ' '.join(S).lower())
L = [lemmatizer.lemmatize(w,get_wordnet_pos(w)) for w in nltk.word_tokenize(s)]
K = set([w for w in L if (not w in stopWords) and (len(w)>1)])
for w in K: net.write(k.lower()+','+w+'\n')
net.close()
print('{0} {1}\n'.format("keys",len(D)))
tf = datetime.datetime.now()
print('{0}: {1}\n'.format("END",tf))
The execution of the program prints a list of source words with an empty description. It takes around 4 mins.
============= RESTART: C:\Users\vlado\DL\data\kaggle\dict2pajek.pyw ============
Dict to Pajek 0.1
START: 2022-05-29 19:56:13.505569
COMBATIVE
DIS-
DRUM MAJOR
-ER
BOW NET
HAEMIC
F
HORSE POWER
BOW OAR
CONSONANTAL
CLUNCH
keys 86024
END: 2022-05-29 20:00:38.079762
and produces an edge list on the file ''dict.csv'':
diploblastic,germinallayers
diploblastic,two
diploblastic,primary
diploblastic,characterize
diploblastic,ovum
defigure,stone
defigure,two
defigure,defigured
defigure,weever
defigure,delineate
lombard,pertain
lombard,lombardy
lombard,inhabitant
bahaism,bahai
bahaism,religious
...
====== Converting the edge list into Pajek network ======
It is very easy to do it in R.
> wdir <- "C:/Users/vlado/DL/data/kaggle"
> setwd(wdir)
> D <- read.csv("dict.csv",head=FALSE)
> dim(D)
[1] 660758 2
> head(D)
V1 V2
1 diploblastic germinallayers
2 diploblastic two
3 diploblastic primary
4 diploblastic characterize
5 diploblastic ovum
6 defigure stone
> source("https://raw.githubusercontent.com/bavla/Rnet/master/R/Pajek.R")
> uvFac2net(factor(D$V1),factor(D$V2),Net="EngDict.net",twomode=TRUE)
We get a Pajek two-mode network ''EngDict.net'' with 85973 + 94441 = 180414 nodes and 660758 arcs.