====== Pokec ====== ===== Pokec social network ===== http://snap.stanford.edu/data/soc-pokec.html Pokec is the most popular Slovak on-line social network. These datasets are anonymized and contains relationships and user profile data of the whole network. Profile data are in Slovak language. Friendships in the Pokec network are oriented. Datasets were crawled during MAY 25-27 2012. Author: Lubos Takac, lubos.takac@gmail.com DATASET STATISTICS: Nodes ............................ 1632803 Edges ............................ 30622564 Nodes in largest WCC ............. 1632803 (1.000) Edges in largest WCC ............. 30622564 (1.000) Nodes in largest SCC ............. 1304537 (0.799) Edges in largest SCC ............. 29183655 (0.953) Average clustering coefficient ... 0.1094 Number of triangles .............. 32557458 Fraction of closed triangles ..... 0.01611 Diameter (longest shortest path) . 11 90-percentile effective diameter . 5.3 The data file "soc-pokec-profiles.txt.gz" is large. To unzip it I installed the gzip (http://www.gzip.org/ , http://gnuwin32.sourceforge.net/packages/gzip.htm ). Also TextPad and other text editors were not able to browse/edit the file "soc-pokec-profiles.txt". I further installed the editor http://www.emeditor.com/ (trial version). It works nicely. ===== Conversion to Pajek ===== I first processed the data on 8G 64-bit notebook. > setwd("D:/Data/SNAP") > d <- read.delim("soc-pokec-profiles.txt",header=FALSE,sep="\t") > head(d) V1 V2 V3 V4 V5 V6 V7 V8 V9 1 1 1 14 1 zilinsky kraj, zilina 2012-05-25 11:20:00.0 2005-04-03 00:00:00.0 26 185 cm, 90 kg 2 2 1 62 0 zilinsky kraj, kysucke nove mesto 2012-05-25 23:08:00.0 2007-11-30 00:00:00.0 0 166 cm, 58 kg 3 16 1 64 1 zilinsky kraj, kysucke nove mesto 2012-05-25 23:19:40.0 2008-05-18 00:00:00.0 23 173 cm, 70 kg 4 3 0 38 1 bratislavsky kraj, bratislava - karlova ves 2012-05-10 18:05:00.0 2010-05-23 00:00:00.0 29 null 5 4 1 12 0 banskobystricky kraj, brezno 2011-12-29 12:25:00.0 2011-12-29 00:00:00.0 26 null 6 17 1 47 0 zilinsky kraj, martin 2012-05-25 09:40:00.0 2006-10-21 00:00:00.0 27 162 cm, 60 kg V10 V11 1 it anglicky 2 null nemecky 3 najvatcsej firme na svete urad prace no predsa svoj :d a najlepsie druhy 4 reklamy a medii, sluzieb a obchodu anglicky, nemecky 5 null null 6 null anglicky, nemecky V12 1 sportovanie, spanie, kino, jedlo, pocuvanie hudby, priatelia, divadlo 2 turistika, prace okolo domu, praca s pc, pocuvanie hudby, pozeranie filmov, tancovanie, diskoteky, kupalisko, varenie, party, priatelia, spanie, nakupovanie, stanovanie 3 cestovanie, pocuvanie hudby, nenudit sa 4 sportovanie, cestovanie 5 null 6 citanie, pocuvanie hudby, pozeranie filmov, spanie, nakupovanie V13 V14 V15 1 v dobrej restauracii mam psa null 2 pri svieckach s partnerom macka priemerna 3 v dobrej restauracii ja a nas prefikany alik :) nemozem pribrat nedasa smola som moc aktivny 4 null null null 5 null null null 6 null pes priemerna V16 V17 V18 V19 1 null null null null 2 vyborny zelene cierne dlhe 3 to co by som mal nosit tak nenosim asi tak hnede hnede hnede null 4 null zelene hnede null 5 null null null null 6 null zelene blond, odfarbene dlhe V20 V21 V22 1 null null null 2 zakladne, ale som uz na strednej skole dufam ze ju spravim cierna, modra, ruzova nefajcim 3 coskoro 24.5 alebo 31.9 :d biela, modra, zelena nemam 4 null cierna, modra null 5 null null null 6 vysokoskolske biela, cierna, modra, fialova nefajcim V23 V24 1 null null 2 pijem prilezitostne, iba ked sa nieco kona a to napr. na zabave,na chate,na stanovackach a pod. byk 3 pijem iba ked musim ...svadby.pohreby.krstiny a tak lev 4 null null 5 null null 6 pijem prilezitostne rak V25 1 null 2 dobreho priatela, priatelku, mozno aj viac 3 null 4 dobreho priatela, priatelku, niekoho na chatovanie 5 null 6 null V26 1 null 2 nie je nic lepsie, ako byt zamilovany(a) 3 oplati sa pre nu bojovat 4 null 5 null 6 nie je nic lepsie, ako byt zamilovany(a), hladat lasku?nezmysel...hovori rozum. smiesne...hovori hrdost. riskantne...brani skusenost. \\"ale samota ta zabija\\" sepka srdce! V27 V28 V29 1 null null null 2 iba s mojou laskou laskou mojho zivota slobodny(a) 3 ja uz som stary na take veci :) ked ho stretnem tak vam o nom porozpravam :) slobodny(a) 4 null null slobodny(a) 5 null null null 6 nedokazem mat s niekym sex bez lasky null vydata za najuzasnejsieho cloveka pod slnkom V30 1 null 2 no budu a tak chcem 2 deti staci a tak ked budeme vladat tak bude aj viac co ja viem co ma v zivote postretne:d 3 casom ak budem este vladat :d 4 null 5 null 6 null V31 V32 V33 1 null null null 2 v buducnosti chcem mat deti komedie, romanticke doma z gauca 3 null take co ma uputaju v kine s ludmy ktory mam rad 4 null akcne, horory, komedie, sci-fi, dokumentarne, historicke null 5 null null null 6 v buducnosti chcem mat deti null null V34 1 null 2 disko, pop, rap a jasn eto co teraz leti najviac nejlepsie je fun-radio 3 hoci co co zapasuje ale klasa vede atb samozrejme najnovsie co sa hrava vrebrickoch hytparad :xd 4 rock, metal, house, techno, pop, oldies, jazz 5 null 6 null V35 V36 V37 1 null null null 2 na diskoteke, pri chodzi pri svieckach s partnerom slovenskej 3 samozrejme sam kazdy ma iny vkus neda sa vsetkym vyhoviet null ak sa to da ziet tak setko 4 v aute, v praci, na koncerte, s partnerom null slovenskej, talianskej, japonskej 5 null null null 6 null null null V38 V39 1 null null 2
profesiona­li
null 3 null nie 4 null pravidelne 5 null null 6 null null V40 V41 V42 V43 1 null null null null 2 null null null null 3 lyzovanie, plavanie non kanal bit lepsi ako druhy uz ich moc nectem 4 hokej, futbal, auto-moto sporty, squash auto-moto sporty, futbal, hokej null null 5 null null null null 6 aerobik, kolieskove korcule, plavanie, posilnovanie null zivnostnik null V44 V45 V46 V47 1 null null null null 2 null null null null 3 null null null null 4 null null null null 5 null null null null 6
magazin pre mamicky najmama.sk
null null null V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 1 null null null null null null null null null null null null NA 2 null null null null null null null null null null null null NA 3 null null null null null null null null null null null null NA 4 null null null null null null null null null null null null NA 5 null null null null null null null null null null null null NA 6 null null null null null null null null null null null null NA >
I added the column names: colNames <- c( "user_id", "public", "completion_percentage", "gender", "region", "last_login", "registration", "AGE", "body", "I_am_working_in_field", "spoken_languages", "hobbies", "I_most_enjoy_good_food", "pets", "body_type", "my_eyesight", "eye_color", "hair_color", "hair_type", "completed_level_of_education", "favourite_color", "relation_to_smoking", "relation_to_alcohol", "sign_in_zodiac", "on_pokec_i_am_looking_for", "love_is_for_me", "relation_to_casual_sex", "my_partner_should_be", "marital_status", "children", "relation_to_children", "I_like_movies", "I_like_watching_movie", "I_like_music", "I_mostly_like_listening_to_music", "the_idea_of_good_evening", "I_like_specialties_from_kitchen", "fun", "I_am_going_to_concerts", "my_active_sports", "my_passive_sports", "profession", "I_like_books", "life_style", "music", "cars", "politics", "relationships", "art_culture", "hobbies_interests", "science_technologies", "computers_internet", "education", "sport", "movies", "travelling", "health", "companies_brands", "more" ) and > dim(d) [1] 1062701 60 > colnames(d) <- colNames > write.table(d,"pokec.csv",sep=";") > e <- d[,c(1:37,39:43)] > dim(e) [1] 1062701 42 > write.table(e,"pikec.csv",sep=";") There is a problem. In R we get only 1062701 rows. I was not able to find the error in R commands or input files. I decided to make the selection of variables in Python. ===== Selection of variables ===== import csv, sys, os # http://docs.python.org/py3k/library/csv.html # http://snap.stanford.edu/data/soc-pokec.html os.chdir("D:/Data/SNAP/pokec") podatki = "soc-pokec-profiles.txt" cns = [ "user_id", "public", "completion_percentage", "gender", "region", "last_login", "registration", "age", "body", "I_am_working_in_field", "spoken_languages", "hobbies", "I_most_enjoy_good_food", "pets", "body_type", "my_eyesight", "eye_color", "hair_color", "hair_type", "completed_level_of_education", "favourite_color", "relation_to_smoking", "relation_to_alcohol", "sign_in_zodiac", "on_pokec_i_am_looking_for", "love_is_for_me", "relation_to_casual_sex", "my_partner_should_be", "marital_status", "children", "relation_to_children", "I_like_movies", "I_like_watching_movie", "I_like_music", "I_mostly_like_listening_to_music", "the_idea_of_good_evening", "I_like_specialties_from_kitchen", "fun", "I_am_going_to_concerts", "my_active_sports", "my_passive_sports", "profession", "I_like_books", "life_style", "music", "cars", "politics", "relationships", "art_culture", "hobbies_interests", "science_technologies", "computers_internet", "education", "sport", "movies", "travelling", "health", "companies_brands", "more" ] with open(podatki,newline='',encoding='utf-8') as dat,\ open('pokec1.csv','w',newline='',encoding='utf-8') as lst: datReader = csv.reader(dat,delimiter='\t',quotechar='"') lstWriter = csv.writer(lst,delimiter=';',quotechar='|', quoting=csv.QUOTE_NONNUMERIC) n = 0 try: lstWriter.writerow(cns[0:36]+cns[38:42]) for row in datReader: n = n+1 if (n % 10000) == 0: print(n) lstWriter.writerow(row[0:36]+row[38:42]) except csv.Error as e: sys.exit('file {}, line {}: {}'.format( podatki, datReader.line_num, e)) print(n) The rest was done in R. ==== pokecA.RData ==== > setwd("D:/Data/SNAP/pokec") > sel <- c(rep("character",11),rep("NULL",29)) > system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel, + na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)}) user system elapsed 292.17 2.75 317.88 > > dim(d) [1] 1632803 11 > summary(d) user_id public completion_percentage gender region last_login registration age body I_am_working_in_field spoken_languages > id <- as.integer(d$user_id) > public <- as.integer(d$public) > table(public) public 0 1 552525 1080278 > complete <- as.numeric(d$completion_percentage) > summary(complete) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 12.00 41.00 39.79 64.00 100.00 > gender <- as.integer(d$gender) > table(gender) gender 0 1 828304 804336 > reg <- d$region > head(reg) [1] "zilinsky kraj, zilina" [2] "zilinsky kraj, kysucke nove mesto" [3] "zilinsky kraj, kysucke nove mesto" [4] "bratislavsky kraj, bratislava - karlova ves" [5] "banskobystricky kraj, brezno" [6] "zilinsky kraj, martin" > i <- 1:length(reg) > length(i) [1] 1632803 > s <- unlist(strsplit(reg,', ')) > county <- s[2*i-1] > place <- s[2*i] > head(county) [1] "zilinsky kraj" "zilinsky kraj" "zilinsky kraj" [4] "bratislavsky kraj" "banskobystricky kraj" "zilinsky kraj" > pla <- as.factor(place) > age <- as.integer(d$age) > summary(age) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 0.00 19.00 17.07 26.00 112.00 163.00 > login <- as.Date(d$last_login) > register <- as.Date(d$registration) > body <- d$body > work <- d$I_am_working_in_field > lang <- d$spoken_languages > save(id,public,complete,gender,reg,age,login,register,body,work,lang,file="pokecA.RData") [[http://vlado.fmf.uni-lj.si/pub/networks/Data/snap/pokecA.RData|pokecA.RData]] (34 M) ==== pokecB.RData ==== > setwd("D:/Data/SNAP/pokec") > sel <- c("character",rep("NULL",10),"character","NULL","character",rep("NULL",2), + rep("character",3),"NULL",rep("character",4),rep("NULL",4),"character", + rep("NULL",2),rep("character",3),rep("NULL",2),rep("character",4)) > system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel, + na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)}) user system elapsed 277.35 3.35 285.64 > dim(d) [1] 1632803 18 > save(d,file="pokecB.RData") [[http://vlado.fmf.uni-lj.si/pub/networks/Data/snap/pokecB.RData|pokecB.RData]] (101 M) ===== Cleaning of variables ===== > setwd("D:/Data/SNAP/pokec") > load("pokecB.RData") > objects() [1] "d" > dim(d) [1] 1632803 18 > colnames(d) [1] "user_id" "hobbies" "pets" [4] "eye_color" "hair_color" "hair_type" [7] "favourite_color" "relation_to_smoking" "relation_to_alcohol" [10] "sign_in_zodiac" "marital_status" "I_like_movies" [13] "I_like_watching_movie" "I_like_music" "I_am_going_to_concerts" [16] "my_active_sports" "my_passive_sports" "profession" > zodiac <- d$sign_in_zodiac > zod <- substr(zodiac,1,3) > t <- table(zod) > sort(t,decreasing=TRUE)[1:30] zod lev rak bli byk bar ryb pan vod sko koz vah str som ... to rac 76264 71131 69216 68539 67214 65253 63696 63091 62407 61758 61670 59191 1668 1066 840 821 nev ♥ tak tvr ja pot opi dra kra no pes had kro nep 657 621 571 552 520 497 478 473 414 407 380 377 362 340 > > eyeC <- d$eye_color > ec <- substr(eyeC,1,6) > t <- table(ec) > sort(t,decreasing=TRUE)[1:30] ec hnede modre zelene hnede, modre, cierne zeleno sive modro- hnedoz hnedo- hnede modre 297289 198895 149255 30104 25288 24374 21679 11605 6688 5771 5509 5105 4315 modroz modro hnedo hneda sive, modra zelena cerven tmavo podla modruc sivomo krasne 3879 3519 3418 3415 3399 2527 1893 1563 1323 1298 1178 1107 1059 hnede. modros neviem cokola 1055 1032 1007 933 > pets <- substr(d$pets,1,6) > t <- table(pets) > sort(t,decreasing=TRUE)[1:30] pets nemam mam ps pes pes, m mam ma macka, pes, r mam ry macka mam ko vtacik 128457 116975 76026 34902 25878 14513 11031 9544 8331 7730 7398 pes, v mam vt rybky mam hl korytn pes, a rybky, mal so pes, k papaga mala s 6992 6626 6275 6171 5913 4784 4719 4612 4180 3953 3947 mam dv nemam skreco mam pa pes, p mam br mam 2 mam ha 3775 3693 3509 3343 3167 2746 2503 2390 ===== Making the network ===== To produce the network data the properties should be reordered with respect to the ''user_id''. Subject: Re: Pokec From: Ľuboš Takáč Date: Sat, May 18, 2013 17:07 To: vladimir.batagelj@fmf.uni-lj.si Hallo Vladimir, the number corresponds to user_id, it mean that f.e. row in relationship file 4 5 is friendship relation between users with ids 4 and 5, there is not garanted that you find it in profiles data on rows 4 or 5. You have to find users wit such ids. So as you said in second example. Kind regards Lubos Takac. 2013/5/18 Vladimir Batagelj > Dear Lubos Takac, > > at http://snap.stanford.edu/data/soc-pokec.html I found your data set > on Pokec. I would like to know what is the relation between node > numbers in network and the "user_id"s in the descriptions. > Is > the i-th row in the description file the description of the i-th node; > or > the i-th row in the description file is the description of the > corresponding user_id node; > or > something else? > > best regards, Vladimir Batagelj > -- > Vladimir Batagelj, University of Ljubljana, FMF, Department of Mathematics > Jadranska 19, 1000 Ljubljana, Slovenia > http://vladowiki.fmf.uni-lj.si/doku.php?id=vlado