====== Pokec ======
===== Pokec social network =====
http://snap.stanford.edu/data/soc-pokec.html
Pokec is the most popular Slovak on-line social network. These datasets
are anonymized and contains relationships and user profile data of the
whole network. Profile data are in Slovak language. Friendships in the
Pokec network are oriented. Datasets were crawled during MAY 25-27 2012.
Author: Lubos Takac, lubos.takac@gmail.com
DATASET STATISTICS:
Nodes ............................ 1632803
Edges ............................ 30622564
Nodes in largest WCC ............. 1632803 (1.000)
Edges in largest WCC ............. 30622564 (1.000)
Nodes in largest SCC ............. 1304537 (0.799)
Edges in largest SCC ............. 29183655 (0.953)
Average clustering coefficient ... 0.1094
Number of triangles .............. 32557458
Fraction of closed triangles ..... 0.01611
Diameter (longest shortest path) . 11
90-percentile effective diameter . 5.3
The data file "soc-pokec-profiles.txt.gz" is large. To unzip it I installed the gzip (http://www.gzip.org/ , http://gnuwin32.sourceforge.net/packages/gzip.htm ). Also TextPad and other text editors were not able to browse/edit the file "soc-pokec-profiles.txt". I further installed the editor http://www.emeditor.com/ (trial version). It works nicely.
===== Conversion to Pajek =====
I first processed the data on 8G 64-bit notebook.
> setwd("D:/Data/SNAP")
> d <- read.delim("soc-pokec-profiles.txt",header=FALSE,sep="\t")
> head(d)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 1 1 14 1 zilinsky kraj, zilina 2012-05-25 11:20:00.0 2005-04-03 00:00:00.0 26 185 cm, 90 kg
2 2 1 62 0 zilinsky kraj, kysucke nove mesto 2012-05-25 23:08:00.0 2007-11-30 00:00:00.0 0 166 cm, 58 kg
3 16 1 64 1 zilinsky kraj, kysucke nove mesto 2012-05-25 23:19:40.0 2008-05-18 00:00:00.0 23 173 cm, 70 kg
4 3 0 38 1 bratislavsky kraj, bratislava - karlova ves 2012-05-10 18:05:00.0 2010-05-23 00:00:00.0 29 null
5 4 1 12 0 banskobystricky kraj, brezno 2011-12-29 12:25:00.0 2011-12-29 00:00:00.0 26 null
6 17 1 47 0 zilinsky kraj, martin 2012-05-25 09:40:00.0 2006-10-21 00:00:00.0 27 162 cm, 60 kg
V10 V11
1 it anglicky
2 null nemecky
3 najvatcsej firme na svete urad prace no predsa svoj :d a najlepsie druhy
4 reklamy a medii, sluzieb a obchodu anglicky, nemecky
5 null null
6 null anglicky, nemecky
V12
1 sportovanie, spanie, kino, jedlo, pocuvanie hudby, priatelia, divadlo
2 turistika, prace okolo domu, praca s pc, pocuvanie hudby, pozeranie filmov, tancovanie, diskoteky, kupalisko, varenie, party, priatelia, spanie, nakupovanie, stanovanie
3 cestovanie, pocuvanie hudby, nenudit sa
4 sportovanie, cestovanie
5 null
6 citanie, pocuvanie hudby, pozeranie filmov, spanie, nakupovanie
V13 V14 V15
1 v dobrej restauracii mam psa null
2 pri svieckach s partnerom macka priemerna
3 v dobrej restauracii ja a nas prefikany alik :) nemozem pribrat nedasa smola som moc aktivny
4 null null null
5 null null null
6 null pes priemerna
V16 V17 V18 V19
1 null null null null
2 vyborny zelene cierne dlhe
3 to co by som mal nosit tak nenosim asi tak hnede hnede hnede null
4 null zelene hnede null
5 null null null null
6 null zelene blond, odfarbene dlhe
V20 V21 V22
1 null null null
2 zakladne, ale som uz na strednej skole dufam ze ju spravim cierna, modra, ruzova nefajcim
3 coskoro 24.5 alebo 31.9 :d biela, modra, zelena nemam
4 null cierna, modra null
5 null null null
6 vysokoskolske biela, cierna, modra, fialova nefajcim
V23 V24
1 null null
2 pijem prilezitostne, iba ked sa nieco kona a to napr. na zabave,na chate,na stanovackach a pod. byk
3 pijem iba ked musim ...svadby.pohreby.krstiny a tak lev
4 null null
5 null null
6 pijem prilezitostne rak
V25
1 null
2 dobreho priatela, priatelku, mozno aj viac
3 null
4 dobreho priatela, priatelku, niekoho na chatovanie
5 null
6 null
V26
1 null
2 nie je nic lepsie, ako byt zamilovany(a)
3 oplati sa pre nu bojovat
4 null
5 null
6 nie je nic lepsie, ako byt zamilovany(a), hladat lasku?nezmysel...hovori rozum. smiesne...hovori hrdost. riskantne...brani skusenost. \\"ale samota ta zabija\\" sepka srdce!
V27 V28 V29
1 null null null
2 iba s mojou laskou laskou mojho zivota slobodny(a)
3 ja uz som stary na take veci :) ked ho stretnem tak vam o nom porozpravam :) slobodny(a)
4 null null slobodny(a)
5 null null null
6 nedokazem mat s niekym sex bez lasky null vydata za najuzasnejsieho cloveka pod slnkom
V30
1 null
2 no budu a tak chcem 2 deti staci a tak ked budeme vladat tak bude aj viac co ja viem co ma v zivote postretne:d
3 casom ak budem este vladat :d
4 null
5 null
6 null
V31 V32 V33
1 null null null
2 v buducnosti chcem mat deti komedie, romanticke doma z gauca
3 null take co ma uputaju v kine s ludmy ktory mam rad
4 null akcne, horory, komedie, sci-fi, dokumentarne, historicke null
5 null null null
6 v buducnosti chcem mat deti null null
V34
1 null
2 disko, pop, rap a jasn eto co teraz leti najviac nejlepsie je fun-radio
3 hoci co co zapasuje ale klasa vede atb samozrejme najnovsie co sa hrava vrebrickoch hytparad :xd
4 rock, metal, house, techno, pop, oldies, jazz
5 null
6 null
V35 V36 V37
1 null null null
2 na diskoteke, pri chodzi pri svieckach s partnerom slovenskej
3 samozrejme sam kazdy ma iny vkus neda sa vsetkym vyhoviet null ak sa to da ziet tak setko
4 v aute, v praci, na koncerte, s partnerom null slovenskej, talianskej, japonskej
5 null null null
6 null null null
V38 V39
1 null null
2 null
3 null nie
4 null pravidelne
5 null null
6 null null
V40 V41 V42 V43
1 null null null null
2 null null null null
3 lyzovanie, plavanie non kanal bit lepsi ako druhy uz ich moc nectem
4 hokej, futbal, auto-moto sporty, squash auto-moto sporty, futbal, hokej null null
5 null null null null
6 aerobik, kolieskove korcule, plavanie, posilnovanie null zivnostnik null
V44 V45 V46 V47
1 null null null null
2 null null null null
3 null null null null
4 null null null null
5 null null null null
6 null null null
V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60
1 null null null null null null null null null null null null NA
2 null null null null null null null null null null null null NA
3 null null null null null null null null null null null null NA
4 null null null null null null null null null null null null NA
5 null null null null null null null null null null null null NA
6 null null null null null null null null null null null null NA
>
I added the column names:
colNames <- c(
"user_id",
"public",
"completion_percentage",
"gender",
"region",
"last_login",
"registration",
"AGE",
"body",
"I_am_working_in_field",
"spoken_languages",
"hobbies",
"I_most_enjoy_good_food",
"pets",
"body_type",
"my_eyesight",
"eye_color",
"hair_color",
"hair_type",
"completed_level_of_education",
"favourite_color",
"relation_to_smoking",
"relation_to_alcohol",
"sign_in_zodiac",
"on_pokec_i_am_looking_for",
"love_is_for_me",
"relation_to_casual_sex",
"my_partner_should_be",
"marital_status",
"children",
"relation_to_children",
"I_like_movies",
"I_like_watching_movie",
"I_like_music",
"I_mostly_like_listening_to_music",
"the_idea_of_good_evening",
"I_like_specialties_from_kitchen",
"fun",
"I_am_going_to_concerts",
"my_active_sports",
"my_passive_sports",
"profession",
"I_like_books",
"life_style",
"music",
"cars",
"politics",
"relationships",
"art_culture",
"hobbies_interests",
"science_technologies",
"computers_internet",
"education",
"sport",
"movies",
"travelling",
"health",
"companies_brands",
"more" )
and
> dim(d)
[1] 1062701 60
> colnames(d) <- colNames
> write.table(d,"pokec.csv",sep=";")
> e <- d[,c(1:37,39:43)]
> dim(e)
[1] 1062701 42
> write.table(e,"pikec.csv",sep=";")
There is a problem. In R we get only 1062701 rows. I was not able to find the error in R commands or input files. I decided to make the selection of variables in Python.
===== Selection of variables =====
import csv, sys, os
# http://docs.python.org/py3k/library/csv.html
# http://snap.stanford.edu/data/soc-pokec.html
os.chdir("D:/Data/SNAP/pokec")
podatki = "soc-pokec-profiles.txt"
cns = [ "user_id", "public", "completion_percentage", "gender", "region",
"last_login", "registration", "age", "body", "I_am_working_in_field",
"spoken_languages", "hobbies", "I_most_enjoy_good_food", "pets",
"body_type", "my_eyesight", "eye_color", "hair_color", "hair_type",
"completed_level_of_education", "favourite_color", "relation_to_smoking",
"relation_to_alcohol", "sign_in_zodiac", "on_pokec_i_am_looking_for",
"love_is_for_me", "relation_to_casual_sex", "my_partner_should_be",
"marital_status", "children", "relation_to_children", "I_like_movies",
"I_like_watching_movie", "I_like_music", "I_mostly_like_listening_to_music",
"the_idea_of_good_evening", "I_like_specialties_from_kitchen", "fun",
"I_am_going_to_concerts", "my_active_sports", "my_passive_sports",
"profession", "I_like_books", "life_style", "music", "cars", "politics",
"relationships", "art_culture", "hobbies_interests", "science_technologies",
"computers_internet", "education", "sport", "movies", "travelling",
"health", "companies_brands", "more" ]
with open(podatki,newline='',encoding='utf-8') as dat,\
open('pokec1.csv','w',newline='',encoding='utf-8') as lst:
datReader = csv.reader(dat,delimiter='\t',quotechar='"')
lstWriter = csv.writer(lst,delimiter=';',quotechar='|',
quoting=csv.QUOTE_NONNUMERIC)
n = 0
try:
lstWriter.writerow(cns[0:36]+cns[38:42])
for row in datReader:
n = n+1
if (n % 10000) == 0: print(n)
lstWriter.writerow(row[0:36]+row[38:42])
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(
podatki, datReader.line_num, e))
print(n)
The rest was done in R.
==== pokecA.RData ====
> setwd("D:/Data/SNAP/pokec")
> sel <- c(rep("character",11),rep("NULL",29))
> system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel,
+ na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)})
user system elapsed
292.17 2.75 317.88
>
> dim(d)
[1] 1632803 11
> summary(d)
user_id public completion_percentage gender
region last_login registration age
body I_am_working_in_field spoken_languages
> id <- as.integer(d$user_id)
> public <- as.integer(d$public)
> table(public)
public
0 1
552525 1080278
> complete <- as.numeric(d$completion_percentage)
> summary(complete)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 12.00 41.00 39.79 64.00 100.00
> gender <- as.integer(d$gender)
> table(gender)
gender
0 1
828304 804336
> reg <- d$region
> head(reg)
[1] "zilinsky kraj, zilina"
[2] "zilinsky kraj, kysucke nove mesto"
[3] "zilinsky kraj, kysucke nove mesto"
[4] "bratislavsky kraj, bratislava - karlova ves"
[5] "banskobystricky kraj, brezno"
[6] "zilinsky kraj, martin"
> i <- 1:length(reg)
> length(i)
[1] 1632803
> s <- unlist(strsplit(reg,', '))
> county <- s[2*i-1]
> place <- s[2*i]
> head(county)
[1] "zilinsky kraj" "zilinsky kraj" "zilinsky kraj"
[4] "bratislavsky kraj" "banskobystricky kraj" "zilinsky kraj"
> pla <- as.factor(place)
> age <- as.integer(d$age)
> summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 19.00 17.07 26.00 112.00 163.00
> login <- as.Date(d$last_login)
> register <- as.Date(d$registration)
> body <- d$body
> work <- d$I_am_working_in_field
> lang <- d$spoken_languages
> save(id,public,complete,gender,reg,age,login,register,body,work,lang,file="pokecA.RData")
[[http://vlado.fmf.uni-lj.si/pub/networks/Data/snap/pokecA.RData|pokecA.RData]] (34 M)
==== pokecB.RData ====
> setwd("D:/Data/SNAP/pokec")
> sel <- c("character",rep("NULL",10),"character","NULL","character",rep("NULL",2),
+ rep("character",3),"NULL",rep("character",4),rep("NULL",4),"character",
+ rep("NULL",2),rep("character",3),rep("NULL",2),rep("character",4))
> system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel,
+ na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)})
user system elapsed
277.35 3.35 285.64
> dim(d)
[1] 1632803 18
> save(d,file="pokecB.RData")
[[http://vlado.fmf.uni-lj.si/pub/networks/Data/snap/pokecB.RData|pokecB.RData]] (101 M)
===== Cleaning of variables =====
> setwd("D:/Data/SNAP/pokec")
> load("pokecB.RData")
> objects()
[1] "d"
> dim(d)
[1] 1632803 18
> colnames(d)
[1] "user_id" "hobbies" "pets"
[4] "eye_color" "hair_color" "hair_type"
[7] "favourite_color" "relation_to_smoking" "relation_to_alcohol"
[10] "sign_in_zodiac" "marital_status" "I_like_movies"
[13] "I_like_watching_movie" "I_like_music" "I_am_going_to_concerts"
[16] "my_active_sports" "my_passive_sports" "profession"
> zodiac <- d$sign_in_zodiac
> zod <- substr(zodiac,1,3)
> t <- table(zod)
> sort(t,decreasing=TRUE)[1:30]
zod
lev rak bli byk bar ryb pan vod sko koz vah str som ... to rac
76264 71131 69216 68539 67214 65253 63696 63091 62407 61758 61670 59191 1668 1066 840 821
nev ♥ tak tvr ja pot opi dra kra no pes had kro nep
657 621 571 552 520 497 478 473 414 407 380 377 362 340
>
> eyeC <- d$eye_color
> ec <- substr(eyeC,1,6)
> t <- table(ec)
> sort(t,decreasing=TRUE)[1:30]
ec
hnede modre zelene hnede, modre, cierne zeleno sive modro- hnedoz hnedo- hnede modre
297289 198895 149255 30104 25288 24374 21679 11605 6688 5771 5509 5105 4315
modroz modro hnedo hneda sive, modra zelena cerven tmavo podla modruc sivomo krasne
3879 3519 3418 3415 3399 2527 1893 1563 1323 1298 1178 1107 1059
hnede. modros neviem cokola
1055 1032 1007 933
> pets <- substr(d$pets,1,6)
> t <- table(pets)
> sort(t,decreasing=TRUE)[1:30]
pets
nemam mam ps pes pes, m mam ma macka, pes, r mam ry macka mam ko vtacik
128457 116975 76026 34902 25878 14513 11031 9544 8331 7730 7398
pes, v mam vt rybky mam hl korytn pes, a rybky, mal so pes, k papaga mala s
6992 6626 6275 6171 5913 4784 4719 4612 4180 3953 3947
mam dv nemam skreco mam pa pes, p mam br mam 2 mam ha
3775 3693 3509 3343 3167 2746 2503 2390
===== Making the network =====
To produce the network data the properties should be reordered with respect to
the ''user_id''.
Subject: Re: Pokec
From: Ľuboš Takáč
Date: Sat, May 18, 2013 17:07
To: vladimir.batagelj@fmf.uni-lj.si
Hallo Vladimir,
the number corresponds to user_id,
it mean that f.e. row in relationship file
4 5
is friendship relation between users with ids 4 and 5, there is not
garanted that you find it in profiles data on rows 4 or 5. You have to find
users wit such ids.
So as you said in second example.
Kind regards Lubos Takac.
2013/5/18 Vladimir Batagelj
> Dear Lubos Takac,
>
> at http://snap.stanford.edu/data/soc-pokec.html I found your data set
> on Pokec. I would like to know what is the relation between node
> numbers in network and the "user_id"s in the descriptions.
> Is
> the i-th row in the description file the description of the i-th node;
> or
> the i-th row in the description file is the description of the
> corresponding user_id node;
> or
> something else?
>
> best regards, Vladimir Batagelj
> --
> Vladimir Batagelj, University of Ljubljana, FMF, Department of Mathematics
> Jadranska 19, 1000 Ljubljana, Slovenia
> http://vladowiki.fmf.uni-lj.si/doku.php?id=vlado