Pokec

Pokec social network

http://snap.stanford.edu/data/soc-pokec.html

Pokec is the most popular Slovak on-line social network. These datasets are anonymized and contains relationships and user profile data of the whole network. Profile data are in Slovak language. Friendships in the Pokec network are oriented. Datasets were crawled during MAY 25-27 2012.

Author: Lubos Takac, lubos.takac@gmail.com

DATASET STATISTICS:

Nodes ............................  1632803
Edges ............................ 30622564
Nodes in largest WCC .............  1632803 (1.000)
Edges in largest WCC ............. 30622564 (1.000)
Nodes in largest SCC .............  1304537 (0.799)
Edges in largest SCC ............. 29183655 (0.953)
Average clustering coefficient ...   0.1094
Number of triangles .............. 32557458
Fraction of closed triangles .....  0.01611
Diameter (longest shortest path) .       11
90-percentile effective diameter .      5.3

The data file “soc-pokec-profiles.txt.gz” is large. To unzip it I installed the gzip (http://www.gzip.org/ , http://gnuwin32.sourceforge.net/packages/gzip.htm ). Also TextPad and other text editors were not able to browse/edit the file “soc-pokec-profiles.txt”. I further installed the editor http://www.emeditor.com/ (trial version). It works nicely.

Conversion to Pajek

I first processed the data on 8G 64-bit notebook.

> setwd("D:/Data/SNAP")
> d <- read.delim("soc-pokec-profiles.txt",header=FALSE,sep="\t")
> head(d)
  V1 V2 V3 V4                                          V5                    V6                    V7 V8            V9
1  1  1 14  1                       zilinsky kraj, zilina 2012-05-25 11:20:00.0 2005-04-03 00:00:00.0 26 185 cm, 90 kg
2  2  1 62  0           zilinsky kraj, kysucke nove mesto 2012-05-25 23:08:00.0 2007-11-30 00:00:00.0  0 166 cm, 58 kg
3 16  1 64  1           zilinsky kraj, kysucke nove mesto 2012-05-25 23:19:40.0 2008-05-18 00:00:00.0 23 173 cm, 70 kg
4  3  0 38  1 bratislavsky kraj, bratislava - karlova ves 2012-05-10 18:05:00.0 2010-05-23 00:00:00.0 29          null
5  4  1 12  0                banskobystricky kraj, brezno 2011-12-29 12:25:00.0 2011-12-29 00:00:00.0 26          null
6 17  1 47  0                       zilinsky kraj, martin 2012-05-25 09:40:00.0 2006-10-21 00:00:00.0 27 162 cm, 60 kg
                                   V10                                 V11
1                                   it                            anglicky
2                                 null                             nemecky
3 najvatcsej firme na svete urad prace no predsa svoj :d a najlepsie druhy
4   reklamy a medii, sluzieb a obchodu                   anglicky, nemecky
5                                 null                                null
6                                 null                   anglicky, nemecky
                                                                                                                                                                       V12
1                                                                                                    sportovanie, spanie, kino, jedlo, pocuvanie hudby, priatelia, divadlo
2 turistika, prace okolo domu, praca s pc, pocuvanie hudby, pozeranie filmov, tancovanie, diskoteky, kupalisko, varenie, party, priatelia, spanie, nakupovanie, stanovanie
3                                                                                                                                  cestovanie, pocuvanie hudby, nenudit sa
4                                                                                                                                                  sportovanie, cestovanie
5                                                                                                                                                                     null
6                                                                                                          citanie, pocuvanie hudby, pozeranie filmov, spanie, nakupovanie
                        V13                        V14                                          V15
1      v dobrej restauracii                    mam psa                                         null
2 pri svieckach s partnerom                      macka                                    priemerna
3      v dobrej restauracii ja a nas prefikany alik :) nemozem pribrat nedasa smola som moc aktivny
4                      null                       null                                         null
5                      null                       null                                         null
6                      null                        pes                                    priemerna
                                         V16    V17              V18  V19
1                                       null   null             null null
2                                    vyborny zelene           cierne dlhe
3 to co by som mal nosit tak nenosim asi tak  hnede      hnede hnede null
4                                       null zelene            hnede null
5                                       null   null             null null
6                                       null zelene blond, odfarbene dlhe
                                                         V20                           V21      V22
1                                                       null                          null     null
2 zakladne, ale som uz na strednej skole dufam ze ju spravim         cierna, modra, ruzova nefajcim
3                              coskoro 24.5 alebo 31.9    :d          biela, modra, zelena    nemam
4                                                       null                 cierna, modra     null
5                                                       null                          null     null
6                                              vysokoskolske biela, cierna, modra, fialova nefajcim
                                                                                              V23  V24
1                                                                                            null null
2 pijem prilezitostne, iba ked sa nieco kona a to napr. na zabave,na chate,na stanovackach a pod.  byk
3                                             pijem iba ked musim ...svadby.pohreby.krstiny a tak  lev
4                                                                                            null null
5                                                                                            null null
6                                                                             pijem prilezitostne  rak
                                                 V25
1                                               null
2         dobreho priatela, priatelku, mozno aj viac
3                                               null
4 dobreho priatela, priatelku, niekoho na chatovanie
5                                               null
6                                               null
                                                                                                                                                                                      V26
1                                                                                                                                                                                    null
2                                                                                                                                                nie je nic lepsie, ako byt zamilovany(a)
3                                                                                                                                                                oplati sa pre nu bojovat
4                                                                                                                                                                                    null
5                                                                                                                                                                                    null
6 nie je nic lepsie, ako byt zamilovany(a), hladat lasku?nezmysel...hovori rozum. smiesne...hovori hrdost. riskantne...brani skusenost. \\&quot;ale samota ta zabija\\&quot; sepka srdce!
                                   V27                                          V28                                          V29
1                                 null                                         null                                         null
2                   iba s mojou laskou                          laskou mojho zivota                                  slobodny(a)
3      ja uz som stary na take veci :) ked ho stretnem tak vam o nom porozpravam :)                                  slobodny(a)
4                                 null                                         null                                  slobodny(a)
5                                 null                                         null                                         null
6 nedokazem mat s niekym sex bez lasky                                         null vydata za najuzasnejsieho cloveka pod slnkom
                                                                                                              V30
1                                                                                                            null
2 no budu a tak chcem 2 deti staci a tak ked budeme vladat tak bude aj viac co ja viem co ma v zivote postretne:d
3                                                                                   casom ak budem este vladat :d
4                                                                                                            null
5                                                                                                            null
6                                                                                                            null
                          V31                                                      V32                          V33
1                        null                                                     null                         null
2 v buducnosti chcem mat deti                                      komedie, romanticke                 doma z gauca
3                        null                                       take co ma uputaju v kine s ludmy ktory mam rad
4                        null akcne, horory, komedie, sci-fi, dokumentarne, historicke                         null
5                        null                                                     null                         null
6 v buducnosti chcem mat deti                                                     null                         null
                                                                                               V34
1                                                                                             null
2                          disko, pop, rap a jasn eto co teraz leti najviac nejlepsie je fun-radio
3 hoci co co zapasuje ale klasa vede atb samozrejme najnovsie co sa hrava vrebrickoch hytparad :xd
4                                                    rock, metal, house, techno, pop, oldies, jazz
5                                                                                             null
6                                                                                             null
                                                        V35                       V36                               V37
1                                                      null                      null                              null
2                                  na diskoteke, pri chodzi pri svieckach s partnerom                        slovenskej
3 samozrejme sam kazdy ma iny vkus neda sa vsetkym vyhoviet                      null        ak sa to da ziet tak setko
4                 v aute, v praci, na koncerte, s partnerom                      null slovenskej, talianskej, japonskej
5                                                      null                      null                              null
6                                                      null                      null                              null
                                                                                 V38        V39
1                                                                               null       null
2 <div> <a title=vstup do klubu href=/klub/profesionali>profesiona&shy;li</a> </div>       null
3                                                                               null        nie
4                                                                               null pravidelne
5                                                                               null       null
6                                                                               null       null
                                                  V40                             V41                 V42               V43
1                                                null                            null                null              null
2                                                null                            null                null              null
3                                 lyzovanie, plavanie                       non kanal bit lepsi ako druhy uz ich moc nectem
4             hokej, futbal, auto-moto sporty, squash auto-moto sporty, futbal, hokej                null              null
5                                                null                            null                null              null
6 aerobik, kolieskove korcule, plavanie, posilnovanie                            null          zivnostnik              null
                                                                                                                V44  V45  V46  V47
1                                                                                                              null null null null
2                                                                                                              null null null null
3                                                                                                              null null null null
4                                                                                                              null null null null
5                                                                                                              null null null null
6 <div> <a title=vstup do klubu href=/klub/magazin-pre-mamicky-najmama-sk>magazin pre mamicky najmama.sk</a> </div> null null null
   V48  V49  V50  V51  V52  V53  V54  V55  V56  V57  V58  V59 V60
1 null null null null null null null null null null null null  NA
2 null null null null null null null null null null null null  NA
3 null null null null null null null null null null null null  NA
4 null null null null null null null null null null null null  NA
5 null null null null null null null null null null null null  NA
6 null null null null null null null null null null null null  NA
> 

I added the column names:

colNames <- c(
 "user_id",
 "public",
 "completion_percentage",
 "gender",
 "region",
 "last_login",
 "registration",
 "AGE",
 "body",
 "I_am_working_in_field",
 "spoken_languages",
 "hobbies",
 "I_most_enjoy_good_food",
 "pets",
 "body_type",
 "my_eyesight",
 "eye_color",
 "hair_color",
 "hair_type",
 "completed_level_of_education",
 "favourite_color",
 "relation_to_smoking",
 "relation_to_alcohol",
 "sign_in_zodiac",
 "on_pokec_i_am_looking_for",
 "love_is_for_me",
 "relation_to_casual_sex",
 "my_partner_should_be",
 "marital_status",
 "children",
 "relation_to_children",
 "I_like_movies",
 "I_like_watching_movie",
 "I_like_music",
 "I_mostly_like_listening_to_music",
 "the_idea_of_good_evening",
 "I_like_specialties_from_kitchen",
 "fun",
 "I_am_going_to_concerts",
 "my_active_sports",
 "my_passive_sports",
 "profession",
 "I_like_books",
 "life_style",
 "music",
 "cars",
 "politics",
 "relationships",
 "art_culture",
 "hobbies_interests",
 "science_technologies",
 "computers_internet",
 "education",
 "sport",
 "movies",
 "travelling",
 "health",
 "companies_brands",
 "more" )

and

> dim(d)
[1] 1062701      60
> colnames(d) <- colNames
> write.table(d,"pokec.csv",sep=";")
> e <- d[,c(1:37,39:43)]
> dim(e)
[1] 1062701      42
> write.table(e,"pikec.csv",sep=";")

There is a problem. In R we get only 1062701 rows. I was not able to find the error in R commands or input files. I decided to make the selection of variables in Python.

Selection of variables

import csv, sys, os
# http://docs.python.org/py3k/library/csv.html
# http://snap.stanford.edu/data/soc-pokec.html
os.chdir("D:/Data/SNAP/pokec")
podatki = "soc-pokec-profiles.txt"
cns = [ "user_id", "public", "completion_percentage", "gender", "region",
 "last_login", "registration", "age", "body", "I_am_working_in_field",
 "spoken_languages", "hobbies", "I_most_enjoy_good_food", "pets",
 "body_type", "my_eyesight", "eye_color", "hair_color", "hair_type",
 "completed_level_of_education", "favourite_color", "relation_to_smoking",
 "relation_to_alcohol", "sign_in_zodiac", "on_pokec_i_am_looking_for",
 "love_is_for_me", "relation_to_casual_sex", "my_partner_should_be",
 "marital_status", "children", "relation_to_children", "I_like_movies",
 "I_like_watching_movie", "I_like_music", "I_mostly_like_listening_to_music",
 "the_idea_of_good_evening", "I_like_specialties_from_kitchen", "fun",
 "I_am_going_to_concerts", "my_active_sports", "my_passive_sports",
 "profession", "I_like_books", "life_style", "music", "cars", "politics",
 "relationships", "art_culture", "hobbies_interests", "science_technologies",
 "computers_internet", "education", "sport", "movies", "travelling",
 "health", "companies_brands", "more" ]
with open(podatki,newline='',encoding='utf-8') as dat,\
     open('pokec1.csv','w',newline='',encoding='utf-8') as lst:
  datReader = csv.reader(dat,delimiter='\t',quotechar='"')
  lstWriter = csv.writer(lst,delimiter=';',quotechar='|',
                         quoting=csv.QUOTE_NONNUMERIC)
  n = 0
  try:
    lstWriter.writerow(cns[0:36]+cns[38:42])
    for row in datReader:
      n = n+1
      if (n % 10000) == 0: print(n)
      lstWriter.writerow(row[0:36]+row[38:42])
  except csv.Error as e:
    sys.exit('file {}, line {}: {}'.format(
      podatki, datReader.line_num, e))
print(n)

The rest was done in R.

pokecA.RData

> setwd("D:/Data/SNAP/pokec")
> sel <- c(rep("character",11),rep("NULL",29))
> system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel,
+ na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)})
   user  system elapsed 
 292.17    2.75  317.88 
> 
> dim(d)
[1] 1632803      11
> summary(d)
user_id        public                  completion_percentage    gender         
region         last_login              registration             age           
body           I_am_working_in_field   spoken_languages  
> id <- as.integer(d$user_id)
> public <- as.integer(d$public)
> table(public)
public
      0       1 
 552525 1080278 
> complete <- as.numeric(d$completion_percentage) 
> summary(complete)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   12.00   41.00   39.79   64.00  100.00 
> gender <- as.integer(d$gender)
> table(gender)
gender
     0      1 
828304 804336 
> reg <- d$region
> head(reg)
[1] "zilinsky kraj, zilina"                      
[2] "zilinsky kraj, kysucke nove mesto"          
[3] "zilinsky kraj, kysucke nove mesto"          
[4] "bratislavsky kraj, bratislava - karlova ves"
[5] "banskobystricky kraj, brezno"               
[6] "zilinsky kraj, martin"                      
> i <- 1:length(reg)
> length(i)
[1] 1632803
> s <- unlist(strsplit(reg,', '))
> county <- s[2*i-1]
> place <- s[2*i]
> head(county)
[1] "zilinsky kraj"        "zilinsky kraj"        "zilinsky kraj"       
[4] "bratislavsky kraj"    "banskobystricky kraj" "zilinsky kraj"       
> pla <- as.factor(place)
> age <- as.integer(d$age)
> summary(age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00   19.00   17.07   26.00  112.00  163.00 
> login <- as.Date(d$last_login)
> register <- as.Date(d$registration)
> body <- d$body
> work <- d$I_am_working_in_field
> lang <- d$spoken_languages
> save(id,public,complete,gender,reg,age,login,register,body,work,lang,file="pokecA.RData")  

pokecA.RData (34 M)

pokecB.RData

 
> setwd("D:/Data/SNAP/pokec")
> sel <- c("character",rep("NULL",10),"character","NULL","character",rep("NULL",2),
+ rep("character",3),"NULL",rep("character",4),rep("NULL",4),"character",
+ rep("NULL",2),rep("character",3),rep("NULL",2),rep("character",4))
> system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel,
+ na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)})
   user  system elapsed 
 277.35    3.35  285.64 
> dim(d)
[1] 1632803      18     
> save(d,file="pokecB.RData")

pokecB.RData (101 M)

Cleaning of variables

> setwd("D:/Data/SNAP/pokec")
> load("pokecB.RData")
> objects()
[1] "d"
> dim(d)
[1] 1632803      18
> colnames(d)
 [1] "user_id"                "hobbies"                "pets"                  
 [4] "eye_color"              "hair_color"             "hair_type"             
 [7] "favourite_color"        "relation_to_smoking"    "relation_to_alcohol"   
[10] "sign_in_zodiac"         "marital_status"         "I_like_movies"         
[13] "I_like_watching_movie"  "I_like_music"           "I_am_going_to_concerts"
[16] "my_active_sports"       "my_passive_sports"      "profession"  

> zodiac <- d$sign_in_zodiac
> zod <- substr(zodiac,1,3)
> t <- table(zod)
> sort(t,decreasing=TRUE)[1:30]
zod
  lev   rak   bli   byk   bar   ryb   pan   vod   sko   koz   vah   str   som   ...   to    rac 
76264 71131 69216 68539 67214 65253 63696 63091 62407 61758 61670 59191  1668  1066   840   821 
  nev   ♥   tak   tvr   ja    pot   opi   dra   kra   no    pes   had   kro   nep 
  657   621   571   552   520   497   478   473   414   407   380   377   362   340 
>

> eyeC <- d$eye_color
> ec <- substr(eyeC,1,6)
> t <- table(ec)
> sort(t,decreasing=TRUE)[1:30]
ec
 hnede  modre zelene hnede, modre, cierne zeleno   sive modro- hnedoz hnedo- hnede  modre  
297289 198895 149255  30104  25288  24374  21679  11605   6688   5771   5509   5105   4315 
modroz modro  hnedo   hneda sive,   modra zelena cerven tmavo  podla  modruc sivomo krasne 
  3879   3519   3418   3415   3399   2527   1893   1563   1323   1298   1178   1107   1059 
hnede. modros neviem cokola 
  1055   1032   1007    933 
          
> pets <- substr(d$pets,1,6)
> t <- table(pets)
> sort(t,decreasing=TRUE)[1:30]
pets
nemam  mam ps    pes pes, m mam ma macka, pes, r mam ry  macka mam ko vtacik 
128457 116975  76026  34902  25878  14513  11031   9544   8331   7730   7398 
pes, v mam vt  rybky mam hl korytn pes, a rybky, mal so pes, k papaga mala s 
  6992   6626   6275   6171   5913   4784   4719   4612   4180   3953   3947 
mam dv  nemam skreco mam pa pes, p mam br mam 2  mam ha 
  3775   3693   3509   3343   3167   2746   2503   2390 

Making the network

To produce the network data the properties should be reordered with respect to the user_id.

Subject:   	Re: Pokec
From:   	Ľuboš Takáč <lubos.takac@gmail.com>
Date:   	Sat, May 18, 2013 17:07
To:   	vladimir.batagelj@fmf.uni-lj.si

Hallo Vladimir,
the number corresponds to user_id,
it mean that f.e. row in relationship file
4 5
is friendship relation between users with ids 4 and 5, there is not
garanted that you find it in profiles data on rows 4 or 5. You have to find
users wit such ids.

So as you said in second example.

Kind regards Lubos Takac.


2013/5/18 Vladimir Batagelj <vladimir.batagelj@fmf.uni-lj.si>

> Dear Lubos Takac,
>
> at http://snap.stanford.edu/data/soc-pokec.html I found your data set
> on Pokec. I would like to know what is the relation between node
> numbers in network and the "user_id"s in the descriptions.
> Is
> the i-th row in the description file the description of the i-th node;
> or
> the i-th row in the description file is the description of the
> corresponding user_id node;
> or
> something else?
>
> best regards,   Vladimir Batagelj
> --
> Vladimir Batagelj, University of Ljubljana, FMF, Department of Mathematics
>   Jadranska 19, 1000 Ljubljana, Slovenia
> http://vladowiki.fmf.uni-lj.si/doku.php?id=vlado
notes/pokec.txt · Last modified: 2015/07/13 15:29 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki