Pokec social network

Pokec is the most popular Slovak on-line social network. These datasets are anonymized and contains relationships and user profile data of the whole network. Profile data are in Slovak language. Friendships in the Pokec network are oriented. Datasets were crawled during MAY 25-27 2012.

Author: Lubos Takac,


Nodes ............................  1632803
Edges ............................ 30622564
Nodes in largest WCC .............  1632803 (1.000)
Edges in largest WCC ............. 30622564 (1.000)
Nodes in largest SCC .............  1304537 (0.799)
Edges in largest SCC ............. 29183655 (0.953)
Average clustering coefficient ...   0.1094
Number of triangles .............. 32557458
Fraction of closed triangles .....  0.01611
Diameter (longest shortest path) .       11
90-percentile effective diameter .      5.3

The data file “soc-pokec-profiles.txt.gz” is large. To unzip it I installed the gzip ( , ). Also TextPad and other text editors were not able to browse/edit the file “soc-pokec-profiles.txt”. I further installed the editor (trial version). It works nicely.

Conversion to Pajek

I first processed the data on 8G 64-bit notebook.

> setwd("D:/Data/SNAP")
> d <- read.delim("soc-pokec-profiles.txt",header=FALSE,sep="\t")
> head(d)
I added the column names:

colNames <- c(
 "more" )


> dim(d)
[1] 1062701      60
> colnames(d) <- colNames
> write.table(d,"pokec.csv",sep=";")
> e <- d[,c(1:37,39:43)]
> dim(e)
[1] 1062701      42
> write.table(e,"pikec.csv",sep=";")

There is a problem. In R we get only 1062701 rows. I was not able to find the error in R commands or input files. I decided to make the selection of variables in Python.

Selection of variables

import csv, sys, os
podatki = "soc-pokec-profiles.txt"
cns = [ "user_id", "public", "completion_percentage", "gender", "region",
 "last_login", "registration", "age", "body", "I_am_working_in_field",
 "spoken_languages", "hobbies", "I_most_enjoy_good_food", "pets",
 "body_type", "my_eyesight", "eye_color", "hair_color", "hair_type",
 "completed_level_of_education", "favourite_color", "relation_to_smoking",
 "relation_to_alcohol", "sign_in_zodiac", "on_pokec_i_am_looking_for",
 "love_is_for_me", "relation_to_casual_sex", "my_partner_should_be",
 "marital_status", "children", "relation_to_children", "I_like_movies",
 "I_like_watching_movie", "I_like_music", "I_mostly_like_listening_to_music",
 "the_idea_of_good_evening", "I_like_specialties_from_kitchen", "fun",
 "I_am_going_to_concerts", "my_active_sports", "my_passive_sports",
 "profession", "I_like_books", "life_style", "music", "cars", "politics",
 "relationships", "art_culture", "hobbies_interests", "science_technologies",
 "computers_internet", "education", "sport", "movies", "travelling",
 "health", "companies_brands", "more" ]
with open(podatki,newline='',encoding='utf-8') as dat,\
     open('pokec1.csv','w',newline='',encoding='utf-8') as lst:
  datReader = csv.reader(dat,delimiter='\t',quotechar='"')
  lstWriter = csv.writer(lst,delimiter=';',quotechar='|',
  n = 0
    for row in datReader:
      n = n+1
      if (n % 10000) == 0: print(n)
  except csv.Error as e:
    sys.exit('file {}, line {}: {}'.format(
      podatki, datReader.line_num, e))

The rest was done in R.


> setwd("D:/Data/SNAP/pokec")
> sel <- c(rep("character",11),rep("NULL",29))
> system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel,
+ na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)})
   user  system elapsed 
 292.17    2.75  317.88 
> dim(d)
[1] 1632803      11
> summary(d)
user_id        public                  completion_percentage    gender         
region         last_login              registration             age           
body           I_am_working_in_field   spoken_languages  
> id <- as.integer(d$user_id)
> public <- as.integer(d$public)
> table(public)
      0       1 
 552525 1080278 
> complete <- as.numeric(d$completion_percentage) 
> summary(complete)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   12.00   41.00   39.79   64.00  100.00 
> gender <- as.integer(d$gender)
> table(gender)
     0      1 
828304 804336 
> reg <- d$region
> head(reg)
[1] "zilinsky kraj, zilina"                      
[2] "zilinsky kraj, kysucke nove mesto"          
[3] "zilinsky kraj, kysucke nove mesto"          
[4] "bratislavsky kraj, bratislava - karlova ves"
[5] "banskobystricky kraj, brezno"               
[6] "zilinsky kraj, martin"                      
> i <- 1:length(reg)
> length(i)
[1] 1632803
> s <- unlist(strsplit(reg,', '))
> county <- s[2*i-1]
> place <- s[2*i]
> head(county)
[1] "zilinsky kraj"        "zilinsky kraj"        "zilinsky kraj"       
[4] "bratislavsky kraj"    "banskobystricky kraj" "zilinsky kraj"       
> pla <- as.factor(place)
> age <- as.integer(d$age)
> summary(age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00   19.00   17.07   26.00  112.00  163.00 
> login <- as.Date(d$last_login)
> register <- as.Date(d$registration)
> body <- d$body
> work <- d$I_am_working_in_field
> lang <- d$spoken_languages
> save(id,public,complete,gender,reg,age,login,register,body,work,lang,file="pokecA.RData")  

pokecA.RData (34 M)


> setwd("D:/Data/SNAP/pokec")
> sel <- c("character",rep("NULL",10),"character","NULL","character",rep("NULL",2),
+ rep("character",3),"NULL",rep("character",4),rep("NULL",4),"character",
+ rep("NULL",2),rep("character",3),rep("NULL",2),rep("character",4))
> system.time({d <- read.csv("pokec1.csv",header=TRUE,sep=";",colClasses=sel,
+ na.strings=c("null"),fill=TRUE,quote="|",comment.char="",flush=TRUE)})
   user  system elapsed 
 277.35    3.35  285.64 
> dim(d)
[1] 1632803      18     
> save(d,file="pokecB.RData")

pokecB.RData (101 M)

Cleaning of variables

> setwd("D:/Data/SNAP/pokec")
> load("pokecB.RData")
> objects()
[1] "d"
> dim(d)
[1] 1632803      18
> colnames(d)
 [1] "user_id"                "hobbies"                "pets"                  
 [4] "eye_color"              "hair_color"             "hair_type"             
 [7] "favourite_color"        "relation_to_smoking"    "relation_to_alcohol"   
[10] "sign_in_zodiac"         "marital_status"         "I_like_movies"         
[13] "I_like_watching_movie"  "I_like_music"           "I_am_going_to_concerts"
[16] "my_active_sports"       "my_passive_sports"      "profession"  

> zodiac <- d$sign_in_zodiac
> zod <- substr(zodiac,1,3)
> t <- table(zod)
> sort(t,decreasing=TRUE)[1:30]
Making the network

To produce the network data the properties should be reordered with respect to the user_id.

