Cleaning the data

Amazon new books

New books data (Nov 2017)

First try

> wdir <- "C:/Users/batagelj/Documents/papers/2017/Moscow/EDA/test"
> setwd(wdir)
> T <- read.csv2("newBooks.csv",stringsAsFactors=FALSE)
> dim(T)
[1] 970  15
> nrow(T)
[1] 970
> ncol(T)
[1] 15
> head(T)
  bID     Amazon      bind npag                        pub year    lang wid thi
1   1 0521840856 Hardcover  402 Cambridge University Press 2004 English   6 1.1
2   2 0521387078 Paperback  857 Cambridge University Press 1994 English   6 1.5
3   3 1446247414 Paperback  304      SAGE Publications Ltd 2013 English 7.3 0.7
4   4 0195379470 Paperback  264    Oxford University Press 2011 English 9.2 0.7
5   5 0199206651 Hardcover  720    Oxford University Press 2010 English 9.8 1.7
6   6 1493909827 Paperback  207                   Springer 2014 English 6.1 0.5
  hei   duni weig   wuni   pric
1   9 inches  1.4 pounds 121.52
2   9 inches  2.6 pounds  52.41
3 9.1 inches  1.4 pounds  37.38
4 6.1 inches 12.8 ounces  20.75
5 7.6 inches  4.1 pounds  61.54
6 9.2 inches 11.2 ounces  46.18
                                                                                                                                                                    titl
1        Amazon.com: Generalized Blockmodeling (Structural Analysis in the Social Sciences) (9780521840859): Patrick Doreian, Vladimir Batagelj, Anuska Ferligoj: Books 
2 Amazon.com: Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences) (9780521387071): Stanley Wasserman, Katherine Faust: Books 
3                                                Analyzing Social Networks: Stephen P Borgatti, Martin G. Everett, Jeffrey C. Johnson: 9781446247419: Amazon.com: Books 
4                                                   Understanding Social Networks: Theories, Concepts, and Findings: Charles Kadushin: 9780195379471: Amazon.com: Books 
5                                                                                              Networks: An Introduction: Mark Newman: 9780199206650: Amazon.com: Books 
6                                               Amazon.com: Statistical Analysis of Network Data with R (Use R!) (9781493909827): Eric D. Kolaczyk, Gábor Csárdi: Books 
> tail(T)                                                                                                                                                       
> T[c(5,9,333),1:8]
    bID     Amazon      bind npag                                         pub year    lang wid
5     5 0199206651 Hardcover  720                     Oxford University Press 2010 English 9.8
9     9 1473952123 Paperback  248                       SAGE Publications Ltd 2017 English 6.7
333 332 1546640010 Paperback   74 CreateSpace Independent Publishing Platform 2017 English   6
> str(T)
'data.frame':   970 obs. of  15 variables:
 $ bID   : chr  "1" "2" "3" "4" ...
 $ Amazon: chr  "0521840856" "0521387078" "1446247414" "0195379470" ...
 $ bind  : chr  "Hardcover" "Paperback" "Paperback" "Paperback" ...
 $ npag  : int  402 857 304 264 720 207 344 744 248 272 ...
 $ pub   : chr  "Cambridge University Press" "Cambridge University Press" "SAGE Publications Ltd" "Oxford University Press" ...
 $ year  : int  2004 1994 2013 2011 2010 2014 2005 2010 2017 2011 ...
 $ lang  : chr  "English" "English" "English" "English" ...
 $ wid   : chr  "6" "6" "7.3" "9.2" ...
 $ thi   : chr  "1.1" "1.5" "0.7" "0.7" ...
 $ hei   : chr  "9" "9" "9.1" "6.1" ...
 $ duni  : chr  "inches" "inches" "inches" "inches" ...
 $ weig  : chr  "1.4" "2.6" "1.4" "12.8" ...
 $ wuni  : chr  "pounds" "pounds" "pounds" "ounces" ...
 $ pric  : chr  "121.52" "52.41" "37.38" "20.75" ...
 $ titl  : chr  "Amazon.com: Generalized Blockmodeling (Structural Analysis in the Social Sciences) (9780521840859): Patrick Dor"| __truncated__ "Amazon.com: Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences) (978"| __truncated__ "Analyzing Social Networks: Stephen P Borgatti, Martin G. Everett, Jeffrey C. Johnson: 9781446247419: Amazon.com: Books " "Understanding Social Networks: Theories, Concepts, and Findings: Charles Kadushin: 9780195379471: Amazon.com: Books " ...

Book IDs

> n_bID <- as.integer(T$bID)
Warning message:
NAs introduced by coercion 
> which(is.na(n_bID))
[1] 142 529 617
> j <- which(is.na(n_bID))
> T$bID[j]
[1] "Python Data Science Handbook: Essential Tools for Working with Data: Jake VanderPlas: 9781491912058: ... 
[2] "Content Rules: How to Create Killer Blogs, Podcasts, Videos, Ebooks, Webinars (and More) That Engage ...
[3] "Using R for Introductory Econometrics: Florian Heiss: 9781523285136: Amazon.com: Books "  

Since there are only three trouble-makers we decided to look to Amazon and correct the data in a copy of the original data file newBooksCorr.csv.

> T <- read.csv2("newBooksCorr.csv",stringsAsFactors=FALSE)
> n_bID <- as.integer(T$bID)
> summary(n_bID)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   242.5   484.0   484.0   725.5   967.0 

Amazon book code

Strings. We check their lengths.

> L <- nchar(T$Amazon)
> summary(L)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     10      10      10      10      10      10 

Binding

> table(T$bind)
 
 Hardcover    Leather Loose leaf      Other  Paperback     Spiral 
       262          1          7          5        691          1 
> n_bind <- factor(T$bind)

Note: we could use also factor(T$bind,levels=Cover).

> barplot(rev(sort(table(n_bind))))

Number of pages

> summary(T$npag)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
2.600e+01 2.560e+02 3.500e+02 1.603e+06 4.960e+02 1.482e+09        42 
> T$npag[(T$npag>2000)&!is.na(T$npag)]
[1] 1482014092
> which((T$npag>2000)&!is.na(T$npag))
[1] 523

Again we decided to correct the data in the file newBooksCorr.csv

> T <- read.csv2("newBooksCorr.csv",stringsAsFactors=FALSE)
> summary(T$npag)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   26.0   256.0   350.0   407.9   496.0  2016.0      42 

Distribution of values:

> boxplot(T$npag)
> hist(T$npag)
> pages <- T$npag[!is.na(T$npag)]
> hist(pages,breaks="Scott",prob=TRUE,ylab="",main="Number of pages")
> lines(density(pages),col="blue",lwd=2)

Publishers

> P <- rev(sort(table(T$pub)))
> length(P)
[1] 241
> P[1:20]
 
                             O'Reilly Media                     Oxford University Press 
                                         47                                          43 
                         Dover Publications                                       Wiley 
                                         40                                          38 
                 Princeton University Press                       McGraw-Hill Education 
                                         35                                          33 
                 Cambridge University Press                      W. W. Norton & Company 
                                         32                                          27 
                                    Pearson                                    Springer 
                                         26                                          24 
                  South-Western College Pub                               The MIT Press 
                                         23                                          19 
CreateSpace Independent Publishing Platform                                   Portfolio 
                                         18                                          17 
                           Simon & Schuster                                     English 
                                         16                                          16 
                                  Routledge                      SAGE Publications, Inc 
                                         15                                          14 
                   Harvard University Press                                 Basic Books 
                                         13                                          13 

Many values, long names. Synonyms? Group less important into Others ?

Publication years

> summary(T$year)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1    2008    2013    1977    2015    2018      48 
> (j <- which(T$year<1900))
 [1]  36 129 139 347 407 472 476 484 523 536 565 569 778 799 937
> T$year[j]
 [1] 31 19  3 67 37  2  1  1  6 12  1 48  8  4  1
> n_year <- T$year
> n_year[j] <- NA
> summary(n_year)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1956    2008    2013    2010    2015    2018      63 

We could, instead, also correct the data in the file newBooksCorr.csv.

> years <- n_year[!is.na(n_year)]
> boxplot(years)
> hist(years,breaks="Scott",prob=TRUE,ylab="",main="Years")
> lines(density(years),col="blue",lwd=2)

Language

> L <- table(T$lang)
> as.vector(L)
 [1]   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   3   1   1   1   1   1   1   1 923   1

923 books are in English. The rest are errors. This variable is not interesting.

Width, thickness, height and unit

> n_wid <- as.numeric(T$wid)
> n_thi <- as.numeric(T$thi)
Warning message:
NAs introduced by coercion 
> n_hei <- as.numeric(T$hei)
Warning message:
NAs introduced by coercion 
> summary(cbind(n_wid,n_thi,n_hei))
     n_wid            n_thi            n_hei       
 Min.   : 0.500   Min.   : 0.100   Min.   :  0.10  
 1st Qu.: 5.500   1st Qu.: 0.700   1st Qu.:  8.20  
 Median : 6.100   Median : 0.900   Median :  9.00  
 Mean   : 6.469   Mean   : 1.146   Mean   :  8.94  
 3rd Qu.: 7.200   3rd Qu.: 1.200   3rd Qu.:  9.30  
 Max.   :15.500   Max.   :19.700   Max.   :234.00  
 NA's   :41       NA's   :42       NA's   :44      
> which(n_wid>12)
[1] 658 672
> which(n_thi>16)
[1] 508
> which(n_hei>20)
[1] 658 672 851
> d <- table(T$duni)
> as.vector(d)
[1]   2   1 922   1
> d
 
   1.8   13.6 inches ounces 
     2      1    922      1 
> which(T$duni!="inches")
[1] 141 523 527 614

The variable duni has a constant value inches. Check on Amazon units 508, 658, 672, 851. On the Amazon pages these data have the same values as in the data frame - I suspect that they are measured in cm.

> boxplot(cbind(n_wid,n_thi,n_hei))
> boxplot(n_wid)
> boxplot(n_thi)
> boxplot(n_hei)
> h <- n_hei
> h[h>200] <- NA
> boxplot(h)
> pairs(cbind(n_wid,n_thi,h))

Weight and unit

> n_weig <- as.numeric(T$weig)
Warning message:
NAs introduced by coercion 
> summary(n_weig)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
        0         2         3    144564         9 133571777        43 
> w <- table(T$wuni)
> as.vector(w)
[1]   1   1   1   1   1 392 530
> w
 
  11.33    26.90    46.54    500 Social Media ...    8    ounces    pounds 
      1        1        1                       1    1       392       530

1 pound (lb) = 16 ounce (oz)

> i <- which((T$wuni=="ounces")&!is.na(n_weig))
> n_weig[i] <- n_weig[i] / 16
> summary(n_weig)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
        0         1         1    144560         2 133571777        43 
> which(n_weig>1000)
[1] 98
> n_weig[98] <- NA
> summary(n_weig)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0125  0.7000  1.2000  1.3937  1.8000  6.4000      44 
> boxplot(n_weig)
> pairs(cbind(n_wid,n_thi,n_hei,n_weig))

Price

> n_pric <- as.numeric(T$pric)
Warning message:
NAs introduced by coercion 
> summary(n_pric)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   4.95   20.20   38.20   56.95   71.59  292.95     404 
> boxplot(n_pric)
> prices <- n_pric[!is.na(n_pric)]
> hist(prices,breaks="Scott",prob=TRUE,ylab="",main="Prices")
> lines(density(prices),col="blue",lwd=2)

The values seem reasonable. Too many NAs.

Correct the program and recollect the data or restrict analysis to units with a complete description.

Title

Remove substrings : Amazon and : Books.

Additional check

As an additional check we can count the number of ”;” in each line of the file newBooksLec.csv.

> D <- readLines("newBooksLec.csv")
> L <- lengths(regmatches(D,gregexpr(";",D)))
> table(L)
L
 14 
968 

All lines have 15 fields, as expected.

We can check the consistency of the data also observing the density of books – it should not vary a lot.

> den <- n_weig/(n_wid*n_thi*n_hei)
> boxplot(den)
> (i <- which((den>0.1)&!is.na(den)))
[1] 302 422 736 745
> clean[i,c("wid","thi","hei","weig")]
    wid thi hei   weig
302 6.3 0.9 0.3 0.6000
422 7.3 5.0 0.1 0.5000
736 6.7 4.4 0.4 2.6000
745 5.6 1.3 0.3 0.7875

All four books have a very small height.

Make a clean data frame

I made some additional corrections in the file newBooksLec.csv. And repeated the essential commands to create the clean data frame:

> T <- read.csv2("newBooksLec.csv",stringsAsFactors=FALSE)
> n_bID <- as.integer(T$bID)
> summary(n_bID)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   242.2   483.5   483.5   724.8   966.0 
> L <- nchar(T$Amazon)
> summary(L)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     10      10      10      10      10      10 
> table(T$bind)
 
 Hardcover    Leather Loose leaf      Other  Paperback     Spiral 
       262          1          7          4        691          1 
> n_bind <- factor(T$bind)
> summary(T$npag)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   26.0   256.0   347.0   406.0   494.8  1792.0      40 
> summary(T$year)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      1    2008    2013    1979    2015    2018      44 
> (j <- which(T$year<1900))
 [1]  36 129 139 347 407 472 476 484 536 565 569 778 799 937
> n_year <- T$year
> n_year[j] <- NA
> summary(n_year)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1956    2008    2013    2010    2015    2018      58 
> n_wid <- as.numeric(T$wid)
> n_thi <- as.numeric(T$thi)
> n_hei <- as.numeric(T$hei)
> summary(cbind(n_wid,n_thi,n_hei))
     n_wid            n_thi            n_hei      
 Min.   : 0.500   Min.   : 0.100   Min.   : 0.10  
 1st Qu.: 5.500   1st Qu.: 0.700   1st Qu.: 8.20  
 Median : 6.100   Median : 0.900   Median : 9.00  
 Mean   : 6.486   Mean   : 1.117   Mean   : 8.71  
 3rd Qu.: 7.200   3rd Qu.: 1.100   3rd Qu.: 9.30  
 Max.   :15.500   Max.   :19.700   Max.   :23.40  
 NA's   :39       NA's   :39       NA's   :39     
> which(n_wid>12)
[1] 658 672
> which(n_thi>16)
[1] 508
> which(n_hei>20)
[1] 658 672 851
> n_weig <- as.numeric(T$weig)
> i <- which((T$wuni=="ounces")&!is.na(n_weig))
> n_weig[i] <- n_weig[i] / 16
> summary(n_weig)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0125  0.7000  1.2000  1.3924  1.8000  6.4000      39 
> head(T$weig)
[1] "1.4"  "2.6"  "1.4"  "12.8" "4.1"  "11.2"
> head(n_weig)
[1] 1.4 2.6 1.4 0.8 4.1 0.7
> n_pric <- as.numeric(T$pric)
> summary(n_pric)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   4.95   20.00   38.00   56.61   71.23  292.95     397 
> clean <- data.frame(bID=n_bID,Amazon=T$Amazon,bind=n_bind,npag=T$npag,year=n_year,pub=T$pub,
+ wid=n_wid,thi=n_thi,hei=n_hei,weig=n_weig,pric=n_pric,titl=T$titl) 
> dim(clean)
[1] 966  12
> write.csv2(clean,file="newBooksClean.csv",row.names=FALSE)

Corrected raw data; Clean data

EDA

ru/hse/eda/clean.txt · Last modified: 2017/11/15 11:35 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki