New books data (Nov 2017)
> wdir <- "C:/Users/batagelj/Documents/papers/2017/Moscow/EDA/test" > setwd(wdir) > T <- read.csv2("newBooks.csv",stringsAsFactors=FALSE) > dim(T) [1] 970 15 > nrow(T) [1] 970 > ncol(T) [1] 15 > head(T) bID Amazon bind npag pub year lang wid thi 1 1 0521840856 Hardcover 402 Cambridge University Press 2004 English 6 1.1 2 2 0521387078 Paperback 857 Cambridge University Press 1994 English 6 1.5 3 3 1446247414 Paperback 304 SAGE Publications Ltd 2013 English 7.3 0.7 4 4 0195379470 Paperback 264 Oxford University Press 2011 English 9.2 0.7 5 5 0199206651 Hardcover 720 Oxford University Press 2010 English 9.8 1.7 6 6 1493909827 Paperback 207 Springer 2014 English 6.1 0.5 hei duni weig wuni pric 1 9 inches 1.4 pounds 121.52 2 9 inches 2.6 pounds 52.41 3 9.1 inches 1.4 pounds 37.38 4 6.1 inches 12.8 ounces 20.75 5 7.6 inches 4.1 pounds 61.54 6 9.2 inches 11.2 ounces 46.18 titl 1 Amazon.com: Generalized Blockmodeling (Structural Analysis in the Social Sciences) (9780521840859): Patrick Doreian, Vladimir Batagelj, Anuska Ferligoj: Books 2 Amazon.com: Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences) (9780521387071): Stanley Wasserman, Katherine Faust: Books 3 Analyzing Social Networks: Stephen P Borgatti, Martin G. Everett, Jeffrey C. Johnson: 9781446247419: Amazon.com: Books 4 Understanding Social Networks: Theories, Concepts, and Findings: Charles Kadushin: 9780195379471: Amazon.com: Books 5 Networks: An Introduction: Mark Newman: 9780199206650: Amazon.com: Books 6 Amazon.com: Statistical Analysis of Network Data with R (Use R!) (9781493909827): Eric D. Kolaczyk, Gábor Csárdi: Books > tail(T) > T[c(5,9,333),1:8] bID Amazon bind npag pub year lang wid 5 5 0199206651 Hardcover 720 Oxford University Press 2010 English 9.8 9 9 1473952123 Paperback 248 SAGE Publications Ltd 2017 English 6.7 333 332 1546640010 Paperback 74 CreateSpace Independent Publishing Platform 2017 English 6 > str(T) 'data.frame': 970 obs. of 15 variables: $ bID : chr "1" "2" "3" "4" ... $ Amazon: chr "0521840856" "0521387078" "1446247414" "0195379470" ... $ bind : chr "Hardcover" "Paperback" "Paperback" "Paperback" ... $ npag : int 402 857 304 264 720 207 344 744 248 272 ... $ pub : chr "Cambridge University Press" "Cambridge University Press" "SAGE Publications Ltd" "Oxford University Press" ... $ year : int 2004 1994 2013 2011 2010 2014 2005 2010 2017 2011 ... $ lang : chr "English" "English" "English" "English" ... $ wid : chr "6" "6" "7.3" "9.2" ... $ thi : chr "1.1" "1.5" "0.7" "0.7" ... $ hei : chr "9" "9" "9.1" "6.1" ... $ duni : chr "inches" "inches" "inches" "inches" ... $ weig : chr "1.4" "2.6" "1.4" "12.8" ... $ wuni : chr "pounds" "pounds" "pounds" "ounces" ... $ pric : chr "121.52" "52.41" "37.38" "20.75" ... $ titl : chr "Amazon.com: Generalized Blockmodeling (Structural Analysis in the Social Sciences) (9780521840859): Patrick Dor"| __truncated__ "Amazon.com: Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences) (978"| __truncated__ "Analyzing Social Networks: Stephen P Borgatti, Martin G. Everett, Jeffrey C. Johnson: 9781446247419: Amazon.com: Books " "Understanding Social Networks: Theories, Concepts, and Findings: Charles Kadushin: 9780195379471: Amazon.com: Books " ...
> n_bID <- as.integer(T$bID) Warning message: NAs introduced by coercion > which(is.na(n_bID)) [1] 142 529 617 > j <- which(is.na(n_bID)) > T$bID[j] [1] "Python Data Science Handbook: Essential Tools for Working with Data: Jake VanderPlas: 9781491912058: ... [2] "Content Rules: How to Create Killer Blogs, Podcasts, Videos, Ebooks, Webinars (and More) That Engage ... [3] "Using R for Introductory Econometrics: Florian Heiss: 9781523285136: Amazon.com: Books "
Since there are only three trouble-makers we decided to look to Amazon and correct the data in a copy of the original data file newBooksCorr.csv.
> T <- read.csv2("newBooksCorr.csv",stringsAsFactors=FALSE) > n_bID <- as.integer(T$bID) > summary(n_bID) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 242.5 484.0 484.0 725.5 967.0
Strings. We check their lengths.
> L <- nchar(T$Amazon) > summary(L) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 10 10 10 10 10
> table(T$bind) Hardcover Leather Loose leaf Other Paperback Spiral 262 1 7 5 691 1 > n_bind <- factor(T$bind)
Note: we could use also factor(T$bind,levels=Cover)
.
> barplot(rev(sort(table(n_bind))))
> summary(T$npag) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 2.600e+01 2.560e+02 3.500e+02 1.603e+06 4.960e+02 1.482e+09 42 > T$npag[(T$npag>2000)&!is.na(T$npag)] [1] 1482014092 > which((T$npag>2000)&!is.na(T$npag)) [1] 523
Again we decided to correct the data in the file newBooksCorr.csv
> T <- read.csv2("newBooksCorr.csv",stringsAsFactors=FALSE) > summary(T$npag) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 26.0 256.0 350.0 407.9 496.0 2016.0 42
Distribution of values:
> boxplot(T$npag) > hist(T$npag) > pages <- T$npag[!is.na(T$npag)] > hist(pages,breaks="Scott",prob=TRUE,ylab="",main="Number of pages") > lines(density(pages),col="blue",lwd=2)
> P <- rev(sort(table(T$pub))) > length(P) [1] 241 > P[1:20] O'Reilly Media Oxford University Press 47 43 Dover Publications Wiley 40 38 Princeton University Press McGraw-Hill Education 35 33 Cambridge University Press W. W. Norton & Company 32 27 Pearson Springer 26 24 South-Western College Pub The MIT Press 23 19 CreateSpace Independent Publishing Platform Portfolio 18 17 Simon & Schuster English 16 16 Routledge SAGE Publications, Inc 15 14 Harvard University Press Basic Books 13 13
Many values, long names. Synonyms? Group less important into Others
?
> summary(T$year) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1 2008 2013 1977 2015 2018 48 > (j <- which(T$year<1900)) [1] 36 129 139 347 407 472 476 484 523 536 565 569 778 799 937 > T$year[j] [1] 31 19 3 67 37 2 1 1 6 12 1 48 8 4 1 > n_year <- T$year > n_year[j] <- NA > summary(n_year) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1956 2008 2013 2010 2015 2018 63
We could, instead, also correct the data in the file newBooksCorr.csv
.
> years <- n_year[!is.na(n_year)] > boxplot(years) > hist(years,breaks="Scott",prob=TRUE,ylab="",main="Years") > lines(density(years),col="blue",lwd=2)
> L <- table(T$lang) > as.vector(L) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 923 1
923 books are in English. The rest are errors. This variable is not interesting.
> n_wid <- as.numeric(T$wid) > n_thi <- as.numeric(T$thi) Warning message: NAs introduced by coercion > n_hei <- as.numeric(T$hei) Warning message: NAs introduced by coercion > summary(cbind(n_wid,n_thi,n_hei)) n_wid n_thi n_hei Min. : 0.500 Min. : 0.100 Min. : 0.10 1st Qu.: 5.500 1st Qu.: 0.700 1st Qu.: 8.20 Median : 6.100 Median : 0.900 Median : 9.00 Mean : 6.469 Mean : 1.146 Mean : 8.94 3rd Qu.: 7.200 3rd Qu.: 1.200 3rd Qu.: 9.30 Max. :15.500 Max. :19.700 Max. :234.00 NA's :41 NA's :42 NA's :44 > which(n_wid>12) [1] 658 672 > which(n_thi>16) [1] 508 > which(n_hei>20) [1] 658 672 851 > d <- table(T$duni) > as.vector(d) [1] 2 1 922 1 > d 1.8 13.6 inches ounces 2 1 922 1 > which(T$duni!="inches") [1] 141 523 527 614
The variable duni has a constant value inches
. Check on Amazon units 508, 658, 672, 851. On the Amazon pages these data have the same values as in the data frame - I suspect that they are measured in cm.
> boxplot(cbind(n_wid,n_thi,n_hei)) > boxplot(n_wid) > boxplot(n_thi) > boxplot(n_hei) > h <- n_hei > h[h>200] <- NA > boxplot(h) > pairs(cbind(n_wid,n_thi,h))
> n_weig <- as.numeric(T$weig) Warning message: NAs introduced by coercion > summary(n_weig) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0 2 3 144564 9 133571777 43 > w <- table(T$wuni) > as.vector(w) [1] 1 1 1 1 1 392 530 > w 11.33 26.90 46.54 500 Social Media ... 8 ounces pounds 1 1 1 1 1 392 530
1 pound (lb) = 16 ounce (oz)
> i <- which((T$wuni=="ounces")&!is.na(n_weig)) > n_weig[i] <- n_weig[i] / 16 > summary(n_weig) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0 1 1 144560 2 133571777 43 > which(n_weig>1000) [1] 98 > n_weig[98] <- NA > summary(n_weig) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.0125 0.7000 1.2000 1.3937 1.8000 6.4000 44 > boxplot(n_weig) > pairs(cbind(n_wid,n_thi,n_hei,n_weig))
> n_pric <- as.numeric(T$pric) Warning message: NAs introduced by coercion > summary(n_pric) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 4.95 20.20 38.20 56.95 71.59 292.95 404 > boxplot(n_pric) > prices <- n_pric[!is.na(n_pric)] > hist(prices,breaks="Scott",prob=TRUE,ylab="",main="Prices") > lines(density(prices),col="blue",lwd=2)
The values seem reasonable. Too many NAs.
Correct the program and recollect the data or restrict analysis to units with a complete description.
Remove substrings : Amazon
and : Books
.
As an additional check we can count the number of ”;” in each line of the file newBooksLec.csv
.
> D <- readLines("newBooksLec.csv") > L <- lengths(regmatches(D,gregexpr(";",D))) > table(L) L 14 968
All lines have 15 fields, as expected.
We can check the consistency of the data also observing the density of books – it should not vary a lot.
> den <- n_weig/(n_wid*n_thi*n_hei) > boxplot(den) > (i <- which((den>0.1)&!is.na(den))) [1] 302 422 736 745 > clean[i,c("wid","thi","hei","weig")] wid thi hei weig 302 6.3 0.9 0.3 0.6000 422 7.3 5.0 0.1 0.5000 736 6.7 4.4 0.4 2.6000 745 5.6 1.3 0.3 0.7875
All four books have a very small height.
I made some additional corrections in the file newBooksLec.csv
. And repeated the essential commands to create the clean data frame:
> T <- read.csv2("newBooksLec.csv",stringsAsFactors=FALSE) > n_bID <- as.integer(T$bID) > summary(n_bID) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 242.2 483.5 483.5 724.8 966.0 > L <- nchar(T$Amazon) > summary(L) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 10 10 10 10 10 > table(T$bind) Hardcover Leather Loose leaf Other Paperback Spiral 262 1 7 4 691 1 > n_bind <- factor(T$bind) > summary(T$npag) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 26.0 256.0 347.0 406.0 494.8 1792.0 40 > summary(T$year) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1 2008 2013 1979 2015 2018 44 > (j <- which(T$year<1900)) [1] 36 129 139 347 407 472 476 484 536 565 569 778 799 937 > n_year <- T$year > n_year[j] <- NA > summary(n_year) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 1956 2008 2013 2010 2015 2018 58 > n_wid <- as.numeric(T$wid) > n_thi <- as.numeric(T$thi) > n_hei <- as.numeric(T$hei) > summary(cbind(n_wid,n_thi,n_hei)) n_wid n_thi n_hei Min. : 0.500 Min. : 0.100 Min. : 0.10 1st Qu.: 5.500 1st Qu.: 0.700 1st Qu.: 8.20 Median : 6.100 Median : 0.900 Median : 9.00 Mean : 6.486 Mean : 1.117 Mean : 8.71 3rd Qu.: 7.200 3rd Qu.: 1.100 3rd Qu.: 9.30 Max. :15.500 Max. :19.700 Max. :23.40 NA's :39 NA's :39 NA's :39 > which(n_wid>12) [1] 658 672 > which(n_thi>16) [1] 508 > which(n_hei>20) [1] 658 672 851 > n_weig <- as.numeric(T$weig) > i <- which((T$wuni=="ounces")&!is.na(n_weig)) > n_weig[i] <- n_weig[i] / 16 > summary(n_weig) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.0125 0.7000 1.2000 1.3924 1.8000 6.4000 39 > head(T$weig) [1] "1.4" "2.6" "1.4" "12.8" "4.1" "11.2" > head(n_weig) [1] 1.4 2.6 1.4 0.8 4.1 0.7 > n_pric <- as.numeric(T$pric) > summary(n_pric) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 4.95 20.00 38.00 56.61 71.23 292.95 397 > clean <- data.frame(bID=n_bID,Amazon=T$Amazon,bind=n_bind,npag=T$npag,year=n_year,pub=T$pub, + wid=n_wid,thi=n_thi,hei=n_hei,weig=n_weig,pric=n_pric,titl=T$titl) > dim(clean) [1] 966 12 > write.csv2(clean,file="newBooksClean.csv",row.names=FALSE)