====== Analysis: years and days ====== [[notes:imfm:corona:s2orcmeta|S2ORC metadata networks]] ===== Years ===== There were some problems with reading the file ''years.tmp'' using ''read.table''. I added the header line to the data file and used the switch ''fill=TRUE''. > wdir <- "C:/Users/batagelj/Documents/2020/corona/MetaTit" > setwd(wdir) > Y <- read.table("years.tmp",header=TRUE,sep=" ",colClasses="character",fill=TRUE) > head(Y) idx id date p1 p2 p3 p4 1 1: ug7v899j 2001-07-04 PMC 2 2: 02tnwd4m 2000-08-15 PMC 3 3: ejv2xln0 2000-08-25 PMC 4 4: 2b73a28n 2001-02-22 PMC 5 5: 9785vg6d 2001-05-11 PMC 6 6: zjufx4fo 2001-12-17 PMC > D <- Y$date > length(D) [1] 375102 > first <- function(x) return(strsplit(x,"-")[[1]][1]) > y <- as.integer(sapply(Y$date,first)) > length(y) [1] 375102 > t <- table(y) > t y 1825 1870 1874 1884 1885 1887 1890 1891 1894 1899 1902 1 1 1 1 2 1 2 1 1 2 2 1903 1906 1916 1918 1919 1920 1922 1925 1926 1927 1931 1 1 1 2 1 2 1 1 1 1 2 1936 1940 1941 1942 1947 1948 1950 1951 1952 1953 1954 1 1 2 2 2 1 4 7 4 1 1 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 7 1 7 1 3 1 1 4 3 8 6 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 5 8 13 31 21 29 40 29 51 51 64 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 76 82 96 149 114 117 171 155 181 223 196 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 222 317 268 290 294 246 314 289 292 386 424 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 540 618 1088 1917 2837 2827 3131 2966 3614 4154 4093 2011 n2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 4630 5074 6143 6841 7405 7842 7317 7227 8470 280456 173 > t[45] 1967 5 > ny <- length(t) > ny [1] 99 > years <- as.integer(names(t)[45:ny]) > freqs <- as.vector(t[45:ny]) > plot(years,freqs,pch=16) > plot(years,freqs,pch=16,ylim=c(0,8500)) The year 2020 (freq = 280456) is outside the picture. {{notes:imfm:corona:pics:years.png}} ===== Days in 2020 ===== > start <- as.integer(as.Date("2019-12-31")) > numDays <- function(d){ + tryCatch( + return(as.integer(as.Date(d))-start), + error=function(ermsg) return(0) + ) + } > nDays <- sapply(Y$date,numDays) > days <- nDays[nDays>0] > td <- table(days) > nd <- length(td) > nd [1] 373 > td[nd-25] 355 1 > f <- as.vector(td[1:(nd-25)]) > d <- as.integer(names(td)[1:(nd-25)]) > plot(d,f,pch=16,cex=0.5,type="l") {{notes:imfm:corona:pics:days2020.png}} The peaks probably correspond to the first day of the month. There is a weekly pattern. ===== Time series visualization ===== > d20 <- Y$date[nDays>0] > length(d20) [1] 139613 > t20 <- table(d20) > data <- data.frame(time=as.Date(names(t20)),freq=as.vector(t20)) > str(data) 'data.frame': 373 obs. of 2 variables: $ time: Date, format: "2020-01-01" "2020-01-02" ... $ freq: int 103 43 42 8 2 35 78 24 33 38 ... > plot(data) > library(ggplot2) > p <- ggplot(data, aes(x=time, y=freq)) + + geom_line() + + xlab("") + + scale_x_date(limit=c(as.Date("2020-01-01"),as.Date("2020-12-11")), + date_breaks = "1 month",date_labels = "%b %Y") > p {{notes:imfm:corona:pics:gdays2020b.png}} ===== Partitions ===== The obtained partitions * ''y'' - year of publication (0 means unknown) * ''nDays'' - number of days from 2019-12-31 (interesting only positive values) were saved to files ''years.clu'' and ''days.vec'' (because of negative values): > source("https://raw.githubusercontent.com/bavla/Rnet/master/R/Pajek.R") > vector2clu(y,Clu="years.clu") > vector2vec(nDays,Vec="days.vec") ===== 375102 or 375094 ? ===== > Y[375100:375102,] idx id date p1 p2 p3 p4 375100 375092: ftt4diu4 2020-10-12 MedRxiv; WHO 375101 375093: onz4762d 2013-12-31 Elsevier; PMC 375102 375094: 0gmtnkbh 2004-09-06 Medline; PMC In the table Y we have 8 additional "works". What is the reason? > for(i in 1:length(D)) if(i != as.integer(strsplit(Y$idx[i],":")[[1]][1])) break Error in if (i != as.integer(strsplit(Y$idx[i], ":")[[1]][1])) break : missing value where TRUE/FALSE needed > i [1] 268342 > Y[(i-2):(i+2),] idx id date p1 p2 p3 p4 268340 268340: z8xps1bw 2020-09-09 Medline; PMC 268341 268341: 2w0zr9c0 2020-04-09 BioRxiv; MedRxiv; Medline; PMC; 268342 WHO 268343 268342: bu34528t 2006-10-10 Elsevier; Medline; PMC 268344 268343: fvdock94 2017-01-01 Elsevier; Medline; PMC > for(i in 268343:length(D)) if((i-as.integer(strsplit(Y$idx[i],":")[[1]][1]))>1) break Error in if ((i - as.integer(strsplit(Y$idx[i], ":")[[1]][1])) > 1) break : missing value where TRUE/FALSE needed > i [1] 289829 > Y[(i-2):(i+2),] idx id date p1 p2 p3 p4 289827 289826: muknyk4e 2020-06-07 Medline; PMC 289828 289827: h5ire02h 2020-03-07 BioRxiv; MedRxiv; Medline; PMC; 289829 WHO 289830 289828: 5skpvx1l 2020-08-05 Medline; PMC 289831 289829: jaqm2jb8 2020-10-15 Medline; PMC I guess that adding an additional column p5 will resolve the problem. The problem doesn't influence the produced distributions. > Y <- read.table("years.tmp",header=TRUE,sep=" ",colClasses="character",fill=TRUE) > tail(Y) idx id date p1 p2 p3 p4 p5 375089 375089: hd1tjj6b 2020-04-24 Medline; PMC 375090 375090: 88wfcc3y 2020-09-24 BioRxiv; WHO 375091 375091: fbn7h6dx 2020-05-20 Medline; PMC 375092 375092: ftt4diu4 2020-10-12 MedRxiv; WHO 375093 375093: onz4762d 2013-12-31 Elsevier; PMC 375094 375094: 0gmtnkbh 2004-09-06 Medline; PMC I was right. ===== To do ===== * distribution by the day of the week