====== Analysis: years and days ======
[[notes:imfm:corona:s2orcmeta|S2ORC metadata networks]]
===== Years =====
There were some problems with reading the file ''years.tmp'' using ''read.table''. I added the header line to the data file and used the switch ''fill=TRUE''.
> wdir <- "C:/Users/batagelj/Documents/2020/corona/MetaTit"
> setwd(wdir)
> Y <- read.table("years.tmp",header=TRUE,sep=" ",colClasses="character",fill=TRUE)
> head(Y)
idx id date p1 p2 p3 p4
1 1: ug7v899j 2001-07-04 PMC
2 2: 02tnwd4m 2000-08-15 PMC
3 3: ejv2xln0 2000-08-25 PMC
4 4: 2b73a28n 2001-02-22 PMC
5 5: 9785vg6d 2001-05-11 PMC
6 6: zjufx4fo 2001-12-17 PMC
> D <- Y$date
> length(D)
[1] 375102
> first <- function(x) return(strsplit(x,"-")[[1]][1])
> y <- as.integer(sapply(Y$date,first))
> length(y)
[1] 375102
> t <- table(y)
> t
y
1825 1870 1874 1884 1885 1887 1890 1891 1894 1899 1902
1 1 1 1 2 1 2 1 1 2 2
1903 1906 1916 1918 1919 1920 1922 1925 1926 1927 1931
1 1 1 2 1 2 1 1 1 1 2
1936 1940 1941 1942 1947 1948 1950 1951 1952 1953 1954
1 1 2 2 2 1 4 7 4 1 1
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965
7 1 7 1 3 1 1 4 3 8 6
1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
5 8 13 31 21 29 40 29 51 51 64
1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
76 82 96 149 114 117 171 155 181 223 196
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
222 317 268 290 294 246 314 289 292 386 424
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
540 618 1088 1917 2837 2827 3131 2966 3614 4154 4093
2011 n2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
4630 5074 6143 6841 7405 7842 7317 7227 8470 280456 173
> t[45]
1967
5
> ny <- length(t)
> ny
[1] 99
> years <- as.integer(names(t)[45:ny])
> freqs <- as.vector(t[45:ny])
> plot(years,freqs,pch=16)
> plot(years,freqs,pch=16,ylim=c(0,8500))
The year 2020 (freq = 280456) is outside the picture.
{{notes:imfm:corona:pics:years.png}}
===== Days in 2020 =====
> start <- as.integer(as.Date("2019-12-31"))
> numDays <- function(d){
+ tryCatch(
+ return(as.integer(as.Date(d))-start),
+ error=function(ermsg) return(0)
+ )
+ }
> nDays <- sapply(Y$date,numDays)
> days <- nDays[nDays>0]
> td <- table(days)
> nd <- length(td)
> nd
[1] 373
> td[nd-25]
355
1
> f <- as.vector(td[1:(nd-25)])
> d <- as.integer(names(td)[1:(nd-25)])
> plot(d,f,pch=16,cex=0.5,type="l")
{{notes:imfm:corona:pics:days2020.png}}
The peaks probably correspond to the first day of the month. There is a weekly pattern.
===== Time series visualization =====
> d20 <- Y$date[nDays>0]
> length(d20)
[1] 139613
> t20 <- table(d20)
> data <- data.frame(time=as.Date(names(t20)),freq=as.vector(t20))
> str(data)
'data.frame': 373 obs. of 2 variables:
$ time: Date, format: "2020-01-01" "2020-01-02" ...
$ freq: int 103 43 42 8 2 35 78 24 33 38 ...
> plot(data)
> library(ggplot2)
> p <- ggplot(data, aes(x=time, y=freq)) +
+ geom_line() +
+ xlab("") +
+ scale_x_date(limit=c(as.Date("2020-01-01"),as.Date("2020-12-11")),
+ date_breaks = "1 month",date_labels = "%b %Y")
> p
{{notes:imfm:corona:pics:gdays2020b.png}}
===== Partitions =====
The obtained partitions
* ''y'' - year of publication (0 means unknown)
* ''nDays'' - number of days from 2019-12-31 (interesting only positive values)
were saved to files ''years.clu'' and ''days.vec'' (because of negative values):
> source("https://raw.githubusercontent.com/bavla/Rnet/master/R/Pajek.R")
> vector2clu(y,Clu="years.clu")
> vector2vec(nDays,Vec="days.vec")
===== 375102 or 375094 ? =====
> Y[375100:375102,]
idx id date p1 p2 p3 p4
375100 375092: ftt4diu4 2020-10-12 MedRxiv; WHO
375101 375093: onz4762d 2013-12-31 Elsevier; PMC
375102 375094: 0gmtnkbh 2004-09-06 Medline; PMC
In the table Y we have 8 additional "works". What is the reason?
> for(i in 1:length(D)) if(i != as.integer(strsplit(Y$idx[i],":")[[1]][1])) break
Error in if (i != as.integer(strsplit(Y$idx[i], ":")[[1]][1])) break :
missing value where TRUE/FALSE needed
> i
[1] 268342
> Y[(i-2):(i+2),]
idx id date p1 p2 p3 p4
268340 268340: z8xps1bw 2020-09-09 Medline; PMC
268341 268341: 2w0zr9c0 2020-04-09 BioRxiv; MedRxiv; Medline; PMC;
268342 WHO
268343 268342: bu34528t 2006-10-10 Elsevier; Medline; PMC
268344 268343: fvdock94 2017-01-01 Elsevier; Medline; PMC
> for(i in 268343:length(D)) if((i-as.integer(strsplit(Y$idx[i],":")[[1]][1]))>1) break
Error in if ((i - as.integer(strsplit(Y$idx[i], ":")[[1]][1])) > 1) break :
missing value where TRUE/FALSE needed
> i
[1] 289829
> Y[(i-2):(i+2),]
idx id date p1 p2 p3 p4
289827 289826: muknyk4e 2020-06-07 Medline; PMC
289828 289827: h5ire02h 2020-03-07 BioRxiv; MedRxiv; Medline; PMC;
289829 WHO
289830 289828: 5skpvx1l 2020-08-05 Medline; PMC
289831 289829: jaqm2jb8 2020-10-15 Medline; PMC
I guess that adding an additional column p5 will resolve the problem.
The problem doesn't influence the produced distributions.
> Y <- read.table("years.tmp",header=TRUE,sep=" ",colClasses="character",fill=TRUE)
> tail(Y)
idx id date p1 p2 p3 p4 p5
375089 375089: hd1tjj6b 2020-04-24 Medline; PMC
375090 375090: 88wfcc3y 2020-09-24 BioRxiv; WHO
375091 375091: fbn7h6dx 2020-05-20 Medline; PMC
375092 375092: ftt4diu4 2020-10-12 MedRxiv; WHO
375093 375093: onz4762d 2013-12-31 Elsevier; PMC
375094 375094: 0gmtnkbh 2004-09-06 Medline; PMC
I was right.
===== To do =====
* distribution by the day of the week