Analysis: years and days

Analysis: years and days

Years

There were some problems with reading the file years.tmp using read.table. I added the header line to the data file and used the switch fill=TRUE.

> wdir <- "C:/Users/batagelj/Documents/2020/corona/MetaTit"
> setwd(wdir)
> Y <- read.table("years.tmp",header=TRUE,sep=" ",colClasses="character",fill=TRUE)
> head(Y)
  idx       id       date  p1 p2 p3 p4
1  1: ug7v899j 2001-07-04 PMC         
2  2: 02tnwd4m 2000-08-15 PMC         
3  3: ejv2xln0 2000-08-25 PMC         
4  4: 2b73a28n 2001-02-22 PMC         
5  5: 9785vg6d 2001-05-11 PMC         
6  6: zjufx4fo 2001-12-17 PMC         
> D <- Y$date
> length(D)
[1] 375102
> first <- function(x) return(strsplit(x,"-")[[1]][1])
> y <- as.integer(sapply(Y$date,first))
> length(y)
[1] 375102
> t <- table(y)
> t
y
  1825   1870   1874   1884   1885   1887   1890   1891   1894   1899   1902 
     1      1      1      1      2      1      2      1      1      2      2 
  1903   1906   1916   1918   1919   1920   1922   1925   1926   1927   1931 
     1      1      1      2      1      2      1      1      1      1      2 
  1936   1940   1941   1942   1947   1948   1950   1951   1952   1953   1954 
     1      1      2      2      2      1      4      7      4      1      1 
  1955   1956   1957   1958   1959   1960   1961   1962   1963   1964   1965 
     7      1      7      1      3      1      1      4      3      8      6 
  1967   1968   1969   1970   1971   1972   1973   1974   1975   1976   1977 
     5      8     13     31     21     29     40     29     51     51     64 
  1978   1979   1980   1981   1982   1983   1984   1985   1986   1987   1988 
    76     82     96    149    114    117    171    155    181    223    196 
  1989   1990   1991   1992   1993   1994   1995   1996   1997   1998   1999 
   222    317    268    290    294    246    314    289    292    386    424 
  2000   2001   2002   2003   2004   2005   2006   2007   2008   2009   2010 
   540    618   1088   1917   2837   2827   3131   2966   3614   4154   4093 
  2011   n2012   2013   2014   2015   2016   2017   2018   2019   2020   2021 
  4630   5074   6143   6841   7405   7842   7317   7227   8470 280456    173 
> t[45]
1967 
   5 
> ny <- length(t)
> ny
[1] 99
> years <- as.integer(names(t)[45:ny])
> freqs <- as.vector(t[45:ny])
> plot(years,freqs,pch=16)
> plot(years,freqs,pch=16,ylim=c(0,8500))

The year 2020 (freq = 280456) is outside the picture.

Days in 2020

> start <- as.integer(as.Date("2019-12-31"))
> numDays <- function(d){
+   tryCatch(
+     return(as.integer(as.Date(d))-start),
+     error=function(ermsg) return(0)
+   )
+ }
> nDays <- sapply(Y$date,numDays)
> days <- nDays[nDays>0]
> td <- table(days)
> nd <- length(td)
> nd
[1] 373
> td[nd-25]
355 
  1 
> f <- as.vector(td[1:(nd-25)])
> d <- as.integer(names(td)[1:(nd-25)])
> plot(d,f,pch=16,cex=0.5,type="l")

The peaks probably correspond to the first day of the month. There is a weekly pattern.

Time series visualization

> d20 <- Y$date[nDays>0]
> length(d20)
[1] 139613
> t20 <- table(d20)
> data <- data.frame(time=as.Date(names(t20)),freq=as.vector(t20))
> str(data)
'data.frame':   373 obs. of  2 variables:
 $ time: Date, format: "2020-01-01" "2020-01-02" ...
 $ freq: int  103 43 42 8 2 35 78 24 33 38 ...
> plot(data)
> library(ggplot2)
> p <- ggplot(data, aes(x=time, y=freq)) +
+   geom_line() + 
+   xlab("") +
+   scale_x_date(limit=c(as.Date("2020-01-01"),as.Date("2020-12-11")),
+     date_breaks = "1 month",date_labels = "%b %Y")
> p

Partitions

The obtained partitions

y - year of publication (0 means unknown)
nDays - number of days from 2019-12-31 (interesting only positive values)

were saved to files years.clu and days.vec (because of negative values):

> source("https://raw.githubusercontent.com/bavla/Rnet/master/R/Pajek.R")
> vector2clu(y,Clu="years.clu")
> vector2vec(nDays,Vec="days.vec")

375102 or 375094 ?

> Y[375100:375102,]
           idx       id       date        p1  p2 p3 p4
375100 375092: ftt4diu4 2020-10-12  MedRxiv; WHO      
375101 375093: onz4762d 2013-12-31 Elsevier; PMC      
375102 375094: 0gmtnkbh 2004-09-06  Medline; PMC

In the table Y we have 8 additional “works”. What is the reason?

> for(i in 1:length(D)) if(i != as.integer(strsplit(Y$idx[i],":")[[1]][1])) break
Error in if (i != as.integer(strsplit(Y$idx[i], ":")[[1]][1])) break : 
  missing value where TRUE/FALSE needed
> i
[1] 268342
> Y[(i-2):(i+2),]
           idx       id       date        p1       p2       p3   p4
268340 268340: z8xps1bw 2020-09-09  Medline;      PMC              
268341 268341: 2w0zr9c0 2020-04-09  BioRxiv; MedRxiv; Medline; PMC;
268342     WHO                                                     
268343 268342: bu34528t 2006-10-10 Elsevier; Medline;      PMC     
268344 268343: fvdock94 2017-01-01 Elsevier; Medline;      PMC      
> for(i in 268343:length(D)) if((i-as.integer(strsplit(Y$idx[i],":")[[1]][1]))>1) break
Error in if ((i - as.integer(strsplit(Y$idx[i], ":")[[1]][1])) > 1) break : 
  missing value where TRUE/FALSE needed
> i
[1] 289829
> Y[(i-2):(i+2),]
           idx       id       date       p1       p2       p3   p4
289827 289826: muknyk4e 2020-06-07 Medline;      PMC              
289828 289827: h5ire02h 2020-03-07 BioRxiv; MedRxiv; Medline; PMC;
289829     WHO                                                    
289830 289828: 5skpvx1l 2020-08-05 Medline;      PMC              
289831 289829: jaqm2jb8 2020-10-15 Medline;      PMC

I guess that adding an additional column p5 will resolve the problem.

The problem doesn't influence the produced distributions.

> Y <- read.table("years.tmp",header=TRUE,sep=" ",colClasses="character",fill=TRUE)
> tail(Y)
           idx       id       date        p1  p2 p3 p4 p5
375089 375089: hd1tjj6b 2020-04-24  Medline; PMC         
375090 375090: 88wfcc3y 2020-09-24  BioRxiv; WHO         
375091 375091: fbn7h6dx 2020-05-20  Medline; PMC         
375092 375092: ftt4diu4 2020-10-12  MedRxiv; WHO         
375093 375093: onz4762d 2013-12-31 Elsevier; PMC         
375094 375094: 0gmtnkbh 2004-09-06  Medline; PMC

I was right.

To do

distribution by the day of the week