====== Data ====== ===== Variables and data tables ===== In a standard setting we have an (ordered) set of **units**. To describe them we select a set of their **properties** (attributes). Data are obtained by **measurement** of properties on units. We obtain a **data table** ^ ^ P₁ ^ P₂ ^ P₃ ^ ... ^ P_m ^ ^ U₁ | v_1,1 | v_1,2 | v_1,3 | ... | v_1,m | ^ U₂ | v_2,1 | v_2,2 | v_2,3 | ... | v_2,m | ^ U₃ | v_3,1 | v_3,2 | v_3,3 | ... | v_3,m | ^ ... | | | | | | ^ U_n | v_n,1 | v_n,2 | v_n,3 | ... | v_n,m | In statistics, vectors - columns in the data table are called **variables**. Because of this in R the basic data structure is a vector. Most operations and functions are working vector-wise.


> a <- c(1,2,3,4,5)
> a
[1] 1 2 3 4 5
> b <- c(10,9,8,7,6)
> b
[1] 10  9  8  7  6
> b[4]
[1] 7
> b[c(1,3,5)]
[1] 10  8  6
> length(a)
[1] 5
> a+b
[1] 11 11 11 11 11
> a-b
[1] -9 -7 -5 -3 -1
> a*b
[1] 10 18 24 28 30
> sqrt(a)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
> 2**a
[1]  2  4  8 16 32
> summary(b)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      6       7       8       8       9      10 
> 3*a
[1]  3  6  9 12 15
> sum(b)
[1] 40
> cumsum(b)
[1] 10 19 27 34 40
> integer(10)
 [1] 0 0 0 0 0 0 0 0 0 0
> 1:9
[1] 1 2 3 4 5 6 7 8 9
> rep(c(0,1),5)
 [1] 0 1 0 1 0 1 0 1 0 1
> seq(1,3,1/3)
[1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000
> c(rep(0,5),rep(1,7))
 [1] 0 0 0 0 0 1 1 1 1 1 1 1

===== Measurement ===== Not all properties of units can be measured, as in physics and geometry, by real numbers. The measurements are made in different **measurement scales** ([[wp>Level_of_measurement]], [[https://www.cambridge.org/core/books/measurement-theory/7D75B72C3E5FA676EA7AD6AB4D8DF4A7|Roberts]]). * **numerical** scales: * **absolute** (counts) * **ratio** (weight, length, price, temperature K) * **interval** (temperature F, date) * **ordinal** (grades: excellent, good, average, poor, unsatisfactory; temperature: very cold, cold, cool, warm, hot, very hot) * **nominal** or **categorical** (nationality, religious preference: Buddhist, Muslim, Christian, Jewish, Other) The type of the measurement scale determines what we can do with the corresponding variables. For example - central element: geometric mean, (arithmetic) mean, median, mode. ===== Nominal and ordinal variables in R ===== Values of numerical scale measurements are represented with integers or real numbers. Integers are often used (as codes) also for representing ordinal and nominal values - but not all numerical operations on them are meaningful.


> t <- c("F","D","F","I","A","F","F","D","B","SLO","I","F","GB","B")
> T <- factor(t)
> t
 [1] "F"   "D"   "F"   "I"   "A"   "F"   "F"   "D"   "B"   "SLO" "I"   "F"   "GB"  "B"  
> T
 [1] F   D   F   I   A   F   F   D   B   SLO I   F   GB  B  
Levels: A B D F GB I SLO
> as.integer(T)
 [1] 4 3 4 6 1 4 4 3 2 7 6 4 5 2
> levels(T)
[1] "A"   "B"   "D"   "F"   "GB"  "I"   "SLO"
> levels(T)[as.integer(T)]
 [1] "F"   "D"   "F"   "I"   "A"   "F"   "F"   "D"   "B"   "SLO" "I"   "F"   "GB"  "B"  
> which(T=="F")
[1]  1  3  6  7 12
> table(T)
T
  A   B   D   F  GB   I SLO 
  1   2   2   5   1   2   1 
> L <- c("unsatisfactory","poor","average","good","excellent")
> s <- c("unsatisfactory","good","good","average")
> k <- factor(s,levels=L,ordered=TRUE)
> k
[1] unsatisfactory good           good           average       
Levels: unsatisfactory < poor < average < good < excellent
> as.integer(k)
[1] 1 4 4 3

See the [[wp>List_of_international_vehicle_registration_codes]]. ==== Test ==== Compute the empirical probability distribution and the cummulative empirical probability distribution of values in vector t. [[ru:7iss:labs:s:s1|Solution]] ===== Special values ===== **Missing data** and **not applicable** measurements. Infinity.


> 1^2009 
[1] 1 
> 2009^0 
[1] 1 
> 0^0 
[1] 1 
> 1/0 
[1] Inf 
> -3/0 
[1] -Inf 
> 5/Inf 
[1] 0 
> Inf+Inf 
[1] Inf 
> 2009*Inf
[1] Inf
> Inf*Inf 
[1] Inf 
> sqrt(Inf) 
[1] Inf 
> Inf-Inf 
[1] NaN 
> 0/0 
[1] NaN 
> 0*Inf 
[1] NaN 
> Inf/Inf 
[1] NaN 
> 3%%0 
[1] NaN 
> sqrt(-2)
[1] NaN
Warning message:
In sqrt(-2) : NaNs produced
> 3+sqrt(NA)
[1] NA
> 1e308
[1] 1e+308
> 1e309
[1] Inf
> 1e-323
[1] 9.881313e-324
> 1e-324
[1] 0

===== Unicode ===== [[|Unicode]].


> L <- c('полностью согласен','скорее согласен','скорее несогласен','полностью несогласен')
> L
[1] "полностью согласен"   "скорее согласен"      "скорее несогласен"    "полностью несогласен"
> a <- c("полностью согласен","скорее несогласен",NA,"полностью согласен","скорее согласен","полностью согласен")
> a
[1] "полностью согласен" "скорее несогласен"  NA       "полностью согласен" "скорее согласен"    "полностью согласен"
> A <- factor(a,levels=L,ordered=TRUE)
> A
[1] полностью согласен скорее несогласен               полностью согласен скорее согласен    полностью согласен
Levels: полностью согласен < скорее согласен < скорее несогласен < полностью несогласен
> as.integer(A)
[1]  1  3 NA  1  2  1
>

**Metadata** are information about the data (way of collection, time, place, remarks, authors, copyright, etc.). The tendency is to integrate the metadata with the data. \\ \\ [[ru:7iss#labs|Back to 7ISS Labs]]