Data

Data

Variables and data tables

In a standard setting we have an (ordered) set of units. To describe them we select a set of their properties (attributes). Data are obtained by measurement of properties on units. We obtain a data table

	P₁	P₂	P₃	…	P_m
U₁	v_1,1	v_1,2	v_1,3	…	v_1,m
U₂	v_2,1	v_2,2	v_2,3	…	v_2,m
U₃	v_3,1	v_3,2	v_3,3	…	v_3,m
…
U_n	v_n,1	v_n,2	v_n,3	…	v_n,m

In statistics, vectors - columns in the data table are called variables. Because of this in R the basic data structure is a vector. Most operations and functions are working vector-wise.

> a <- c(1,2,3,4,5)
> a
[1] 1 2 3 4 5
> b <- c(10,9,8,7,6)
> b
[1] 10  9  8  7  6
> b[4]
[1] 7
> b[c(1,3,5)]
[1] 10  8  6
> length(a)
[1] 5
> a+b
[1] 11 11 11 11 11
> a-b
[1] -9 -7 -5 -3 -1
> a*b
[1] 10 18 24 28 30
> sqrt(a)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
> 2**a
[1]  2  4  8 16 32
> summary(b)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      6       7       8       8       9      10 
> 3*a
[1]  3  6  9 12 15
> sum(b)
[1] 40
> cumsum(b)
[1] 10 19 27 34 40
> integer(10)
 [1] 0 0 0 0 0 0 0 0 0 0
> 1:9
[1] 1 2 3 4 5 6 7 8 9
> rep(c(0,1),5)
 [1] 0 1 0 1 0 1 0 1 0 1
> seq(1,3,1/3)
[1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000
> c(rep(0,5),rep(1,7))
 [1] 0 0 0 0 0 1 1 1 1 1 1 1

Measurement

Not all properties of units can be measured, as in physics and geometry, by real numbers. The measurements are made in different measurement scales (Level_of_measurement, Roberts).

numerical scales:
- absolute (counts)
- ratio (weight, length, price, temperature K)
- interval (temperature F, date)
ordinal (grades: excellent, good, average, poor, unsatisfactory; temperature: very cold, cold, cool, warm, hot, very hot)
nominal or categorical (nationality, religious preference: Buddhist, Muslim, Christian, Jewish, Other)

The type of the measurement scale determines what we can do with the corresponding variables. For example - central element: geometric mean, (arithmetic) mean, median, mode.

Nominal and ordinal variables in R

Values of numerical scale measurements are represented with integers or real numbers. Integers are often used (as codes) also for representing ordinal and nominal values - but not all numerical operations on them are meaningful.

> t <- c("F","D","F","I","A","F","F","D","B","SLO","I","F","GB","B")
> T <- factor(t)
> t
 [1] "F"   "D"   "F"   "I"   "A"   "F"   "F"   "D"   "B"   "SLO" "I"   "F"   "GB"  "B"  
> T
 [1] F   D   F   I   A   F   F   D   B   SLO I   F   GB  B  
Levels: A B D F GB I SLO
> as.integer(T)
 [1] 4 3 4 6 1 4 4 3 2 7 6 4 5 2
> levels(T)
[1] "A"   "B"   "D"   "F"   "GB"  "I"   "SLO"
> levels(T)[as.integer(T)]
 [1] "F"   "D"   "F"   "I"   "A"   "F"   "F"   "D"   "B"   "SLO" "I"   "F"   "GB"  "B"  
> which(T=="F")
[1]  1  3  6  7 12
> table(T)
T
  A   B   D   F  GB   I SLO 
  1   2   2   5   1   2   1 
> L <- c("unsatisfactory","poor","average","good","excellent")
> s <- c("unsatisfactory","good","good","average")
> k <- factor(s,levels=L,ordered=TRUE)
> k
[1] unsatisfactory good           good           average       
Levels: unsatisfactory < poor < average < good < excellent
> as.integer(k)
[1] 1 4 4 3

See the List_of_international_vehicle_registration_codes.

Test

Compute the empirical probability distribution and the cummulative empirical probability distribution of values in vector t. Solution

Special values

Missing data and not applicable measurements. Infinity.

> 1^2009 
[1] 1 
> 2009^0 
[1] 1 
> 0^0 
[1] 1 
> 1/0 
[1] Inf 
> -3/0 
[1] -Inf 
> 5/Inf 
[1] 0 
> Inf+Inf 
[1] Inf 
> 2009*Inf
[1] Inf
> Inf*Inf 
[1] Inf 
> sqrt(Inf) 
[1] Inf 
> Inf-Inf 
[1] NaN 
> 0/0 
[1] NaN 
> 0*Inf 
[1] NaN 
> Inf/Inf 
[1] NaN 
> 3%%0 
[1] NaN 
> sqrt(-2)
[1] NaN
Warning message:
In sqrt(-2) : NaNs produced
> 3+sqrt(NA)
[1] NA
> 1e308
[1] 1e+308
> 1e309
[1] Inf
> 1e-323
[1] 9.881313e-324
> 1e-324
[1] 0

Unicode

Unicode.

> L <- c('полностью согласен','скорее согласен','скорее несогласен','полностью несогласен')
> L
[1] "полностью согласен"   "скорее согласен"      "скорее несогласен"    "полностью несогласен"
> a <- c("полностью согласен","скорее несогласен",NA,"полностью согласен","скорее согласен","полностью согласен")
> a
[1] "полностью согласен" "скорее несогласен"  NA       "полностью согласен" "скорее согласен"    "полностью согласен"
> A <- factor(a,levels=L,ordered=TRUE)
> A
[1] полностью согласен скорее несогласен  <NA>             полностью согласен скорее согласен    полностью согласен
Levels: полностью согласен < скорее согласен < скорее несогласен < полностью несогласен
> as.integer(A)
[1]  1  3 NA  1  2  1
>

Metadata are information about the data (way of collection, time, place, remarks, authors, copyright, etc.). The tendency is to integrate the metadata with the data.

Back to 7ISS Labs