====== Data ======
===== Variables and data tables =====
In a standard setting we have an (ordered) set of **units**. To describe them we select a set of their **properties** (attributes). Data are obtained by **measurement** of properties on units. We obtain a **data table**
^ ^ P1 ^ P2 ^ P3 ^ ... ^ Pm ^
^ U1 | v1,1 | v1,2 | v1,3 | ... | v1,m |
^ U2 | v2,1 | v2,2 | v2,3 | ... | v2,m |
^ U3 | v3,1 | v3,2 | v3,3 | ... | v3,m |
^ ... | | | | | |
^ Un | vn,1 | vn,2 | vn,3 | ... | vn,m |
In statistics, vectors - columns in the data table are called **variables**. Because of this in R the basic data structure is a vector. Most operations and functions are working vector-wise.
> a <- c(1,2,3,4,5)
> a
[1] 1 2 3 4 5
> b <- c(10,9,8,7,6)
> b
[1] 10 9 8 7 6
> b[4]
[1] 7
> b[c(1,3,5)]
[1] 10 8 6
> length(a)
[1] 5
> a+b
[1] 11 11 11 11 11
> a-b
[1] -9 -7 -5 -3 -1
> a*b
[1] 10 18 24 28 30
> sqrt(a)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
> 2**a
[1] 2 4 8 16 32
> summary(b)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6 7 8 8 9 10
> 3*a
[1] 3 6 9 12 15
> sum(b)
[1] 40
> cumsum(b)
[1] 10 19 27 34 40
> integer(10)
[1] 0 0 0 0 0 0 0 0 0 0
> 1:9
[1] 1 2 3 4 5 6 7 8 9
> rep(c(0,1),5)
[1] 0 1 0 1 0 1 0 1 0 1
> seq(1,3,1/3)
[1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000
> c(rep(0,5),rep(1,7))
[1] 0 0 0 0 0 1 1 1 1 1 1 1
===== Measurement =====
Not all properties of units can be measured, as in physics and geometry, by real numbers. The measurements are made in different **measurement scales** ([[wp>Level_of_measurement]], [[https://www.cambridge.org/core/books/measurement-theory/7D75B72C3E5FA676EA7AD6AB4D8DF4A7|Roberts]]).
* **numerical** scales:
* **absolute** (counts)
* **ratio** (weight, length, price, temperature K)
* **interval** (temperature F, date)
* **ordinal** (grades: excellent, good, average, poor, unsatisfactory; temperature: very cold, cold, cool, warm, hot, very hot)
* **nominal** or **categorical** (nationality, religious preference: Buddhist, Muslim, Christian, Jewish, Other)
The type of the measurement scale determines what we can do with the corresponding variables. For example - central element: geometric mean, (arithmetic) mean, median, mode.
===== Nominal and ordinal variables in R =====
Values of numerical scale measurements are represented with integers or real numbers. Integers are often used (as codes) also for representing ordinal and nominal values - but not all numerical operations on them are meaningful.
> t <- c("F","D","F","I","A","F","F","D","B","SLO","I","F","GB","B")
> T <- factor(t)
> t
[1] "F" "D" "F" "I" "A" "F" "F" "D" "B" "SLO" "I" "F" "GB" "B"
> T
[1] F D F I A F F D B SLO I F GB B
Levels: A B D F GB I SLO
> as.integer(T)
[1] 4 3 4 6 1 4 4 3 2 7 6 4 5 2
> levels(T)
[1] "A" "B" "D" "F" "GB" "I" "SLO"
> levels(T)[as.integer(T)]
[1] "F" "D" "F" "I" "A" "F" "F" "D" "B" "SLO" "I" "F" "GB" "B"
> which(T=="F")
[1] 1 3 6 7 12
> table(T)
T
A B D F GB I SLO
1 2 2 5 1 2 1
> L <- c("unsatisfactory","poor","average","good","excellent")
> s <- c("unsatisfactory","good","good","average")
> k <- factor(s,levels=L,ordered=TRUE)
> k
[1] unsatisfactory good good average
Levels: unsatisfactory < poor < average < good < excellent
> as.integer(k)
[1] 1 4 4 3
See the [[wp>List_of_international_vehicle_registration_codes]].
==== Test ====
Compute the empirical probability distribution and the cummulative empirical probability distribution of values in vector t. [[ru:7iss:labs:s:s1|Solution]]
===== Special values =====
**Missing data** and **not applicable** measurements. Infinity.
> 1^2009
[1] 1
> 2009^0
[1] 1
> 0^0
[1] 1
> 1/0
[1] Inf
> -3/0
[1] -Inf
> 5/Inf
[1] 0
> Inf+Inf
[1] Inf
> 2009*Inf
[1] Inf
> Inf*Inf
[1] Inf
> sqrt(Inf)
[1] Inf
> Inf-Inf
[1] NaN
> 0/0
[1] NaN
> 0*Inf
[1] NaN
> Inf/Inf
[1] NaN
> 3%%0
[1] NaN
> sqrt(-2)
[1] NaN
Warning message:
In sqrt(-2) : NaNs produced
> 3+sqrt(NA)
[1] NA
> 1e308
[1] 1e+308
> 1e309
[1] Inf
> 1e-323
[1] 9.881313e-324
> 1e-324
[1] 0
===== Unicode =====
[[|Unicode]].
> L <- c('полностью согласен','скорее согласен','скорее несогласен','полностью несогласен')
> L
[1] "полностью согласен" "скорее согласен" "скорее несогласен" "полностью несогласен"
> a <- c("полностью согласен","скорее несогласен",NA,"полностью согласен","скорее согласен","полностью согласен")
> a
[1] "полностью согласен" "скорее несогласен" NA "полностью согласен" "скорее согласен" "полностью согласен"
> A <- factor(a,levels=L,ordered=TRUE)
> A
[1] полностью согласен скорее несогласен полностью согласен скорее согласен полностью согласен
Levels: полностью согласен < скорее согласен < скорее несогласен < полностью несогласен
> as.integer(A)
[1] 1 3 NA 1 2 1
>
**Metadata** are information about the data (way of collection, time, place, remarks, authors, copyright, etc.). The tendency is to integrate the metadata with the data.
\\ \\
[[ru:7iss#labs|Back to 7ISS Labs]]