In a standard setting we have an (ordered) set of units. To describe them we select a set of their properties (attributes). Data are obtained by measurement of properties on units. We obtain a data table
P1 | P2 | P3 | … | Pm | |
---|---|---|---|---|---|
U1 | v1,1 | v1,2 | v1,3 | … | v1,m |
U2 | v2,1 | v2,2 | v2,3 | … | v2,m |
U3 | v3,1 | v3,2 | v3,3 | … | v3,m |
… | |||||
Un | vn,1 | vn,2 | vn,3 | … | vn,m |
In statistics, vectors - columns in the data table are called variables. Because of this in R the basic data structure is a vector. Most operations and functions are working vector-wise.
> a <- c(1,2,3,4,5) > a [1] 1 2 3 4 5 > b <- c(10,9,8,7,6) > b [1] 10 9 8 7 6 > b[4] [1] 7 > b[c(1,3,5)] [1] 10 8 6 > length(a) [1] 5 > a+b [1] 11 11 11 11 11 > a-b [1] -9 -7 -5 -3 -1 > a*b [1] 10 18 24 28 30 > sqrt(a) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 > 2**a [1] 2 4 8 16 32 > summary(b) Min. 1st Qu. Median Mean 3rd Qu. Max. 6 7 8 8 9 10 > 3*a [1] 3 6 9 12 15 > sum(b) [1] 40 > cumsum(b) [1] 10 19 27 34 40 > integer(10) [1] 0 0 0 0 0 0 0 0 0 0 > 1:9 [1] 1 2 3 4 5 6 7 8 9 > rep(c(0,1),5) [1] 0 1 0 1 0 1 0 1 0 1 > seq(1,3,1/3) [1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 > c(rep(0,5),rep(1,7)) [1] 0 0 0 0 0 1 1 1 1 1 1 1
Not all properties of units can be measured, as in physics and geometry, by real numbers. The measurements are made in different measurement scales (Level_of_measurement, Roberts).
The type of the measurement scale determines what we can do with the corresponding variables. For example - central element: geometric mean, (arithmetic) mean, median, mode.
Values of numerical scale measurements are represented with integers or real numbers. Integers are often used (as codes) also for representing ordinal and nominal values - but not all numerical operations on them are meaningful.
> t <- c("F","D","F","I","A","F","F","D","B","SLO","I","F","GB","B") > T <- factor(t) > t [1] "F" "D" "F" "I" "A" "F" "F" "D" "B" "SLO" "I" "F" "GB" "B" > T [1] F D F I A F F D B SLO I F GB B Levels: A B D F GB I SLO > as.integer(T) [1] 4 3 4 6 1 4 4 3 2 7 6 4 5 2 > levels(T) [1] "A" "B" "D" "F" "GB" "I" "SLO" > levels(T)[as.integer(T)] [1] "F" "D" "F" "I" "A" "F" "F" "D" "B" "SLO" "I" "F" "GB" "B" > which(T=="F") [1] 1 3 6 7 12 > table(T) T A B D F GB I SLO 1 2 2 5 1 2 1 > L <- c("unsatisfactory","poor","average","good","excellent") > s <- c("unsatisfactory","good","good","average") > k <- factor(s,levels=L,ordered=TRUE) > k [1] unsatisfactory good good average Levels: unsatisfactory < poor < average < good < excellent > as.integer(k) [1] 1 4 4 3
Compute the empirical probability distribution and the cummulative empirical probability distribution of values in vector t. Solution
Missing data and not applicable measurements. Infinity.
> 1^2009 [1] 1 > 2009^0 [1] 1 > 0^0 [1] 1 > 1/0 [1] Inf > -3/0 [1] -Inf > 5/Inf [1] 0 > Inf+Inf [1] Inf > 2009*Inf [1] Inf > Inf*Inf [1] Inf > sqrt(Inf) [1] Inf > Inf-Inf [1] NaN > 0/0 [1] NaN > 0*Inf [1] NaN > Inf/Inf [1] NaN > 3%%0 [1] NaN > sqrt(-2) [1] NaN Warning message: In sqrt(-2) : NaNs produced > 3+sqrt(NA) [1] NA > 1e308 [1] 1e+308 > 1e309 [1] Inf > 1e-323 [1] 9.881313e-324 > 1e-324 [1] 0
> L <- c('полностью согласен','скорее согласен','скорее несогласен','полностью несогласен') > L [1] "полностью согласен" "скорее согласен" "скорее несогласен" "полностью несогласен" > a <- c("полностью согласен","скорее несогласен",NA,"полностью согласен","скорее согласен","полностью согласен") > a [1] "полностью согласен" "скорее несогласен" NA "полностью согласен" "скорее согласен" "полностью согласен" > A <- factor(a,levels=L,ordered=TRUE) > A [1] полностью согласен скорее несогласен <NA> полностью согласен скорее согласен полностью согласен Levels: полностью согласен < скорее согласен < скорее несогласен < полностью несогласен > as.integer(A) [1] 1 3 NA 1 2 1 >
Metadata are information about the data (way of collection, time, place, remarks, authors, copyright, etc.). The tendency is to integrate the metadata with the data.