====== Data ====== ===== Variables and data tables ===== In a standard setting we have an (ordered) set of **units**. To describe them we select a set of their **properties** (attributes). Data are obtained by **measurement** of properties on units. We obtain a **data table** ^ ^ P1 ^ P2 ^ P3 ^ ... ^ Pm ^ ^ U1 | v1,1 | v1,2 | v1,3 | ... | v1,m | ^ U2 | v2,1 | v2,2 | v2,3 | ... | v2,m | ^ U3 | v3,1 | v3,2 | v3,3 | ... | v3,m | ^ ... | | | | | | ^ Un | vn,1 | vn,2 | vn,3 | ... | vn,m | In statistics, vectors - columns in the data table are called **variables**. Because of this in R the basic data structure is a vector. Most operations and functions are working vector-wise. > a <- c(1,2,3,4,5) > a [1] 1 2 3 4 5 > b <- c(10,9,8,7,6) > b [1] 10 9 8 7 6 > b[4] [1] 7 > b[c(1,3,5)] [1] 10 8 6 > length(a) [1] 5 > a+b [1] 11 11 11 11 11 > a-b [1] -9 -7 -5 -3 -1 > a*b [1] 10 18 24 28 30 > sqrt(a) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 > 2**a [1] 2 4 8 16 32 > summary(b) Min. 1st Qu. Median Mean 3rd Qu. Max. 6 7 8 8 9 10 > 3*a [1] 3 6 9 12 15 > sum(b) [1] 40 > cumsum(b) [1] 10 19 27 34 40 > integer(10) [1] 0 0 0 0 0 0 0 0 0 0 > 1:9 [1] 1 2 3 4 5 6 7 8 9 > rep(c(0,1),5) [1] 0 1 0 1 0 1 0 1 0 1 > seq(1,3,1/3) [1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 > c(rep(0,5),rep(1,7)) [1] 0 0 0 0 0 1 1 1 1 1 1 1 ===== Measurement ===== Not all properties of units can be measured, as in physics and geometry, by real numbers. The measurements are made in different **measurement scales** ([[wp>Level_of_measurement]], [[https://www.cambridge.org/core/books/measurement-theory/7D75B72C3E5FA676EA7AD6AB4D8DF4A7|Roberts]]). * **numerical** scales: * **absolute** (counts) * **ratio** (weight, length, price, temperature K) * **interval** (temperature F, date) * **ordinal** (grades: excellent, good, average, poor, unsatisfactory; temperature: very cold, cold, cool, warm, hot, very hot) * **nominal** or **categorical** (nationality, religious preference: Buddhist, Muslim, Christian, Jewish, Other) The type of the measurement scale determines what we can do with the corresponding variables. For example - central element: geometric mean, (arithmetic) mean, median, mode. ===== Nominal and ordinal variables in R ===== Values of numerical scale measurements are represented with integers or real numbers. Integers are often used (as codes) also for representing ordinal and nominal values - but not all numerical operations on them are meaningful. > t <- c("F","D","F","I","A","F","F","D","B","SLO","I","F","GB","B") > T <- factor(t) > t [1] "F" "D" "F" "I" "A" "F" "F" "D" "B" "SLO" "I" "F" "GB" "B" > T [1] F D F I A F F D B SLO I F GB B Levels: A B D F GB I SLO > as.integer(T) [1] 4 3 4 6 1 4 4 3 2 7 6 4 5 2 > levels(T) [1] "A" "B" "D" "F" "GB" "I" "SLO" > levels(T)[as.integer(T)] [1] "F" "D" "F" "I" "A" "F" "F" "D" "B" "SLO" "I" "F" "GB" "B" > which(T=="F") [1] 1 3 6 7 12 > table(T) T A B D F GB I SLO 1 2 2 5 1 2 1 > L <- c("unsatisfactory","poor","average","good","excellent") > s <- c("unsatisfactory","good","good","average") > k <- factor(s,levels=L,ordered=TRUE) > k [1] unsatisfactory good good average Levels: unsatisfactory < poor < average < good < excellent > as.integer(k) [1] 1 4 4 3 See the [[wp>List_of_international_vehicle_registration_codes]]. ==== Test ==== Compute the empirical probability distribution and the cummulative empirical probability distribution of values in vector t. [[ru:7iss:labs:s:s1|Solution]] ===== Special values ===== **Missing data** and **not applicable** measurements. Infinity. > 1^2009 [1] 1 > 2009^0 [1] 1 > 0^0 [1] 1 > 1/0 [1] Inf > -3/0 [1] -Inf > 5/Inf [1] 0 > Inf+Inf [1] Inf > 2009*Inf [1] Inf > Inf*Inf [1] Inf > sqrt(Inf) [1] Inf > Inf-Inf [1] NaN > 0/0 [1] NaN > 0*Inf [1] NaN > Inf/Inf [1] NaN > 3%%0 [1] NaN > sqrt(-2) [1] NaN Warning message: In sqrt(-2) : NaNs produced > 3+sqrt(NA) [1] NA > 1e308 [1] 1e+308 > 1e309 [1] Inf > 1e-323 [1] 9.881313e-324 > 1e-324 [1] 0 ===== Unicode ===== [[|Unicode]]. > L <- c('полностью согласен','скорее согласен','скорее несогласен','полностью несогласен') > L [1] "полностью согласен" "скорее согласен" "скорее несогласен" "полностью несогласен" > a <- c("полностью согласен","скорее несогласен",NA,"полностью согласен","скорее согласен","полностью согласен") > a [1] "полностью согласен" "скорее несогласен" NA "полностью согласен" "скорее согласен" "полностью согласен" > A <- factor(a,levels=L,ordered=TRUE) > A [1] полностью согласен скорее несогласен полностью согласен скорее согласен полностью согласен Levels: полностью согласен < скорее согласен < скорее несогласен < полностью несогласен > as.integer(A) [1] 1 3 NA 1 2 1 > **Metadata** are information about the data (way of collection, time, place, remarks, authors, copyright, etc.). The tendency is to integrate the metadata with the data. \\ \\ [[ru:7iss#labs|Back to 7ISS Labs]]