Project 2

Analysis of textual data

We would like to explore whether the word frequency distributions of the books are characteristic of their author(s).

Select at least 50 books (from the project Gutenberg or some other source) in the same language with some (at least 4) groups based on the same author (for example Jane Austen, Frank Baum, Agatha Christie, James Fenimore Cooper, Jerome Klapka Jerome, Edgar Allan Poe,…); or on the same genre.

For each book b, determine the frequency distribution f_b of its words. S_b = sum( f_b[w] : w ∈ b) is the total number of words in a book b . Let F be the joint frequency distribution of words in all selected books. W_k is the set of k the most frequent words in F. Create a description x_b of each book b as a vector (initial part of the probability distribution) with elements for w ∈ W_k

x_b[w] = f_b[w] / S_b

They are the rows of a data frame X_k with words from W_k as columns. Analyze the so obtained data frame X_k using clustering for two values of k (for example, k=50 and k=200). What are the changes in the cluster structure? Are the obtained clusters reflecting selected groups (authors, genres)? Which words are specific for obtained clusters? …

Using (part) of the probability distribution for unit description we made units comparable. But as we know, the probabilities approximately follow Zipf's law. Applying the Euclidean distance directly to the data frame X_k would favourize the influence of high-frequency words. An alternative is to use the standardized data frame Y_k = scale(X_k) or, considering Zipf's law, the data frame Z_k, where Z_k[,i] = i * X_k[,i] for i = 1, …, k . Is it better to cluster the original data frame X_k or its “standardized” version Y_k or Z_k?

Some hints

1. Avoid translations - the obtained distribution is characteristic of a translator.

2. Jack London: A Daughter of the Snows has nonstandard start and end of the text. Improvements in the program from slides

i <- grep("\\*\\*\\* ?START OF",text,ignore.case=TRUE)
j <- grep("\\*\\*\\* ?END OF",text,ignore.case=TRUE)

3. Some additional delimiters appear in text ”,“,—,‘,’ and, for example in Virginia Woolf: The Voyage Out, some words are in the Greek alphabet. Changes in the program from slides

separator <- "([[:punct:]]|[[:space:]]|”|“|—|‘|’)+"
book %>% strsplit(separator) %>% unlist %>% .[nchar(.)>0] %T>%
   {cat("words =",length(.),"\n")} %>% tolower %>% table %>%
   .[grepl("[[:alnum:]]+",names(.))] -> z

The function table returns frequency distribution with names in alphabetic order. This is convenient also for combining (summing two distributions) by merging - to get the joint frequency distribution T of all books. See Merge algorithm - for combining frequency distributions we have, in the case head(A) == head(B), to sum the corresponding frequencies. See merging distributions.

Only the joint distribution T needs to be sorted to get the k most frequent words

select <- names(sort(T,decreasing=TRUE))[1:k]

From a book frequency distribution Pb you can get the frequency vector Qb for selected words by

Qb <- Pb[select]

Attention, if a selected word w doesn't appear in Pb the names(Qb)[w] == NA. You have to set the corresponding frequencies to 0 and names(Qb) ← select .

Projects; EDA