Project 2

Analysis of textual data

We would like to explore whether the word frequency distributions of the books are characteristic of their author(s).

Select at least 50 books (from the project Gutenberg or some other source) in the same language with some (at least 4) groups based on the same author (for example Jane Austen, Frank Baum, Agatha Christie, James Fenimore Cooper, Jerome Klapka Jerome, Edgar Allan Poe,…); or on the same genre.

For each book b, determine the frequency distribution fb of its words. Sb = sum( fb[w] : w ∈ b) is the total number of words in a book b . Let F be the joint frequency distribution of words in all selected books. Wk is the set of k the most frequent words in F. Create a description xb of each book b as a vector (initial part of the probability distribution) with elements for w ∈ Wk

xb[w] = fb[w] / Sb

They are the rows of a data frame Xk with words from Wk as columns. Analyze the so obtained data frame Xk using clustering for two values of k (for example, k=50 and k=200). What are the changes in the cluster structure? Are the obtained clusters reflecting selected groups (authors, genres)? Which words are specific for obtained clusters? …

Using (part) of the probability distribution for unit description we made units comparable. But as we know, the probabilities approximately follow Zipf's law. Applying the Euclidean distance directly to the data frame Xk would favourize the influence of high-frequency words. An alternative is to use the standardized data frame Yk = scale(Xk) or, considering Zipf's law, the data frame Zk, where Zk[,i] = i * Xk[,i] for i = 1, …, k . Is it better to cluster the original data frame Xk or its “standardized” version Yk or Zk?

Some hints

1. Avoid translations - the obtained distribution is characteristic of a translator.

2. Jack London: A Daughter of the Snows has nonstandard start and end of the text. Improvements in the program from slides

i <- grep("\\*\\*\\* ?START OF",text,ignore.case=TRUE)
j <- grep("\\*\\*\\* ?END OF",text,ignore.case=TRUE)

3. Some additional delimiters appear in text ”,“,—,‘,’ and, for example in Virginia Woolf: The Voyage Out, some words are in the Greek alphabet. Changes in the program from slides

separator <- "([[:punct:]]|[[:space:]]|”|“|—|‘|’)+"
book %>% strsplit(separator) %>% unlist %>% .[nchar(.)>0] %T>%
   {cat("words =",length(.),"\n")} %>% tolower %>% table %>%
   .[grepl("[[:alnum:]]+",names(.))] -> z 

The function table returns frequency distribution with names in alphabetic order. This is convenient also for combining (summing two distributions) by merging - to get the joint frequency distribution T of all books. See Merge algorithm - for combining frequency distributions we have, in the case head(A) == head(B), to sum the corresponding frequencies. See merging distributions.

Only the joint distribution T needs to be sorted to get the k most frequent words

select <- names(sort(T,decreasing=TRUE))[1:k]

From a book frequency distribution Pb you can get the frequency vector Qb for selected words by

Qb <- Pb[select]

Attention, if a selected word w doesn't appear in Pb the names(Qb)[w] == NA. You have to set the corresponding frequencies to 0 and names(Qb) ← select .


Projects; EDA

ru/hse/eda21/stu/p2.txt · Last modified: 2022/01/12 21:48 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki