Project 2
- Alternative A
- Alternative B

Project 2

Collecting textual information from the internet

Alternative A

Select a topic. On Google (or in some other way) prepare a list of at least 100 HTML pages dealing (relevant) with the selected topic. Download these pages and extract the text from them. Save it on your disk.

Make a list (topic vocabulary) of at least 20 terms (words/ phrases) characteristic for the topic.

For each downloaded page determine the frequency distribution f of appearances of terms from the list on the page. In this way, you obtain a data frame with pages as units (rows) and selected terms as variables (columns).

Analyze it. For example clusters of pages, what are their characteristics, dependencies among terms, etc?

Hint: fractional approach - normalization across units x[i,j] = f[i,j]/F[i], where F[i] = sum( f[i,j] : j ∈ 1:k )

Hint: you don't need the HTML page, but only its content - text. There are different solutions in R how to get rid of HTML tags - for example here.

Alternative B

Select at least 100 books (from the project Gutenberg or some other source) in the same language with groups belonging to the same author or same type.

For each book b, determine the frequency distribution f[b] of its words and S[b] = sum( f[b,w] : w ∈ b) is the total number of words in b . Let t be the joint frequency distribution of words in all selected books. W is the set of k the most frequent words in t. Create the description x[b] of each book b as a vector with elements

x[b,w] = f[b,w] / S[b]

Analyze so obtained data frame x using clustering for two values of k (for example, k=50 and k=100). What are the changes in the cluster structure? Are the obtained clusters reflecting selected groups (authors, types)? Which words are specific for obtained clusters? …

Projects; EDA