Project 2

Collecting textual information from the internet.

Select a topic. On Google (or in some other way) prepare a list of at least 100 HTML pages dealing (relevant) with the selected topic. Download these pages and extract the text from them. Save it on your disk.

Make a list (topic vocabulary) of at least 20 terms (words/ phrases) characteristic for the topic.

For each downloaded page determine the frequency distribution of appearances of terms from the list in the page. In this way you obtain a data frame with pages as units (rows) and selected terms as variables (columns).

Analyze it. For example: clusters of pages, what are their characteristics, dependencies among terms, etc.

Hint: fractional approach - normalization across units x(i,j) = f(i,j)/F(i), F(i) = sum( f(i,j) : j ∈ 1:k )

Hint: you don't need the HTML page, but only its content - text. There are different solutions in R how to get rid of HTML tags - for example here.