Collecting textual information from the internet.
Select a topic. On Google (or in some other way) prepare a list of at least 100 HTML pages dealing (relevant) with the selected topic. Download these pages and extract the text from them. Save it on your disk.
Make a list (topic vocabulary) of at least 20 terms (words/ phrases) characteristic for the topic.
For each downloaded page determine the frequency distribution of appearances of terms from the list in the page. In this way you obtain a data frame with pages as units (rows) and selected terms as variables (columns).
Analyze it. For example: clusters of pages, what are their characteristics, dependencies among terms, etc.
Hint: fractional approach - normalization across units x(i,j) = f(i,j)/F(i), F(i) = sum( f(i,j) : j ∈ 1:k )
Hint: you don't need the HTML page, but only its content - text. There are different solutions in R how to get rid of HTML tags - for example here.