It is a very rich dataset available at
http://www.census.gov/support/USACdataDownloads.html
We used it in
http://vladowiki.fmf.uni-lj.si/lib/exe/fetch.php?media=pajek:slides:9nationssunbelt32.pdf
and an improved version in p. 369-381
Unfortunately, it seems that they limited the access to https://data.census.gov/ for users outside the US ??? - I get the message:
Access Denied You don't have permission to access "http://www.census.gov/data" on this server.
Subsets of variables can be obtained from other sources. For example:
Try the above links and some other (Google “US counties data”). If you can get some data we can proceed to details.
For this topic, Софья Кошовец s.koshovets@gmail.com already expressed her interest, but she didn't inform me about her decision. Here is my answer to her.
The idea comes from the Viszards session on Sunbelt XXXI (2011)
http://vladowiki.fmf.uni-lj.si/lib/exe/fetch.php?media=pajek:slides:visz31.pdf
The original data set was borrowed from the DataExpo 2009 flights contest:
http://stat-computing.org/dataexpo/2009/
but you can also collect (download) recent data from
Can the keywords for a paper be determined with a program?
In the term paper, you can make an overview of existing approaches to this problem and search for available programs (and start collecting test corpus of papers). The keywords can be “free” or selected from a given list. Present results of the application of some programs on some example papers.
In the master thesis, you can make an evaluation of available programs comparing keywords suggested by programs with keywords selected by authors. You can also try to develop your own procedure - for list-based keywords you can “learn” from existing papers.
They are important in named-entity recognition.
In the term paper, you can make an overview of existing approaches to this problem and search for available programs (special attention to the Russian language).
In the master thesis, you can develop a program for XML tagging of plain text based on a dictionary of “interesting” terms and lemmatization and illustrate its use with application to selected data. I collected some links at
http://vladowiki.fmf.uni-lj.si/doku.php?id=notes:text:lem&#named-entity_recognition