Lemmatization

URLs

Books

  • Nitin Hardeniya et al.: Natural Language Processing: Python and NLTK. November 2016
  • Jacob Perkins: Python 3 Text Processing with NLTK 3 Cookbook. August 2014

Dictionary lemmatizer

In the “A language learners’ forum” DangerDave2010 proposed a simple and efficient dictionary based lemmatizer

#encoding: utf8
lemmaDict = {}
with open('lemmatization-es.txt', 'rb') as f:
   data = f.read().decode('utf8').replace(u'\r', u'').split(u'\n')
   data = [a.split(u'\t') for a in data]
   
for a in data:
   if len(a) >1:
      lemmaDict[a[1]] = a[0]
   
def lemmatize(word):
   return lemmaDict.get(word, word + u'*')
 
def test():
   for a in [ u'salió', u'usuarios', u'abofeteéis', u'diferenciando', u'diferenciándola' ]:
      print(lemmatize(a))
   
test()

The dictionary lemmatization-es.txt can be downloaded from lexiconista.

I downloaded some dictionaries to C:\Users\batagelj\Downloads\data\lemma .

pro/bib/lem.txt · Last modified: 2020/12/11 21:45 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki