In the “A language learners’ forum” DangerDave2010 proposed a simple and efficient dictionary based lemmatizer
#encoding: utf8 lemmaDict = {} with open('lemmatization-es.txt', 'rb') as f: data = f.read().decode('utf8').replace(u'\r', u'').split(u'\n') data = [a.split(u'\t') for a in data] for a in data: if len(a) >1: lemmaDict[a[1]] = a[0] def lemmatize(word): return lemmaDict.get(word, word + u'*') def test(): for a in [ u'salió', u'usuarios', u'abofeteéis', u'diferenciando', u'diferenciándola' ]: print(lemmatize(a)) test()
The dictionary lemmatization-es.txt
can be downloaded from lexiconista.
I downloaded some dictionaries to C:\Users\batagelj\Downloads\data\lemma
.