NLTK (Natural Language Toolkit) install
NLTK Tutorials
Introduction - Install NLTKTokenizing and Tagging
Stemming
Chunking
tf-idf
"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum." - from http://www.nltk.org/.
Now, we'll install NLTK on Ubuntu 14.04. The following steps are from Installing NLTK:
- Install Setuptools: http://pypi.python.org/pypi/setuptools
- Install Pip: run sudo easy_install pip
- Install Numpy (optional): run sudo pip install -U numpy
- Install NLTK: run sudo pip install -U nltk
- Test installation: run python then type import nltk
Once installed we need to test NLTK. As listed in the previous section, the first thing to do is if we can import NLTK:
>>> import nltk
Then, we can move on do more following the guide from :
>>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] >>> tagged = nltk.pos_tag(tokens) >>> tagged[0:6] [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]
During the test, we may get the following error message:
LookupError: ********************************************************************** Resource 'taggers/maxent_treebank_pos_tagger/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/k/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **********************************************************************
Then, we may want to do nlpt.download():
>>> nltk.download('maxent_treebank_pos_tagger') [nltk_data] Downloading package 'maxent_treebank_pos_tagger' to [nltk_data] /home/k/nltk_data... [nltk_data] Unzipping taggers/maxent_treebank_pos_tagger.zip. True >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] >>>
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization