Tutorials python :: python word Tokenization

Python - Word Tokenization

Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc. The Natural Language Tool kit(NLTK) is a library used to achieve this. Install NLTK before proceeding with the python program for word tokenization.

conda install -c anaconda nltk

Next we use the word_tokenize method to split the paragraph into individual words.

import nltk

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"

nltk_tokens = nltk.word_tokenize(word_data)

print (nltk_tokens)

When we execute the above code, it produces the following result.

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers',

'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',

'comforts', 'of', 'their', 'drawing', 'rooms']

Tokenizing Sentences

We can also tokenize the sentences in a paragraph like we tokenized the words. We use the method sent_tokenize to achieve this. Below is an example.

import nltk

sentence_data = "Sun rises in the east. Sun sets in the west."

nltk_tokens = nltk.sent_tokenize(sentence_data)

print (nltk_tokens)

When we execute the above code, it produces the following result.

['Sun rises in the east.', 'Sun sets in the west.']

24loader, Home of exclusive blog and update music and entertaiment

Header Ads

Tokenizing Sentences

No comments

Facebook

Videos

Popular

Recent

Comments

Tags

Related Post No.

ads

Popular Posts

Photography

Recent in Sports

Custom Widget

Related Post No.