Text preprocessing for English

7 minute read

What will be included in this article

  • Why we do text preprocessing
  • Tokenization
  • Case-folding
  • Stopwords filtering
  • Stemming
  • Lemmatization

Why we do text preprocessing

When you have a collection of documents/sentences and want to build features for machine learning, text preprocessing helps you normalize your input data and reduce noises. It could facilitate your analysis; however, improper use of preprocessing could also make you lose important information in your raw data. So be careful of the preprocessing steps you will do for your tasks. In the following sections, we will talk about several effective processes for text preprocessing.

Tokenization

A tokenizer splits a sentence into words or a document into sentences. Most of them are implemented by regular expressions. The following is an example (punctuations are also removed here):

from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize
import string

data = "I have a pen. I have an Apple. Uh! Apple pen!"
print(sent_tokenize(data)) # sentence level tokens

# remove all punctuations
trantab = str.maketrans(dict.fromkeys(list(string.punctuation)))
print(word_tokenize(data.translate(trantab))) # word level tokens
['I have a pen.', 'I have an Apple.',
'Uh!', 'Apple pen!']
['I', 'have', 'a', 'pen', 'I', 'have',
'an', 'Apple', 'Uh', 'Apple', 'pen']

Case-folding

It is a super easy operation and the purpose is to normalize words into the same form in case (e.g. “The” and “the” will be treated as two different unigrams if no case-folding).

from nltk.tokenize import word_tokenize
import string

data = "I have a pen. I have an Apple. Uh! Apple pen!"

# remove all punctuations
trantab = str.maketrans(dict.fromkeys(list(string.punctuation)))
data = data.translate(trantab)
# tokenize and lowercase
tokens_lowercase = [t.lower() for t in word_tokenize(data)]
print(tokens_lowercase)
['i', 'have', 'a', 'pen', 'i', 'have',
'an', 'apple', 'uh', 'apple', 'pen']

However, I’ve seen many people lowercase the tokens in NLP tasks thoughtlessly. In fact, letter case is meaningful for some tasks, such as translation (e.g. a word in all caps is usually treated as an abbreviation).


source (No I won’t tell you what the Chinese translation means here lol)

And authorship identification.

“For example, comparing prose and poetry pieces, without reducing to lower case the verses initial upper case, automatically creates a distance of around 1/8” – Labbé et al. (2001)

So be aware of the effect may cause before changing the case of words.

Stopwords filtering

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though “stop words” usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Wikipedia

The purpose of stopwords filtering is to focus on “meaningful” words in sentences/documents. However, there are also side-effects:

  • It breaks the structure and context of a sentence, so both syntactic and semantic information will change.
  • The definition of “stopwords” could be different in different tasks, so using improper stopwords corpous could affect the quality of text features.

In the example below, we use the default stopwords corpus in NLTK, which contains 2,400 stopwords for 11 languages, built by Porter et al. (1980)

import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

data = "I have a pen. I have an Apple. Uh! Apple pen!"

# remove all punctuations
trantab = str.maketrans(dict.fromkeys(list(string.punctuation)))
data = data.translate(trantab)
# tokenize and lowercase
tokens = [t.lower() for t in word_tokenize(data)]
# remove stopwords
sw_set = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in sw_set]
print(tokens)
['pen', 'apple', 'uh', 'apple', 'pen']

Stemming

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. Wikipedia

In order to introduce stem, let’s start from morpheme in linguistic.

A morpheme is the smallest grammatical unit in a language. In other words, it is the smallest meaningful unit of a language.

  • Free morphemes can function independently as words (e.g. town, dog)
  • Bound morphemes appear only as parts of words. Most bound morphemes in English are affixes, particularly prefixes (e.g. un-) and suffixes. (e.g. -tion, -ation, -ible, -ing)
  • When a morpheme stands by itself, it is considered as a root because it has a meaning of its own (e.g. the morpheme cat)
  • Roots are composed of only one morpheme, while stems can be composed of more than one morpheme. An example of this is the word quirkiness. The root is quirk, but the stem is quirky which has two morphemes. Wikipedia

Let’s take a look at an example. We use the stemmer proposed by Porter et al. (1980), which is one of the most popular rule-based stemmer.

import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

data = "Happy and me are happy. " + \
    "Happy is the happiest girl " + \
    "who is filled with happiness and happier than others"

# remove all punctuations
trantab = str.maketrans(dict.fromkeys(list(string.punctuation)))
data = data.translate(trantab)
# tokenize and lowercase
tokens = [t.lower() for t in word_tokenize(data)]
# stemming by the stemmer built by Martin Porter
stemmer = PorterStemmer(PorterStemmer.ORIGINAL_ALGORITHM)
print(tokens)
print([stemmer.stem(d) for d in tokens])
['happy', 'and', 'me', 'are', 'happy', 'happy', 'is', 'the',
'happiest', 'girl', 'who', 'is', 'filled', 'with', 'happiness',
'and', 'happier', 'than', 'others']
['happi', 'and', 'me', 'ar', 'happi', 'happi', 'i', 'the',
'happiest', 'girl', 'who', 'i', 'fill', 'with', 'happi',
'and', 'happier', 'than', 'other']

In the above example, “happy” and “happiness” are converted to the same stem “happi” and could be grouped together as one feature. Stemming can convert word features to stem features, which is effective to reduce the size of features. However, there are some problems:

  • The stemmer rules are manually crafted based on statistics, so it’s not always correct when given a large sample vocabulary (Porter, 2001).
  • Stems could be meaningless words that are not in dictionaries. (e.g. “is” -> “i”, “happy”->”happi”)

Lemmatization

Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form. Unlike stemming, lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. Wikipedia

The difference between lemmatization and stemming is that lemmatization utilizes dictionary-like resources to convert a word into its basic form. In the example below, we look up words on WordNet, which is a large lexical database of English (Let’s talk about WordNet in the future), to lemmatize the sentence.

import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

data = "Happy and me are happy. " + \
    "Happy is the happiest girl " + \
    "who is filled with happiness and happier than others"

# remove all punctuations
trantab = str.maketrans(dict.fromkeys(list(string.punctuation)))
data = data.translate(trantab)
# tokenize and lowercase
tokens = [t.lower() for t in word_tokenize(data)]
# lemmatize by the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

print([wordnet_lemmatizer.lemmatize(d, pos='v') for d in tokens]) # POS = verb
print([wordnet_lemmatizer.lemmatize(d, pos='a') for d in tokens]) # POS = adjective
['happy', 'and', 'me', 'be', 'happy', 'happy', 'be', 'the',
'happiest', 'girl', 'who', 'be', 'fill', 'with', 'happiness',
'and', 'happier', 'than', 'others']
['happy', 'and', 'me', 'are', 'happy', 'happy', 'is', 'the',
'happy', 'girl', 'who', 'is', 'filled', 'with', 'happiness',
'and', 'happy', 'than', 'others']

As shown in the example, a lemmatizer can convert a word into its exact basic form if the given POS tagging of word is correct and WordNet is knowledgeable enough. Comparing to stemming, lemmatization may be more accurate (in some cases), but it requires much more human efforts of word relationship annotation. In my understanding, the two normalizations are both effective and it’s difficult to determine which one is better. There are some comments from Introduction to Information Retrieval - Stanford NLP:

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Let’s wrap it up for today! Please leave your comments below if you have any question. Also, if you want to know any specific topic, feel free to let me know and I will put into my to-do list.

Reference

  1. Labbé, C., & Labbé, D. (2001). Inter-textual distance and authorship attribution Corneille and Molière. Journal of Quantitative Linguistics, 8(3), 213-231.
  2. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.
  3. Porter, M. F. (2001). Snowball: A language for stemming algorithms.

Leave a Comment