Let’s talk about text features in NLP (1)

6 minute read

What will be included in this article

  • A brief introduction of NLP/NLU
  • Why the selection of text features is critical
  • Word/Character N-gram models
    • Introduction
    • Comparison
    • Discussion

First, what is NLP/NLU?

“Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data.” Wikipedia

“Natural language understanding (NLU) is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension.” Wikipedia

I’m not going to discuss the difference between NLP and NLU here, so let’s use NLP in the rest of this post.

The basic idea of NLP is teaching computers to recognize some “properties” of human language and utilizing its power to facilitate text analysis tasks, such as sentiment analysis, topic classification, summarization, phishing detection, authorship attribution, etc etc.

Selection of text features

Because we are teaching computers new things, we need to transform human knowledge to digital representations. These representations are created by predefined text features, and the text features are generated based on language properties. Since language properties vary a lot from language families, countries, regions, people, and time, it is critical to select the language properties based on different goals.

N-gram model

One of the most intuitive text features is counting word occurrence. For example, we want to know if an article is about cars, we can count if it contains more words related to car than other topics. This simple idea leads to the n-gram model.

“In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. “ Wikipedia

The n in n-gram is a positive integer. We usually refer 1-gram as unigram, 2-gram as bigram, 3-gram as trigram.

Word unigram model

The following is an example of word unigram:

from nltk import ngrams
from collections import Counter
sentence = 'I love my Mazda 3! Mazda 3 is the best car in the world!'
ng = ngrams(sentence.split(), n=1)
Counter(ng)

Output:

Counter({('3',): 1,
         ('3!',): 1,
         ('I',): 1,
         ('Mazda',): 2,
         ('best',): 1,
         ('car',): 1,
         ('in',): 1,
         ('is',): 1,
         ('love',): 1,
         ('my',): 1,
         ('the',): 2,
         ('world!',): 1})

However, you might want to filter out stopwords and punctuations in this sentence so you can keep only “meaningful” (well, it actually depends) unigrams:

from nltk import ngrams
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') # for punctuations
sw_set = set(stopwords.words('english')) # for stopwords
sentence = 'I love my Mazda 3! Mazda 3 is the best car in the world!'
tokens = [t for t in tokenizer.tokenize(sentence) if t.lower() not in sw_set]
ng = ngrams(tokens, n=1)
Counter(ng)

Output:

Counter({('3',): 2,
         ('Mazda',): 2,
         ('best',): 1,
         ('car',): 1,
         ('love',): 1,
         ('world',): 1})

Now you can feel how word unigram captures the frequency of words in a sentence. It is extremely useful when word selection is an important attribute in your tasks (e.g. Categorize the topic of a news article)

Word bigram/n-gram model

When applying word bigram to the previous example:

Counter({('3', 'Mazda'): 1,
         ('Mazda', '3'): 2,
         ('best', 'car'): 1,
         ('love', 'Mazda'): 1})

Word bigram model includes the frequency of “adjacency” between two words. Word adjacency is a useful information in many applications, such as authorship attribution (Segarra et al., 2015). In the research of authorship attribution, every person has certain writing habits that are not easily changed. Word adjacency can be used to capture a person’s writing habits. Furthermore, we can extend word n-gram to capture syntactic information such as part-of-speech (POS) tagging. Juola (2008) suggests that a person’s preferred syntactic constructions can also be cues to his authorship. It is an intriguing topic for me, but let’s stop for now :p

Character n-gram model

Character n-gram is also useful for authorship attribution and style-based text categorization (Stamatatos, 2009). Let’s see how it works on recognizing the difference between the same phrase used by US and UK people:

  • favorite color (US)
  • favourite colour (UK)

The following is an example of character trigram model:

from nltk import ngrams
from collections import Counter
from nltk.tokenize import RegexpTokenizer
import pandas as pd
tokenizer = RegexpTokenizer(r'\w+') # for punctuations

sentence_us = 'favorite color'
sentence_uk = 'favourite colour'

ng_us = dict(Counter(ngrams(sentence_us, n=3)))
ng_us_raw = {
        '3_gram': ["".join(k) for k in ng_us.keys()],
        'count': [ng_us[k] for k in ng_us]
}
ng_us = pd.DataFrame(ng_us_raw, columns = ['3_gram', 'count'])
ng_uk = dict(Counter(ngrams(sentence_uk, n=3)))
ng_uk_raw = {
        '3_gram': ["".join(k) for k in ng_uk.keys()],
        'count': [ng_uk[k] for k in ng_uk]
}
ng_uk = pd.DataFrame(ng_uk_raw, columns = ['3_gram', 'count'])
print("ng_us: '{}'".format(sentence_us))
print(ng_us)
print("ng_uk: '{}'".format(sentence_uk))
print(ng_uk)

Output:

ng_us: 'favorite color'
   3_gram  count
0     fav      1
1     avo      1
2     vor      1
3     ori      1
4     rit      1
5     ite      1
6     te       1
7     e c      1
8      co      1
9     col      1
10    olo      1
11    lor      1
ng_uk: 'favourite colour'
   3_gram  count
0     fav      1
1     avo      1
2     vou      1
3     our      2
4     uri      1
5     rit      1
6     ite      1
7     te       1
8     e c      1
9      co      1
10    col      1
11    olo      1
12    lou      1

The different usage of “or” and “our” in a word (e.g. “favorite” in US and “favourite” in UK) is a well-known style difference between US and UK people. This example shows that character n-gram model is effective to identify this difference.

Comparison between Word/Character n-gram model

Both word and character level n-gram models are proved to be able to capture the frequency and adjacency. But they have some differences in practical applications:

  • Dimensionality
    • The feature dimension of word n-grams grows exponentially with the size of the corpus (the total collection of text examples) and the size of n. It suffers from curse of dimensionality and sparse feature space. Dimension reduction is required to improve the performance of word n-gram model.
    • The feature dimension of character n-grams grows only with the size of n and suffer less in terms of curse of dimensionality.
  • Performance: It actually depends on what kind of tasks you are working on. For example, you can perform several cost-sensitive evaluation measures (Kanaris et al., 2007) to select the best model for your task.

Discussion about N-gram model

Before the popularity of word embedding, n-gram model is one of the most adopted text features in NLP (We will discuss about it in the future …). The following are some pros and cons about n-gram model:

Pros

  • It captures the frequency and adjacency of words/characters
  • It’s easy to be implemented and understood
  • It performs very well in many NLP tasks

Cons

  • Curse of dimensionality, spare feature space
  • The performance is seriously affected by N-gram selection and dimension reduction

Let’s wrap it up for today! Please leave your comments below if you have any question. Also, if you want to know any specific topic, feel free to let me know and I will put into my to-do list.

Reference

  1. Segarra, S., Eisen, M., & Ribeiro, A. (2015). Authorship attribution through function word adjacency networks. IEEE Transactions on Signal Processing, 63(20), 5464-5478.
  2. Juola, P. (2008). Authorship attribution. Foundations and Trends® in Information Retrieval, 1(3), 233-334.
  3. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology, 60(3), 538-556.
  4. Kanaris, I., Kanaris, K., Houvardas, I., & Stamatatos, E. (2007). Words versus character n-grams for anti-spam filtering. International Journal on Artificial Intelligence Tools, 16(06), 1047-1067.

Leave a Comment