Feature Selection Techniques and Manipulation

This tutorial presents techniques and manpulation ways for features in text categorization problems. It explains how to work with text data, transform them into numbers and built custom feature selection methods, or just find significant properties.

Requirements

The level of this tutorial is intermediate. It requires basic understanding of numpy, scipy, sparse matrices and sklearn libraries.

Motivation

Have you ever wandered how sklearn's functions work? Did you ever find their use insufficient or did you ever wanted to modify some modules regarding feature selection? I hope this tutorial will answer some of your question and help you start hacking your way through data science.

Suppose you have the following corpus

corpus = [
    'This is the first document this.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'This is an example of feature selection',
    'I love kittens',
    'My favourite book is Harry Potter',
    "Do not beleive everything you read"
]
  • How would you count how many times the word document occurs in the corpus?
  • How would you delete it from the whole corpus?
  • How would you identify in which sentences this word appear?

Let's try this out in the most straight-forward way:

In [ ]:
# counting words
corpus = [
    'This is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document',
    'This is an example of feature selection',
    'I love kittens',
    'My favourite book is Harry Potter',
    "Do not beleive everything you read"
]

result=0
for text in corpus:
    for word in text.split():
        if word=="document":
            result=result+1
print(result)
In [ ]:
# deleting words
corpus = [
    'This is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document',
    'This is an example of feature selection',
    'I love kittens',
    'My favourite book is Harry Potter',
    "Do not beleive everything you read"
]

k=0
for text in corpus:
    temp_list = text.split()
    for word in temp_list:
        if word=="document":
            temp_list.remove(word)
    corpus[k] = ' '.join(temp_list)
    k=k+1
            
print(corpus)
In [ ]:
# identifying words
corpus = [
    'This is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document',
    'This is an example of feature selection',
    'I love kittens',
    'My favourite book is Harry Potter',
    "Do not beleive everything you read"
]


flag=0
identification_list=[]
for text in corpus:
    temp_list = text.split()
    for word in temp_list:
        if word=="document":
            flag=1
    identification_list.append(flag)
    flag=0

            
print(identification_list)

Change the way you think about text

Instead of scanning the document, try converting it into a matrix represenation and perform all the actions column-wise and row-wise We can use CountVectorizer from sklearn to achieve that

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

corpus = [
    'This is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document',
    'This is an example of feature selection',
    'I love kittens',
    'My favourite book is Harry Potter',
    "Do not beleive everything you read"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.toarray())  
['an', 'and', 'beleive', 'book', 'do', 'document', 'everything', 'example', 'favourite', 'feature', 'first', 'harry', 'is', 'kittens', 'love', 'my', 'not', 'of', 'one', 'potter', 'read', 'second', 'selection', 'the', 'third', 'this', 'you']
[[0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1]]
In [4]:
# count every word how many times it appears in a corpus
k=0
amount_of_features=X.shape[1]
score=[]
while(k<amount_of_features):
            arr=X[:,k]
            arr=arr.toarray()
            score.append(np.sum(arr))
            k=k+1
print(score)
            
[1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 2, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 5, 1]

Creating Masks

Masks are matrices that act uppon other sparce matrices and change them according to a specific way. They can change their elements according to some pattern. For example if we want to drop the column containing the word document, we have to make a mask that act uppon that column. The column containing the word is number 5

In [24]:
pattern=[]
for feature in vectorizer.get_feature_names():
    if (feature== "document"):
        pattern.append(False)
    else:
        pattern.append(True)
print(pattern)
mask=np.array(pattern, dtype=bool)
[True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]
In [28]:
#drop the row containing the word document
print('old X')
print(X.A)
print("\n new X")
X_new=X[:,mask]
print(X_new.toarray())
old X
[[0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1]]

 new X
[[0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0]
 [0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1]]

Removing Non relevant features

Let's say we have the same corpus with relevant and non-relevant documents.

How would you remove all non relevant features from corpus?

document relevance
This is the first document this Yes
This document is the second document Yes
And this is the third one No
Is this the first document Yes
This is an example of feature selection Yes
I love kittens No
My favourite book is Harry Potter No
Do not beleive everything you read Yes

You can use masks to do that!

In [31]:
corpus = [
    'This is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document',
    'This is an example of feature selection',
    'I love kittens',
    'My favourite book is Harry Potter',
    "Do not beleive everything you read"
]
relevance=[1, 1, 0, 1, 1, 0, 0, 1]
mask= np.array(relevance, dtype=bool)
print("old X")
print(X.A)
print("\n new X")
new_X=X[mask]
print(new_X.A)
old X
[[0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1]]

 new X
[[0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1]]

Notice that new X is not exactly correct because it contains empy rows.

These rows correspond to features not being used, and they need to be removed.

To remove the we construct again the appropriate mask

In [32]:
# make the mask
mask =  new_X.getnnz(0)>0
print(mask)
[ True False  True False  True  True  True  True False  True  True False
  True False False False  True  True False False  True  True  True  True
 False  True  True]
In [33]:
#drop the column
print("old new_X")
print(new_X.A)
new_X=new_X[:,mask]
print("\n new new_X")
print(new_X.A)
old new_X
[[0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1]]

 new new_X
[[0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0]
 [0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 1 0]
 [0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0]
 [1 0 0 0 0 1 1 0 1 0 1 0 0 1 0 1 0]
 [0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1]]

Relevant document frequency

RDF is a filtering feature selection method that assigns each feature with a score depending on how many times that feature occur in relevant documents. After that it selects the best relevant features and produce a better train set.

Using the same corpus we will selekt the top 5 features and reform both train and test set.

document relevance
This is the first document this Yes train_set
This document is the second document Yes train_set
And this is the third one No train_set
Is this the first document Yes train_set
This is an example of feature selection Yes test_set
I love kittens No test_set
My favourite book is Harry Potter No test_set
Do not beleive everything you read Yes test_set
In [60]:
x_train = [
    'This is the first document',
    'This document is the second document',
    'And this is the third one',
    'Is this the first document' ]
y_train =[1,1,0,1]

x_test = ['This is an example of feature selection',
    'I love kittens',
    'My favourite book is Harry Potter',
    "Do not beleive everything you read" ]
y_test=[1,0,0,1]
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.transform(x_test)
print(X_train.A)
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
In [68]:
mask1 = np.array(y_train, dtype=bool) # this is an only rel_doc mask 

x_rel_train= X_train[mask1] # this is a matrix containing only rel_docs in countvectorizer fashion

mask = x_rel_train.getnnz(0)>0 # this is the mask, 1-d array containing only rel_features, it has the length of all features and ones in rel - zeros in nonrel

x_rel_train= x_rel_train[:,mask] # this is the matrix with [rel_docs x rel_features]
print("x_rel is:")
print(x_rel_train.A)

new_X_test1 = X_test[:,mask]
new_X_train1= X_train[:,mask]

print("\nthe new_x_train without non-relevant features is\n" )
print(new_X_train1.A)
x_rel is:
[[1 1 1 0 1 1]
 [2 0 1 1 1 1]
 [1 1 1 0 1 1]]

the new_x_train without non-relevant features is

[[1 1 1 0 1 1]
 [2 0 1 1 1 1]
 [0 0 1 0 1 1]
 [1 1 1 0 1 1]]
In [62]:
k=0
score=[]
amount_of_features=x_rel_train.shape[1]
while(k<amount_of_features):
    arr=x_rel_train[:,k]
    arr=arr.toarray()
    arr=np.where(arr > 0, 1, 0) # because we need just one, not the total amount of particular feature in that documnt so if there is above zero make it 1 (like binary countvectorizer)
    score.append(np.sum(arr))
    k=k+1

print(score)
[3, 2, 3, 1, 3, 3]
In [63]:
# creating the mask
topk=5
mask = np.zeros(len(score), dtype=bool) ######## this code is from sklearns selectkbest it makes a mask with ones in selected features and zeros to not selected
mask[np.argsort(score, kind="mergesort")[-topk:]] = True
print(mask)
print("\n")
new_X_train2= new_X_train1[:,mask]
new_X_test2 = new_X_test1[:,mask]

print("old X_train")
print(new_X_train1.A)
print("\n new X_train")
print(new_X_train2.A)

print("\n new_X_test1")
print(new_X_test1.A)
print("\nnew_X_test2")
print(new_X_test2.A)
[ True  True  True False  True  True]


old X_train
[[1 1 1 0 1 1]
 [2 0 1 1 1 1]
 [0 0 1 0 1 1]
 [1 1 1 0 1 1]]

 new X_train
[[1 1 1 1 1]
 [2 0 1 1 1]
 [0 0 1 1 1]
 [1 1 1 1 1]]

 new_X_test1
[[0 0 1 0 0 1]
 [0 0 0 0 0 0]
 [0 0 1 0 0 0]
 [0 0 0 0 0 0]]

new_X_test2
[[0 0 1 0 1]
 [0 0 0 0 0]
 [0 0 1 0 0]
 [0 0 0 0 0]]

Conclusion

  • We built a custom feature selection method, relevant document frequency.
  • We understood how to transform train set and test set accordingly
  • We understood how selectkbest works from sklearn