This tutorial presents techniques and manpulation ways for features in text categorization problems. It explains how to work with text data, transform them into numbers and built custom feature selection methods, or just find significant properties.
The level of this tutorial is intermediate. It requires basic understanding of numpy, scipy, sparse matrices and sklearn libraries.
Have you ever wandered how sklearn's functions work? Did you ever find their use insufficient or did you ever wanted to modify some modules regarding feature selection? I hope this tutorial will answer some of your question and help you start hacking your way through data science.
Suppose you have the following corpus
corpus = [
'This is the first document this.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'This is an example of feature selection',
'I love kittens',
'My favourite book is Harry Potter',
"Do not beleive everything you read"
]
Let's try this out in the most straight-forward way:
# counting words
corpus = [
'This is the first document',
'This document is the second document',
'And this is the third one',
'Is this the first document',
'This is an example of feature selection',
'I love kittens',
'My favourite book is Harry Potter',
"Do not beleive everything you read"
]
result=0
for text in corpus:
for word in text.split():
if word=="document":
result=result+1
print(result)
# deleting words
corpus = [
'This is the first document',
'This document is the second document',
'And this is the third one',
'Is this the first document',
'This is an example of feature selection',
'I love kittens',
'My favourite book is Harry Potter',
"Do not beleive everything you read"
]
k=0
for text in corpus:
temp_list = text.split()
for word in temp_list:
if word=="document":
temp_list.remove(word)
corpus[k] = ' '.join(temp_list)
k=k+1
print(corpus)
# identifying words
corpus = [
'This is the first document',
'This document is the second document',
'And this is the third one',
'Is this the first document',
'This is an example of feature selection',
'I love kittens',
'My favourite book is Harry Potter',
"Do not beleive everything you read"
]
flag=0
identification_list=[]
for text in corpus:
temp_list = text.split()
for word in temp_list:
if word=="document":
flag=1
identification_list.append(flag)
flag=0
print(identification_list)
Instead of scanning the document, try converting it into a matrix represenation and perform all the actions column-wise and row-wise We can use CountVectorizer from sklearn to achieve that
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
corpus = [
'This is the first document',
'This document is the second document',
'And this is the third one',
'Is this the first document',
'This is an example of feature selection',
'I love kittens',
'My favourite book is Harry Potter',
"Do not beleive everything you read"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
# count every word how many times it appears in a corpus
k=0
amount_of_features=X.shape[1]
score=[]
while(k<amount_of_features):
arr=X[:,k]
arr=arr.toarray()
score.append(np.sum(arr))
k=k+1
print(score)
Masks are matrices that act uppon other sparce matrices and change them according to a specific way. They can change their elements according to some pattern. For example if we want to drop the column containing the word document, we have to make a mask that act uppon that column. The column containing the word is number 5
pattern=[]
for feature in vectorizer.get_feature_names():
if (feature== "document"):
pattern.append(False)
else:
pattern.append(True)
print(pattern)
mask=np.array(pattern, dtype=bool)
#drop the row containing the word document
print('old X')
print(X.A)
print("\n new X")
X_new=X[:,mask]
print(X_new.toarray())
Let's say we have the same corpus with relevant and non-relevant documents.
How would you remove all non relevant features from corpus?
document | relevance |
---|---|
This is the first document this | Yes |
This document is the second document | Yes |
And this is the third one | No |
Is this the first document | Yes |
This is an example of feature selection | Yes |
I love kittens | No |
My favourite book is Harry Potter | No |
Do not beleive everything you read | Yes |
corpus = [
'This is the first document',
'This document is the second document',
'And this is the third one',
'Is this the first document',
'This is an example of feature selection',
'I love kittens',
'My favourite book is Harry Potter',
"Do not beleive everything you read"
]
relevance=[1, 1, 0, 1, 1, 0, 0, 1]
mask= np.array(relevance, dtype=bool)
print("old X")
print(X.A)
print("\n new X")
new_X=X[mask]
print(new_X.A)
# make the mask
mask = new_X.getnnz(0)>0
print(mask)
#drop the column
print("old new_X")
print(new_X.A)
new_X=new_X[:,mask]
print("\n new new_X")
print(new_X.A)
RDF is a filtering feature selection method that assigns each feature with a score depending on how many times that feature occur in relevant documents. After that it selects the best relevant features and produce a better train set.
Using the same corpus we will selekt the top 5 features and reform both train and test set.
document | relevance | |
---|---|---|
This is the first document this | Yes | train_set |
This document is the second document | Yes | train_set |
And this is the third one | No | train_set |
Is this the first document | Yes | train_set |
This is an example of feature selection | Yes | test_set |
I love kittens | No | test_set |
My favourite book is Harry Potter | No | test_set |
Do not beleive everything you read | Yes | test_set |
x_train = [
'This is the first document',
'This document is the second document',
'And this is the third one',
'Is this the first document' ]
y_train =[1,1,0,1]
x_test = ['This is an example of feature selection',
'I love kittens',
'My favourite book is Harry Potter',
"Do not beleive everything you read" ]
y_test=[1,0,0,1]
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.transform(x_test)
print(X_train.A)
mask1 = np.array(y_train, dtype=bool) # this is an only rel_doc mask
x_rel_train= X_train[mask1] # this is a matrix containing only rel_docs in countvectorizer fashion
mask = x_rel_train.getnnz(0)>0 # this is the mask, 1-d array containing only rel_features, it has the length of all features and ones in rel - zeros in nonrel
x_rel_train= x_rel_train[:,mask] # this is the matrix with [rel_docs x rel_features]
print("x_rel is:")
print(x_rel_train.A)
new_X_test1 = X_test[:,mask]
new_X_train1= X_train[:,mask]
print("\nthe new_x_train without non-relevant features is\n" )
print(new_X_train1.A)
k=0
score=[]
amount_of_features=x_rel_train.shape[1]
while(k<amount_of_features):
arr=x_rel_train[:,k]
arr=arr.toarray()
arr=np.where(arr > 0, 1, 0) # because we need just one, not the total amount of particular feature in that documnt so if there is above zero make it 1 (like binary countvectorizer)
score.append(np.sum(arr))
k=k+1
print(score)
# creating the mask
topk=5
mask = np.zeros(len(score), dtype=bool) ######## this code is from sklearns selectkbest it makes a mask with ones in selected features and zeros to not selected
mask[np.argsort(score, kind="mergesort")[-topk:]] = True
print(mask)
print("\n")
new_X_train2= new_X_train1[:,mask]
new_X_test2 = new_X_test1[:,mask]
print("old X_train")
print(new_X_train1.A)
print("\n new X_train")
print(new_X_train2.A)
print("\n new_X_test1")
print(new_X_test1.A)
print("\nnew_X_test2")
print(new_X_test2.A)