Rocchio and KNN — Job post detection

Kiu Lam
4 min readOct 28, 2017

People occasionally posted job-related posts to my college Facebook group

Example 01
Example 02

post like this often get a response for a person, to direct them to a specific group

Example
yea, Ollie , this is for you

Wouldn’t be easier if you can automate it

Seriously, this is the first thing that came up in my mind, after seeing the same person, yea this is for you Ollie, who restlessly directing people to the specific group for this specific post.

Here’s what I can do, I can write a bot, to automatically flagged whether the post is a job post or not, then automatically posted a response if it’s a job post.

In this article, I will mainly discuss of the algorithms, Rocchio classifier and K-nearest neighbor, behind my job post detection system, in tfidf vector space model.

Before so, let’s briefly introduce the idea behind these two algorithms

Rocchio classifier:

Given a list of vectors in category “A”, and a list of vectors in category “B”. We can sum up all of their vectors from the same category to form their corresponding prototype vector

So for any of the query submitted to the system, it can determine whether the query is belong to category “A” or “B” by performing cosine similarity.

In this example, I have the K size of 5, and the majority of the vectors return are “A”. (A,A,A,B,B). Therefore the query submitted belongs to the category of A.

KNN classifier:

KNN, also known as K-Nearest Neighbor, determines the outcome of the category, by performing cosine similarity on each of them, and the closet in K size

Here’s my approach. I have a corpus of job and non-job posts, with a size of 5 for each, I am going to use the popular sklearn library, to help to transform them into vectors perspective, and finally compare the accuracy with KNN (default to 3) and Rocchio classifier with test cases

Here’s the code

And here’s the original result

KNN requires a bigger size of the corpus, as it’s checking the majority of the documents returned in K size, thus the bigger the corpus, the higher the accuracy.

1-nearest neighbor

For now, with the corpus size of 10, it would be nice if the K size is small too

Oh! What happens if K is 10 with the corpus of 10

A good example is, I have a category of “uncle” from my mother side, and “uncle” from my father side, if I submitted a vector with the mention of “uncle” which

In another side, Rocchio may did poorer with polymorphic vectors, for instance, if both of the prototype vector are really closed to each other, then which one would the system return?

At least, if I allowed the vectorizer to have ngram, then the accuracy would be increased. Because there are some specific phrases people are going to use for job-related posts.

vectorizer = TfidfVectorizer(ngram_range=(1, 500))

At least this sums up how the job detection system work. Using sklearn to do machine learning tasks is worthwhile! Github repo

--

--