People occasionally posted job-related posts to my college Facebook group
post like this often get a response for a person, to direct them to a specific group
Wouldn’t be easier if you can automate it
Seriously, this is the first thing that came up in my mind, after seeing the same person, yea this is for you Ollie, who restlessly directing people to the specific group for this specific post.
Here’s what I can do, I can write a bot, to automatically flagged whether the post is a job post or not, then automatically posted a response if it’s a job post.
In this article, I will mainly discuss of the algorithms, Rocchio classifier and K-nearest neighbor, behind my job post detection system, in tfidf vector space model.
Before so, let’s briefly introduce the idea behind these two algorithms
Rocchio classifier:
Given a list of vectors in category “A”, and a list of vectors in category “B”. We can sum up all of their vectors from the same category to form their corresponding prototype vector
So for any of the query submitted to the system, it can determine whether the query is belong to category “A” or “B” by performing cosine similarity.
KNN classifier:
KNN, also known as K-Nearest Neighbor, determines the outcome of the category, by performing cosine similarity on each of them, and the closet in K size
Here’s my approach. I have a corpus of job and non-job posts, with a size of 5 for each, I am going to use the popular sklearn library, to help to transform them into vectors perspective, and finally compare the accuracy with KNN (default to 3) and Rocchio classifier with test cases
Here’s the code
And here’s the original result
KNN requires a bigger size of the corpus, as it’s checking the majority of the documents returned in K size, thus the bigger the corpus, the higher the accuracy.
For now, with the corpus size of 10, it would be nice if the K size is small too
Oh! What happens if K is 10 with the corpus of 10
In another side, Rocchio may did poorer with polymorphic vectors, for instance, if both of the prototype vector are really closed to each other, then which one would the system return?
At least, if I allowed the vectorizer to have ngram, then the accuracy would be increased. Because there are some specific phrases people are going to use for job-related posts.
vectorizer = TfidfVectorizer(ngram_range=(1, 500))
At least this sums up how the job detection system work. Using sklearn to do machine learning tasks is worthwhile! Github repo