Part 1 | How to better understand job descriptions and profiles through Deep Learning
We are back with another interesting tech-blog series and this time we discuss a topic that’s been the flavour of the season: How to apply Deep Learning to make Hiring Better?
So let’s get started!
Decoding job descriptions and resumes
In many organizations, including ours, we aim to intelligently match candidates to a job description (JD). This activity demands to understand a JD and a resume at a higher, deeper level. This requires a lot of messy wanderings through unstructured text, extracting important skills and traits from each section, and coming up with a similarity score between a resume and JD. It could sometimes become very difficult as it is hard to teach a machine how to extract important information from unstructured text.
Ready for a fun fact?
Of all the animals that evolved in our planet, somehow, we are the only species able to create a syntactic and semantic complex-rule based language for our communication!
The rich set of vocabulary, contextual information and syntax makes it tough for a machine to parse important information. Yet, we are getting very good at understanding JDs and resumes using a set of new tools in our arsenal, which includes Deep Learning.
Why Deep Learning?
Deep Learning, based on modern neural networks has been able to surpass human level inaccuracies in various domains. It is fascinating to us because it can efficiently encode a lot of contextual information, which helps us to understand the important features automatically. Previous machine learning models required us to do a lot of manual feature engineering but with Deep Learning that has changed.
Deep Learning can automatically infer most of the features, and even beat those shallow learning approaches in many cases if tuned properly with enough data. For example: a trained sequential model could differentiate between ‘working with project manager’ and ‘working as project manager’. A one-hot encoded vector could be easily fooled in thinking that both of the sentences are almost similar, but a Deep Learning based recurrent neural network can use the contextual information in making a decision.
Predicting the relevant title
Natural language has a lot of variations like the one we have mentioned in the earlier example. Although a naive algorithm would be able to extract important skills from a job description and a resume, it could make a lot of silly mistakes. What if we can reduce a job description and a resume to a birds-eye-view and look at it in a different perspective altogether?
We also work directly with client specific problems, one of such important problems is to hire good external candidates from job portals like Naukri, Monster and Shine. Sometimes, it becomes tough to identify the candidate title as they can be very specific to a client. For example, client could internally call a role as ‘Desktop Engineer Band 2’, but the actual role is for a ‘Desktop Support Engineer’. When searching for candidates externally, it would be better to use titles like desktop engineer or desktop support engineer. Using alternative titles in querying for candidates would improve the relevancy of the search results. Here is one such job description. If we can predict these alternative titles, it could be used as a way to normalize titles.
There has been criticism about Deep Learning and it’s black box nature. It looks like magic, but at its core it really boils down to using dense vector representations and a lot of super-cool mathematics.
Why do you need a Deep Learning model when you can train multiple linear classifiers and interpret what features are effecting your result directly?
We can give several reasons as to why shallow learning methods are inefficient. But as a good debater once told me, when you don’t have one solid good reason, you need to have several reasons!
First, linear or logistic regression considers each class independently and trains multiple classifiers. This is not true in the real world. Each class might interact and be correlated with the other. A Deep Learning model uses shared parameters among all the classes, this generalizes better to unseen data than individual classifiers. Also, there has been research suggesting that deep learning can be very resistant to noise in labels. And, Deep Learning models can scale well with data. As deep learning uses matrix multiplications, it can be efficiently implemented on a GPU. Moreover, they are not black boxes. You can interpret them as you would interpret a linear regression model, but it’s just a little more difficult. Again, the current research in attention models and combination of reinforcement and supervised learning are guiding us towards a more interpretable deep learning models.
To achieve the task of extracting relevant titles, we have trained a deep recurrent neural network on almost 5 million parsed resumes, JDs and title pairs. The dataset contains the data from various domains. In the data preparation stage, classes with very low number of occurrences were removed. The samples were tokenized into sentences and then sentences were tokenized into words. Each word is assigned an index with a random embedding that is learned during the training phase. The pre-processing was kept to a minimum except for word lowercasing. We have not identified any phrases beforehand. The model hyperparameters: word embedding size of 300, a word level Gated Recurrent Unit (GRU) with hidden state of size 200, a sentence level GRU with hidden state of size 200 and around 2600 classes.
It uses a uses a hierarchical structure that encodes the given unstructured text into a vector using a recurrent neural network inspired by this research work. Each sentence is passed through a network called Gated Recurrent Unit (GRU) which tries to capture the important information in the sentence. These sentences (with their vectors) are clubbed together to form a sentence vector by using another GRU. By using this structure, we can reduce all the complexity in the unstructured text to a single vector which could be used to make a prediction about the title it belongs to.
All the weights are initialized randomly, and the whole network is optimized with negative log likelihood objective function with a softmax at the end to normalize the probabilities. The model was trained with SGD optimizer, the learning rate was chosen by closely monitoring validation loss on a small dataset. Gradient clipping was also used. The validation loss was monitored at the end of each epoch, and early stopping was used as a regularization technique to avoid overfitting. At the end of seventh epoch, the validation loss started increasing, so training was halted. The model reached around 24.5 top-1 accuracy before it stopped improving.
Here is the training loss of the model plotted against minibatch number.
This was the first of the 2 part blog series on how Deep Learning can effectively help us decode JDs. We hope you ejoyed reading it and also found the information helpful. In the next part, we will discuss the results this Deep Learning model shows when applied and how we can use it to better ‘Search and Match.’
Sandeep Tammu is a Data Scientist with EdGE Networks. With a desire to probe and experiment the real world, Sandeep pursued his education in electronics and physics. Working as a Data Scientist, he strives to bring the same level of enthusiasm into exploring and gaining valuable insights from various kinds of data using machine learning and predictive modelling techniques.