For those who are actively looking for data scientist jobs in the U.S., the best news this month is the LinkedIn Workforce Report August 2018. According to the report, there is a shortage of 151,717 people with data science skills, with particularly acute shortages in New York City, San Francisco Bay Area and Los Angeles. To help job hunters to better understand the job market, Shanshan Lu scraped Indeed website and collected information of 7,000 data scientist jobs around the U.S. on August 3rd. The information that he collected are: Company Name, Position Name, Location, Job Description, and Number of Reviews of the Company.
I am on schedule to graduate with a PhD in EE this December and I've begun applying to machine learning jobs. Something that has confused me a bit is how many different titles exist for very similar jobs (i.e., data scientists and machine learning engineers will share many technical qualifications such as SQL, Python, Torch/Tensorflow, R).
Given Kaggle's active communitty, a number of coding repos are available that ask the following questions already, including Exploration - Data Scientist job market :
Data Scientist Job Market(U.S)-Data Viz:
While the classifier repository achieved a ~77% classification precision using job descriptions to classify position name, they didn't perform an analysis on that relationship revealing what positions require what responsabilities/requirements/whatever else is included in a description. My research questions for this data vis project are:
Googling the most common job positions, here are their definitions:
I've found out that linear SVM classifiers are indeed the best for NLP generally, however I found that a bagging ensemble of 30 classifiers improved classification precision from 77% to 95%. Additionally, I've found that incorrect classifications consistently occur most commonly with the "Data analyst" and "Manager" positions because they have the least amount of data, not because the job descriptions are poorly correlated to the job titles. I've inspected the job descriptions of incorrect classifications for these two classes and they seem fine, including highly correlated keywords such as "team work" and "leading" for the "manager" class. I'm going to begin work for the next work week on visualizing an interactive confusion matrix plot using "cm.csv" in D3 using react. Forked this project from Dr. Kelleher's Vega-Lite API Template, created this Python repo for generating viz data, and created this gist from the Python repo.
Made a prototype scatter plot of raw position title TSNE embeddings which are color coordinated by the assigned position group class. Added interaction to show the raw position title of the scatter point that the mouse covers. Looking at some of the "other" data points, I've adjusted the class assignments to be a bit more accurate. For instance, any position title with the word "executive" is now a "data science manager" class, where before they were "other".
Changed code from vega-lite to D3 by forking Interactive Color Legend, which comes with the added feature of highlighting classes when mousing over them in the legend. Added a large marker for each of the 5 classes average location for quicker digestion, but they lack visibility. Removed axis and grid.
Changed class centroid text from mean location to median location. Changed class centroid text to have color matching class legend, and to become transparent when mousing over legend. Added white rectangle background to all class centroid texts for increased visibliity. Added description plot, such that there are now two embedding plots that represent the inputs and outputs of the machine learning classifier.
Began making the confusion matrix by creating two intersecting scaleBands in index.js, which are called in a <div> and <rect> in ConfusionGroup.js, following this bar chart example. Reformatted the confusion matrix csv file to consist of a 'row', 'column', and 'value' column for better use in scaleBand. No visual progress to report.
Gave up on scalebands in favor of <rect> objects. Created first draft of confusion matrix. Viz is too big to fork, so I've continued making screenshots.
Corrected an error with the confusion matrice's values, and added interactivity to the confusion matrix, highlighting the linear SVM's classification performance depending on which class is moused over in the legend.