High, Leigh!


High, Leigh!

Sorry for the belated follow-up on what I said I would do in my last post. In addition to some urgent business getting in the way, the most important problem to tackle Stack Overflow’s 2017 survey is that the data is quite noisy. The main problem is that there are many missing values in the observations, and even powerful imputation tools — such as mice — are not enough to solve this problem.

In order to avoid getting nonsense values, I decided to do some exploratory tests before going into clustering proper. Here are my preliminary results:

First of all, only 12,891 respondents out of 51,392 decided to answer the question on their annual salary. So, at least for preliminary purposes, these 12,891 are the only reliable ones.

Second, as would be expected, Country has the most important correlation with Salary, as can be seen in the following table: https://www.dropbox.com/s/ycaqxlg0opd5xbv/TrainColumsCorrelation0002AllCountries.xls?dl=0 . (Folks: for your safety, you can test in VirusTotal.com the URLs I am providing here.) As a consequence, I proceeded to process specific countries, and only for data that contains salaries.

Third, after reading the csv with the survey data (survey_results_public.csv) into a data frame, the factor levels in FormalEducation do not have any sensible order. So I changed them to the best hierarchical order I reckoned I could provisionally work with: Primary or elementary school would be substituted with number 22, Secondary school with 23… up till Doctoral degree with 28. I excluded “Prefer not to answer”, because it represents an insignificant portion of observations with Salary data; and I merged Master’s and Professional degrees into a single new category, 27.

Once I had a more-or-less clean set of 3,776 observations for the United States, I obtained the following summaries (in US Dollar annual salaries) for each of the education levels: https://www.dropbox.com/s/yom45u10i9a12ag/FrmlEducationLvls0006US.txt?dl=0 .

The annual salaries for “Never completed formal education” level seems paradoxical, because of the very wide spread between salary levels; but we can disregard this category because its correponding number of observations is insignificant.

If you look at the medians and the means, it pretty much looks like a sequence of stairs from the basic (primary school) to the highest (doctoral degree) education levels. The sequence of stairs is even more evident in the box plots you can find in the following URL: https://www.dropbox.com/s/1amisl5x5xc2r1x/BoxPlot0006US.jpg?dl=0 .

As sort-of-a-preview of what we might find with a clustering analysis, I obtained a chart of the CART (Classification and Regression Tree) model for the US observations: https://www.dropbox.com/s/m2naucry62f0km2/CARTPlot0006UnitedStates.jpg?dl=0 . This CART model just confirms what I had found in the summaries and in the box plots.

My next step will be to produce clusters that might throw some light into the problem you posed in your article, on the Computer Science salary levels, by obtaining combinations of Country and MajorUndergrad. Also, the correlations table I mentioned earlier is clearly telling us that we must try combinations of countries, EmploymentStatus, and YearsProgram.

Cheers, and I’ll keep in touch.

Jaime