Software for Data Science, Data Mining, Data Analytics, and Machine Learning

Lianfen Qian

In the era of big data, data scientist is highly demanded. What basic skills a data scientist has to learn and what software language one needs to know?

The top eight basic skills a data scientist has to pick up are programming skills, data visualization & communication, data intuition, statistics, data wrangling, machine learning,  software engineering and multivariate calculus and linear algebra.  From the following requirements for jobs of data analyst, machine learning engineer, data engineer and data scientist, one can see the importance of each skill for each type of job (source:

Data Science Skills - Udacity - Matrix

Computing skill is essential for a data scientist. and both conduct surveys. compared the three most used softwares: R, Python and SAS. Notice that among the three software languages, R and Python are freewares, while SAS is a licensed software. The results show that new data scientists within five years of experience are more likely to use Python than R or SAS, while consulting or large or healthcare companies are more likely to use SAS.  For more details, read on: collected a list of tools and platforms used by data scientists at

My question is: Why large or healthcare companies prefer license software over freeware?

I asked a VP of a financial group of a large company. He told me that the company prefers the licensed software because they believe it is more reliable to use a licensed software. So called reliable, it means that they spend the license many to buy insurance. That is, with licensed software, if any analysis goes wrong due to the software, the software company is responsible for the wrong analysis, and the user company could file for claim and is possible to receive a huge insurance amount.

I am a R user. Comparing R and SAS, I don’t think SAS is more reliable than R in doing statistical analysis since both languages are written by statistics researchers, generally with PhD or MS degrees in Statistics. As Python becomes popular over the last five years, it is still under development in terms of rigor or advanced statistical methodologies. The main reason Python catches up so fast is the general purpose of the language and easy to learn. For those who are doing classical data analysis, Python can be the choice. For researchers developing new statistical methods, I still prefer using R. This is also supported by Muenchen (2016) who compared R and SAS and concluded that R passes SAS among scholarly use, see

How about Matlab vs R? Overall, they are similar. Mathworks promotes its Matlab at with a list of reasons, while Alakent et al. in their blogs prefer R, see







Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s