Three qualities that make a good data scientist
First of all, before I try to define what makes a good data scientist, I’ll do my best to define exactly what a data scientist is, and how their job differs from that of a statistician, an analyst or any other overlapping profession. I spent a lot of time comparing my opinion with that of the online community, only to find that this seems to be a point of some controversy.
What I found was that one person’s data scientist can be another’s business analyst. In fact, along with big data, it is sometimes regarded as a bit of a buzzword. Buzzword or not, data science is undeniably a rapidly growing profession, with McKinsey estimating that by 2018 the US economy will be short of up to 190,000 people capable of performing deep analytics (see The Rise of the Data Scientist for more).
What are employers looking for?
The job descriptions for data science positions mention combinations of computer science skills, statistics, analytics and more. The spectrum of desired attributes a potential data scientist should have is very broad, so broad in fact that I find that the best way to define a data scientist is to define what makes a good one.
There has been a lot said about the 3 Vs of big data: volume, velocity and variety. I find that there are a lot of parallels between these characteristics and what a good data scientist should be able to adapt to and make use of.
Volume – know when to go big
We are living in an age where datasets can exist in the petabytes and above. A data scientist should be able to make sense of data at this scale, both technically (making use of big data toolkits) and conceptually. Conceptualising and visualising data of this size can be difficult; the human mind is not used to thinking at the petabyte scale. This is something a good data scientist must be able to do if they want to derive any meaningful insight.
In the age of big data it can be tempting to simply look for the largest datasets and assume they will provide the best answers; this is not the case. A good data scientist must be able to recognise when small datasets, or subsets of large datasets are useful, and when they can be used in combination to yield insight.
The value of a statistical way of thinking should never be underestimated in data science. When it is possible to do away with sampling and look at everything at once, there is a danger of forgetting the fundamental principles that so many data science techniques are based on. A good data scientist should bear this in mind no matter what the size of the data.
Velocity – think in real time
As the amount of data that is available increases, so does the speed at which it is generated and the immediacy of the analysis required to make sense of it. A data scientist will often be faced with a project where the client is interested in real time insight into their business problem; examples include a need for real time reactions within the supply chain, or a need to make sense of constantly streaming customer data. As such an essential skill for a data scientist to possess is to be able to think of analytics and insight dynamically, to be able to stream in data in real time and work alongside more static data sets to provide “living analysis “of a system.
Variety – nothing is off limits
The variety of data able to be consumed and accumulated by a business is greater than ever before. Companies are now able to access unstructured and structured data of all kinds, in combination with ingestion of externally available data. A good data scientist should not only be aware of this, they should be looking to use data from any combination of diverse sources in the pursuit of insight.
Examples could include combining customer purchasing data, times and amounts along with sentiment analysis from social media, weather information and geographic information in a predictive analysis to provide recommendations of real time stocking of products and reactionary deployment of employees across a whole company.
2. Having a well stocked tool-kit
To be able to adapt to these crucial elements a good data scientist must be in possession of a certain set of skills, or at least be ready to learn them as they go. Some examples of the most useful techniques include machine learning, data mining, predictive modelling, visualisation and data manipulation.
These are used in combination, along with a whole host more to work through and solve problems all the way from looking at the raw data, delving into it to find trends, designing and training algorithms, through to the presentation of the data in a way that even the most data-phobic business can understand and benefit from.
There is often a great deal of debate around what is the best tool or platform for a certain task or project, and while learning Hadoop, Spark, R and Python would all be very useful for a data scientist, the key for a good data scientist is to be able to derive insight and solve problems. After that the tool choice becomes secondary. They must be able to use their skills as an adaptable data scientist to decide what technique to use when, and what combination of tools would best suit the problem.
3. Understanding the why
One of the most essential qualities a data scientist can have, and possibly the most prized (and rewarded) amongst employers is the ability to understand the rationale behind the analysis.
If a data scientist can think both in terms of the analysis and in terms of the business problem they are trying to solve, they can become an invaluable asset. One who possesses these qualities, along with an ability to present the findings to business users and management will quickly be regarded as a great data scientist.
Do you meet the grade?
So, I may have not defined exactly what a data scientist is but rather looked at what the market demands from a good one. I am sure you will all agree, it takes a little bit more than just being an Excel wizard.