Avoid These Mistakes if You Want to Master Data Science

The best teacher is your last mistake. Speak about Data Science and this feels apt. Data science is an interdisciplinary field which implements various tools, techniques, algorithms, and systems to derive actionable insights and valuable information from data; both structured and unstructured.

We as humans learn from our mistakes but when it comes to a field as lucrative as data science, the margin for error is extremely thin. Organisations cannot afford errors and mistakes when the stakes are high. It may cost an arm and a leg to the organisation and consequently the job to the individual data scientist.

So if you have decided to opt for Data Science as your career, make sure that you do not commit the following mistakes.

  1. Remember, Correlation and Causation are not the same!

Mistaking correlation for causation is a formidable mistake. Often, data scientists tend to assume that correlation implies causation.

Causation between 2 variables X and Y means X causes Y. It basically means one variable is the result of the occurrence of the other. It is also referred to as cause and effect.

Whereas, Correlation describes the relationship between the 2 variables X and Y.  The value of correlation is between -1 to +1. A negative correlation signifies a negative relationship between the variables. This means that the variables move in the opposite direction i.e. when X decreases, Y increases and vice versa. Similarly, a positive correlation signifies the variables are moving in the same direction and zero correlation means there is no relation between the two; the value of variables is independent of each other.

In data science, correlation is not causation. If two variables X and Y seem to be related to each other, it does not mean X causes Y.

2. Actions speak louder than words.

Data science includes a lot of linear algebra, probability, theorems, statistics, machine learning algorithms, etc. Often, newcomers tend to put all their efforts into learning the theoretical concepts. Although it is wise to know the concepts, implementing the learning outcomes practically with hands-on training is paramount.

So make sure you spend more time solving practical examples over theoretical learning. You can find data sets on Google to put your concepts to a test.

3. All day ‘code’ and no play make Jack a dull boy.

Now, before you jump to developing the models of the future, you need to understand the basics. Often, young data scientists tend to rush into coding without acknowledging how the algorithm works. If you really start coding from scratch, make sure you imbibe a learning approach over attaining perfection.

Linear algebra, Statistics, Calculus, & Probability

You must first understand the four aforementioned concepts before diving into hardcore coding. Data Science is a result of all the techniques combined and not just algorithms and codes.

4. Jack of all trades and master of none.

There is a plethora of tools with each tool having its own unique features. Often, one may end up trying to learn all different tools at once. Given their unique features, which tool to master is a constant dilemma. But going behind multiple tools will hamper your real-life problem-solving skills or even land you in ‘no-man’s-land’.

So why is it not recommended to try multiple tools all at once? Well, you will end up mastering neither of them. And for all we know, ‘half knowledge is a dangerous thing!’

5. Outer results attract, but inner code captivates!

Remember, it is not always about the accuracy. Sure developing a model with 97% accuracy is commendable but if you cannot explain how the model is built, how it attained its accuracy, what are the deciding features, your efforts will go in vain. Your client may reject it.

Think of it this way. If you cannot explain it to a customer that his loan application was rejected because of his age or any previous credit history, the business will take a hit. So it is imperative to know how the model works.

In order to avoid this, you can pick a niche and talk to the associated individuals in the organization to understand how a particular project works. Once you have an initial understanding, create fundamental models. Try explaining them to people apart from the industry and escalate the level of complexity of your model slowly. Follow this review & repeat model until you do not understand what is happening inside the model. This will help you to understand when to stop and why fundamentals models are necessary.

6. All you need is the ability to ‘code’

So with all the buzz around learning R and Python, entry-level professionals often end up believing that data science involves a lot of coding. As a matter of fact, gone are the days when coding was paramount. Having the knowledge to understand and write code will always be a plus but coding is not a prerequisite for data analysis.

There are multiple tools available in the market (some of them have been mentioned below) for exploratory data analysis. If coding is not your strong points, drag and drop will do the job for you.

All the mentioned tools are free to download.

Trifacta. With absolutely no coding involved, Trifacta makes data cleaning and structuring more intuitive. Even the most complicated CSV or JSON file can be wrangled in a few hundred seconds. The platform is now available as a cloud service as well.

Rapid Miner. From data preparation to deploying final machine learning models, Rapid Miner makes the easy to collate a data for predictive modelling.

Qlikview. Be it managing the data or build a custom app, Qlikview provides solutions for analytics and business intelligence.

Tableau Explore. Save. Share. This is all you have to do to create interactive visualisations from your Spreadsheets. Once published, the visuals can be embedded into web pages and blogs or be shared via social media and email.

H2O: H2O is considered as the best open-source machine learning platform and it is compatible with R & Python on Hadoop/Yarn, Spark.

7. Emphasizing on tools over the Business problem.

Let us take an example. Imagine you are given a data set which contains the history of how likely is a customer to purchase a particular car. The variables involved may include the customer’s income, loan history, credit history, family size, daily usage, color preference, engine power, etc. There is a possibility that you may be unaware of what a particular variable means. Yet you may build a model with high accuracy without knowing why a particular variable was dropped. Well, it might turn out to be a disastrous mistake. The variable you dropped may have significant importance in the real-world scenario.

So what happened here? Blatant implementation of tools for a business problem is not desirable. Having the knowledge of multiple tools is promising but blending the knowledge with the existing business problem is the deciding factor for a data scientist.

8. Data Science is all about the plan.

Data analysis is a structured process. First, the objectives of the problem are to be understood followed by testing a few hypotheses. Jumping on the data without a structured approach is a common mistake.

“Data Scientist’s who do not know what they want, end up with results that they do not want”.

There is no single approach to data science. There is no ideal set of ‘what’, ‘why’, and ‘how’ for data analysis. One should focus on obtaining the results by using big data for defining the variables, model design, and processing the data accurately. Understand what is the desired end result and implement a suitable approach.

When it comes to a field as lucrative as data science, organisations aim for perfection. And as the famous saying goes ‘practice makes a man perfect’, these are a few mistakes you should definitely avoid as the decision maker of the future.

Leave a Reply

Notify of