The term Data Science or its many variants – such as Analytics, Big Data, Artificial Intelligence, Robotics, Machine Learning, Deep Learning, all used interchangeably – are much used (and abused) buzz-words that are used to analyze, explain and predict a whole range of events and happening. It would appear there is almost no human endeavour that is untouched by Data Science and its applications. We see it being applied in the entire spectrum of fields as diverse as Financial Services, Health Care, Public Governance, Sports, Space, Life Sciences, Travel and Transportation to name just a few. There are also various extreme perceptions of Data Science that paint either ideal or dystopian scenarios of the future if we employ Data Science as a tool to solve problems, large or small. Depending on the point of view of the predictor, we will either see a world cured of all ills or one which is taken over by robots which are more powerful than humans and hence control them.
This blog will aim to give an overview of Data Science for beginners and explain the general lifecycle followed by Data Science projects.
Data Science refers to the use of data to organize it into meaningful groupings, finding patterns, making predictions by analyzing the available data and validating the predicted results against actual occurrences. A comprehensive understanding of Data Science requires knowledge of diverse fields such as Statistics, Linear Algebra, Calculus, Computer Science and Programming. The application of Data Science to a particular field also requires Subject Matter Expertise (SME) of that field.
In order to understand what Data Science is, it helps to understand, what it is not. Let us take a couple of examples in Banking to illustrate and contrast. If a customer deposits a sum of money in a Fixed Deposit (FD) in a bank, the outcome of the transaction in terms of what the customer will receive on maturity is well-defined. The amount receivable is a direct function of the interest rate, tenure of the FD, whether interest is cumulative or paid quarterly and some other parameters such as if the person is entitled to a preferential rate, the penalty clause in case of early withdrawal and any other relevant factors. There is no Data Science required for computing the Receivable Amount, which is deterministic. Now, if a customer of the bank seeks a loan, the Loan Sanctioning Officer will have to weigh the possibility of the customer repaying the loan based on certain factors which may vary from bank to bank and will depend on the parameters of the model selected by the bank (assuming it has a model in place). The outcome is by no means certain and the model will likely give a probability figure of the predicted result (whether the customer will repay the loan or default). The probabilistic nature of the result is one of the features of Data Science.
One of the key aspects of Data Science is the concept of Machine Learning. It is based on analyzing and seeking a pattern in the data provided to the machine. There are 2 main types of machine learning: supervised and unsupervised.
In supervised learning, we give the algorithm or machine a data set with the “right answers” or “labels”. There is a set of independent variables and one dependent variable. One instance of data comprising values of independent and the dependent variable forms a tuple. A data set comprises of many such tuples. A data set known as training data set is provided to the machine with the “right answers” (of the dependent variable) and the machine “learns” the relationship between the dependent and independent variables. This learning is in the form of a mathematical equation that best explains the relationship. Then some test data is provided to the machine without the dependent variable and the machine is asked to predict the dependent variable in the test data on basis of the relationship it has learned. The predicted values of the dependent variable are then compared with the actual values to determine the accuracy of the model.
In unsupervised learning, the data has no labels and the objective is to find hidden structures in the data or classify the data into groups. This is also known as data mining.
Steps in Machine Learning
A generalized life-cycle of machine learning is given below, applicable to supervised learning. This may change in specific situations, however the overall process would be similar. Some of the steps would not be applicable for unsupervised learning.
- Define the problem and problem type (regression, clustering, ranking, etc.)
- Collect and organize data from multiple sources as required
- Exploratory Data Analysis (EDA) – view the data, find missing values and outliers, remove noise, clean the data and make it ready for analysis
- Split data into training and test sets
- Select the algorithm or combination of algorithms to analyze the data
- Run the train data through the algorithms and build the model
- Test the model with the test data
- Analyze the difference between actual and predicted results
- If desired results not obtained, iterate over some of the steps
- Derive and present the conclusions