I see that you planned to read this post. Great!! What does this indicate? Your keen interest to know more about data science and leave no stone unturned in preparing for a glorious and lucrative data science career.
Learning is a never ending process.
During a data science career or even before starting it, you need to stay informed and away from all the doubts. What you need, in this case, is a data science ready reckoner that you can refer, whenever you wish to know about a key data science term or definition.
Scope of the Basic Glossary
Indeed the scope is data science (no guessing work!!). But it is essential to know the data science areas from which these terms and definitions will be covered. So, here goes the scope of this glossary.
- Data Science
- Python and R
- Probability and Statistics
- Data Visualization
- Machine Learning
- Deep Learning
Comprehending Data Science Terms: Starting with the Basics
To save the day and time for you, here are some of the essential basic terms of the data science vocabulary. These data science terms and definitions are arranged in an alphabetical order to make it for an interesting learning experience. You would not want to go to Bias after reading about Feature Engineering.
Get Set Go…..
A set of rules given to the computer to complete a data science process or solve a problem is termed as algorithm. A data scientist is expected to know why Logistic Regression will be the best choice and why Support Vector Machine will not work.
Application Programming Interface (API) offers a cluster of functions used to work on a particular service or application and deploy its features. In this interface, one code connects with the other code; this is unlike GUI or CLI where humans interact with the code.
One of the most popular uses of API is when you are browsing the net. If you shared something on your social media accounts or paid online for your dress is where you used API.
When you are inclined towards something or someone in an unjust way, you are biased towards it. In data science, bias is a parameter that is used to modify the output values away or towards the decision boundary. This can immensely affect the insights, resulting in weak and pricey decisions.
The process through which the category of a record is identified, based on the record’s score is termed as classification. It is related to the supervised machine learning. This function’s objective is to make exactly predict the target class for each instance in the data.
A range, based on an estimate, which shows error margin, together with a probability that a value will be in this range. Confidence interval can be calculated through dedicated mathematical formulae given in statistics.
A parameter and a matrix to indicate how is a classification model performing on test data that has evident true values. The types of values used to assess the performance are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).
Any variable that can have any value out of the infinite number of values, within a specific range is continuous variable. For eg: Size and age will be continuous variables when denoted as a decimal number.
A relationship between two or more datasets or variables is termed as correlation. If the values from both these correlated variables or datasets increase together, the datasets are said to be positively correlated. These datasets are negatively correlated when the values in one dataset increase as the values in the second dataset, decrease. But when any change in one dataset has no relation to the change in the second dataset, both are not correlated.
A group of functions or scripts that move data forward in a series. First method’s output is the second’s input. This is the process until the project team working gets cleaned and transformed data.
A process in which a data scientist asks fundamental questions to know the dataset’s context and a data analyst uses visual exploration tools to see inside the data, its characteristics. The insights thus received will be useful in the exhaustive analysis and will alert for any type of anomalies present in data.
Decision science is the branch field in the umbrella field of data science. Decision scientists, with an aim to understand the end user clearly, use technology and maths to answer business problems and mix it with design thinking and behavioral science.
The ones building the analysis platforms for the data scientists and easing their work, from behind the curtains (backend) are the data engineers and the process is data engineering. The infrastructure that data engineers build is used to collect, clean, store and prepare data for analysis. These talented professionals work on speeding up the analysis process and maintaining data in an organized and easily accessible format.
A boundary that segregates elements two classes. These boundaries may or may not be distinct.
Modeled on a tree, the decision tree is a supervised learning algorithm, wherein a set of alternatives are denoted by the branches and the decisions by the leaves. Moving along the branches, taking decisions, you will finally arrive at the result you want (on one of the leaves).
Exploratory Data Analysis
The first step in data analysis, exploratory data analysis (EDA) is used to derive insights with the use of graphical techniques. A data scientist summarizes the prime features of a dataset and initiates next steps and complex model development, with the help of EDA techniques.
Extract, Transform, Load (ETL)
An important process of data warehouse, ETL includes three processes of fetching raw data from different sources, bringing it to a platform, prepared for analysis. Extract is reading data from source database, transform is changing the form of available data into the desired format and load is writing the collected data into the target database.
A problem dataset’s measurable input variables that provide information. For instance, with respect to the information of employees of an organization, years of experience, annual package and tenure in the company are three features.
The process in which suitable attributes of a dataset are identified, which will be crucial in model building. Involving fewer features will reduce the complexity and time taken in a model’s training and testing; this will come in handy when working with enormous datasets. What begins with gauging the relevance of each feature in forecasting target variable, progresses to selecting a features’ subset, which will direct to a high-performance model.
A human’s knowledge converted into quantitative values that can be comprehended by a computer, is the essence of feature engineering. In feature engineering, the inputs of the model are features. These may either be simple derived or complex abstract.
A developer’s community and a code sharing and publishing service, GitHub is a great repository of exceptional features like access control, task management, feature requests etc. This is a phenomenal platform for hosting open source software projects through free accounts and private repositories.
K-means clustering is a data mining algorithm that clusters, classifies or groups N objects on the basis of their features, into K groups (so called clusters). This is one of the most useful, popular and easiest unsupervised algorithms.
K-nearest neighbors or KNN is a machine learning algorithm that segregates objects on the basis of their proximity to neighbors. To facilitate the execution of the algorithm, the number of neighbors (k) to consider and distance to specify the proximity of these neighbors are decided.
This is a regression technique that establishes the relation between two variables by linking a linear equation with the observed data. This way, predicting unknown variable on the basis of its known variable, will be easy.
A statistical model and a classification technique, in which the relationship between one or more independent variables and a binary dependent variable is known. This is just like the linear regression; difference is that the expected results are a defined set of categories and not continuous.
Machine Learning Engineer
More than just a software programmer, machine learning engineers are the brains behind complicated machines that learn and perform with minimal human supervision.
A multi-functional numerical computing interface that simplifies your working with statistical data modeling, matrix functions and implementing algorithms, MATLAB primarily finds its application in the scientific disciplines.
This brings us to the end of the vocabulary list. Did I miss out on few terms? Want to know more about some selected data science terms and definitions? Do not forget to write to us in the comment box below.
Watch out this space for the list of advanced terms.