Welcome to INSAID’s Data Science Glossary!

This Glossary is designed to help you refresh your Data Science concepts and introduce you to new concepts that you can research on later.

**Scope of the Basic Glossary**

It is essential to know the areas of Data Science from which these terms and definitions will be covered. So, here goes the scope of this glossary.

– Data Science

– Python and R

– Probability and Statistics

– Data Visualization

– Machine Learning

– Deep Learning

**Basic Data Science Concepts you need to know!**

Here are some of the basic essential terms of the Data Science vocabulary. These Data Science terms and definitions are arranged in an alphabetical order to make it for an easier learning experience.

**Algorithm**

A set of rules given to the computer to complete a Data Science process or solve a problem is termed as an algorithm. A Data Scientist is expected to know why Logistic Regression will be the best choice and why Support Vector Machine will not work.

**API**

Application Programming Interface (API) offers a cluster of functions used to work on a particular service or application and deploy its features. In this interface, **one code connects with the other code**; this is unlike GUI or CLI where humans interact with the code.

One of the most popular uses of API is when you are browsing the net. If you shared something on your social media accounts or paid online for your dress is where you used API.

**Bias**

When you are inclined towards something or someone in an unjust way, you are biased towards it. In Data Science, bias is a parameter that is used to modify the output values away or towards the decision boundary. This can immensely affect the insights, resulting in weak and pricey decisions.

**Classification**

The process through which the category of a record is identified, based on the record’s score is termed as classification. **It is related to supervised machine learning**. This function’s objective is to exactly predict the target class for each instance in the data.

**Confidence Interval**

A range, based on an estimate, which shows error margin, together with a probability that a value will be in this range. The confidence interval can be calculated through dedicated mathematical formulae given in statistics.

**Confusion Matrix**

A parameter and a matrix to indicate how is a classification model performing on **test data** that has evident true values. The types of values used to assess the performance are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).

**Continuous Variable**

Any variable that can have any value out of the infinite number of values, within a specific range is a continuous variable. For eg: Size and age will be continuous variables when denoted as a decimal number.

**Correlation**

A relationship between two or more data-sets or variables is termed as correlation.

If the values from both these correlated variables or data-sets increase together, the data-sets are said to be **positively** correlated. These data-sets are **negatively** correlated when the values in one data-set increase as the values in the second data-set, decrease. But when any change in one data-set has no relation to the change in the second data-set, both are not correlated.

**Data Pipelines**

A group of functions or scripts that move data forward in a series. First method’s output is the second’s input. This is the process until the project team working gets cleaned and transformed data.

**Data Exploration**

A process in which a Data Scientist asks fundamental questions to know the data-set’s context and a Data Analyst uses visual exploration tools to see inside the data, its characteristics. The insights thus received will be useful in the exhaustive analysis and will alert for any type of anomalies present in data.

**Decision Science**

Decision science is the branch field in the umbrella field of Data Science. Decision scientists, with an aim to understand the end-user clearly, use technology and math to answer business problems and mix it with design thinking and behavioral science.

**Data Engineering**

**Data Engineering**

The ones building the analysis platforms for the Data Scientists and easing their work, from behind the curtains (backend) are the Data Engineers and the process is data engineering. The infrastructure that data engineers build is used to** collect, clean, store and prepare data** for analysis. These talented professionals work on speeding up the analysis process and maintaining data in an organized and easily accessible format.

**Decision Boundary**

**Decision Boundary**

A boundary that segregates the elements of two classes. These boundaries may or may not be distinct.

**Decision Tree**

Modeled on a tree, the decision tree is a supervised learning algorithm, wherein a set of alternatives are denoted by the **branches** and the decisions by the leaves. Moving along the branches, making decisions, you will finally arrive at the result you want (on one of the leaves).

**Exploratory Data Analysis**

The first step in data analysis, exploratory data analysis (EDA) is used to derive insights with the use of graphical techniques. A Data Scientist summarizes the prime features of a data-set and initiates next steps and complex model development, with the help of EDA techniques.

**Extract, Transform, Load (ETL)**

An important process of data warehouse, ETL includes three processes of fetching raw data from different sources, bringing it to a platform, prepared for analysis. **Extract is reading data** from the source database, transform is changing the form of available data into the desired format and load is writing the collected data into the target database.

**Feature**

A problem data-set’s measurable input variables that provide information. For instance, with respect to the information of employees of an organization, years of experience, annual package and tenure in the company are three features.

**Feature Selection**

The process in which suitable attributes of a data-set are identified, which will be crucial in model building. Involving fewer features will **reduce the complexity** and time taken in a model’s **training** and **testing**; this will come in handy when working with enormous data-sets. What begins with gauging the relevance of each feature in forecasting target variable, progresses to selecting a features’ subset, which will direct to a high-performance model.

**Feature Engineering**

A human’s knowledge converted into quantitative values that can be comprehended by a computer, is the essence of feature engineering. In feature engineering, the inputs of the model are features. These may either be simply **derived** or **complex** **abstract**.

**GitHub**

A developer’s community and a code sharing and publishing service, GitHub is a great repository of exceptional features like *access control, task management, feature requests*, etc. This is a phenomenal platform for hosting open-source software projects through free accounts and private repositories.

**K-means Clustering**

K-means clustering is a data mining algorithm that clusters classifies or groups **N-objects** on the basis of their features, into **K-groups** (so-called clusters). This is one of the most useful, popular and easiest unsupervised algorithms.

**K-nearest Neighbors**

K-nearest neighbors or KNN is a machine learning algorithm that segregates objects on the basis of their proximity to neighbors. To facilitate the execution of the algorithm, the number of neighbors (k) to consider and distance to **specify the proximity** of these neighbors are decided.

**Linear Regression**

This is a regression technique that establishes the relation between two variables by linking a linear equation with the observed data. This way, predicting unknown variable on the basis of its known variable, will be easy.

**Logistic Regression**

A statistical model and a classification technique, in which the relationship between one or more independent variables and a binary dependent variable is known. This is just like the linear regression; difference is that the expected results are a defined set of categories and not continuous.

**Machine Learning Engineer**

More than just a software programmer, machine learning engineers are the brains behind complicated machines that learn and perform with minimal human supervision.

**MATLAB**

A multi-functional numerical computing interface that simplifies your working with statistical data modeling, matrix functions and implementing algorithms, MATLAB primarily finds its application in the scientific disciplines.

*This brings us to the end of the vocabulary list. Did I miss out on few terms? *

Want to know more about some selected Data Science terms and definitions? Do not forget to write to us in the comment box below. Watch this space for the list of advanced terms!

## 2 Comments

Pingback: INSAID Spotlight Budding Data Science Leader Interview

Pingback: Data Mining Books: 10 Important Books for Data Mining Professionals