1. Machine learning terminologies¶
Codes and Datasets
The datasets and the codes of the tutorial can be downloaded from the repository
In this chapter, we will understand the basic building blocks of SciKit-Learn library. Further, we will discuss the various types of machine learning algorithms. Also, we will see several terms which are used in machine learning process.
Machine learning algorithms is a part of data analysis process. The data analysis process involves following steps,
- Collecting the data from various sources
- Cleaning and rearranging the data e.g. filling the missing values from the dataset etc.
- Exploring the data e.g. checking the statistical values of the data and visualizing the data using plots etc.
- Modeling the data using correct machine learning algorithms.
- Lastly, check the performance of the newly created model.
In this tutorial we will see all the steps of data analysis process except the first step i.e. data collection process. We will use the data which are available on the various websites.
Data analysis requires the knowledge of multiple field e.g. data cleaning using Python or R language. Good knowledge of mathematics for measuring the statistical parameter of the data. Also, we need to have the knowledge of some specific field on which we want to apply the machine learning algorithm. Lastly, we must have the understanding of the machine learning algorithms.
1.2. Machine learning¶
In general programming methods, we write the codes to solve the problem; and the code can solve a particular types of problem only. This is known as ‘hard coding’ method. But in the machine learning process, the codes are designed to see the patterns in the datasets to solve the problems, therefore it is more generalizes and can make the decisions on the new problems as well. This difference is shown in Table 1.1.
|Hard coding||can solve a particular type of problems|
|Machine learning||sees the pattern in the data and solve the new problem by itself|
Lastly, the Machine learning can be defined as the process of extracting knowledge from the data, such that an accurate predication can be made on the future data. In the other words, machine learning algorithms are able to predict the outcomes of the new data based on their training.
1.3. Basic terminology¶
In this section, we will see basic building blocks of SciKit library along with several terms used in machine learning process.
1.3.1. Data: samples and features¶
Data is stored in two dimensional form in the SciKit, which are known as the ‘samples’ and ‘features’.
- Samples: Each data has certain number of samples.
- Features: Each sample has some features, e.g if we have samples of lines, then features of this lines can be ‘x’ and ‘y’ coordinates.
- All the features should be identical in SciKit. For example, all the lines should have only two features i.e. ‘x’ and ‘y’ coordinates. If some lines have third feature as ‘thickness of line’, then we need to append/delete this feature to all the lines.
- Target: There may be the certain numbers of possible outputs for the data, which is known as ‘target’. For example, the the points can be on the ‘straight line’ or on the ‘curve line’. Therefore, the possible targets for this case are ‘line’ and ‘curve’.
- Different names are used for ‘targets’ and ‘features’ as shown in Table 1.2,
|Features||Inputs, Attributes, Predictors, Independent variable, Input variables|
|Target||Outputs, Outcomes, Responses, Labels, Dependent variables|
1.3.3. Load the inbuilt data¶
Let’s understand this with an example. The SciKit library includes some input data as well. First we will use these data and later we will read the data from the files for the data analysis.
- The stored datasets in the SciKit library can be used as below,
>>> from sklearn.datasets import load_iris # import 'iris' dataset >>> iris = load_iris() # save data set in 'iris'
- Now, we can see the data stored in the ‘iris’. Note that dataset is stored in the form of ‘dictionary’.
>>> iris.keys() dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
Following is the description of above keys,
‘feature_names’: This contains the information about the features (optional).
>>> iris.feature_names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
‘data’: It contains certain numbers of samples for the data e.g. this dataset contains 150 samples and each sample has four features. In the below results, the first three entries of the data is shown. The name of the columns (i.e. features of the data) are shown by the ‘feature_names’ e.g. the first column stores the speal-length.
>>> iris.data.shape # 150 samples, 4 features (150, 4) >>> iris.data[0:3] # display 3 samples of stored data array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2]])
‘target_names’: This contains the details about the target (optional).
>>> iris.target_names # flower categories array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
‘target’: It is the possible outputs for the data (optional). This is required for supervised learning, which will be discussed in this chapter. Here ‘0’ represents the ‘setoas’ family of the Iris-flower.
>>> iris.target array([0, 0, 0, 0, ..., 0, 1, 1, 1, ..., 2, 2, 2])
‘DESCR’: It contains the description about the data set(optional).
>>> iris.DESCR 'Iris Plants Database\n====================\n [...]
Following are the important points about the dataset, which we discussed in this section,
- Datasets have samples of data, which includes some features of the data.
- All the features should be available in every data. If there are missing/extra features in some data, the we need to add/remove those features from the data for SciKit.
- Also, the dataset may contain the ‘target’ values in it.
1.4. Types of machine learning¶
Machine learning can be divided into two categories i.e. supervised and unsupervised, as shown in this section,
1.4.1. Supervised learning¶
In Supervised Learning, we have a dataset which contains both the input ‘features’ and output ‘target’, as discussed in Section 1.3.3, where Iris flower dataset has both ‘features’ and ‘target’.
126.96.36.199. Classification and regression¶
The supervised learning can be further divided into two categories i.e. classification and regression.
Classification: In classification the targets are discrete i.e. there are fixed number of values of the outputs e.g. in Section 1.3.3 there are only three types of flower. Also, these outputs are represented using strings e.g. (Male/Female) or with fixed number of integers as shown for ‘iris’ dataset in Section 1.3.3 where 0, 1 and 2 are used for three types of flower.
- If the target has only two possible values, then it is known as ‘binary classification’.
- If the target has more than two possible values, then it is known as ‘multiclass classification’.
Regression: In regression the targets are continuous e.g. we want the calculate the ‘age of the animal (i.e. target)’ with the help of the ‘fossil dataset (i.e. feature)’. In this case, the problem regression problem as the age is a continuous quantity as it does not have fixed number of values.
1.4.2. Unsupervised learning¶
In Unsupervised Learning, the dataset contains only ‘features’ and ‘no target’. Here, we need to find the relationship between the various types of data. In the other words, we have to find the labels from the given dataset.
Unsupervised learning can be divided into three categories i.e. Clustering, Dimensionality reduction and Anomaly detection.
- Clustering: It is process of reducing the observations. This is acheived by collecting the simialar data in one class.
- Dimensionality reduction: This is the reduction of higher dimensional data to 2 dimensional or 3 dimensional data, as it is easy to visualize the data in 2 dimensional and 3 dimensional form.
- Anomaly detection: This is the process of removal of undesired data from the dataset.
Sometimes these two methods, i.e. supervised and unsupervised learning, are combined. For example the unsupervised learning can be used to find useful features and targets; and then these features can be used by the supervised training method.
For example, we have a the ‘titanic’ dataset, where we have all the information about the passengers e.g. age, gender, traveling-class and number of people died during accident etc. Here, we need to find the relationship between various types of data e.g. people who are traveling in higher-class must have higher chances of survival etc.
Please note the following points,
- Not all the problems can be solved using Machine learning algorithms.
- If a problem can be solved directly, then do not use machine learning algorithms.
- Each machine learning algorithms has it’s own advantages and disadvantages. In the other words, we need to choose the correct machine learning algorithms to solve the problem.
- We need not to be expert in the mathematics behind the machine learning algorithms; but we should be aware of pros and cons of the algorithms.
- Below is the summery of this section. Table 1.3 shows the types of machine learning, and Table 1.4 shows the types of variable in machine learning algorithms.
|Supervised||Binary classification, multiclass classification, regression|
|Unsupervised||Clustering, Dimensionality reduction, Anomaly detection|
|categorical or factor||string (e.g. Male/Female), or fixed number of integers 0/1/2|
|numeric||floating point values|