4. Regression

In previous chapters, we saw the example of supervised learning for ‘classification’ problems; i.e. the ‘targets’ had the fixed number of values. In this section, we will see the another class of supervised learning i.e. ‘regression’, where ‘targets’ can have continuous values. Note the ‘features’ can have continuous values in both the cases.

Also, in previous chapters, we used the SciKit’s inbuilt-dataset and read the dataset from the file. In this chapter, we will create the dataset by ourselves.

4.1. Noisy sine wave dataset

Let’s create a dataset where the ‘features’ are the samples of the cooridantes of the x-axis, whereas the ‘targets’ are the noisy samples of the sine waves i.e. uniformly distributed noise samples will be added to the sine-wave; and the corresponding waveforms are shown in Fig. 4.1. This can be achieved as below,

../_images/inc_sine.png

Fig. 4.1 Sine wave + Uniformly distributed noise generated by Listing 4.1

Listing 4.1 Generation of noisy sine wave as shown in Fig. 4.1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# regression_ex.py

import numpy as np
import matplotlib.pyplot as plt


N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)
plt.plot(x, sine_wave, 'o');
plt.show()

Note

For SciKit library, the features must be in 2-dimensional format, i.e. features are the ‘list of list’, whereas target must be in 1-dimensional format. Currently, we have both in 1-dimensional format, therefore we need to convert the ‘features’ into 2-dimensional format as shown in Listing 4.2.

Listing 4.2 converting ‘x’ into 2D
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# regression_ex.py

import numpy as np
import matplotlib.pyplot as plt


N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)
# plt.plot(x, sine_wave, 'o');
# plt.show()

# convert features in 2D format i.e. list of list
print('Before: ', x.shape)
features = x[:, np.newaxis]
print('After: ', features.shape)

# uncomment below line to see the differences
# print(x)
# print(features)

# save sine wave in variable 'targets'
targets = sine_wave

Below is the output for above code,

$ python regression_ex.py
Before:  (100,)
After:  (100, 1)

4.2. Regression model

Now, we test the regression model i.e. “LinearRegression” on the dataset as below, which has the similar steps as classification problems. The predicted and actual points of the sine wave is shown in Fig. 4.2.

Important

Please note the following important points,

  • The ‘stratify’ can not be used for single features as shown in Line 40.
  • The ‘score’ uses ‘feature and target (not predicted target)’ for scoring in Regression. This calculates the score which is known as \(R^2\) score.
  • The ‘accuracy_score’ uses ‘feature and ‘predicted target’ for scoring in Classification.
Listing 4.3 Score of regression model
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# regression_ex.py

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

N = 100 # 100 samples
x = np.linspace(-3, 3, N) # coordinates
noise_sample = np.random.RandomState(20)  # constant random value
# growing sinusoid with random fluctuation
sine_wave = x + np.sin(4*x) + noise_sample.uniform(N)
# plt.plot(x, sine_wave, 'o');
# plt.show()

# convert features in 2D format i.e. list of list
# print('Before: ', x.shape)
features = x[:, np.newaxis]
# print('After: ', features.shape)

# uncomment below line to see the differences
# print(x)
# print(features)

# save sine wave in variable 'targets'
targets = sine_wave


# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        # random but same for all run, also accuracy depends on the
        # selection of data e.g. if we put 10 then accuracy will be 1.0
        # in this example
        random_state=23,
        # keep same proportion of 'target' in test and target data
        # stratify=targets  # can not used for single feature
    )

# training using 'training data'
regressor = LinearRegression()
regressor.fit(train_features, train_targets) # fit the model for training data

# predict the 'target' for 'training data'
prediction_training_targets = regressor.predict(train_features)

# note that 'score' uses 'feature and target (not predict_target)'
# for scoring in Regression
# whereas 'accuracy_score' uses 'features and predict_targets'
# for scoring in Classification
self_accuracy = regressor.score(train_features, train_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)

# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)

# plot the predicted and actual target for test data
plt.plot(prediction_test_targets, '-*')
plt.plot(test_targets, '-o' )
plt.show()

Following are the outputs of above code,

$ python regression_ex.py
Accuracy for training data (self accuracy): 0.843858910263
Accuracy for test data: 0.822872868183
../_images/pred_sin_test.png

Fig. 4.2 Actual and predicted points of the sine wave

4.3. Conclusion

In this chapter, we saw the example of Regression problems. Also, we saw the basic differences between the scoring in the Regression and Classification problems.