Mean, Median and Mode

Mean - The average value, Median - The mid point value, Mode - The most common value

In [2]:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
In [3]:
import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)
x
Out[3]:
89.76923076923077
In [4]:
x = numpy.median(speed)
x
Out[4]:
87.0
In [6]:
from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)
x
ModeResult(mode=array([86]), count=array([3]))

Standard Deviation

Standard deviation is a number that describes how spread out the values are. A low standard deviation means that most of the numbers are close to the mean (average) value. A high standard deviation means that the values are spread out over a wider range.

In [7]:
speed = [86,87,88,86,87,85,86]
In [8]:
x = numpy.std(speed)
x
Out[8]:
0.9035079029052513
In [9]:
speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)
x
Out[9]:
37.84501153334721

Variance

Variance is another number that indicates how spread out the values are. In fact, if you take the square root of the variance, you get the standard deviation! Or the other way around, if you multiply the standard deviation by itself, you get the variance!

to calculate the variance: 1. find the mean [77.4], 2. for each value find the difference from the mean [32-77.4], 3. for each difference find the square [-45.4*-45.4]. The variance is the average number of these squared differences

In [11]:
(-45.4)**2 # find the square of a number using exponent operator **
Out[11]:
2061.16
In [13]:
import math
math.pow((-45.4),2) # find the square of a number using the math library
Out[13]:
2061.16
In [14]:
(-45.4)*(-45.4) # find the square of a number by multiplying it by itself
Out[14]:
2061.16
In [15]:
speed = [32,111,138,28,59,77,97]

x = numpy.var(speed)
x # calculate the variance of speed
Out[15]:
1432.2448979591834
In [17]:
print(math.sqrt(1432.25)) # calculate the standard deviation of speed, using the square root of variance
37.84507894033252

Percentiles

Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.

In [18]:
ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 75)
x # print the age that 75% of people are younger than
Out[18]:
43.0

Data Distribution & Histogram

In [20]:
x = numpy.random.uniform(0.0, 5.0, 250)

x # create a data set of 250 floating numbers between 0 and 5 for testing
Out[20]:
array([4.33156252, 0.97471433, 3.71475603, 3.17396971, 2.2404338 ,
       2.28635532, 0.18472212, 0.73057356, 4.32706902, 2.92728038,
       1.75696158, 3.73051633, 3.25145817, 0.09187174, 3.24075513,
       4.19741668, 1.18120835, 4.90627753, 0.60683934, 4.54105931,
       2.72455876, 2.18574515, 4.09499193, 3.77535619, 2.51412088,
       2.37694286, 4.35223562, 3.02842177, 0.52972429, 3.74687586,
       0.48603213, 0.82922318, 3.14183658, 2.73321769, 0.52122714,
       1.56120542, 1.23066887, 0.64820354, 0.78995411, 3.49759716,
       1.01303649, 0.96524112, 4.83919105, 0.28587731, 2.87289906,
       1.88916835, 4.01831328, 2.13742762, 3.67880953, 3.83668071,
       2.30873841, 0.86235962, 0.73393401, 2.55122756, 0.48754103,
       1.48235446, 4.10584957, 2.00431951, 3.30027459, 4.53902471,
       3.86917216, 0.18082425, 2.52671231, 1.05461369, 4.85200213,
       3.64454564, 2.72752939, 0.09968727, 3.20467432, 2.61556658,
       0.5549752 , 2.62990596, 4.5664124 , 0.07769675, 2.19189066,
       4.2297233 , 0.75353601, 1.79937925, 4.92945538, 4.05184829,
       0.93044776, 2.11226122, 4.30015375, 2.62253569, 2.99259077,
       0.17966373, 3.26270482, 1.12908267, 1.53010555, 2.76762684,
       4.0351432 , 2.05241603, 0.48455842, 3.62603409, 4.85260384,
       4.59022088, 0.43310487, 0.95048311, 4.96056426, 2.23169374,
       4.78855398, 4.54117111, 3.30999714, 2.5308554 , 4.95740259,
       0.82029782, 4.9963412 , 2.36048646, 1.47161929, 0.02122999,
       4.63576743, 0.88196497, 0.32724771, 0.70304922, 2.76334371,
       2.62635251, 4.46128935, 0.22588469, 4.38388976, 1.04349138,
       4.46355989, 2.71254448, 0.31828754, 1.91289007, 2.04280285,
       0.76584457, 1.88404276, 3.84729681, 2.05616365, 3.63966576,
       2.16295488, 4.36103458, 3.60257031, 2.44924892, 4.37769826,
       1.47611399, 4.56782522, 1.13131109, 2.09792929, 1.33358379,
       1.13106004, 4.31636077, 2.27383035, 3.68189946, 2.40352412,
       2.05201934, 3.0358261 , 3.08934203, 3.53440648, 2.12132655,
       0.97869997, 3.47452692, 2.27315159, 4.00419511, 0.19728448,
       4.36062923, 1.31332332, 2.00522718, 0.88583265, 4.05795219,
       2.96116015, 4.44835499, 1.26724145, 0.09323955, 4.37089069,
       1.76561541, 2.69187255, 1.30858968, 2.63084652, 1.73894333,
       3.25784253, 1.52667508, 4.53777419, 3.13178077, 1.63062888,
       0.53418646, 1.43199424, 1.51994984, 0.28534977, 1.79560033,
       2.86729882, 4.9114386 , 2.99973907, 3.43527651, 1.49686891,
       4.1684091 , 4.98248534, 1.66194136, 2.18739043, 1.35880911,
       1.1936302 , 3.53720456, 4.58645457, 1.90873596, 3.66332179,
       1.07596966, 1.75748524, 3.63413588, 3.88055709, 0.48397021,
       4.89554617, 0.01482625, 4.08829112, 0.78273011, 2.87616107,
       3.22635542, 3.01353713, 4.53042549, 2.557296  , 3.25927791,
       1.99101181, 0.28716048, 2.59108068, 4.13024389, 4.9015755 ,
       3.14180649, 0.00596959, 2.05200577, 2.43312412, 3.06413178,
       1.68790718, 2.70267791, 0.17666891, 4.55801479, 2.7904297 ,
       0.90812955, 2.80641754, 2.41895574, 1.47098139, 0.54051996,
       1.81491294, 1.2888422 , 1.47709732, 4.08000177, 0.63673579,
       2.84619684, 2.98736478, 0.15918772, 3.73900768, 3.86861603,
       1.8627632 , 2.10304393, 4.79224069, 4.64833034, 1.17489806,
       3.6217719 , 2.81374783, 0.76764518, 0.07159991, 4.9355013 ])
In [21]:
import numpy
import matplotlib.pyplot as plt

x = numpy.random.uniform(0.0, 5.0, 250)

plt.hist(x, 5) # create a histogram to visualise the numbers in the data set
plt.show()

Normal Data Distribution

In probability theory this kind of data distribution is known as the normal data distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution.

In [22]:
x = numpy.random.normal(5.0, 1.0, 100000)

plt.hist(x, 100)
plt.show()

Histogram Explained

We use the array from the numpy.random.normal() method, with 100000 values, to draw a histogram with 100 bars. We specify that the mean value is 5.0, and the standard deviation is 1.0. Meaning that the values should be concentrated around 5.0, and rarely further away than 1.0 from the mean. And as you can see from the histogram, most values are between 4.0 and 6.0, with a top at approximately 5.0.

Scatter Plot

A scatter plot is diagram where each value in the data set is represented by a dot.

The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis:

In [23]:
import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()

Scatter Plot Explained

The x-axis represents ages, and the y-axis represents speeds. What we can read from the diagram is that the two fastest cars were both 2 years old, and the slowest car was 12 years old.

In [24]:
# random numbers for testing
x = numpy.random.normal(5.0, 1.0, 1000) # The mean set to 5.0 with a standard deviation of 1.0. 
y = numpy.random.normal(10.0, 2.0, 1000) # The mean set to 10.0 with a standard deviation of 2.0:

plt.scatter(x, y)
plt.show()

Scatter Plot Explained

We can see that the dots are concentrated around the value 5 on the x-axis, and 10 on the y-axis. We can also see that the spread is wider on the y-axis than on the x-axis.

Linear Regression

Regression

The term regression is used when you try to find the relationship between variables. In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events.

Linear Regression

Linear regression uses the relationship between the data-points to draw a straight line through all them. This line can be used to predict future values.

In [26]:
import matplotlib.pyplot as plt
from scipy import stats

# Create two arrays with age(x) and speed(y) of cars 
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

# Execute a method that returns some important key values of Linear Regression:
slope, intercept, r, p, std_err = stats.linregress(x, y) 

# Create a function that uses the slope and intercept values to return a new value. 
# This new value represents where on the y-axis the corresponding x value will be placed:
def myfunc(x):
    return slope * x + intercept

# Run each value of the x array through the function. This will result in a new array with new values for the y-axis:
mymodel = list(map(myfunc, x))

plt.scatter(x, y) # Draw the scatter plot
plt.plot(x, mymodel) # Draw the line of linear regression:
plt.show()
In [27]:
r # how well does my data fit linear regression, r-squared
Out[27]:
-0.758591524376155

R-Squared

It is important to know how well the relationship between the values of the x-axis and the values of the y-axis is, if there are no relationship the linear regression can not be used to predict anything.

The relationship is measured with a value called the r-squared. The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related. Python and the Scipy module will computed this value for you, all you have to do is feed it with the x and y values:

In [29]:
speed = myfunc(10) # predict the speed of a car thats 10 years old
speed
Out[29]:
85.59308314937454

Create an example where linear regression is a bad fit

In [30]:
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
    return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
In [31]:
r # a very low r-squared number
Out[31]:
0.01331814154297491

Polynomial Regression

If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression.

Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points.

18 cars registered as they were passing a tollbooth. We have registered the car's speed, and the time of day (hour) the passing occurred. The x-axis represents the hours of the day and the y-axis represents the speed:

In [32]:
x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] # time of day
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100] # speed of cars

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) # NumPy method that lets us make a polynomial model

myline = numpy.linspace(1, 22, 100) # Specify how the line will display, we start at position 1, and end at position 22:

plt.scatter(x, y)
plt.plot(myline, mymodel(myline)) # Draw the line of polynomial regression:
plt.show()
In [35]:
from sklearn.metrics import r2_score # Import the relevant library to perform the r-squared calculation

r2_score(y, mymodel(x)) # Print the r-squared number to see if the relationship is good for polynomial regression
Out[35]:
0.9432150416451027
In [38]:
speed = mymodel(17) # predict the speed of a car passing the tollbooth at 5pm (17:00)
speed
Out[38]:
88.87331269697987

Create an example where polynomial regression is a bad fit

In [39]:
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

myline = numpy.linspace(2, 95, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
In [40]:
r # A very low r-squared number, bad fit
Out[40]:
0.01331814154297491

Multiple Regression

Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.

In [43]:
import pandas
from sklearn import linear_model

df = pandas.read_csv("cars.csv")

X = df[['Weight', 'Volume']] # make a list of the independent values
y = df['CO2'] # dependent values

# From the sklearn module we will use the LinearRegression() method to create a linear regression object.
regr = linear_model.LinearRegression() 

# This object has a method called fit() that takes the independent and dependent values as parameters and 
# fills the regression object with data that describes the relationship:
regr.fit(X, y)

# predict the CO2 emission of a car where the weight is 2300g, and the volume is 1300ccm:
predictedCO2 = regr.predict([[2300, 1300]])
predictedCO2
Out[43]:
array([107.2087328])

We have predicted that a car with 1.3 liter engine, and a weight of 2.3 kg, will release approximately 107 grams of CO2 for every kilometer it drives.

Coefficient

The coefficient is a factor that describes the relationship with an unknown variable.

Example: if x is a variable, then 2x is x two times. x is the unknown variable, and the number 2 is the coefficient. In this case, we can ask for the coefficient value of weight against CO2, and for volume against CO2. The answer(s) we get tells us what would happen if we increase, or decrease, one of the independent values.

In [45]:
regr.coef_ # Print the coefficient values of the regression object weight=0.00755095 and volume=0.00780526
Out[45]:
array([0.00755095, 0.00780526])

These values tells us that if the weight increases by 1g, the CO2 emission increases by 0.00755095g.

And if the engine size (Volume) increases by 1 ccm, the CO2 emission increases by 0.00780526 g.

In [47]:
predictedCO2 = regr.predict([[3300, 1300]])

predictedCO2
Out[47]:
array([114.75968007])

We have predicted that a car with 1.3 liter engine, and a weight of 3.3 kg, will release approximately 115 grams of CO2 for every kilometer it drives.

Which shows that the coefficient of 0.00755095 is correct:

107.2087328 + (1000 * 0.00755095) = 114.75968

In [48]:
107.2087328 + (1000 * 0.00755095)
Out[48]:
114.75968280000001

Scale

When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

There are different methods for scaling data, in this tutorial we will use a method called standardization.

The standardization method uses this formula:

z = (x - u) / s

Where z is the new value, x is the original value, u is the mean and s is the standard deviation.

If you take the weight column from the data set above, the first value is 790, and the scaled value will be:

(790 - 1292.23) / 238.74 = -2.1 If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be:

(1.0 - 1.61) / 0.38 = -1.59

Now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0.

You do not have to do this manually, the Python sklearn module has a method called StandardScaler() which returns a Scaler object with methods for transforming data sets.

In [50]:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv("cars2.csv")

X = df[['Weight', 'Volume']]

scaledX = scale.fit_transform(X)

scaledX
Out[50]:
array([[-2.10389253, -1.59336644],
       [-0.55407235, -1.07190106],
       [-1.52166278, -1.59336644],
       [-1.78973979, -1.85409913],
       [-0.63784641, -0.28970299],
       [-1.52166278, -1.59336644],
       [-0.76769621, -0.55043568],
       [ 0.3046118 , -0.28970299],
       [-0.7551301 , -0.28970299],
       [-0.59595938, -0.0289703 ],
       [-1.30803892, -1.33263375],
       [-1.26615189, -0.81116837],
       [-0.7551301 , -1.59336644],
       [-0.16871166, -0.0289703 ],
       [ 0.14125238, -0.0289703 ],
       [ 0.15800719, -0.0289703 ],
       [ 0.3046118 , -0.0289703 ],
       [-0.05142797,  1.53542584],
       [-0.72580918, -0.0289703 ],
       [ 0.14962979,  1.01396046],
       [ 1.2219378 , -0.0289703 ],
       [ 0.5685001 ,  1.01396046],
       [ 0.3046118 ,  1.27469315],
       [ 0.51404696, -0.0289703 ],
       [ 0.51404696,  1.01396046],
       [ 0.72348212, -0.28970299],
       [ 0.8281997 ,  1.01396046],
       [ 1.81254495,  1.01396046],
       [ 0.96642691, -0.0289703 ],
       [ 1.72877089,  1.01396046],
       [ 1.30990057,  1.27469315],
       [ 1.90050772,  1.01396046],
       [-0.23991961, -0.0289703 ],
       [ 0.40932938, -0.0289703 ],
       [ 0.47215993, -0.0289703 ],
       [ 0.4302729 ,  2.31762392]])
In [51]:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv("cars2.csv")

X = df[['Weight', 'Volume']]
y = df['CO2']

scaledX = scale.fit_transform(X)

regr = linear_model.LinearRegression()
regr.fit(scaledX, y)

scaled = scale.transform([[2300, 1.3]])

predictedCO2 = regr.predict([scaled[0]])
print(predictedCO2)
[107.2087328]

Train/Test

Train/Test is a method to measure the accuracy of your model.

It is called Train/Test because you split the the data set into two sets: a training set and a testing set.

80% for training, and 20% for testing.

In [1]:
import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)

x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x

plt.scatter(x, y)
plt.show()
<Figure size 640x480 with 1 Axes>
In [ ]: