Source code please in Github repository

# Q1 Comparison of Classifiers

## Data Description

We use the letter dataset from Statlog. The statistics of dataset is shown in Table 1. The class number of the data is 26.

DatasetSizeFeature
Train1500016
Test500016

## Comparison of Classifiers

You are required to implement the following classifiers and compute the performance achieved by different classifiers.

• Decision Tree

You should form decision trees on dataset in terms of entropy and gini criterions. For each criterion, you should set the depth as [5,10,15,20,25] separately. You need to compare the performance (accuracy, precision, recall, f1 score and training time) and give a brief discussion.

• KNN, Random Forest

Apply three different classifiers KNN and Random Forest on the dataset. For each classifier, evaluate the performance (accuracy, precision, recall, f1 score and training time) . You are required to compare the performance of different classifiers and give a brief discussion.

import pandas as pd
import numpy as np

def preprocess(path):
"""
Preprocess the txt file

Input: the file path
Output:
X: a list contains all feature vectors
y: a list of labels
"""

with open(path) as f:
X, y = list(), list()
data = line.strip().split(" ")
i = 0
vec = list()
for item in data:
if i == 0:
y.append(item) # the first item is the label
else:
vec.append(item.split(":")) # traversal 16 features and form features vector
i += 1
X.append(vec)
return X, y

X_train, y_train = preprocess("Q1_dataset/letter_train")
X_test, y_test = preprocess("Q1_dataset/letter_test")
def evaluate(y_predict, y_test):
"""
Evaluate the classifier performance by accuracy, percision, recall and f1

Input: y_predict and the ground truth y_test
Output: accuracy, precision, recall, f1
"""
accuracy = metrics.accuracy_score(y_test, y_predict)
precision = metrics.precision_score(y_test, y_predict, average="macro")
recall = metrics.recall_score(y_test, y_predict, average="macro")
f1 = metrics.f1_score(y_test, y_predict, average="macro")
print("accuracy = ", accuracy)
print("precision = ", precision)
print("recall = ", recall)
print("f1 = ", f1)

return accuracy, precision, recall, f1

## Decision Tree

import time
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

criterion = ["gini", "entropy"]
max_depth = [5, 10, 15, 20, 25]

for c in criterion:
for m in max_depth:
t0 = time.time()
DT = DecisionTreeClassifier(criterion=c, max_depth=m)
DT.fit(X_train, y_train)
t1 = time.time()
y_predict = DT.predict(X_test)
print("criterion = %s, max_depth = %s" % (c, m))
evaluate(y_predict, y_test)
print("trainning time = ", t1 - t0)

According to the above evaluation result, we can summarize the performance of decision tree with different parameters in this table (keep 3 significant digits)

CriterionMax DepthAccuracyPrecisionRecallF1Training Time
gini50.3690.3920.3680.3270.0942s
gini100.7130.7630.7160.7260.103s
gini150.8320.8420.8330.8350.125s
gini200.8670.8690.8670.8670.126s
gini250.8720.8720.8720.8720.126s
entropy50.5010.5610.5010.4980.086s
entropy100.7990.8040.7990.8000.118s
entropy150.8740.8750.8750.8750.132s
entropy200.8710.8710.8720.8710.151s
entropy250.8740.8740.8740.8740.157s

From this table, we can see that with the same parameter of max_depth, entropy always performs better than gini. It's obvious that the larger the max_depth, the better performance of the classifier with higher accuracy, precison, recall and F1 but also with longer time to train. The best accuracy that the clssifier can achieve is about 0.874.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

n_neighbors = [2, 5, 8]
for n in n_neighbors:
t0 = time.time()
neigh.fit(X_train, y_train)
t1 = time.time()
y_predict = neigh.predict(X_test)
print("n_neighbors = ", n)
evaluate(y_predict, y_test)
print("trainning time = ", t1 - t0)

n_estimators = [50, 100, 150]
for n in n_estimators:
t0 = time.time()
RF = RandomForestClassifier(n_estimators=n, criterion="gini", max_depth=20)
RF.fit(X_train, y_train)
t1 = time.time()
y_predict = RF.predict(X_test)
print("n_estimators = ", n)
evaluate(y_predict, y_test)
print("trainning time = ", t1 - t0)

We summarize the performance of different classifiers in the following table. Notice we tune the n_neighbors(number of neighbors to use for K nearest neighbors queries) in the KNN and adjust the n_estimators(number of trees in the forest) in RandomForest. Then we get 6 different classifiers.

ClassifierAccuracyPrecisionRecallF1Training Time
KNN -> n_neighbors = 20.9470.9480.9490.9470.198s
KNN -> n_neighbors = 50.9520.9520.9520.9520.189s
KNN -> n_neighbors = 80.9480.9490.9480.9480.187s
RandomForest-> n_estimators = 500.9570.9580.9580.9580.932s
RandomForest-> n_estimators = 1000.9600.9610.9610.9611.813s
RandomForest-> n_estimators = 1500.9620.9630.9630.9632.734s

Because RandomForest is an ensemble methods that it need to train multiple base classifiers to combine, it needs much more time to train compared with KNN (2.734s is 15X of 0.187s). More number of trees in the forest, it needs more training time but can achieve better performance. However, in KNN, the training time becomes smaller when n_neighbors becomes larger. The best n_neighbors is 5 and the best accuracy that KNN achieves is about 0.952 which is lower than the performance of RandomForest's best accuracy 0.962.

In a word, RandomForest classifiers reduces the variance and has better performnce than KNN but need more time to train.

The following table shows the training dataset. It consists of 10 data and 2 labels.

x0123456789
y111-1-1-1111-1

We assume the weak classifier is produced by $x < v$ or $x > v$ where $v$ is the threshold and makes the classifier get the best accuracy on the dataset. You should implement the AdaBoost algorithm to learn a strong classifier. Notice that you CANNOT use Adaboost library. You need to implement it manually.

You should also report the final expression of the strong classifier, such as $C^∗(x) = sign [\alpha_1 C_1(x) + \alpha_2 C_2(x) + \alpha_3 C_3(x) + \cdots]$, where $C_i(x)$ is the base classifier and $\alpha_i$ is the weight of base classifier. You are also required to describe each basic classifier in detail.

For simplicity, the threshold $v$ should be the multiple of 0.5, i.e., $v\%0.5==0$. For example, you can set $v$ as 2, 2.5, or 3, but you cannot set $v$ as 2.1.

import math
import numpy as np

class BaseClassifier:
"""
The Base Classifier Class

Initial Parameters:
v: the threshold of the base classifier
name: (string) the name of the base classifier
lower_v_label: (+1 or -1) the label when x < v
"""

def __init__(self, v, name, lower_v_label):
self.v = v
self.name = name
self.lower_v_label = lower_v_label

def predict(self, X):
y_predict = list()
for x in X:
if x < self.v:
y_predict.append(self.lower_v_label)
else:
y_predict.append(-self.lower_v_label)
return y_predict

def get_error_rate(self, X, y, weights):
weight_count = 0
total = len(y)
y_predict = self.predict(X)
for i in range(total):
if y_predict[i] != y[i]:
weight_count += weights[i]
return weight_count / total

We consider from two sides to find all best base classifiers, firstly we consider this situation,

C(x) = \left\{ \begin{aligned} +1 & , x < v \\ -1 & , \ x \geq v \end{aligned} \right.

With $v$ satisfying $v\%0.5==0$, it's obvious when $v=2.5$ or $v=8.5$, classifier achieves the lowest error rate 0.3, only misclassfying 3 samples. We set the two classifiers as $C_1$ and $C_2$:

C_1(x) = \left\{ \begin{aligned} +1 & , x < 2.5 \\ -1 & , \ x \geq 2.5 \end{aligned} \right. \ \ \ \ \ C_2(x) = \left\{ \begin{aligned} +1 & , x < 8.5 \\ -1 & , \ x \geq 8.5 \end{aligned} \right.

Then we consider a classifier with this form,

C(x) = \left\{ \begin{aligned} -1 & , x < v \\ +1 & , \ x \geq v \end{aligned} \right.

Similarly, we can find that the classifier with $v=5.5$ has the smallest error rate of 0.4. We denote it by $C_3$:

C_3(x) = \left\{ \begin{aligned} -1 & , x < 5.5 \\ +1 & , \ x \geq 5.5 \end{aligned} \right.

# Training Dataset
X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [1, 1, 1, -1, -1, -1, 1, 1, 1, -1]

weights = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Initialize three best basic classifiers C1, C2, C3
# v = 2.5, lower_v_label = 1
C1 = BaseClassifier(v=2.5, name="C1", lower_v_label=1)
print("Classifier 1: v < 2.5 => y = +1, v >= 2.5 => y = -1")
print("prediction:",C1.predict(X))
print("error rate: ",C1.get_error_rate(X, y, weights))
print()

# v = 8.5, lower_v_label = 1
C2 = BaseClassifier(v=8.5, name="C2", lower_v_label=1)
print("Classifier 1: v < 8.5 => y = +1, v >= 8.5 => y = -1")
print("prediction:", C2.predict(X))
print("error rate: ", C2.get_error_rate(X, y, weights))
print()

# v = 5.5, lower_v_label = -1
C3 = BaseClassifier(v=5.5, name="C3", lower_v_label=-1)
print("Classifier 1: v < 2.5 => y = -1, v >= 2.5 => y = +1")
print("prediction:",C3.predict(X))
print("error rate: ",C3.get_error_rate(X, y, weights))
print()
Classifier 1: v < 2.5 => y = +1, v >= 2.5 => y = -1
prediction: [1, 1, 1, -1, -1, -1, -1, -1, -1, -1]
error rate:  0.3

Classifier 1: v < 8.5 => y = +1, v >= 8.5 => y = -1
prediction: [1, 1, 1, 1, 1, 1, 1, 1, 1, -1]
error rate:  0.3

Classifier 1: v < 2.5 => y = -1, v >= 2.5 => y = +1
prediction: [-1, -1, -1, -1, -1, -1, 1, 1, 1, 1]
error rate:  0.4


class Adaboost:
"""

Initial Parameters:
base_classifiers: [Classifier1, Classifier2, ...] the list of base classifiers used to train
n_classifiers: The maximum number of classifiers at which boosting is terminated
"""

def __init__(self, base_classifiers, n_classifiers):
self.base_classifiers = base_classifiers
self.n_classifiers = n_classifiers

# train on data X -> y
def fit(self, X, y):
weights = [1/len(X)] * len(X) # initialize weight
self.alphas = list()
self.classifiers = list()
for n in range(self.n_classifiers):
min_error_rate = 1
for cf in self.base_classifiers:
e = cf.get_error_rate(X, y, weights)
print(cf.name, "error rate = ", e)
if e < min_error_rate:
best_base_classifier = cf
min_error_rate = e

# calculate the importance of the classifier
alpha = 1/2 * math.log((1-min_error_rate)/min_error_rate)

# save alpha and the best_base_classifier at each iteration
self.alphas.append(alpha)
self.classifiers.append(best_base_classifier)

# update weights
y_predict = best_base_classifier.predict(X)
for i in range(len(weights)):
if y_predict[i] == y[i]:
weights[i] = weights[i] / (2 * (1 - min_error_rate))
else:
weights[i] = weights[i] / (2 * min_error_rate)
print("weights = ", weights)
print()

# the information to print to describe the final ensemble classifier
info = "final ensemble classifier = "
i = 0
for i in range(len(self.alphas)):
if i < len(self.alphas) - 1:
info += (str(self.alphas[i]) + " * " + self.classifiers[i].name + " + ")
else:
info += (str(self.alphas[i]) + " * " + self.classifiers[i].name)
print(info)

def predict(self, X):
scores = np.array([0.0]*len(X))
for i in range(len(self.alphas)):
scores += self.alphas[i] * np.array(self.classifiers[i].predict(X))

return np.sign(scores)

def get_error_rate(self, X, y, weights):
weight_count = 0
total = len(y)
y_predict = self.predict(X)
for i in range(total):
if y_predict[i] != y[i]:
weight_count += weights[i]
return weight_count / total
ada = Adaboost(base_classifiers=[C1, C2, C3], n_classifiers=3)
ada.fit(X, y)
C1 error rate =  0.030000000000000006
C2 error rate =  0.030000000000000006
C3 error rate =  0.04
weights =  [0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 0.051546391752577324, 1.6666666666666665, 1.6666666666666665, 1.6666666666666665, 0.051546391752577324]

C1 error rate =  0.5
C2 error rate =  0.015463917525773196
C3 error rate =  0.02061855670103093
weights =  [0.02617801047120419, 0.02617801047120419, 0.02617801047120419, 1.6666666666666667, 1.6666666666666667, 1.6666666666666667, 0.8464223385689353, 0.8464223385689353, 0.8464223385689353, 0.02617801047120419]

C1 error rate =  0.25392670157068065
C2 error rate =  0.5
C3 error rate =  0.010471204188481676
weights =  [1.25, 1.25, 1.25, 0.8421516754850089, 0.8421516754850089, 0.8421516754850089, 0.42768959435626097, 0.42768959435626097, 0.42768959435626097, 1.25]

final ensemble classifier = 1.7380493449176364 * C1 + 2.07683056968926 * C2 + 2.2742999172498486 * C3


ada.predict(X)
array([ 1.,  1.,  1., -1., -1., -1.,  1.,  1.,  1., -1.])

ada.get_error_rate(X, y, weights)
0.0


In the end, we get a strong classifier $C^*(x)$

$$C^*(x) = sign(1.7380493449176364 * C_1(x) + 2.07683056968926 * C_2(x) + 2.2742999172498486 * C_3(x))$$

that can achieve 0 error of classification where

C_1(x) = \left\{ \begin{aligned} +1 & , x < 2.5 \\ -1 & , \ x \geq 2.5 \end{aligned} \right. \ \ \ \ \ C_2(x) = \left\{ \begin{aligned} +1 & , x < 8.5 \\ -1 & , \ x \geq 8.5 \end{aligned} \right. \ \ \ \ \ C_3(x) = \left\{ \begin{aligned} -1 & , x < 5.5 \\ +1 & , \ x \geq 5.5 \end{aligned} \right.

The classification result of $C^*(x)$ is [1, 1, 1, -1, -1, -1, 1, 1, 1, -1] 