diff --git a/assignment-1/submission/18340986009/README.md b/assignment-1/submission/18340986009/README.md new file mode 100644 index 0000000000000000000000000000000000000000..fa66f00c0a8bb1084d3920b47714534b3a660dfe --- /dev/null +++ b/assignment-1/submission/18340986009/README.md @@ -0,0 +1,159 @@ +# KNN Classification + +This report includes two parts: +1. Find a KNN model that maximize accuracy rate with given dataset. (Distribution type of each class = Gaussian, distribution parameters chosen at random) +2. Assess how distribution parameters affects model accuracy using the model built in part 1. + + +## 1. Model Generation + +### 1.1 Overview of Mock Data + +Generate 3 classes of 2-dimension Gaussian Distribution. + +$ + N_0 = 150 \hspace{1cm} + C_0 \sim \mathcal{N}(\mu = \begin{bmatrix}50\\\\50\end{bmatrix},\sigma^{2} = \begin{bmatrix}60 & -50\\\\-50 & 140\end{bmatrix}) +$ + +$ + N_1 = 250 \hspace{1cm} + C_1 \sim \mathcal{N}(\mu = \begin{bmatrix}60\\\\20\end{bmatrix},\sigma^{2} = \begin{bmatrix}130 & 10\\\\10 & 100\end{bmatrix}) +$ + +$ + N_2 = 100 \hspace{1cm} + C_2 \sim \mathcal{N}(\mu = \begin{bmatrix}20\\\\60\end{bmatrix},\sigma^{2} = \begin{bmatrix}120 & 20\\\\20 & 90\end{bmatrix}) +$ + +Mock Data 1 Overview: + + + +500 points are then split randomly into training set (80%) and testing set (20%). + +### 1.2 Model Accuracy with Different K and Distance Method + +Since a rule of thumb is to let $K = \sqrt{N}$, where $ N = N_0 + N_1 + N_2$, we first try some Ks around $\sqrt{400} = 20$ using both Euclidean and Manhattan distance. + +| \ | K = 10 | K = 15 | K = 20 | K = 25 | K = 30 | +| ------------ |:------:|:------:|:------:|:------:|:------:| +| **Euclidean** |83.0|82.0|83.0|81.0|80.0| +| **Manhattan** |83.0|82.0|81.0|81.0|81.0| + +The KNN model with $K = 10$ gives the best prediction result of 83% for both distance methods, so we consider choosing $K_{0} = 10$ as a starting point for model optimization. Below is a scatter plot showing the prediction result of the chosen model ($K = 10$, Euclidean Distance). Each red dot represents a mis-classification. + +*Noticed model accuracy using different distance method doesn't show much difference for this dataset. + + + +### 1.3 Model Optimization + +General Idea: $K_{i+1} = \lceil{K_{i} + Step_{i+1}}\rceil$ + +Detailed steps: + + - For each $K_{i+1}$, calculate its accuracy rate $R_{i+1}$. + - If $R_{i+1} > R_{0}$, a better model is find. End our optimization. Else: + - If $R_{i+1} > R_{i}$, let $Step_{i+1} = \frac{1}{C} Step_{i} $, where $C = (R_{i+1} - R_{i}) / R_{i}$. + Which is, if model accuracy improves, continue in this direction with a smaller step. The step size is negatively related to the percentage of improvement. + - If $R_{i+1} <= R_{i}$, let $Step_{i+1} = - \frac{1}{2} Step_{i}$. + Which is, if the new K does not improve model accuracy, try a smaller step in reverse direction. + +The model from 1.2 gives K = 10 and Euclidean distance. Using this model as the starting point, define the first step $Step_{0} = \frac{1}{100}N = 5$. + +Optimization process: + +| \ | K = 10 | K = 5 | K = 8 | +| ------------ |:------:|:------:|:------:| +| **Accuracy rate (%)** |83.0|83.0|85.0| + + After three iterations, a higher accuracy rate of 85% is reached when K is adjusted to 8. Thus, our final KNN model will use K = 8 and Euclidean distance. + +Prediction result evaluation: + + + +Compared with the model before optimization, two points on the top is now classified correctly. + +## 2. Distribution Parameters & Model Accuracy + +From inuition, we hypothesis that any change that results in a more balanced mixture of all classes will make classification harder, thereby decrease model accuracy. Below, we modify the parameters of Gaussian distributions to test our hypothesis. + +### 2.1 Change of Variance and Covariance + +Let the means stay the same. Modify the variance-covariance matrix for each class to increase overlapping between each class: + +$ + N_0 = 150 \hspace{1cm} + C_0 \sim \mathcal{N}(\mu = \begin{bmatrix}50\\\\50\end{bmatrix},\sigma^{2} = \begin{bmatrix}300 & 0\\\\0 & 200\end{bmatrix}) +$ + +$ + N_1 = 250 \hspace{1cm} + C_1 \sim \mathcal{N}(\mu = \begin{bmatrix}60\\\\20\end{bmatrix},\sigma^{2} = \begin{bmatrix}250 & 0\\\\0 & 150\end{bmatrix}) +$ + +$ + N_2 = 100 \hspace{1cm} + C_2 \sim \mathcal{N}(\mu = \begin{bmatrix}20\\\\60\end{bmatrix},\sigma^{2} = \begin{bmatrix}150 & 0\\\\0 & 150\end{bmatrix}) +$ + +Mock Data 2 Overview: + + + +Prediction result evaluation: + + + +Accuracy of our model drop from 85% to 79% as expected. + +### 2.2 Change of Mean + +Let other parameters stay the same, decrease the distance between the means of each class to increase overlapping: + +$ + N_0 = 150 \hspace{1cm} + C_0 \sim \mathcal{N}(\mu = \begin{bmatrix}50\\\\50\end{bmatrix},\sigma^{2} = \begin{bmatrix}60 & -50\\\\-50 & 140\end{bmatrix}) +$ + +$ + N_1 = 250 \hspace{1cm} + C_1 \sim \mathcal{N}(\mu = \begin{bmatrix}50\\\\40\end{bmatrix},\sigma^{2} = \begin{bmatrix}130 & 10\\\\10 & 100\end{bmatrix}) +$ + +$ + N_2 = 100 \hspace{1cm} + C_2 \sim \mathcal{N}(\mu = \begin{bmatrix}40\\\\60\end{bmatrix},\sigma^{2} = \begin{bmatrix}120 & 20\\\\20 & 90\end{bmatrix}) +$ + +Mock Data 3 Overview: + + + +Prediction result evaluation: + + + +Accuracy of our model drop from 85% to 73% as expected. + +### 2.3 N & Model Accuracy + +In attempts to increase model accuracy, we try double the Ns in proportion to Data 3. With $N_{total} = 1000$, we expect some increase on model accuracy. + +Mock Data 4 Overview: + + + +Prediction result evaluation: + + + +Model accuracy decreases from 73% to 62.5% even though our data size doubled. This suggests sample size contributes much less to model accuracy compared with distribution parameters. This makes sense because if the data labeled by different categories does indeed come from the same distribution, increasing N should provide more evidence of the similarity between these different categories. + +## Summary + +The main takeaways for this exercise: + +Model accuracy depends more on distribution parameters and the choice of K. Distance method have little influence on model accuracy, and whether an increase of N improves model accuracy or not depends on if the true distributions of all categories are significantly different (Might be able to use p-value from a statistical test to evaluate). diff --git a/assignment-1/submission/18340986009/img/Figure 1.png b/assignment-1/submission/18340986009/img/Figure 1.png new file mode 100644 index 0000000000000000000000000000000000000000..32d5ded9c9d662bf7eacaede5e9316ba1d545335 Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 1.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 2.png b/assignment-1/submission/18340986009/img/Figure 2.png new file mode 100644 index 0000000000000000000000000000000000000000..c7e7752721f808ea5ca19a56a7e642badb1617fd Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 2.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 3.png b/assignment-1/submission/18340986009/img/Figure 3.png new file mode 100644 index 0000000000000000000000000000000000000000..5a3fd62c0681f995d32c1ea794258095239261ee Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 3.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 4.png b/assignment-1/submission/18340986009/img/Figure 4.png new file mode 100644 index 0000000000000000000000000000000000000000..9c1e05f712b290be595b12c812476c72e0f0002d Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 4.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 5.png b/assignment-1/submission/18340986009/img/Figure 5.png new file mode 100644 index 0000000000000000000000000000000000000000..e49ec9595ac9c813a2e6044375c534bb669b3a7c Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 5.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 6.png b/assignment-1/submission/18340986009/img/Figure 6.png new file mode 100644 index 0000000000000000000000000000000000000000..11a84369882f65a2a3e46237e51fe479d4f14b88 Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 6.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 7.png b/assignment-1/submission/18340986009/img/Figure 7.png new file mode 100644 index 0000000000000000000000000000000000000000..ee33c60766eb907d5b8992c24ca3806c297d9fc8 Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 7.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 8.png b/assignment-1/submission/18340986009/img/Figure 8.png new file mode 100644 index 0000000000000000000000000000000000000000..a3f42ac859f2ef35448cb16f0412df387ba8e7a8 Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 8.png differ diff --git a/assignment-1/submission/18340986009/img/Figure 9.png b/assignment-1/submission/18340986009/img/Figure 9.png new file mode 100644 index 0000000000000000000000000000000000000000..0de5d1f658bdd5860681cfee20432e8074f39a1d Binary files /dev/null and b/assignment-1/submission/18340986009/img/Figure 9.png differ diff --git a/assignment-1/submission/18340986009/source.py b/assignment-1/submission/18340986009/source.py new file mode 100644 index 0000000000000000000000000000000000000000..410b588394c97d15227671e94f6e24e6cbb46882 --- /dev/null +++ b/assignment-1/submission/18340986009/source.py @@ -0,0 +1,249 @@ +#!/usr/bin/env python +# coding: utf-8 + +# In[1]: + + +import sys +import numpy as np +import matplotlib.pyplot as plt + + +# ## Define Global Functions + +# In[139]: + + +# Generate Training and Testing Sets +def generate(Ns, Means, Covs, train_frac): + + # Generate 2-D data of N class + data = list() + label = list() + + for i in range(0,len(Ns)): + Ci = np.random.multivariate_normal(Means[i], Covs[i], Ns[i]) + data.append(Ci) + label.append([i]*Ns[i]) + + data = np.array([v for subl in data for v in subl]) + label = np.array([v for subl in label for v in subl]) + + #Assign random number + idx = np.arange(sum(Ns)) + np.random.shuffle(idx) + + data = data[idx] + label = label[idx] + + # Split into training and testing set + split_point = int(label.size * train_frac) + train_data, test_data = data[:split_point,], data[split_point:,] + train_label, test_label = label[:split_point,], label[split_point:,] + + np.save("data.npy",((train_data, train_label), + (test_data, test_label))) + + return train_data, train_label, test_data, test_label + + +# Read in saved data +def read(): + (train_data, train_label), (test_data, test_label) = np.load( + "data.npy", allow_pickle = True) + return (train_data, train_label), (test_data, test_label) + + +# Create scatter plot of different categories +def display(data, colorby, name, title): + colors = ['red','grey','blue'] + datas =[[],[],[]] + + for i in range(len(data)): + datas[colorby[i]].append(data[i]) + + for i in range(len(datas)): + each = np.array(datas[i]) + if len(each) == 0: + continue + plt.scatter(each[:, 0], each[:, 1], + marker = 'o', + color = colors[i], + alpha = 0.7) + + plt.xlabel("X1") + plt.ylabel("X2") + plt.title(title) + plt.savefig(f'img/{name}') + plt.show() + + +# ## Define Class KNN + +# In[140]: + + +class KNN: + + def __init__(self): + + self.K = None + self.Dist = None + self.data = None + self.label = None + + + # Calculate distance between two given points + def get_distance(self, x, y, dist_type = "Euclidean"): + dist = 0.0 + if "Euclidean" == dist_type: + distance = 0.0 + for i in range(len(x)): + distance += (x[i] - y[i])**2 + dist = np.sqrt(distance) + + if "Manhattan" == dist_type: + distance = 0.0 + for i in range(len(x)): + distance += np.abs(x[i] - y[i]) + dist = distance + + return dist + + + # Make a prediction for one point + def predict_for_one(self, K, Dist, target, train_data, train_label): + # Calculate distances between target point and other points + dists = [] + neighbors = [] + + for i in range(len(train_data)): + dist = self.get_distance(target, train_data[i], Dist) + dists.append((train_data[i], train_label[i], dist)) + + # Get the K nearest neighbors + dists.sort(key = lambda e: e[-1]) + neighbors = dists[1:K+1] + + # Make prediction based on conditional probabilities + neighbors_class = [e[-2] for e in neighbors] + prediction = max(neighbors_class, key = neighbors_class.count) + + return prediction + + + # Calculate model accuracy + def calc_accuracy(self, K, Dist, train_data, train_label): + predictions = [] + # Make predictions for the training data + for i in range(len(train_label)): + target = train_data[i] + prediction = self.predict_for_one( + K, Dist, target, train_data, train_label + ) + predictions.append(prediction) + + correct = 0 + for i in range(len(predictions)): + if train_label[i] == predictions[i]: + correct += 1 + accuracy = correct / len(predictions) * 100 + + return accuracy + + + # Find the Optimal K & Distance combination + def fit(self, K_list, Dist_list, train_data, train_label): + + # Loop through the given options for K and distance methods + accuracy_list = [] + for i in range(len(Dist_list)): + Dist = Dist_list[i] + dum_list = [] + for j in range(len(K_list)): + K = K_list[j] + accuracy = self.calc_accuracy( + K, Dist, train_data, train_label + ) + dum_list.append(accuracy) + accuracy_list.append(dum_list) + + # Find the K & Distance method that gives the highest accuracy + ac_array = np.array(accuracy_list) + global_max = max([max(subl) for subl in accuracy_list]) + params = np.where(ac_array == global_max) + + # Assign the optimal parameters to KNN object + # Randomly choice one if there exist more than one highest accuracy + Dist_idx = np.random.choice(np.array(params[0])) + K_idx = np.random.choice(np.array(params[1])) + + self.Dist = Dist_list[Dist_idx] + self.K = K_list[K_idx] + self.data = train_data + self.label = train_label + + return ac_array + + + def predict(self, test_data): + # If test data has been inputed & Model has been obtained + predictions = [] + # For every point(target) in test data + for i in range(len(test_data)): + target = test_data[i] + prediction = self.predict_for_one( + self.K, self.Dist, + target, + self.data, + self.label) + predictions.append(prediction) + + return np.array(predictions) + + +# ## Start of Program + +# In[143]: + + +if __name__ == '__main__': + + if len(sys.argv) > 1 and sys.argv[1] == "g": + generate( + Ns = [100, 250, 150], + + Means = [[50,50], + [60,20], + [20,60]], + + Covs = [[[60,-50],[-50,140]], + [[130,10],[10,100]], + [[120,20],[20,90]]], + + train_frac = 0.8 + ) + + elif len(sys.argv) > 1 and sys.argv[1] == "d": + (train_data, train_label), (test_data, test_label) = read() + + display(train_data, train_label, + 'train', 'Scatter Plot of Training Data') + display(test_data, test_label, + 'test', 'Scatter Plot of Testing Data') + else: + (train_data, train_label), (test_data, test_label) = read() + + model = KNN() + + model.fit( + K_list = [15, 20, 25], + Dist_list = ["Euclidean", "Manhattan"], + train_data = train_data, + train_label = train_label) + + res = model.predict(test_data) + + print("acc =",np.mean(np.equal(res, test_label))) + +