博客代码:180913
作业代码:180312
KNN的应用(python代码)
-
KNN简介
KNN基本原理图
KNN算法通过计算不同的已知的数据集到目标数据的“距离【1】”,按照大小顺序排序后;依靠不同的阈值K确定范围内不同类型所拥有的数据个数的最大值,并认为未知数据即属于类型。
【1】常见的几种记录算法
-
优点:
- 简单
- 对于基本识别问题(basic recognition problems)效果较好
-
缺点:
- 慵懒的学习算法(lazy learner),无法从训练集中学习,只是单纯的使用训练集计算
- 对于大量的测试集,需要花费大量的内存
-
算法步骤
- 获得K(阈值)
- 计算数据集到测试数据的距离
- 排序获得最近的k个距离对应数据的类型
- 计算不同类型的个数
- 得到最多的个数对应的类型即为测试数据的类型
-
homework
- 三种不同的类型
- 四个数据(属性attributes)
获得最高的准确度
-
具体python代码实现
1 from numpy import * 2 import operator 3 import numpy as np 4 import codecs 5 6 f = open("train.txt") 7 8 lines = f.readlines() 9 count_1 = len(lines)10 A = zeros((count_1,4))11 A_row = 012 for line in lines:13 list_1 = line.strip('\n').split(',')14 list_1 = [l for l in list_1 if len(l) > 0]15 A[A_row:] = list_1[0:4]16 A_row+=117 18 B = list()19 for line in lines:20 list_2 = line.strip('\n').split(',')21 list_2 = [l for l in list_2 if len(l) > 0]22 B.append(list_2[4:5])23 24 def KNN(test, A, B, k):25 num = A.shape[0]26 diff = tile(test, (num, 1)) - A27 squareddiff = diff ** 228 squareddist = sum(squareddiff, axis = 1)29 distance = squareddist ** 0.530 31 sortdist = argsort(distance)32 classcount = {}33 for i in range(k):34 vote = B[sortdist[i]]35 vote = tuple(vote)36 #print(vote)37 classcount[vote] = classcount.get(vote, 0) + 138 39 maxcount = 040 for key, value in classcount.items():41 if value > maxcount:42 maxcount = value43 maxindex = key44 45 return maxindex46 47 ftest = open("test_try.txt")48 lines = ftest.readlines()49 count_2 = len(lines)50 C = zeros((count_2,4))51 Test_row = 052 for line in lines:53 list_3 = line.strip('\n').split(',')54 list_3 = [l for l in list_3 if len(l) > 0]55 C[Test_row:] = list_3[0:4]56 Test_row+=157 58 D = list()59 for line in lines:60 list_4 = line.strip('\n').split(',')61 list_4 = [l for l in list_4 if len(l) > 0]62 D.append(list_4[4:5])63 64 def tryBestK(A, B):65 num1 = C.shape[0]66 maxright = 067 for k in range(1,count_1):68 out = []69 for i in range(num1):70 output = KNN(C[i], A, B, k)71 output = list(output)72 out.append(output) 73 count_3 = 074 for i in range(num1):75 if out[i] == D[i]:76 count_3+=177 right = count_3/num178 if maxright < right:79 maxright = right80 maxans = k81 82 return maxans83 84 K = tryBestK(A, B)85 num1 = C.shape[0]86 out = []87 for i in range(num1):88 output = KNN(C[i], A, B, K)89 output = list(output)90 out.append(output)91 92 C = np.column_stack((C,out))93 f = codecs.open("test_ans.txt",'w','utf-8')94 for i in C:95 f.write(",".join(i)+'\r\n')96 f.close()
该代码具体用到的不同部分将在博客代码180914,180915,180916三篇分开具体介绍