Python实现机器学习算法：K近邻和决策树详解

Python实现机器学习算法：K近邻和决策树详解

随着数据越来越庞大，传统的统计方法无法满足现代数据分析的需求。机器学习算法成为了数据科学的重要部分。在机器学习中，K近邻算法和决策树算法是两个非常常用的算法。现在我们来详细探讨一下这两个算法的实现。

K近邻算法

K近邻算法，简称KNN算法，是一种基于实例的学习方法。它的核心思想是根据相邻样本的特征进行预测。KNN算法是一种无参数的算法，它没有训练过程，因此它被认为是一种经验风险最小化算法。

KNN算法的实现步骤如下：

1. 计算测试样本与各个训练样本之间的距离。
2. 根据距离计算出K个最近邻居。
3. 根据K个最近邻居中出现最多的类别来决定测试样本的类别。

KNN算法在编程时需要注意以下几点：

· 需要将样本之间的距离计算出来，距离的计算方式可以使用欧几里得距离、曼哈顿距离等。
· KNN算法是考虑最近邻样本的方法，因此需要选定合适的K值。

下面是Python代码实现：

```python
import numpy as np
from collections import Counter

class KNN:
    def __init__(self, k=5, distance_method='euclidean'):
        self.k = k
        self.distance_method = distance_method

    def fit(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train

    def predict(self, x_test):
        predictions = []
        for test_sample in x_test:
            distances = []
            for train_sample in self.x_train:
                if self.distance_method == 'euclidean':
                    distance = np.sqrt(np.sum((test_sample - train_sample) ** 2))
                elif self.distance_method == 'manhattan':
                    distance = np.sum(abs(test_sample - train_sample))
                distances.append(distance)
            distances = np.array(distances)
            indices = np.argsort(distances)
            indices = indices[:self.k]
            k_nearest_classes = self.y_train[indices]
            most_common_class = Counter(k_nearest_classes).most_common(1)
            predictions.append(most_common_class[0][0])
        return predictions
```

在代码中，我们首先定义了KNN类。并且我们可以通过传入k值和距离计算方式来初始化该类。在fit方法中，我们将训练数据和训练标签保存下来。在predict方法中，我们首先计算测试样本和训练样本之间的距离，然后根据距离排序，选取距离最近的k个邻居。最后，我们使用Counter对象来统计邻居中出现最多的类别，作为测试样本的预测类别。

决策树算法

决策树算法是一种基于树形结构的有监督学习算法。它通过将数据集分成不同的子集来构建一个树形结构的分类模型。每个叶子节点代表一个类别。决策树学习的过程就是从根节点开始，根据节点上的特征，将数据样本分配到不同的子节点中去。这个过程重复递归进行，直到某个节点的所有样本都属于同一类别，或者达到预先设定的终止条件。

决策树算法的实现步骤如下：

1. 选择最优特征划分数据集。
2. 根据最优特征划分数据集，使得各个子集的类别尽可能一致。
3. 递归地建立决策树。
4. 终止条件：当前节点所有样本属于同一类别，或者当前节点无法再进行特征划分。

决策树算法在编程时需要注意以下几点：

· 需要选择合适的特征划分方法，可以使用信息增益、信息增益比等方法。
· 需要考虑如何处理连续型特征。

下面是Python代码实现：

```python
import numpy as np
from collections import Counter

class Node:
    def __init__(self, feature=None, feature_i=None, pred=None, left=None, right=None):
        self.feature = feature
        self.feature_i = feature_i
        self.pred = pred
        self.left = left
        self.right = right

class DecisionTree:
    def __init__(self, impurity='gini', max_depth=None):
        self.impurity = impurity
        self.max_depth = max_depth

    def _gini(self, y):
        _, counts = np.unique(y, return_counts=True)
        p = counts / len(y)
        gini = 1 - np.sum(p ** 2)
        return gini

    def _entropy(self, y):
        _, counts = np.unique(y, return_counts=True)
        p = counts / len(y)
        entropy = -np.sum(p * np.log2(p))
        return entropy

    def _select_feature(self, x, y):
        feature_scores = []
        if self.impurity == 'gini':
            impurity_func = self._gini
        elif self.impurity == 'entropy':
            impurity_func = self._entropy
        base_score = impurity_func(y)
        for i in range(x.shape[1]):
            sample_values = np.unique(x[:, i])
            feature_score = base_score
            for value in sample_values:
                left_idx = x[:, i] < value
                right_idx = x[:, i] >= value
                left_y = y[left_idx]
                right_y = y[right_idx]
                left_score = impurity_func(left_y) * len(left_y)
                right_score = impurity_func(right_y) * len(right_y)
                score = base_score - (left_score + right_score) / len(y)
                feature_score += score
            feature_scores.append(feature_score)
        feature_scores = np.array(feature_scores)
        return np.argmin(feature_scores), np.min(feature_scores)

    def _build_tree(self, x, y, depth=0):
        if len(np.unique(y)) == 1:
            return Node(pred=y[0])
        elif depth == self.max_depth:
            most_common_y = Counter(y).most_common(1)[0][0]
            return Node(pred=most_common_y)
        else:
            feature_i, _ = self._select_feature(x, y)
            feature_values = np.unique(x[:, feature_i])
            node = Node(feature=feature_i, feature_i=feature_i)
            for value in feature_values:
                idx = x[:, feature_i] == value
                if np.sum(idx) == 0:
                    most_common_y = Counter(y).most_common(1)[0][0]
                    node.left = Node(pred=most_common_y)
                else:
                    left_x = x[idx]
                    left_y = y[idx]
                    node.left = self._build_tree(left_x, left_y, depth + 1)
                if np.sum(~idx) == 0:
                    most_common_y = Counter(y).most_common(1)[0][0]
                    node.right = Node(pred=most_common_y)
                else:
                    right_x = x[~idx]
                    right_y = y[~idx]
                    node.right = self._build_tree(right_x, right_y, depth + 1)
            return node

    def fit(self, x_train, y_train):
        self.tree = self._build_tree(x_train, y_train)

    def predict(self, x_test):
        predictions = []
        for sample in x_test:
            node = self.tree
            while node.feature is not None:
                if sample[node.feature] < np.unique(x_train[:, node.feature])[0]:
                    node = node.left
                else:
                    node = node.right
            predictions.append(node.pred)
        return predictions
```

在代码中，我们首先定义了Node类，表示决策树上的节点。它有一个属性feature表示该节点的特征，feature_i表示该特征在数据集中的位置，pred表示该节点的预测类别，left和right表示该节点左右子节点。

然后我们定义了DecisionTree类，它有两个属性：impurity表示特征选择的方法，max_depth表示建立决策树的最大深度。在_fit方法中，我们通过_build_tree方法递归地建立决策树。在_predict方法中，我们对测试样本进行遍历，递归寻找对应的叶子节点，返回该节点的预测类别。

结语

KNN算法和决策树算法是机器学习中非常常用的算法，对于数据分析工作来说都是必备技能。Python是机器学习中最受欢迎的编程语言之一，它的实现让我们更方便地学习和使用这些算法。希望本文对您对于KNN算法和决策树算法的理解有所帮助。
首页

课程中心

免费公开课

技术干货

就业动态

马哥动态

Python实现机器学习算法：K近邻和决策树详解