sklearn随机森林模型同时获得label和probability

sklearn随机森林训练好的模型(RandomForestClassifier)有两个预测函数，predict和predict_proba。

假设我们的数据集有6个类别，A-F。
predict函数会返回预测的label，如：

1	array(['E', 'F', 'B', ..., 'C', 'C', 'C'], dtype=object)

predict_proba函数返回的是各个类别的概率，每行是一个6列的列表，如：

array(
  [
    [1.85903433e-02, 9.35975346e-06, 2.69576464e-02, 2.94020479e-02,
        8.54120416e-01, 7.09201869e-02],
    [4.43988224e-03, 0.00000000e+00, 2.83222005e-03, 3.47528652e-03,
        1.68512905e-02, 9.72401321e-01]
  ]
)

那么究竟应该如何同时获得label和probability呢？线上真正使用的时候可能需要结合使用。一个可行的思路是调用predict_proba获得各个label的概率，然后得到最大概率label的index，最后再将这个index映射为具体的label就可以了。通过查看文档可见具体的类别信息存储在一个叫做”class_”的字段中了:

classes_: ndarray of shape (n_classes,) or a list of such arrays
The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

核心代码片断如下:

classes = model.classes_
probabilities = model.predict_proba(df)
index = np.argmax(probabilities, axis=1)
for i, j in enumerate(index):
    label = classes[j]
    prop = probabilities[i][j]
    print(label, prop)

结果如下:
E 0.8541204157531124
F 0.9724013207285382
B 0.5046077566187441
B 0.5745286467331606
E 0.713419214014316
A 0.6133339121442233
F 0.9724013207285382
C 0.8217302482178279

官方相关文档