CountVectorizer()类解析

zz22-- 2021-09-07 原文

主要可以参考下面几个链接：

1.sklearn文本特征提取

2.使用scikit-learn tfidf计算词语权重

3.sklearn官方中文文档

4.sklearn.feature_extraction.text.CountVectorizer

补充一下：CounterVectorizer()类的函数transfome()的用法

它主要是把新的文本转化为特征矩阵，只不过，这些特征是已经确定过的。而这个特征序列是前面的fit_transfome()输入的语料库确定的特征。见例子：

1 >>>from sklearn.feature_extraction.text import CountVectorizer
2 >>>vec=CountVectrizer()
3 >>>vec.transform([\'Something completely new.\']).toarray()

错误返回，sklearn.exceptions.NotFittedError: CountVectorizer – Vocabulary wasn\’t fitted.表示没有对应的词汇表，这个文本无法转换。其实就是没有建立vocabulary表，没法对文本按照矩阵索引来统计词的个位数

corpus = [
     \'This is the first document.\',
    \'This is the second second document.\',
   \'And the third one.\',
   \'Is this the first document?\']
X = vec.fit_transform(corpus)
X.toarray()

　vocabulary列表

>>>vec.get_feature_names()
 [\'and\', \'document\', \'first\', \'is\', \'one\', \'second\', \'the\', \'third\', \'this\']

　得到的稀疏矩阵是

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

建立vocabulary后可以用transform（）来对新文本进行矩阵化了

>>>vec.transform([\'this is\']).toarray()
 array([[0, 0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64)
>>>vec.transform([\'too bad\']).toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

简单分析\’this is\’在vocabulary表里面，则对应词统计数量，形成矩阵。而\’too bad\’在vocabulary表中没有这两词，所以矩阵都为0.