讲讲义中K-means聚类实例修正

步骤1:下载测试数据(格式为SGML):
http://www.daviddlewis.com/resou ... reuters21578.tar.gz

步骤2:将数据解压
$ mkdir -p mahout-work/reuters-sgm
$ cd mahout-work/reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd ..

步骤3:将SGML格式数据转化为文本文件
$ bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters mahout-work/reuters-sgm mahout-work/reuters-out

步骤4:将数据转化为SequenceFile格式
$ bin/mahout seqdirectory \
    -i mahout-work/reuters-out \
    -o mahout-work/reuters-out-seqdir \
    -c UTF-8 -chunk 5

步骤5:创建向量和Sequence文件
$ bin/mahout seq2sparse
    -i mahout-work/reuters-out-seqdir \
    -o mahout-work/reuters-out-seqdir-sparse-kmeans


步骤6: 运行K-means
$ bin/mahout kmeans \
    -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
    -c mahout-work/reuters-kmeans-clusters \
    -o mahout-work/reuters-kmeans \
    -dm org.apache.mahout.common.distance.CosineDistanceMeasure –cd 0.1 \
    -x 10 -k 20 –ow

0 个评论

要回复文章请先登录注册