hadoop streaming 运行 python nltk 提取文件关键词 非常慢

用python写了一个程序 利用nltk 把hdfs上的数据文件 依次做 分词-大小写转换-去掉停用词-词形归一化-词性标注-最后只保留名词   =0.76em使用Hadoop Streaming 运行 Python 或 用 hive transform (hive udf python) 发现运行的超级慢 (处理几千行是没有问题的) 大约5百万条记录运程了 一个多小时还没有处理完   代码如下:
#!/home/hadoop/python2.7.6/bin/env python
# encoding: utf-8
'''
Created on 2017年7月14日
@author: admin
'''
import os
os.environ['NLTK_DATA'] = "/home/hadoop/nltk_data"
import nltk,re,sys
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# title_list=[
# "women Silicone Adhesive Stick on Gel Push-Up bras Backless Strapless Drawstring Corset Invisible bra ",
#"pants the hot led county fashion color new high ,Women Underwear Ba(bydoll Lingerie Dress Sleepwear Lace Bra G-string Set  2017 ( ",
# "Novelty Candy Color Kitchen Tools Heat Resistant Silicone Put A Spoon Mat Insulation Mat Placemat",
# "Silicone Spoon Holder Heat Resistant Kitchen Utensil Spatula Holder Cooking Tool",
# "Novelty Candy Color Kitchen Tools Heat Resistant Silicone Put A Spoon Mat Insulation Mat Placemat (Size: 1) (Size: 1)",
# "Silicone Heat Resistant Spoon Fork Mat Rest Utensil Spatula Holder Kitchen Tool (Size: One Size)",
# "Kitchen Silicone Spoon Rest Heat Resistant Non-stick Silicone Cooking Tools Mat",
# ]

raw_nn=["lipstick"]    
#rex=re.compile(u"\\(|\\)|\\:|\\!|\\,|\\/|[\\\u4e00-\\\u9fa5]+")
rex=re.compile(u"[a-zA-Z0-9]{2,}")
for line in sys.stdin:
    line_list=line.strip().split("\001")

    if len(line_list)==1:
        print "\t".join([line_list[0],""])
        continue 
           
    itemid=line_list[0]
    itename=" ".join(line_list[1:])
    
    #titlename=re.subn(rex,'',itename.decode("utf-8"))[0]
    titlename=" ".join(re.findall(rex,itename))
    if not titlename:
        print "\t".join([itemid,""])
        continue
    key_word=nltk.word_tokenize(titlename)
    key_word=[word.lower() for word in set(key_word) if len(word)>=2]
    wordnet_lematizer = WordNetLemmatizer()
    words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in key_word]
    filtered_words = [word for word in words if word not in stopwords.words('english')]
    pos_result = nltk.pos_tag(filtered_words)
    key_word_list=
    pos= ['NN']
    for tup in pos_result:
        word = tup[0]
        pos_word = tup[1]
        if pos_word in pos or word in raw_nn:
            key_word_list.append(word)
    #print "有效词:",list(set(key_word_list))
    key_word_lists=list(set(key_word_list))
    print "\t".join([itemid,str(" ".join(key_word_lists))])

    
还要看你集群配置

fish - Hadooper

赞同来自: 开心就好_kxjh

启动了多少个Map以及Reduce?任务慢在什么阶段?

开心就好_kxjh

赞同来自: fish

您好,已经找到了 解决办法 设置了map数(split.size) 行快就处理完了

要回复问题请先登录注册