挖掘DBLP作者合作关系,FP-Growth算法实践(2):从DBLP数据集中
发布时间:2021-05-25 16:56:01  所属栏目:大数据  来源:网络整理 
            导读:副标题#e# 上篇文章:http://www.voidcn.com/article/p-nsbrwwsu-zv.html?(挖掘DBLP作者合作关系,FP-Growth算法实践(1):从DBLP数据集中提取目标信息(会议、作者等)) 大家反映代码不能用,主要是太慢了,好吧,我也承认慢,在内存构造树,肯定的!
                
                
                
            | def XmlLineParser(fileName):
    rf=open(fileName,"r")
    for line in rf:
        #print "line [1]",line
        if line.startswith("<inproceedings"):
            print "line [1]",line
            booktitle=""
            year=""
            title=""
            authorList=""
            for line in rf:
                print "line [2]",line
                if line.startswith("<author"):
                    authorList+=line
                if line.startswith("<title"):
                    title=line
                elif line.startswith("<year"):
                    year=line[6:10]
                    if year<fromYear:
                        break
                elif line.startswith("<booktitle"):
                    booktitle=((line[11:]).split("</")[0]).split(" ")[0]
                    if not confNameDict.has_key(booktitle):
                        break
                elif line.startswith("</inproceedings"):
                    #tranList=[] #"confName    t    year    t    title    t    author1|author2|..|authorn"
                    localTran=booktitle+"t"+year+"t"+(title[7:]).split("</")[0]+"t"
                    for authorLine in authorList.split("n"):
                        for author in re.findall(re.compile(r'<author>(.*)</author>',re.S),authorLine):
                            localTran+=author+"|"
                    wf=open("tranDB.txt","a")
                    wf.write(localTran[:-1]+"n") #remove last "|"
                    wf.close()
                    break #do not forget
    rf.close()调用直接: XmlLineParser(fileName) 拿走不谢,代码写得有点水,见谅。 (编辑:扬州站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! | 
站长推荐
            
        

