博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离...
阅读量:6735 次
发布时间:2019-06-25

本文共 6807 字,大约阅读时间需要 22 分钟。

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 汇总于此 还有这本书 http://www-nlp.stanford.edu/IR-book/ 里面有词向量空间 SVM 等介绍

http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf 专门介绍向量空间

https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf 也提到了其他思路 貌似类似语音识别的统计模型

使用深度学习来做文档相似度计算 https://cs224d.stanford.edu/reports/PoulosJackson.pdf 还有这里 http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf

网页里直接比较文本相似度的 http://www.scurtu.it/documentSimilarity.html

这里汇总了一些回答 http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents  包括利用NLP NLTK库来做,或者是diff,skylearn词向量空间+cos

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 也有cosine相似度计算方法

lucene 3 里的cosine相似度计算方法 https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 注意:4和3的计算方法不一样

向量空间模型(http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):

Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)

As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <word,frequency> pairs, not words only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:

Document a:

A B C A A B C. D D E A B. D A B C B A.

Document b:

A B C A A B C. D A B C B A.

Vector a:

A:6, B:5, C:3, D:3, E:1

Vector b:

A:5, B:4, C:3, D:1, E:0

Which result in the following similarity measure:

(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=62/(8.94427*7.14143)=0.970648

 

lucene里 more like this:

you may want to check the MoreLikeThis feature of lucene.

MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.

Sample code example (java reference) -

MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index readermlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarityQuery query = mlt.like(docID); // Pass the doc id TopDocs similarDocs = searcher.search(query, 10); // Use the searcherif (similarDocs.totalHits == 0)    // Do handling}

 

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 提到: 

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4);

How can i get the cosine similarity between these two documents?

Thank you

 

When indexing, there's an option to store term frequency vectors.

During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

 

 

16

As Julia points out  is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4.

import java.io.IOException;import java.util.*;import org.apache.commons.math3.linear.*;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.core.SimpleAnalyzer;import org.apache.lucene.document.*;import org.apache.lucene.document.Field.Store;import org.apache.lucene.index.*;import org.apache.lucene.store.*;import org.apache.lucene.util.*;public class CosineDocumentSimilarity {    public static final String CONTENT = "Content";    private final Set
terms = new HashSet<>(); private final RealVector v1; private final RealVector v2; CosineDocumentSimilarity(String s1, String s2) throws IOException { Directory directory = createIndex(s1, s2); IndexReader reader = DirectoryReader.open(directory); Map
f1 = getTermFrequencies(reader, 0); Map
f2 = getTermFrequencies(reader, 1); reader.close(); v1 = toRealVector(f1); v2 = toRealVector(f2); } Directory createIndex(String s1, String s2) throws IOException { Directory directory = new RAMDirectory(); Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer); IndexWriter writer = new IndexWriter(directory, iwc); addDocument(writer, s1); addDocument(writer, s2); writer.close(); return directory; } /* Indexed, tokenized, stored. */ public static final FieldType TYPE_STORED = new FieldType(); static { TYPE_STORED.setIndexed(true); TYPE_STORED.setTokenized(true); TYPE_STORED.setStored(true); TYPE_STORED.setStoreTermVectors(true); TYPE_STORED.setStoreTermVectorPositions(true); TYPE_STORED.freeze(); } void addDocument(IndexWriter writer, String content) throws IOException { Document doc = new Document(); Field field = new Field(CONTENT, content, TYPE_STORED); doc.add(field); writer.addDocument(doc); } double getCosineSimilarity() { return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm()); } public static double getCosineSimilarity(String s1, String s2) throws IOException { return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity(); } Map
getTermFrequencies(IndexReader reader, int docId) throws IOException { Terms vector = reader.getTermVector(docId, CONTENT); TermsEnum termsEnum = null; termsEnum = vector.iterator(termsEnum); Map
frequencies = new HashMap<>(); BytesRef text = null; while ((text = termsEnum.next()) != null) { String term = text.utf8ToString(); int freq = (int) termsEnum.totalTermFreq(); frequencies.put(term, freq); terms.add(term); } return frequencies; } RealVector toRealVector(Map
map) { RealVector vector = new ArrayRealVector(terms.size()); int i = 0; for (String term : terms) { int value = map.containsKey(term) ? map.get(term) : 0; vector.setEntry(i++, value); } return (RealVector) vector.mapDivide(vector.getL1Norm()); }}

转载地址:http://gafqo.baihongyu.com/

你可能感兴趣的文章
网站安全检测之逻辑漏洞检测 修复方案
查看>>
Linq SQL 动态个数where查询
查看>>
技术变化那么快,程序员如何做到不被淘汰?
查看>>
Spring Boot 2.x基础教程:快速入门
查看>>
vi或vim的用法
查看>>
关于C++ 容器的swap操作
查看>>
CentOs系统安装mysql数据库
查看>>
python 学习笔记(02)
查看>>
操作系统基础
查看>>
我的友情链接
查看>>
Android开发规范
查看>>
linux数据加密和CA证书的颁发
查看>>
.NET Framework 4.0 初始化错误的解决办法!!
查看>>
JavaScript tutorial - DOM nodes and tree
查看>>
[一文一命令]route命令详解
查看>>
PostgreSQL中如何关闭死锁的进程
查看>>
Web性能测试基本性能指标
查看>>
探索perl的世界1-10章讲义及示例
查看>>
Linux安全小命令-限制远程root登录
查看>>
我的友情链接
查看>>