[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804310#comment-16804310 ] Andy Hind commented on LUCENE-6968: --- [~mayyas], in answer to your questions: 1) Depends on your view - a lower default would probably make sense. The default of 512 is good for 3-10+ pages of A4 text - again with an "it depends" caveat 2) Query time is covered here https://issues.apache.org/jira/browse/SOLR-12879 You can probably get away with the default query time tokenisation in SOLR, post BM25. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783921#comment-16783921 ] Mayya Sharipova commented on LUCENE-6968: - [~andyhind] Thanks very much for your answer, it made things more clear. I still have a couple of additional questions if you don't mind: 1) The default settings of the filter will produce for each document 512 tokens each of the size 16 bytes, that is approximately 8Kb. Isn't 8Kb too big of a size to be a document's signature? 2) What is the way to combine `min_hash` tokens to a query for similarity search? Do you have any examples? Is this a work in progress? Thanks again! > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737192#comment-16737192 ] Andy Hind commented on LUCENE-6968: --- [~mayyas] Hi Mayya, there is a good review paper here [https://arxiv.org/pdf/1408.2927.pdf]. See sections 3.5.1 and 3.5.2 and related references. I have not found the specific comment about bias I was trying to locate. The handwaving view is that empty or missing hashes are biased for many to many comparisons. It is difficult to tune the hash parameters for a wide mix of doc sizes, and small documents in particular, as the number of hashes increases with doc size over some range. It is better to have some value rather than none. There is an argument about what value should be used but that is less important. Repetition is one way of filling in gaps and making the hash count consistent. For two small docs, there is going to be a bit of asymmetry in the measure whatever you do. In some cases, like containment, the bias may be a good thing :) Apologies for my slow response. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706571#comment-16706571 ] Mayya Sharipova commented on LUCENE-6968: - [~andyhind] Hello Andy! I have several questions about the implementation of the *MinHashFilter*, and was wondering if you would be able to answer them. Thanks a lot in advance. The implementation from 1st original patch where the minimum set is kept is very clear to me, and follows the classic idea of constructing MinHash signature and LSH search after it. But I am having a hard time understanding the final implementation for MinHashFilter. 1) What constitutes the signature of a document? Are these all values stored in the hash table? Doesn't it make a signature too large? Can you please refer the paper that describes this way of constructing minhash signatures. 2) What is the use of {{withRotation}} parameter? What is the advantage of using {{withRotation=true}}? In the paper you cited: [http://www.auai.org/uai2014/proceedings/individuals/225.pdf], they fill empty bins with "value of the closest non-empty bin in the clockwise direction (circular right hand side) added *with offset C*". In the {{MinHashFilter}} implementation values for empty buckets are just blindly copied from non-empty ones, so a lot of buckets with have the same value. Hopefully the questions make sense. Thanks again in advance. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili >Priority: Major > Fix For: 6.2, 7.0 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601387#comment-15601387 ] Tommaso Teofili commented on LUCENE-6968: - [~yo...@apache.org] the _MinHash_ filter can be already used in Solr just like any other {{TokenFilter}} using its {{MinHashFilterFactory}}; regarding the test, there used to be one to test its end to end usage in one of the first patches, however it was decided to keep only the {{BaseTokenStreamTestCase}} ones as to check the filtering was working correctly. Perhaps we can put the old usage test in the context of Solr and add it within the Solr tests. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15599783#comment-15599783 ] Yonik Seeley commented on LUCENE-6968: -- Is there a JIRA issue yet to expose (and test) this and MinHash in Solr? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414895#comment-15414895 ] Varun Thacker commented on LUCENE-6968: --- Hi Tommaso, I think we need to fix the CHANGES entry to move the entry to the 6.2 section. Its under 7.0 section currently > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348025#comment-15348025 ] Tommaso Teofili commented on LUCENE-6968: - I've backported this to branch 6.x for inclusion in next 6.2 release. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348023#comment-15348023 ] ASF subversion and git services commented on LUCENE-6968: - Commit 96a5595fffd4a4a1c553a987382697cb8a92b354 in lucene-solr's branch refs/heads/branch_6x from [~teofili] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96a5595 ] LUCENE-6968 - LSH Filter backported to branch 6.x > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0) > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333646#comment-15333646 ] ASF subversion and git services commented on LUCENE-6968: - Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/apiv2 from [~teofili] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ] LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0) > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331794#comment-15331794 ] Tommaso Teofili commented on LUCENE-6968: - yes, I plan to merge it to 6.x, I wanted to have a few more runs on Jenkins before merging it back to make sure there're no token filtering level issues. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0) > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330032#comment-15330032 ] Andy Hind commented on LUCENE-6968: --- Hi Tommaso - are you planning to merge this to 6.x? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0) > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15329206#comment-15329206 ] Tommaso Teofili commented on LUCENE-6968: - I've committed this, thanks to [~andyhind] and [~caomanhdat] for your patches! > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0) > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15329205#comment-15329205 ] ASF subversion and git services commented on LUCENE-6968: - Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch refs/heads/master from [~teofili] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ] LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Fix For: master (7.0) > > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328054#comment-15328054 ] Andy Hind commented on LUCENE-6968: --- Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter that was failing intermittently. I think on one of the evil test cases. LongPair should implement equals (and probably hashCode if it will be reused) as it goes into a TreeSet. An over sight on my part. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275123#comment-15275123 ] Tommaso Teofili commented on LUCENE-6968: - thanks a lot Andy, it generally looks much better, I will have a look at failures on point 5 (and eventually add the IDE fixes back). > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274697#comment-15274697 ] Andy Hind commented on LUCENE-6968: --- I have attached an updated patch. This addresses the following issues # Support for single hash, split into ranges with a minimum for each range # Remove end to end tests and beefed up unit tests # Remove Guava in favour of Yonik's murmur hash implementation. (Some duplication here with SOLR) # Fixed alignment and "evil" test case issue # TestFactories passes > 200 times (some Japanese Number tokenisation failures) # Fixed formatting There were issue applying patch 4 on its own or over the previous patch. I believe I have included everything other then the IDE related bits. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, > LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263867#comment-15263867 ] Andy Hind commented on LUCENE-6968: --- After a bit more digging, the single hash and keeping the minimum set can be improved. See: [1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf [2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right adding an offset for each step taken. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262918#comment-15262918 ] Andy Hind commented on LUCENE-6968: --- [~yo...@apache.org] has murmurhash3_x64_128 here https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262169#comment-15262169 ] Andy Hind commented on LUCENE-6968: --- I agree a pure token stream test makes sense. The only concern I have is about testing token filters chained together. Chaining shingle generation with min hashing requires that the underlying token stream has its state reset correctly for reuse. As I missed this, I added a test to cover it. Is there somewhere else in the test framework that covers this case? Some randomised chaining of filters?? Perhaps chaining is more of a SOLR thing. I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 5-shingles become well distributed. Link [2] says upto 2/3 of all possible trigrams have been seen in 30 years of news articles. So it seems we can expect to see many of the possible 5-shingles. Some bioinformatic use cases may also require this. {quote} [1] http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html [2] http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf {quote} I was not that keen to add Guava! However, it was already there somewhere. I am happy if this moves off into a separate module. I will also look to see how this dependency could be removed. Perhaps we should have some time to consider how to include the fingerprint length (sum of the min set size over all hashes) to support an unbiased query. An unbiased query would be more difficult to build correctly. Some fingerprint/LSH query support and tests may make sense. Some other statistics may also be useful in generating faster queries that find similar documents using some threshold and probability of meeting that threshold. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261997#comment-15261997 ] Robert Muir commented on LUCENE-6968: - We don't need any "integration tests" or "end to end tests". -1 to those. Just think about it. Tokenizers only output one thing, and that is tokens. And that is what we should test. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat >Assignee: Tommaso Teofili > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261990#comment-15261990 ] Tommaso Teofili commented on LUCENE-6968: - thanks Robert, I agree on all of your suggestions. The error reported by the TestFactories witnesses what you say, I can work on those more appropriate tests; however I would also like to keep the "query" ones as to keep an eye on whether one of the use case for this filter is supported as expected (a sort of integration test). Removing Guava might of course be possible (and I'd prefer to do that), although rewriting the hashcode stuff is likely to be a bit annoying. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261961#comment-15261961 ] Robert Muir commented on LUCENE-6968: - also, these analyzers should not be tested with queries. Instead, just use the asserts in BaseTokenStreamTestCase to check that the correct tokens are output. I see some solr code was copied so that things can be tested with queries. I know that solr likes to test this way, but nobody wants to debug a test failure that tests an analyzer this way. Just keep in mind we have a ton of analyzers, and when things go wrong (and they do), people have to debug them or even do refactorings across hundreds of them. That is why we have a consistent test class (BaseTokenStreamTestCase). > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261955#comment-15261955 ] Robert Muir commented on LUCENE-6968: - analysis-common library cannot have any external dependencies. so If this filter is going to depend on guava, it needs to be its own analysis module. Just like all other analysis modules that have any third party dependencies. I would try to remove the guava dependency at all costs anyway. It causes hell for developers to depend on guava. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, > LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261480#comment-15261480 ] Cao Manh Dat commented on LUCENE-6968: -- Thanks, that make sense! > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259924#comment-15259924 ] Andy Hind commented on LUCENE-6968: --- This comes down to "what is a good estimate of |A U B|" and do we need it for the use case. For query, the main use case is finding documents like one source document. So we are comparing Doc A with all other documents. What we need is a measure that is fair for A -> B, A -> C. We probably do not care about B -> C. If we take the fingerprint from A and just OR the bits together into a big query we have a consistent measure of similarity of A with any other document. This particular measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for this use case that may not matter. This measure contains both inclusion and duplication which may be a good thing. It is also pretty intuitive what it means. This is (A ∩ B)/|A|. If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of |A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - |A ∩ B|), or some similar estimate. We could use the size of the fingerprint set. We could keep the full ordered set of hashes and have extra statistics like the total number of hashes and total number of unique hashes. For two short documents, where there are fewer fingerprints than the maximim, we have the full set. For two larger docs we have an estimate of these based in the min hash sets. You can argue "min of many hashes" is a random sample with replacement and "min set of one hash" is a random sample with out replacement (min set); if your hash is good. If the sample is small compared with the set of all hashes the arguments converge. If we were doing arbitrary comparisons between any pair of documents then we would have to use an unbiased estimator. Finding candidate pairs, moving onto clustering, ... > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732 ] Cao Manh Dat commented on LUCENE-6968: -- Thanks for the link. I totally agree that keeping some lowest values for single hash function would be better. But in the wiki doc. It pointed out that the estimator formulation for "variant with a single hash function" is not same as the estimator formulation for "variant with many hash function". So the generated query must be different for each case. For example, in case we use single hash function and keep some lowest values : 1. We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7] 3. So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = |{5}| / k = 0.2 > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256114#comment-15256114 ] Andy Hind commented on LUCENE-6968: --- The argument here says it is pretty much the same. {code} https://en.wikipedia.org/wiki/MinHash {code} The plan was to offer both options. With respect to banding and finding docs related to some start document, the number of hashes may depend on the start document. Let's start with 5 word shingles, one hash and keep the minimum 100 hash values. For a five word document we get one hash. For a 100 word doc where all the shingles/words are the same we get one hash. For all different shingles we get 96 hashes. If we have 100 different hashes and keep the lowest one all the above cases end up with 100 hashes. So back to banding. With minimum sets, you need to look and see how many hashes you really got and then do the banding. Comparing a small documents/snippet (where we get 10 hashes in the fingerprint)with a much larger document (where we get 100 hashes) is an interesting case to consider. Starting with the small document there are fewer bits to match in the generated query. With 100 hashes from the small document I think you end up in the roughly same place, except for small snippets. Any given band is more likely to have the same shingle hashed different ways. There is also an argument for a winnowing approach. With a 100 hash fingerprint, sampling for 100 words is great but not so great for 100,000 words. With a minimum set we have the option to generate a finger print related to the document length and other features. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255435#comment-15255435 ] Cao Manh Dat commented on LUCENE-6968: -- What's a wonderful patch. The code is optimized, sure that the the index will be much smaller! But there are some line that's quite hard to understand for me. - Is combineOrdered(HasCode... ) necessary? - The patch keep some lowest values for each position, so for given expectedTruePositive how can we compute the band size? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255434#comment-15255434 ] Cao Manh Dat commented on LUCENE-6968: -- What's a wonderful patch. The code is optimized, sure that the the index will be much smaller! But there are some line that's quite hard to understand for me. - Is combineOrdered(HasCode... ) necessary? - The patch keep some lowest values for each position, so for given expectedTruePositive how can we compute the band size? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252743#comment-15252743 ] Andy Hind commented on LUCENE-6968: --- Hi It would be quite common to use min hashing after shingling. At this point the number of possible word combinations vs the size of the hash is important. With shingles of 5 words from 100,000 that is 10e25 combinations. Some naive processing of the ~500k Enron emails (splitting on white space, case folding and 5 word shingles) gives ~1e13 combinations. So a long hash would be better at 1.8e19. I have not yet looked at a larger corpus. The LSH query is neat. However the logic can give banding where the last band is uneven. In the patch I think the last band would be dropped unless bands * rows in band = # of hashes The underlying state of the source filter may also be lost (if using shingling) I do not believe the similarity is required at all. I think you can get Jaccard distance using constant score queries and disabling coordination on the boolean query. I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit more flexibility allowing a minimum set of hash values for a bunch of hashes. There is clearly some trade off for speed of hashing and over representing short documents. The minimum set may be a solution to this. I think there is some interesting research there. I will add my patch inspired by the original and apologise for the mixed formatting in advance .. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252122#comment-15252122 ] Joel Bernstein commented on LUCENE-6968: HI, I believe there will be some work forthcoming from Alfresco on this ticket. We may want to hold off until those patches go up. Hopefully I will have an update on this soon. [~teofili], if you feel strongly though that this is ready to commit, then go for it. We can always make adjustments after the committal as well. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252065#comment-15252065 ] Tommaso Teofili commented on LUCENE-6968: - [~caomanhdat], I'd like to commit this patch shortly, my only concern is around the similarity implementation, having all methods returning 1 doesn't seem correct to me, or am I missing something ? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195272#comment-15195272 ] Tommaso Teofili commented on LUCENE-6968: - I was having a second look at this patch, why is {{LSHSimilarity}} returning 1 in all its methods ? Is that intended or does it mean it still needs to be implemented? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147204#comment-15147204 ] Tommaso Teofili commented on LUCENE-6968: - since this has also been mentioned in SOLR-7739 maybe the LSH filter can be used to create a Lucene nearest neighbour Classifier which uses such filter (instead of the existing k nearest neighbour one). > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093091#comment-15093091 ] Cao Manh Dat commented on LUCENE-6968: -- Yes, It kinda like finding K nearest neighbor. But there a different here: - In K nearest neighbor, we use K as parameter which decide how many nearest neighbor we wanna retrieve. - In LSH we use a radius as parameter. The radius define minimum distance between center doc and other docs. In both case, the result will be rank by distance to the center doc. > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6968) LSH Filter
[ https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091901#comment-15091901 ] Joel Bernstein commented on LUCENE-6968: [~caomanhdat], interestiing ticket! Can you describe some of the potential uses for LSH. For example will this be helpful in approximating K nearest neighbor? > LSH Filter > -- > > Key: LUCENE-6968 > URL: https://issues.apache.org/jira/browse/LUCENE-6968 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Cao Manh Dat > Attachments: LUCENE-6968.patch, LUCENE-6968.patch > > > I'm planning to implement LSH. Which support query like this > {quote} > Find similar documents that have 0.8 or higher similar score with a given > document. Similarity measurement can be cosine, jaccard, euclid.. > {quote} > For example. Given following corpus > {quote} > 1. Solr is an open source search engine based on Lucene > 2. Solr is an open source enterprise search engine based on Lucene > 3. Solr is an popular open source enterprise search engine based on Lucene > 4. Apache Lucene is a high-performance, full-featured text search engine > library written entirely in Java > {quote} > We wanna find documents that have 0.6 score in jaccard measurement with this > doc > {quote} > Solr is an open source search engine > {quote} > It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org