[jira] [Commented] (LUCENE-6968) LSH Filter

2019-03-28 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804310#comment-16804310
 ] 

Andy Hind commented on LUCENE-6968:
---

[~mayyas], in answer to your questions:

1) Depends on your view - a lower default would probably make sense. The 
default of 512  is good  for 3-10+ pages of A4 text - again with an "it 
depends" caveat 

2) Query time is covered here https://issues.apache.org/jira/browse/SOLR-12879

    You can probably get away with the default query time tokenisation in SOLR, 
post BM25.

 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2019-03-04 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783921#comment-16783921
 ] 

Mayya Sharipova commented on LUCENE-6968:
-

[~andyhind] Thanks very much for your answer, it made things more clear.  I 
still have a couple of additional questions if you don't mind:

1)  The default settings of the filter will produce for each document 512 
tokens each of the size 16 bytes, that is approximately 8Kb. Isn't 8Kb too big 
of a size to be a document's signature?

2) What is the way to combine `min_hash` tokens to a query for similarity 
search? Do you have any examples? Is this a work in progress?

Thanks again!

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2019-01-08 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737192#comment-16737192
 ] 

Andy Hind commented on LUCENE-6968:
---

[~mayyas]     Hi Mayya, there is a good review paper here 
[https://arxiv.org/pdf/1408.2927.pdf].  See sections 3.5.1 and 3.5.2 and 
related references. I have not found the specific comment about bias I was 
trying to locate.

The handwaving view is that empty or missing hashes are biased for many to many 
comparisons. It is difficult to tune the hash parameters for a wide mix of doc 
sizes, and small documents in particular, as the number of hashes increases 
with doc size over some range. It is better to have some value rather than 
none. There is an argument about what value should be used but that is less 
important. Repetition is one way of filling in gaps and making the hash count 
consistent. For two small docs, there is going to be a bit of asymmetry in the 
measure whatever you do. In some cases, like containment, the bias may be a 
good thing :)

Apologies for my slow response.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2018-12-02 Thread Mayya Sharipova (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706571#comment-16706571
 ] 

Mayya Sharipova commented on LUCENE-6968:
-

[~andyhind]  Hello Andy! I have several questions about the implementation of 
the *MinHashFilter*, and was wondering if you would be able to answer them. 
Thanks a lot in advance.

The implementation from 1st original patch where the minimum set is kept is 
very clear to me, and follows the classic idea of constructing MinHash 
signature and LSH search after it. But I am having a hard time understanding 
the final implementation for MinHashFilter.

1) What constitutes the signature of a document? Are these all values stored in 
the hash table? Doesn't it make a signature too large? Can you please refer the 
paper that describes this way of constructing minhash signatures.

2) What is the use of {{withRotation}} parameter? What is the advantage of 
using {{withRotation=true}}? In the paper you cited: 
[http://www.auai.org/uai2014/proceedings/individuals/225.pdf], they fill empty 
bins with "value of the closest non-empty bin in the clockwise direction 
(circular right hand side) added *with offset C*". In the {{MinHashFilter}} 
implementation values for empty buckets are just blindly copied from non-empty 
ones, so a lot of buckets with have the same value.

Hopefully the questions make sense. Thanks again in advance.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-10-24 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601387#comment-15601387
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

[~yo...@apache.org] the _MinHash_ filter can be already used in Solr just like 
any other {{TokenFilter}} using its {{MinHashFilterFactory}}; regarding the 
test, there used to be one to test its end to end usage in one of the first 
patches, however it was decided to keep only the {{BaseTokenStreamTestCase}} 
ones as to check the filtering was working correctly. Perhaps we can put the 
old usage test in the context of Solr and add it within the Solr tests.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-10-23 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15599783#comment-15599783
 ] 

Yonik Seeley commented on LUCENE-6968:
--

Is there a JIRA issue yet to expose (and test) this and MinHash in Solr?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-08-10 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414895#comment-15414895
 ] 

Varun Thacker commented on LUCENE-6968:
---

Hi Tommaso,

I think we need to fix the CHANGES entry to move the entry to the 6.2 section. 
Its under 7.0 section currently 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-24 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348025#comment-15348025
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

I've backported this to branch 6.x for inclusion in next 6.2 release.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348023#comment-15348023
 ] 

ASF subversion and git services commented on LUCENE-6968:
-

Commit 96a5595fffd4a4a1c553a987382697cb8a92b354 in lucene-solr's branch 
refs/heads/branch_6x from [~teofili]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=96a5595 ]

LUCENE-6968 - LSH Filter backported to branch 6.x


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333646#comment-15333646
 ] 

ASF subversion and git services commented on LUCENE-6968:
-

Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch 
refs/heads/apiv2 from [~teofili]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ]

LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-15 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331794#comment-15331794
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

yes, I plan to merge it to 6.x, I wanted to have a few more runs on Jenkins 
before merging it back to make sure there're no token filtering level issues.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-14 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330032#comment-15330032
 ] 

Andy Hind commented on LUCENE-6968:
---

Hi Tommaso - are you planning to merge this to 6.x?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-14 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15329206#comment-15329206
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

I've committed this, thanks to [~andyhind] and [~caomanhdat] for your patches!

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-14 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15329205#comment-15329205
 ] 

ASF subversion and git services commented on LUCENE-6968:
-

Commit 82a9244193ba948142b834ec08e2de0d98cfba9f in lucene-solr's branch 
refs/heads/master from [~teofili]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=82a9244 ]

LUCENE-6968 - MinHash filter, thanks to Andy Hind and Cao Manh Dat for patches


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-13 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328054#comment-15328054
 ] 

Andy Hind commented on LUCENE-6968:
---

Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter 
that was failing intermittently. I think on one of the evil test cases.

LongPair should implement equals (and probably hashCode if it will be reused) 
as it goes into a TreeSet. An over sight on my part.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-05-07 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275123#comment-15275123
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

thanks a lot Andy, it generally looks much better, I will have a look at 
failures on point 5 (and eventually add the IDE fixes back).

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-05-06 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274697#comment-15274697
 ] 

Andy Hind commented on LUCENE-6968:
---

I have attached an updated patch.

This addresses the following issues
# Support for single hash, split into ranges with a minimum for each range
# Remove end to end tests and beefed up unit tests
# Remove Guava in favour of Yonik's murmur hash implementation. (Some 
duplication here with SOLR)
# Fixed alignment and "evil" test case issue   
# TestFactories passes > 200 times (some Japanese Number tokenisation failures)
# Fixed formatting

There were issue applying patch 4 on its own or over the previous patch. I 
believe I have included everything other then the IDE related bits.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-29 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263867#comment-15263867
 ] 

Andy Hind commented on LUCENE-6968:
---

After a bit more digging, the single hash and keeping the minimum set can be 
improved.

See: 
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 
buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To 
fill an empty bucket, take the minimum from the next non-empty bucket on the 
right adding an offset for each step taken. 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262918#comment-15262918
 ] 

Andy Hind commented on LUCENE-6968:
---

[~yo...@apache.org] has murmurhash3_x64_128 here 
https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262169#comment-15262169
 ] 

Andy Hind commented on LUCENE-6968:
---

I agree a pure token stream test makes sense. The only concern I have is about 
testing token filters chained together. Chaining shingle generation with min 
hashing requires that the underlying token stream has its state reset correctly 
for reuse. As I missed this, I added a test to cover it. Is there somewhere 
else in the test framework that covers this case? Some randomised chaining of 
filters?? Perhaps chaining is more of a SOLR thing.

I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 
5-shingles become well distributed. Link [2] says upto 2/3 of all possible 
trigrams have been seen in 30 years of news  articles. So it seems we can 
expect to see many of the possible 5-shingles. Some bioinformatic use cases may 
also require this.  

{quote}
[1] 
http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html
[2] http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
{quote}

I was not that keen to add Guava! However, it was already there somewhere.
I am happy if this moves off into a separate module. I will also look to see 
how this dependency could be removed.

Perhaps we should have some time to consider how to include the fingerprint 
length (sum of the min set size over all hashes) to support an unbiased query. 
An unbiased query would be more difficult to build correctly. Some 
fingerprint/LSH query support and tests may make sense. Some other statistics 
may also be useful in generating faster queries that find similar documents 
using some threshold and probability of meeting that threshold. 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261997#comment-15261997
 ] 

Robert Muir commented on LUCENE-6968:
-

We don't need any "integration tests" or "end to end tests".

-1 to those. Just think about it. Tokenizers only output one thing, and that is 
tokens. And that is what we should test.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261990#comment-15261990
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

thanks Robert, I agree on all of your suggestions.
The error reported by the TestFactories witnesses what you say, I can work on 
those more appropriate tests; however I would also like to keep the "query" 
ones as to keep an eye on whether one of the use case for this filter is 
supported as expected (a sort of integration test).
Removing Guava might of course be possible (and I'd prefer to do that), 
although rewriting the hashcode stuff is likely to be a bit annoying.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261961#comment-15261961
 ] 

Robert Muir commented on LUCENE-6968:
-

also, these analyzers should not be tested with queries. Instead, just use the 
asserts in BaseTokenStreamTestCase to check that the correct tokens are output.

I see some solr code was copied so that things can be tested with queries. I 
know that solr likes to test this way, but nobody wants to debug a test failure 
that tests an analyzer this way.

Just keep in mind we have a ton of analyzers, and when things go wrong (and 
they do), people have to debug them or even do refactorings across hundreds of 
them. That is why we have a consistent test class (BaseTokenStreamTestCase).

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261955#comment-15261955
 ] 

Robert Muir commented on LUCENE-6968:
-

analysis-common library cannot have any external dependencies. 

so If this filter is going to depend on guava, it needs to be its own analysis 
module. Just like all other analysis modules that have any third party 
dependencies.

I would try to remove the guava dependency at all costs anyway. It causes hell 
for developers to depend on guava.


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-27 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261480#comment-15261480
 ] 

Cao Manh Dat commented on LUCENE-6968:
--

Thanks, that make sense!

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-27 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259924#comment-15259924
 ] 

Andy Hind commented on LUCENE-6968:
---

This comes down to "what is a good estimate of |A U B|" and do we need it for 
the use case.

For query, the main use case is finding documents like one source document. So 
we are comparing Doc A with all other documents. What we need is a measure that 
is fair for A -> B, A -> C. We probably do not care about B -> C. If we take 
the fingerprint from A and just OR the bits together into a big query we have a 
consistent measure of similarity of A with any other document. This particular 
measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for 
this use case that may not matter. This measure contains both inclusion and 
duplication which may be a good thing. It is also pretty intuitive what it 
means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of 
|A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - 
|A ∩ B|), or some similar estimate. We could use the size of the fingerprint 
set. We could keep the full ordered set of hashes and have extra statistics 
like the total number of hashes and total number of unique hashes. 

For two short documents, where there are fewer fingerprints than the maximim, 
we have the full set.
For two larger docs we have an estimate of these based in the min hash sets.  
You can argue "min of many hashes"  is a random sample with replacement and 
"min set of one hash" is a random sample with out replacement (min set); if 
your hash is good. If the sample is small compared with the set of all hashes 
the arguments converge. 

If we were doing arbitrary comparisons between any pair of documents then we 
would have to use an unbiased estimator. Finding candidate pairs, moving onto 
clustering, ...


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-26 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257732#comment-15257732
 ] 

Cao Manh Dat commented on LUCENE-6968:
--

Thanks for the link. I totally agree that keeping some lowest values for single 
hash function would be better. 

But in the wiki doc. It pointed out that the estimator formulation for "variant 
with a single hash function" is not same as the estimator formulation for 
"variant with many hash function". So the generated query must be different for 
each case.

For example, in case we use single hash function and keep some lowest values :
1. We have doc A = [1, 2, 5, 6, 7], doc B = [3, 4, 5, 6, 7]
3. So jaccard(A,B) = |hk(A U B) ∩ hk(A) ∩ hk(B)| / k = |{5}| / k = 0.2

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-25 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256114#comment-15256114
 ] 

Andy Hind commented on LUCENE-6968:
---

The argument here says it is pretty much the same.

{code}
https://en.wikipedia.org/wiki/MinHash
{code}

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the 
number of hashes may depend on the start document. 

Let's start with 5 word shingles, one hash and keep the minimum 100 hash 
values. For a five word document we get one hash. For a 100 word doc where all 
the shingles/words are the same we get one hash. For all different shingles we 
get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end 
up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes 
you really got and then do the banding. Comparing a small documents/snippet 
(where we get 10 hashes in the fingerprint)with a much larger document (where 
we get 100 hashes) is an interesting case to consider. Starting with the small 
document there are fewer bits to match in the generated query. With 100 hashes 
from the small document I think you end up in the roughly same place, except 
for small snippets. Any given band is more likely to have the same shingle 
hashed different ways.

There is also an argument for a winnowing approach. With a 100 hash 
fingerprint, sampling for 100 words is great but not so great for 100,000 
words. With a minimum set we have the option to generate a finger print related 
to the document length and other features.




> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-23 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255435#comment-15255435
 ] 

Cao Manh Dat commented on LUCENE-6968:
--

What's a wonderful patch. The code is optimized, sure that the the index will 
be much smaller!

But there are some line that's quite hard to understand for me.
- Is combineOrdered(HasCode... ) necessary?
- The patch keep some lowest values for each position, so for given 
expectedTruePositive how can we compute the band size?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-23 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255434#comment-15255434
 ] 

Cao Manh Dat commented on LUCENE-6968:
--

What's a wonderful patch. The code is optimized, sure that the the index will 
be much smaller!

But there are some line that's quite hard to understand for me.
- Is combineOrdered(HasCode... ) necessary?
- The patch keep some lowest values for each position, so for given 
expectedTruePositive how can we compute the band size?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-21 Thread Andy Hind (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252743#comment-15252743
 ] 

Andy Hind commented on LUCENE-6968:
---

Hi

It would be quite common to use min hashing after shingling. At this point the 
number of possible word combinations vs the size of the hash is important. With 
shingles of 5 words from 100,000 that is 10e25 combinations. Some naive 
processing of the ~500k  Enron emails (splitting on white space, case folding 
and 5 word shingles) gives ~1e13  combinations. So a long hash would be better 
at 1.8e19. I have not yet looked at a larger corpus.

The LSH query is neat. However the logic can give banding where the last band 
is uneven. In the patch I think the last band would be dropped unless bands * 
rows in band  = # of hashes

The underlying state of the source filter may also be lost (if using shingling)

I do not believe the similarity is required at all. I think you can get Jaccard 
distance using constant score queries and disabling coordination on the boolean 
query. 

I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit 
more flexibility allowing a minimum set of hash values for a bunch of hashes. 
There is clearly some trade off for speed of hashing and over representing 
short documents. The minimum set may be a solution to this.  I think there is 
some interesting research there. 

I will add my patch inspired by the original  and apologise for the mixed 
formatting in advance ..


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-21 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252122#comment-15252122
 ] 

Joel Bernstein commented on LUCENE-6968:


HI, I believe there will be some work forthcoming from Alfresco on this ticket. 
We may want to hold off until those patches go up. Hopefully I will have an 
update on this soon.

[~teofili], if you feel strongly though that this is ready to commit, then go 
for it. We can always make adjustments after the committal as well.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-21 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252065#comment-15252065
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

[~caomanhdat], I'd like to commit this patch shortly, my only concern is around 
the similarity implementation, having all methods returning 1 doesn't seem 
correct to me, or am I missing something ?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-03-15 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195272#comment-15195272
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

I was having a second look at this patch, why is {{LSHSimilarity}} returning 1 
in all its methods ? Is that intended or does it mean it still needs to be 
implemented?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-02-15 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147204#comment-15147204
 ] 

Tommaso Teofili commented on LUCENE-6968:
-

since this has also been mentioned in SOLR-7739 maybe the LSH filter can be 
used to create a Lucene nearest neighbour Classifier which uses such filter 
(instead of the existing k nearest neighbour one).

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-01-11 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093091#comment-15093091
 ] 

Cao Manh Dat commented on LUCENE-6968:
--

Yes, It kinda like finding K nearest neighbor. But there a different here:
- In K nearest neighbor, we use K as parameter which decide how many nearest 
neighbor we wanna retrieve.
- In LSH we use a radius as parameter. The radius define minimum distance 
between center doc and other docs. 
In both case, the result will be rank by distance to the center doc.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6968) LSH Filter

2016-01-11 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091901#comment-15091901
 ] 

Joel Bernstein commented on LUCENE-6968:


[~caomanhdat], interestiing ticket!

Can you describe some of the potential uses for LSH. For example will this be 
helpful in approximating K nearest neighbor?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org