from:"Andy Hind \(JIRA\)"

[jira] [Updated] (SOLR-13752) MoreLikeThis MLT is biased for uncommon fields

2019-09-10 Thread Andy Hind (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Hind updated SOLR-13752:
-
Status: Patch Available  (was: Open)

> MoreLikeThis MLT is biased for uncommon fields
> --
>
> Key: SOLR-13752
> URL: https://issues.apache.org/jira/browse/SOLR-13752
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Reporter: Andy Hind
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> MLT always uses the total doc count and not the count of docs with the 
> specific field
>  
> To quote Maria Mestre from the discussion on the mailing list - 29/01/19
>  
> {quote}The issue I have is that when retrieving the key scored terms 
> (interestingTerms), the code uses the total number of documents in the index, 
> not the total number of documents with populated “description” field. This is 
> where it’s done in the code: 
> [https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs=]
> The effect of this choice is that the “idf” does not vary much, given that 
> numDocs >> number of documents with “description”, so the key terms end up 
> being just the terms with the highest term frequencies.
> It is inconsistent because the MLT-search then uses these extracted key terms 
> and scores all documents using an idf which is computed only on the subset of 
> documents with “description”. So one part of the MLT uses a different numDocs 
> than another part. This sounds like an odd choice, and not expected at all, 
> and I wonder if I’m missing something.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-13752) MoreLikeThis MLT is biased for uncommon fields

2019-09-10 Thread Andy Hind (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926877#comment-16926877
 ] 

Andy Hind commented on SOLR-13752:
--

https://github.com/apache/lucene-solr/pull/871

> MoreLikeThis MLT is biased for uncommon fields
> --
>
> Key: SOLR-13752
> URL: https://issues.apache.org/jira/browse/SOLR-13752
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: MoreLikeThis
>Reporter: Andy Hind
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> MLT always uses the total doc count and not the count of docs with the 
> specific field
>  
> To quote Maria Mestre from the discussion on the mailing list - 29/01/19
>  
> {quote}The issue I have is that when retrieving the key scored terms 
> (interestingTerms), the code uses the total number of documents in the index, 
> not the total number of documents with populated “description” field. This is 
> where it’s done in the code: 
> [https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs=]
> The effect of this choice is that the “idf” does not vary much, given that 
> numDocs >> number of documents with “description”, so the key terms end up 
> being just the terms with the highest term frequencies.
> It is inconsistent because the MLT-search then uses these extracted key terms 
> and scores all documents using an idf which is computed only on the subset of 
> documents with “description”. So one part of the MLT uses a different numDocs 
> than another part. This sounds like an odd choice, and not expected at all, 
> and I wonder if I’m missing something.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-13752) MoreLikeThis MLT is biased for uncommon fields

2019-09-10 Thread Andy Hind (Jira)

Andy Hind created SOLR-13752:


 Summary: MoreLikeThis MLT is biased for uncommon fields
 Key: SOLR-13752
 URL: https://issues.apache.org/jira/browse/SOLR-13752
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: MoreLikeThis
Reporter: Andy Hind


MLT always uses the total doc count and not the count of docs with the specific 
field

 

To quote Maria Mestre from the discussion on the mailing list - 29/01/19

 
{quote}The issue I have is that when retrieving the key scored terms 
(interestingTerms), the code uses the total number of documents in the index, 
not the total number of documents with populated “description” field. This is 
where it’s done in the code: 
[https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs=]

The effect of this choice is that the “idf” does not vary much, given that 
numDocs >> number of documents with “description”, so the key terms end up 
being just the terms with the highest term frequencies.

It is inconsistent because the MLT-search then uses these extracted key terms 
and scores all documents using an idf which is computed only on the subset of 
documents with “description”. So one part of the MLT uses a different numDocs 
than another part. This sounds like an odd choice, and not expected at all, and 
I wonder if I’m missing something.
{quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2019-03-29 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805000#comment-16805000
 ] 

Andy Hind commented on SOLR-12879:
--

Yes, there are two parts to the doc update. One for minhash filter in lucene, 
the other for the related qparser in solr.

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.0
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 8.0
>
> Attachments: minhash.filter.adoc.fragment, minhash.patch, 
> minhash.qparser.adoc.fragment
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2019-03-28 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804310#comment-16804310
 ] 

Andy Hind commented on LUCENE-6968:
---

[~mayyas], in answer to your questions:

1) Depends on your view - a lower default would probably make sense. The 
default of 512  is good  for 3-10+ pages of A4 text - again with an "it 
depends" caveat 

2) Query time is covered here https://issues.apache.org/jira/browse/SOLR-12879

    You can probably get away with the default query time tokenisation in SOLR, 
post BM25.

 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2019-03-28 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804303#comment-16804303
 ] 

Andy Hind commented on SOLR-12879:
--

I do not see the docs for this updated/added in 8.0 ...

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.0
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 8.0
>
> Attachments: minhash.filter.adoc.fragment, minhash.patch, 
> minhash.qparser.adoc.fragment
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2019-01-08 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737192#comment-16737192
 ] 

Andy Hind commented on LUCENE-6968:
---

[~mayyas]     Hi Mayya, there is a good review paper here 
[https://arxiv.org/pdf/1408.2927.pdf].  See sections 3.5.1 and 3.5.2 and 
related references. I have not found the specific comment about bias I was 
trying to locate.

The handwaving view is that empty or missing hashes are biased for many to many 
comparisons. It is difficult to tune the hash parameters for a wide mix of doc 
sizes, and small documents in particular, as the number of hashes increases 
with doc size over some range. It is better to have some value rather than 
none. There is an argument about what value should be used but that is less 
important. Repetition is one way of filling in gaps and making the hash count 
consistent. For two small docs, there is going to be a bit of asymmetry in the 
measure whatever you do. In some cases, like containment, the bias may be a 
good thing :)

Apologies for my slow response.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11207) Add OWASP dependency checker to detect security vulnerabilities in third party libraries

2018-11-22 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-11207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696201#comment-16696201
 ] 

Andy Hind commented on SOLR-11207:
--

Is there still interest in adding this improvement? It was pretty straight 
forward to add the ant task. The tool has support to record vulnerabilities 
that do not affect the product (false positives). Expect a patch shortly ... 
Should we split this into SOLR and Lucene?

> Add OWASP dependency checker to detect security vulnerabilities in third 
> party libraries
> 
>
> Key: SOLR-11207
> URL: https://issues.apache.org/jira/browse/SOLR-11207
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Build
>Affects Versions: 6.0
>Reporter: Hrishikesh Gadre
>Priority: Major
>
> Lucene/Solr project depends on number of third party libraries. Some of those 
> libraries contain security vulnerabilities. Upgrading to versions of those 
> libraries that have fixes for those vulnerabilities is a simple, critical 
> step we can take to improve the security of the system. But for that we need 
> a tool which can scan the Lucene/Solr dependencies and look up the security 
> database for known vulnerabilities.
> I found that [OWASP 
> dependency-checker|https://jeremylong.github.io/DependencyCheck/dependency-check-ant/]
>  can be used for this purpose. It provides a ant task which we can include in 
> the Lucene/Solr build. We also need to figure out how (and when) to invoke 
> this dependency-checker. But this can be figured out once we complete the 
> first step of integrating this tool with the Lucene/Solr build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12879) Query Parser for MinHash/LSH

2018-11-20 Thread Andy Hind (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Hind updated SOLR-12879:
-
Attachment: minhash.qparser.adoc.fragment

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.filter.adoc.fragment, minhash.patch, 
> minhash.qparser.adoc.fragment
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-30 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668841#comment-16668841
 ] 

Andy Hind commented on SOLR-12879:
--

Should I raise separate issues for the documentation?

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.filter.adoc.fragment, minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-23 Thread Andy Hind (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Hind updated SOLR-12879:
-
Attachment: minhash.filter.adoc.fragment

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.filter.adoc.fragment, minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-23 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660681#comment-16660681
 ] 

Andy Hind commented on SOLR-12879:
--

MinHash Filter doc ...

 

{quote}

== MinHash Filter

Generates a repeatably random fixed number of hash tokens from all the input 
tokens in the stream.
To do this it first consumes all of the input tokens from its source.
This filter would normally be preceded by a <>, as shown in the 
example below.

Each input token is hashed. It is subsequently "rehashed" `hashCount` times by 
combining with a set of precomputed hashes.
For each of the resulting hashes, the hash space is divided in to `bucketCount` 
buckets. The lowest set of `hashSetSize` hashes (usually a set of one)
is generated for each bucket.

This filter generates one type of signature or sketch for the input tokens and 
can be used to compute Jaccard similarity between documents.


*Arguments:*

`hashCount`:: (integer) the number of hashes to use. The default is 1.

`bucketCount`:: (integer) the number of buckets to use. The default is 512.

`hashSetSize`:: (integer) the size of the set for the lowest hashes from each 
bucket. The default is 1.

`withRotation`:: (boolean) if a hash bucket is empty, generate a hash value 
from the first previous bucket that has a value.
 The default is true if the bucket count is greater than 1 and false otherwise.


The number of hashes generated depends on the options above. With the default 
settings for `withRotation`, the number of hashes geerated is
`hashCount` x `bucketCount` x `hashSetSize` => 512, by default.

*Example:*

[source,xml]


 
 
 
 



*In:* "woof woof woof woof woof"

*Tokenizer to Filter:* "woof woof woof woof woof"

*Out:* "℁팽徭聙↝ꇁ홱杯", "℁팽徭聙↝ꇁ홱杯", "℁팽徭聙↝ꇁ홱杯",  a total of 512 times

{quote]

 

 

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-22 Thread Andy Hind (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659793#comment-16659793
 ] 

Andy Hind commented on SOLR-12879:
--

I don't think there is any reason the patch would not go back to 7.x. It has no 
dependencies other than the analyser. It started life on 6.x, where it needed 
to disable query co-cordination.

The parser is mostly intended to be used with q and fg parameters. A default 
wire up would be great.

I would not be surprised if someone comes up with a use in streaming as it 
provides another distance measure.

I will look at adding the docs. The analyser should also have some explanation. 

 

 

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-17 Thread Andy Hind (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Hind updated SOLR-12879:
-
Attachment: minhash.patch

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-17 Thread Andy Hind (JIRA)

Andy Hind created SOLR-12879:


 Summary: Query Parser for MinHash/LSH
 Key: SOLR-12879
 URL: https://issues.apache.org/jira/browse/SOLR-12879
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
  Components: query parsers
Affects Versions: master (8.0)
Reporter: Andy Hind
 Fix For: master (8.0)


Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide a 
query parser that builds queries that provide a measure of Jaccard similarity. 
The initial patch includes banded queries that were also proposed on the 
original issue.

 

I have one outstanding questions:
 * Should the score from the overall query be normalised?

Note, that the band count is currently approximate and may be one less than in 
practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-10025) SOLR_SSL_OPTS are ignored in bin\solr.cmd

2017-01-23 Thread Andy Hind (JIRA)

Andy Hind created SOLR-10025:


 Summary: SOLR_SSL_OPTS are ignored in bin\solr.cmd
 Key: SOLR-10025
 URL: https://issues.apache.org/jira/browse/SOLR-10025
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 6.3
Reporter: Andy Hind


SSL config fails on windows.
Requires fixes for late binding.
See !SOLR_SSL_OPTS! below 


{code}
REM Select HTTP OR HTTPS related configurations
set SOLR_URL_SCHEME=http
set "SOLR_JETTY_CONFIG=--module=http"
set "SOLR_SSL_OPTS= "
IF DEFINED SOLR_SSL_KEY_STORE (
  set "SOLR_JETTY_CONFIG=--module=https"
  set SOLR_URL_SCHEME=https
  set "SCRIPT_ERROR=Solr server directory %SOLR_SERVER_DIR% not found!"
  set "SOLR_SSL_OPTS=-Dsolr.jetty.keystore=%SOLR_SSL_KEY_STORE% 
-Dsolr.jetty.keystore.password=%SOLR_SSL_KEY_STORE_PASSWORD% 
-Dsolr.jetty.truststore=%SOLR_SSL_TRUST_STORE% 
-Dsolr.jetty.truststore.password=%SOLR_SSL_TRUST_STORE_PASSWORD% 
-Dsolr.jetty.ssl.needClientAuth=%SOLR_SSL_NEED_CLIENT_AUTH% 
-Dsolr.jetty.ssl.wantClientAuth=%SOLR_SSL_WANT_CLIENT_AUTH%"
  IF DEFINED SOLR_SSL_CLIENT_KEY_STORE  (
set "SOLR_SSL_OPTS=!SOLR_SSL_OPTS! 
-Djavax.net.ssl.keyStore=%SOLR_SSL_CLIENT_KEY_STORE% 
-Djavax.net.ssl.keyStorePassword=%SOLR_SSL_CLIENT_KEY_STORE_PASSWORD% 
-Djavax.net.ssl.trustStore=%SOLR_SSL_CLIENT_TRUST_STORE% 
-Djavax.net.ssl.trustStorePassword=%SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD%"
  ) ELSE (
set "SOLR_SSL_OPTS=!SOLR_SSL_OPTS! 
-Djavax.net.ssl.keyStore=%SOLR_SSL_KEY_STORE% 
-Djavax.net.ssl.keyStorePassword=%SOLR_SSL_KEY_STORE_PASSWORD% 
-Djavax.net.ssl.trustStore=%SOLR_SSL_TRUST_STORE% 
-Djavax.net.ssl.trustStorePassword=%SOLR_SSL_TRUST_STORE_PASSWORD%"
  )
) ELSE (
  set SOLR_SSL_OPTS=
)
{code}

We also use a non default keystore type and have to disable perr name chekcking:
{code}

-a ". -Djavax.net.ssl.keyStoreType=JCEKS 
-Djavax.net.ssl.trustStoreType=JCEKS -Dsolr.ssl.checkPeerName=false"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-7476) Fix transient failure in JapaneseNumberFilter run from TestFactories

2016-10-10 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561858#comment-15561858
 ] 

Andy Hind edited comment on LUCENE-7476 at 10/10/16 10:15 AM:
--

Running the tests 100 times via ant produces no issue. This seems to be an 
eclipse configuration issue.
{code}
ant test  -Dtestcase=TestFactories -Dtests.method=test  -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{code}


was (Author: andyhind):
Running the tests 100 times via ant produces no issue. This seems to an eclipse 
configuration issue.
{code}
ant test  -Dtestcase=TestFactories -Dtests.method=test  -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{code}

> Fix transient failure in JapaneseNumberFilter run from TestFactories
> 
>
> Key: LUCENE-7476
> URL: https://issues.apache.org/jira/browse/LUCENE-7476
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 6.2.1
>Reporter: Andy Hind
>Priority: Trivial
> Attachments: LUCENE-7476.patch
>
>
> Repeatedly running TestFactories show this test to fail ~10% of the time.
> I believe the fix is trivial and related to loosing the state of the 
> underlying input stream when testing some analyzer life cycle flows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7476) Fix transient failure in JapaneseNumberFilter run from TestFactories

2016-10-10 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561858#comment-15561858
 ] 

Andy Hind commented on LUCENE-7476:
---

Running the tests 100 times via ant produces no issue. This seems to an eclipse 
configuration issue.
{code}
ant test  -Dtestcase=TestFactories -Dtests.method=test  -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{code}

> Fix transient failure in JapaneseNumberFilter run from TestFactories
> 
>
> Key: LUCENE-7476
> URL: https://issues.apache.org/jira/browse/LUCENE-7476
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 6.2.1
>Reporter: Andy Hind
>Priority: Trivial
> Attachments: LUCENE-7476.patch
>
>
> Repeatedly running TestFactories show this test to fail ~10% of the time.
> I believe the fix is trivial and related to loosing the state of the 
> underlying input stream when testing some analyzer life cycle flows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7476) Fix transient failure in JapaneseNumberFilter run from TestFactories

2016-10-10 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561799#comment-15561799
 ] 

Andy Hind commented on LUCENE-7476:
---

I spotted this running org.apache.lucene.analysis.core.TestFactories with 
@Repeat (iterations = 100) from eclipse
I just got 9 failures running this again. It is odd that I do not see them in 
the build failures. 

I believe the 9 fails are all the same 

{code}
java.lang.IllegalStateException: incrementToken() called while in wrong state: 
INCREMENT_FALSE
at 
__randomizedtesting.SeedInfo.seed([18C3960FB72D4F07:2AB7AA6A139D55E3]:0)
at org.apache.lucene.analysis.MockTokenizer.fail(MockTokenizer.java:125)
at 
org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:136)
at 
org.apache.lucene.analysis.ja.JapaneseNumberFilter.incrementToken(JapaneseNumberFilter.java:152)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:716)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:627)
at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:525)
at 
org.apache.lucene.analysis.core.TestFactories.doTestTokenFilter(TestFactories.java:108)
at 
org.apache.lucene.analysis.core.TestFactories.test(TestFactories.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1764)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:871)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:907)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:921)
at 
org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:367)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:809)
at 
com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:460)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:880)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:781)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:816)
at 
com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:827)
at 
org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at 
org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at 
org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at 
org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at 
org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at

[jira] [Commented] (LUCENE-7476) Fix transient failure in JapaneseNumberFilter run from TestFactories

2016-10-06 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15551409#comment-15551409
 ] 

Andy Hind commented on LUCENE-7476:
---

Patch supplied - TestFactories ran with more then 700 seeds without error.

> Fix transient failure in JapaneseNumberFilter run from TestFactories
> 
>
> Key: LUCENE-7476
> URL: https://issues.apache.org/jira/browse/LUCENE-7476
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 6.2.1
>Reporter: Andy Hind
>Priority: Trivial
> Attachments: LUCENE-7476.patch
>
>
> Repeatedly running TestFactories show this test to fail ~10% of the time.
> I believe the fix is trivial and related to loosing the state of the 
> underlying input stream when testing some analyzer life cycle flows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7476) Fix transient failure in JapaneseNumberFilter run from TestFactories

2016-10-06 Thread Andy Hind (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Hind updated LUCENE-7476:
--
Attachment: LUCENE-7476.patch

> Fix transient failure in JapaneseNumberFilter run from TestFactories
> 
>
> Key: LUCENE-7476
> URL: https://issues.apache.org/jira/browse/LUCENE-7476
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 6.2.1
>Reporter: Andy Hind
>Priority: Trivial
> Attachments: LUCENE-7476.patch
>
>
> Repeatedly running TestFactories show this test to fail ~10% of the time.
> I believe the fix is trivial and related to loosing the state of the 
> underlying input stream when testing some analyzer life cycle flows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7476) Fix transient failure in JapaneseNumberFilter run from TestFactories

2016-10-06 Thread Andy Hind (JIRA)

Andy Hind created LUCENE-7476:
-

 Summary: Fix transient failure in JapaneseNumberFilter run from 
TestFactories
 Key: LUCENE-7476
 URL: https://issues.apache.org/jira/browse/LUCENE-7476
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 6.2.1
Reporter: Andy Hind
Priority: Trivial


Repeatedly running TestFactories show this test to fail ~10% of the time.

I believe the fix is trivial and related to loosing the state of the underlying 
input stream when testing some analyzer life cycle flows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-14 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330032#comment-15330032
 ] 

Andy Hind commented on LUCENE-6968:
---

Hi Tommaso - are you planning to merge this to 6.x?

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-06-13 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328054#comment-15328054
 ] 

Andy Hind edited comment on LUCENE-6968 at 6/13/16 7:55 PM:


Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter 
that was failing intermittently. I think on one of the evil test cases.

LongPair should implement equals (and probably hashCode if it will be reused) 
as it goes into a TreeSet. An over sight on my part.

FWIW, as far as I can tell, the change in patch 6 was included in 5.


was (Author: andyhind):
Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter 
that was failing intermittently. I think on one of the evil test cases.

LongPair should implement equals (and probably hashCode if it will be reused) 
as it goes into a TreeSet. An over sight on my part.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-06-13 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328054#comment-15328054
 ] 

Andy Hind commented on LUCENE-6968:
---

Hi Tommaso, the MinHashFilterTest was running fine. It was JapaneseNumberFilter 
that was failing intermittently. I think on one of the evil test cases.

LongPair should implement equals (and probably hashCode if it will be reused) 
as it goes into a TreeSet. An over sight on my part.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-05-06 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263867#comment-15263867
 ] 

Andy Hind edited comment on LUCENE-6968 at 5/6/16 8:43 PM:
---

After a bit more digging, the single hash and keeping the minimum set can be 
improved.

See: 
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 
buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To 
fill an empty bucket, take the minimum from the next non-empty bucket on the 
right with rotation. 


was (Author: andyhind):
After a bit more digging, the single hash and keeping the minimum set can be 
improved.

See: 
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 
buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To 
fill an empty bucket, take the minimum from the next non-empty bucket on the 
right adding an offset for each step taken. 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-05-06 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274697#comment-15274697
 ] 

Andy Hind commented on LUCENE-6968:
---

I have attached an updated patch.

This addresses the following issues
# Support for single hash, split into ranges with a minimum for each range
# Remove end to end tests and beefed up unit tests
# Remove Guava in favour of Yonik's murmur hash implementation. (Some 
duplication here with SOLR)
# Fixed alignment and "evil" test case issue   
# TestFactories passes > 200 times (some Japanese Number tokenisation failures)
# Fixed formatting

There were issue applying patch 4 on its own or over the previous patch. I 
believe I have included everything other then the IDE related bits.

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6968) LSH Filter

2016-05-06 Thread Andy Hind (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Hind updated LUCENE-6968:
--
Attachment: LUCENE-6968.5.patch

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-29 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263867#comment-15263867
 ] 

Andy Hind commented on LUCENE-6968:
---

After a bit more digging, the single hash and keeping the minimum set can be 
improved.

See: 
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 
buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To 
fill an empty bucket, take the minimum from the next non-empty bucket on the 
right adding an offset for each step taken. 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262918#comment-15262918
 ] 

Andy Hind commented on LUCENE-6968:
---

[~yo...@apache.org] has murmurhash3_x64_128 here 
https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-28 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262169#comment-15262169
 ] 

Andy Hind commented on LUCENE-6968:
---

I agree a pure token stream test makes sense. The only concern I have is about 
testing token filters chained together. Chaining shingle generation with min 
hashing requires that the underlying token stream has its state reset correctly 
for reuse. As I missed this, I added a test to cover it. Is there somewhere 
else in the test framework that covers this case? Some randomised chaining of 
filters?? Perhaps chaining is more of a SOLR thing.

I would prefer to stick with a 128/96 bit hash. The link below [1] "suggests" 
5-shingles become well distributed. Link [2] says upto 2/3 of all possible 
trigrams have been seen in 30 years of news  articles. So it seems we can 
expect to see many of the possible 5-shingles. Some bioinformatic use cases may 
also require this.  

{quote}
[1] 
http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-are-belong-to-you.html
[2] http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
{quote}

I was not that keen to add Guava! However, it was already there somewhere.
I am happy if this moves off into a separate module. I will also look to see 
how this dependency could be removed.

Perhaps we should have some time to consider how to include the fingerprint 
length (sum of the min set size over all hashes) to support an unbiased query. 
An unbiased query would be more difficult to build correctly. Some 
fingerprint/LSH query support and tests may make sense. Some other statistics 
may also be useful in generating faster queries that find similar documents 
using some threshold and probability of meeting that threshold. 

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-6968.4.patch, LUCENE-6968.patch, 
> LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-28 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259924#comment-15259924
 ] 

Andy Hind edited comment on LUCENE-6968 at 4/28/16 9:27 AM:


This comes down to "what is a good estimate of |A U B|" and do we need it for 
the use case.

For query, the main use case is finding documents like one source document. So 
we are comparing Doc A with all other documents. What we need is a measure that 
is fair for A -> B, A -> C. We probably do not care about B -> C. If we take 
the fingerprint from A and just OR the bits together into a big query we have a 
consistent measure of similarity of A with any other document. This particular 
measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for 
this use case that may not matter. This measure contains both inclusion and 
duplication which may be a good thing. It is also pretty intuitive what it 
means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of 
|A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - 
|A ∩ B|), or some similar estimate. We could use the size of the fingerprint 
sets. We could keep the full ordered set of hashes and have extra statistics 
like the total number of hashes and total number of unique hashes. 

For two short documents, where there are fewer fingerprints than the maximum, 
we have the full sets.
For two larger docs we have an estimate of these based in the min hash sets.  
You can argue "min of many hashes"  is a random sample with replacement and 
"min set of one hash" is a random sample without replacement; if your hash is 
good. If the sample is small compared with the set of all hashes the arguments 
converge. 

If we were doing arbitrary comparisons between any pair of documents then we 
would have to use an unbiased estimator. Finding candidate pairs, moving onto 
clustering, ...



was (Author: andyhind):
This comes down to "what is a good estimate of |A U B|" and do we need it for 
the use case.

For query, the main use case is finding documents like one source document. So 
we are comparing Doc A with all other documents. What we need is a measure that 
is fair for A -> B, A -> C. We probably do not care about B -> C. If we take 
the fingerprint from A and just OR the bits together into a big query we have a 
consistent measure of similarity of A with any other document. This particular 
measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for 
this use case that may not matter. This measure contains both inclusion and 
duplication which may be a good thing. It is also pretty intuitive what it 
means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of 
|A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - 
|A ∩ B|), or some similar estimate. We could use the size of the fingerprint 
set. We could keep the full ordered set of hashes and have extra statistics 
like the total number of hashes and total number of unique hashes. 

For two short documents, where there are fewer fingerprints than the maximim, 
we have the full set.
For two larger docs we have an estimate of these based in the min hash sets.  
You can argue "min of many hashes"  is a random sample with replacement and 
"min set of one hash" is a random sample with out replacement (min set); if 
your hash is good. If the sample is small compared with the set of all hashes 
the arguments converge. 

If we were doing arbitrary comparisons between any pair of documents then we 
would have to use an unbiased estimator. Finding candidate pairs, moving onto 
clustering, ...


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-27 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259924#comment-15259924
 ] 

Andy Hind commented on LUCENE-6968:
---

This comes down to "what is a good estimate of |A U B|" and do we need it for 
the use case.

For query, the main use case is finding documents like one source document. So 
we are comparing Doc A with all other documents. What we need is a measure that 
is fair for A -> B, A -> C. We probably do not care about B -> C. If we take 
the fingerprint from A and just OR the bits together into a big query we have a 
consistent measure of similarity of A with any other document. This particular 
measure is biased. For a start Sim(A, B) is not equal to Sim(B, A). But for 
this use case that may not matter. This measure contains both inclusion and 
duplication which may be a good thing. It is also pretty intuitive what it 
means. This is (A ∩ B)/|A|.

If we want Sim(A, B) = Sim(B, A) then we need some consistent measure/sample of 
|A U B| to normalise our measure/estimate of A ∩ B. This could be (|A| + |B| - 
|A ∩ B|), or some similar estimate. We could use the size of the fingerprint 
set. We could keep the full ordered set of hashes and have extra statistics 
like the total number of hashes and total number of unique hashes. 

For two short documents, where there are fewer fingerprints than the maximim, 
we have the full set.
For two larger docs we have an estimate of these based in the min hash sets.  
You can argue "min of many hashes"  is a random sample with replacement and 
"min set of one hash" is a random sample with out replacement (min set); if 
your hash is good. If the sample is small compared with the set of all hashes 
the arguments converge. 

If we were doing arbitrary comparisons between any pair of documents then we 
would have to use an unbiased estimator. Finding candidate pairs, moving onto 
clustering, ...


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-25 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256114#comment-15256114
 ] 

Andy Hind edited comment on LUCENE-6968 at 4/25/16 11:52 AM:
-

The argument here says it is pretty much the same.

{code}
https://en.wikipedia.org/wiki/MinHash
{code}

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the 
number of hashes may depend on the start document. 

Let's start with 5 word shingles, one hash and keep the minimum 100 hash 
values. For a five word document we get one hash. For a 100 word doc where all 
the shingles/words are the same we get one hash. For all different shingles we 
get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end 
up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes 
you really got and then do the banding. Comparing a small documents/snippet 
(where we get 10 hashes in the fingerprint)with a much larger document (where 
we get 100 hashes) is an interesting case to consider. Starting with the small 
document there are fewer bits to match in the generated query. With 100 hashes 
from the small document I think you end up in the roughly same place, except 
for small snippets. Any given band is more likely to have the same shingle 
hashed different ways.

With a 100 hash fingerprint, sampling for 100 words is great but not so great 
for 100,000 words. With a minimum set we have the option to generate a finger 
print related to the document length and other features.

There is also an argument for a winnowing approach. 




was (Author: andyhind):
The argument here says it is pretty much the same.

{code}
https://en.wikipedia.org/wiki/MinHash
{code}

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the 
number of hashes may depend on the start document. 

Let's start with 5 word shingles, one hash and keep the minimum 100 hash 
values. For a five word document we get one hash. For a 100 word doc where all 
the shingles/words are the same we get one hash. For all different shingles we 
get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end 
up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes 
you really got and then do the banding. Comparing a small documents/snippet 
(where we get 10 hashes in the fingerprint)with a much larger document (where 
we get 100 hashes) is an interesting case to consider. Starting with the small 
document there are fewer bits to match in the generated query. With 100 hashes 
from the small document I think you end up in the roughly same place, except 
for small snippets. Any given band is more likely to have the same shingle 
hashed different ways.

There is also an argument for a winnowing approach. With a 100 hash 
fingerprint, sampling for 100 words is great but not so great for 100,000 
words. With a minimum set we have the option to generate a finger print related 
to the document length and other features.




> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-25 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256114#comment-15256114
 ] 

Andy Hind commented on LUCENE-6968:
---

The argument here says it is pretty much the same.

{code}
https://en.wikipedia.org/wiki/MinHash
{code}

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the 
number of hashes may depend on the start document. 

Let's start with 5 word shingles, one hash and keep the minimum 100 hash 
values. For a five word document we get one hash. For a 100 word doc where all 
the shingles/words are the same we get one hash. For all different shingles we 
get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end 
up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes 
you really got and then do the banding. Comparing a small documents/snippet 
(where we get 10 hashes in the fingerprint)with a much larger document (where 
we get 100 hashes) is an interesting case to consider. Starting with the small 
document there are fewer bits to match in the generated query. With 100 hashes 
from the small document I think you end up in the roughly same place, except 
for small snippets. Any given band is more likely to have the same shingle 
hashed different ways.

There is also an argument for a winnowing approach. With a 100 hash 
fingerprint, sampling for 100 words is great but not so great for 100,000 
words. With a minimum set we have the option to generate a finger print related 
to the document length and other features.




> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6968) LSH Filter

2016-04-21 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252743#comment-15252743
 ] 

Andy Hind edited comment on LUCENE-6968 at 4/21/16 11:06 PM:
-

Hi

It would be quite common to use min hashing after shingling. At this point the 
number of possible word combinations vs the size of the hash is important. With 
shingles of 5 words from 100,000 that is 10e25 combinations. Some naive 
processing of the ~500k  Enron emails (splitting on white space, case folding 
and 5 word shingles) gives ~52M  combinations. So a long hash would be better 
at 1.8e19. I have not yet looked at a larger corpus.

The LSH query is neat. However the logic can give banding where the last band 
is uneven. In the patch I think the last band would be dropped unless bands * 
rows in band  = # of hashes

The underlying state of the source filter may also be lost (if using shingling)

I do not believe the similarity is required at all. I think you can get Jaccard 
distance using constant score queries and disabling coordination on the boolean 
query. 

I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit 
more flexibility allowing a minimum set of hash values for a bunch of hashes. 
There is clearly some trade off for speed of hashing and over representing 
short documents. The minimum set may be a solution to this.  I think there is 
some interesting research there. 

I will add my patch inspired by the original  and apologise for the mixed 
formatting in advance ..



was (Author: andyhind):
Hi

It would be quite common to use min hashing after shingling. At this point the 
number of possible word combinations vs the size of the hash is important. With 
shingles of 5 words from 100,000 that is 10e25 combinations. Some naive 
processing of the ~500k  Enron emails (splitting on white space, case folding 
and 5 word shingles) gives ~1e13  combinations. So a long hash would be better 
at 1.8e19. I have not yet looked at a larger corpus.

The LSH query is neat. However the logic can give banding where the last band 
is uneven. In the patch I think the last band would be dropped unless bands * 
rows in band  = # of hashes

The underlying state of the source filter may also be lost (if using shingling)

I do not believe the similarity is required at all. I think you can get Jaccard 
distance using constant score queries and disabling coordination on the boolean 
query. 

I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit 
more flexibility allowing a minimum set of hash values for a bunch of hashes. 
There is clearly some trade off for speed of hashing and over representing 
short documents. The minimum set may be a solution to this.  I think there is 
some interesting research there. 

I will add my patch inspired by the original  and apologise for the mixed 
formatting in advance ..


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6968) LSH Filter

2016-04-21 Thread Andy Hind (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Hind updated LUCENE-6968:
--
Attachment: LUCENE-6968.patch

> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6968) LSH Filter

2016-04-21 Thread Andy Hind (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252743#comment-15252743
 ] 

Andy Hind commented on LUCENE-6968:
---

Hi

It would be quite common to use min hashing after shingling. At this point the 
number of possible word combinations vs the size of the hash is important. With 
shingles of 5 words from 100,000 that is 10e25 combinations. Some naive 
processing of the ~500k  Enron emails (splitting on white space, case folding 
and 5 word shingles) gives ~1e13  combinations. So a long hash would be better 
at 1.8e19. I have not yet looked at a larger corpus.

The LSH query is neat. However the logic can give banding where the last band 
is uneven. In the patch I think the last band would be dropped unless bands * 
rows in band  = # of hashes

The underlying state of the source filter may also be lost (if using shingling)

I do not believe the similarity is required at all. I think you can get Jaccard 
distance using constant score queries and disabling coordination on the boolean 
query. 

I went for 128-bit hashes, or a 32 bit hash identifier + 96 bit hash with a bit 
more flexibility allowing a minimum set of hash values for a bunch of hashes. 
There is clearly some trade off for speed of hashing and over representing 
short documents. The minimum set may be a solution to this.  I think there is 
some interesting research there. 

I will add my patch inspired by the original  and apologise for the mixed 
formatting in advance ..


> LSH Filter
> --
>
> Key: LUCENE-6968
> URL: https://issues.apache.org/jira/browse/LUCENE-6968
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Cao Manh Dat
> Attachments: LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-307) Lock obtain time out errors when opening readers and writers

2006-06-23 Thread Andy Hind (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-307?page=comments#action_12417515 ] 

Andy Hind commented on LUCENE-307:
--

I have seen something similar.

When the lock file is deleted the return value is not checked.
I have seen cases where the lock file is left by this call when you would 
expect it to be deleted.

I suspect that checking the file exists or trying to create the lock file can 
prevent it from being deleted.
Other processes could also be preventing the deletion,  but not in my case.
It seems to be more likely under heavy lock load.

It would be simple to retry the delete and report an exception if the lock file 
could not eventually be deleted by the owner.

There is also no reason why there can not be a single shared lock of each type 
for the singe instance of the FSDirectory.
I suspect there are many use cases where all that is required is in-JVM locking.

I am all for lucene using nio locking and pluggable locking.

For in-JVM locking with nio you have to use the same file channel instance for 
locking. In this patch I do not think that is the case as a new RAF and channel 
instance will be created for each new lock instance.  




 Lock obtain time out errors when opening readers and writers
 

  Key: LUCENE-307
  URL: http://issues.apache.org/jira/browse/LUCENE-307
  Project: Lucene - Java
 Type: Bug

   Components: Other
 Versions: 1.4
  Environment: Operating System: All
 Platform: All
 Reporter: Reece (YT)
 Assignee: Lucene Developers
  Attachments: FSLock.java, TestLuceneLocks.java

 The attached Java file shows a locking issue that occurs with Lucene 1.4.2.
 One thread opens and closes an IndexReader.  The other thread opens an
 IndexWriter, adds a document and then closes the IndexWriter.  I would expect
 that this app should be able to happily run without an issues.
 It fails with:
   java.io.IOException: Lock obtain timed out
 Is this expected?  I thought a Reader could be opened while a Writer is 
 adding a
 document.
 I am able to get the error in less than 5 minutes when running this on Windows
 XP and Mac OS X.
 Any help is appreciated.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-415) Merge error during add to index (IndexOutOfBoundsException)

2006-05-18 Thread Andy Hind (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-415?page=comments#action_12412384 ] 

Andy Hind commented on LUCENE-415:
--

file.getChannel() was added on windows.
It was *before* the truncating file issue was found and resolved.
It is possible the two are related.
I have not verified and tested the same issue on linux.
We had just not seen it on other platforms.


It is possible   file.setLength(0)  also resolves the above issue.
It *certainly* solves some JVM crash/recovery issues.





 Merge error during add to index (IndexOutOfBoundsException)
 ---

  Key: LUCENE-415
  URL: http://issues.apache.org/jira/browse/LUCENE-415
  Project: Lucene - Java
 Type: Bug

   Components: Index
 Versions: 1.4
  Environment: Operating System: Linux
 Platform: Other
 Reporter: Daniel Quaroni
 Assignee: Lucene Developers


 I've been batch-building indexes, and I've build a couple hundred indexes 
 with 
 a total of around 150 million records.  This only happened once, so it's 
 probably impossible to reproduce, but anyway... I was building an index with 
 around 9.6 million records, and towards the end I got this:
 java.lang.IndexOutOfBoundsException: Index: 54, Size: 24
 at java.util.ArrayList.RangeCheck(ArrayList.java:547)
 at java.util.ArrayList.get(ArrayList.java:322)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
 at 
 org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java
 :149)
 at org.apache.lucene.index.SegmentTermEnum.next
 (SegmentTermEnum.java:115)
 at org.apache.lucene.index.SegmentMergeInfo.next
 (SegmentMergeInfo.java:52)
 at org.apache.lucene.index.SegmentMerger.mergeTermInfos
 (SegmentMerger.java:294)
 at org.apache.lucene.index.SegmentMerger.mergeTerms
 (SegmentMerger.java:254)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:93)
 at org.apache.lucene.index.IndexWriter.mergeSegments
 (IndexWriter.java:487)
 at org.apache.lucene.index.IndexWriter.maybeMergeSegments
 (IndexWriter.java:458)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

2006-05-04 Thread Andy Hind (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-436?page=comments#action_12377764 ] 

Andy Hind commented on LUCENE-436:
--

I agree this is not strictly an issue with lucenebut .

Lucene has an unusual use pattern for thread locals (instance vs static member 
variables).
There are issues with ThreadLocal, discussed here and elsewhere, including 
potential instability.
You should expect others to use thread locals - maybe in the same way.

I fixed the issue reported in LUCENE-529 with memory sensitive caching using 
SoftReferences to the values held as ThreadLocals.
Before an out of memory error, the values are cleared and hard refs to the soft 
reference wrapper class remain.
This pattern is used by some classes in the JVM.
This limits the memory overhead with thread local use. 
There will always be some overhead.

I am happy with the alternative fix using WeakHashMap
Note: stale entries are always scanned and removed (each call to put, get, 
size) in contrast to thread locals. 
This is what I want - it must be stable!
 
I spent as much time as I could trying to come up with a clear, simple test 
case that applied regardless of memory constraints.
A clear failure case must be possible, but I did not have time to investigate 
the criteria for ThreadLocal instability.
In any case, I would send this to Sun.

A test case is one thing, knowning, understanding and fixing/working around an 
issue is another.
In all the simple cases I tried I got stability but with higher memory use and 
gc activity than with the fixed version.
However, I did also remove the pointless finalize() method, which could very 
well explain the growth of the thread local table.

We have had problems with 1.5.0_06. The issue is caused by the pattern of 
thread local use and garbage collection producing instability in the size of 
the thread local table. Your single test case does not imply that the issue 
does not exist for other JVMs and use cases.  I have had the issue without 
using RAMDirectory - it seems it is just more likely with it.

By the way, cause and effect is sufficient for me. I have a problem with using 
the 1.4.3 code, this change fixes it.

Regards

Andy

 [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception
 

  Key: LUCENE-436
  URL: http://issues.apache.org/jira/browse/LUCENE-436
  Project: Lucene - Java
 Type: Improvement

   Components: Index
 Versions: 1.4
  Environment: Solaris JVM 1.4.1
 Linux JVM 1.4.2/1.5.0
 Windows not tested
 Reporter: kieran
  Attachments: FixedThreadLocal.java, Lucene-436-TestCase.tar.gz, 
 ThreadLocalTest.java

 We've been experiencing terrible memory problems on our production search 
 server, running lucene (1.4.3).
 Our live app regularly opens new indexes and, in doing so, releases old 
 IndexReaders for garbage collection.
 But...there appears to be a memory leak in 
 org.apache.lucene.index.TermInfosReader.java.
 Under certain conditions (possibly related to JVM version, although I've 
 personally observed it under both linux JVM 1.4.2_06, and 1.5.0_03, and SUNOS 
 JVM 1.4.1) the ThreadLocal member variable, enumerators doesn't get 
 garbage-collected when the TermInfosReader object is gc-ed.
 Looking at the code in TermInfosReader.java, there's no reason why it 
 _shouldn't_ be gc-ed, so I can only presume (and I've seen this suggested 
 elsewhere) that there could be a bug in the garbage collector of some JVMs.
 I've seen this problem briefly discussed; in particular at the following URL:
   http://java2.5341.com/msg/85821.html
 The patch that Doug recommended, which is included in lucene-1.4.3 doesn't 
 work in our particular circumstances. Doug's patch only clears the 
 ThreadLocal variable for the thread running the finalizer (my knowledge of 
 java breaks down here - I'm not sure which thread actually runs the 
 finalizer). In our situation, the TermInfosReader is (potentially) used by 
 more than one thread, meaning that Doug's patch _doesn't_ allow the affected 
 JVMs to correctly collect garbage.
 So...I've devised a simple patch which, from my observations on linux JVMs 
 1.4.2_06, and 1.5.0_03, fixes this problem.
 Kieran
 PS Thanks to daniel naber for pointing me to jira/lucene
 @@ -19,6 +19,7 @@
  import java.io.IOException;
  import org.apache.lucene.store.Directory;
 +import java.util.Hashtable;
  /** This stores a monotonically increasing set of Term, TermInfo pairs in a
   * Directory.  Pairs are accessed either by Term or by ordinal position the
 @@ -29,7 +30,7 @@
private String segment;
private FieldInfos fieldInfos;
 -  private ThreadLocal enumerators = new ThreadLocal();
 +  private final Hashtable enumeratorsByThread = new Hashtable();
private SegmentTermEnum origEnum;
private long size;
 @@ -60,10 +61,10 @@
}
private SegmentTermEnum getEnum() {
 -SegmentTermEnum termEnum =

[jira] Updated: (LUCENE-529) TermInfosReader and other + instance ThreadLocal = transient/odd memory leaks = OutOfMemoryException

2006-03-23 Thread Andy Hind (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-529?page=all ]

Andy Hind updated LUCENE-529:
-

Attachment: ThreadLocalTest.java

Attached is a test which you can use to see how ThreadLocals are left around.
Getting an out of memory exception depends on a number thingsit is set up 
to fail for 64M

Now I understand what is going on, there are a few alternatives:

1) set null on close
- fine for single thread use
- probably leaves (n-1)*segments*2things hanging around for n threaded use

2) Use a weak reference and leave it up to GC to get rid of the referent when 
it is not being used

3) Manage the things youself by object id and thread id - and clean up on 
object close() 

I would go with option 1) and 2) although it may mean things get GCed before a 
call to close() when not used.

The fix I initially suggested is in production, and has been stress tested with 
a couple of hundred users continually pounding the app,
 but not for multithreaded use of IndexReaders. Each time does a couple of 
simple searches with no clever reuse of index readers (which is on the todo 
list)

I do not see how setting the thread local to null on close() has any negative 
impact. You are not going to use the cached information again??

Before the fix: 10-100 threads - 1G JVM - OOM in a few hours 
After: 10-100 threads 256M JVM -  days with a flat memory footprint

I am not sure why the thread local table is so big for us, but that is not 
really the issue.
It could just be building lots of IndexReaders (with thread locals hanging - 
probably making 10/instance ) and gc not kicking in so this table grows and can 
hold a lot of stale entries.  I may get time to investigate further

 TermInfosReader and other + instance ThreadLocal = transient/odd memory 
 leaks =  OutOfMemoryException
 ---

  Key: LUCENE-529
  URL: http://issues.apache.org/jira/browse/LUCENE-529
  Project: Lucene - Java
 Type: Bug
   Components: Index
 Versions: 1.9
  Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer..will aplpy to 1.9 
 code 
 Reporter: Andy Hind
  Attachments: ThreadLocalTest.java

 TermInfosReader uses an instance level ThreadLocal for enumerators.
 This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to 
 current JVMs, 
 not just an old JVM issue as described in the finalizer of the 1.9 code.
 There is also an instance level thread local in SegmentReaderwhich will 
 have the same issue.
 There may be other uses which also need to be fixed.
 I don't understand the intended use for these variables.however
 Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal 
 source code. Each instance of TermInfosReader will be creating an instance of 
 the thread local. All this does is create an instance variable on each thread 
 when it accesses the thread local. Setting it to null in the finaliser will 
 set it to null on one thread, the finalizer thread, where it has never been 
 created.  There is no point to this :-(
 I assume there is a good concurrency reason why an instance variable can not 
 be used...
 I have not used multi-threaded searching, but I have used a lot of threads 
 each making searchers and searching.
 1.4.3 has a clear memory leak caused by this thread local. This use case 
 above is definitely solved by setting the thread local to null in the 
 close(). This at least has a chance of being on the correct thread :-) 
 I know reusing Searchers would help but that is my choice and I will get to 
 that later  
 Now you wnat to know why
 Thread locals are stored in a table of entries. Each entry is *weak 
 reference* to the key (Here the TermInfosReader instance)  and a *simple 
 reference* to the thread local value. When the instance is GCed its key 
 becomes null. 
 This is now a stale entry in the table.
 Stale entries are cleared up in an ad hoc way and until they are cleared up 
 the value will not be garbage collected.
 Until the instance is GCed it is a valid key and its presence may cause the 
 table to expand.
 See the ThreadLocal code.
 So if you have lots of threads, all creating thread locals rapidly, you can 
 get each thread holding a large table of thread locals which all contain many 
 stale entries and preventing some objects from being garbage collected. 
 The limited GC of the thread local table is not enough to save you from 
 running out of memory.  
 Summary:
 
 - remove finalizer()
 - set the thread local to null in close() 
   - values will be available for gc 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (LUCENE-529) TermInfosReader and other + instance ThreadLocal = transient/odd memory leaks = OutOfMemoryException

2006-03-22 Thread Andy Hind (JIRA)

TermInfosReader and other + instance ThreadLocal = transient/odd memory leaks 
=  OutOfMemoryException 


 Key: LUCENE-529
 URL: http://issues.apache.org/jira/browse/LUCENE-529
 Project: Lucene - Java
Type: Bug
  Components: Index  
Versions: 1.9
 Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer..will aplpy to 1.9 
code 
Reporter: Andy Hind


TermInfosReader uses an instance level ThreadLocal for enumerators.
This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to current 
JVMs, 
not just an old JVM issue as described in the finalizer of the 1.9 code.

There is also an instance level thread local in SegmentReaderwhich will 
have the same issue.
There may be other uses which also need to be fixed.

I don't understand the intended use for these variables.however

Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal 
source code. Each instance of TermInfosReader will be creating an instance of 
the thread local. All this does is create an instance variable on each thread 
when it accesses the thread local. Setting it to null in the finaliser will set 
it to null on one thread, the finalizer thread, where it has never been 
created.  There is no point to this :-(

I assume there is a good concurrency reason why an instance variable can not be 
used...

I have not used multi-threaded searching, but I have used a lot of threads each 
making searchers and searching.
1.4.3 has a clear memory leak caused by this thread local. This use case above 
is definitely solved by setting the thread local to null in the close(). This 
at least has a chance of being on the correct thread :-) 
I know reusing Searchers would help but that is my choice and I will get to 
that later  

Now you wnat to know why

Thread locals are stored in a table of entries. Each entry is *weak reference* 
to the key (Here the TermInfosReader instance)  and a *simple reference* to the 
thread local value. When the instance is GCed its key becomes null. 
This is now a stale entry in the table.
Stale entries are cleared up in an ad hoc way and until they are cleared up the 
value will not be garbage collected.
Until the instance is GCed it is a valid key and its presence may cause the 
table to expand.
See the ThreadLocal code.

So if you have lots of threads, all creating thread locals rapidly, you can get 
each thread holding a large table of thread locals which all contain many stale 
entries and preventing some objects from being garbage collected. 
The limited GC of the thread local table is not enough to save you from running 
out of memory.  

Summary:

- remove finalizer()
- set the thread local to null in close() 
  - values will be available for gc 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-530) Extend NumberTools to support int/long/float/double to string

2006-03-22 Thread Andy Hind (JIRA)

Extend NumberTools to support int/long/float/double to string 
--

 Key: LUCENE-530
 URL: http://issues.apache.org/jira/browse/LUCENE-530
 Project: Lucene - Java
Type: Improvement
  Components: Analysis  
Versions: 1.9
Reporter: Andy Hind
Priority: Minor


Extend Number tools to support int/long/float/double to string 

So you can search using range queries on int/long/float/double, if you want.

Here is the basis for how NumberTools cold be extended to support 
int/long/double/float.
As I only write these values to the index and fix tokenisation in searchesI was 
not so fussed about the reverse transformations back to Strings.



public class NumericEncoder
{
/*
 * Constants for integer encoding
 */

static int INTEGER_SIGN_MASK = 0x8000;

/*
 * Constants for long encoding
 */

static long LONG_SIGN_MASK = 0x8000L;

/*
 * Constants for float encoding
 */

static int FLOAT_SIGN_MASK = 0x8000;

static int FLOAT_EXPONENT_MASK = 0x7F80;

static int FLOAT_MANTISSA_MASK = 0x007F;

/*
 * Constants for double encoding
 */

static long DOUBLE_SIGN_MASK = 0x8000L;

static long DOUBLE_EXPONENT_MASK = 0x7FF0L;

static long DOUBLE_MANTISSA_MASK = 0x000FL;

private NumericEncoder()
{
super();
}

/**
 * Encode an integer into a string that orders correctly using string
 * comparison Integer.MIN_VALUE encodes as  and MAX_VALUE as
 * .
 * 
 * @param intToEncode
 * @return
 */
public static String encode(int intToEncode)
{
int replacement = intToEncode ^ INTEGER_SIGN_MASK;
return encodeToHex(replacement);
}

/**
 * Encode a long into a string that orders correctly using string comparison
 * Long.MIN_VALUE encodes as  and MAX_VALUE as
 * .
 * 
 * @param longToEncode
 * @return
 */
public static String encode(long longToEncode)
{
long replacement = longToEncode ^ LONG_SIGN_MASK;
return encodeToHex(replacement);
}

/**
 * Encode a float into a string that orders correctly according to string
 * comparison. Note that there is no negative NaN but there are codings that
 * imply this. So NaN and -Infinity may not compare as expected.
 * 
 * @param floatToEncode
 * @return
 */
public static String encode(float floatToEncode)
{
int bits = Float.floatToIntBits(floatToEncode);
int sign = bits  FLOAT_SIGN_MASK;
int exponent = bits  FLOAT_EXPONENT_MASK;
int mantissa = bits  FLOAT_MANTISSA_MASK;
if (sign != 0)
{
exponent ^= FLOAT_EXPONENT_MASK;
mantissa ^= FLOAT_MANTISSA_MASK;
}
sign ^= FLOAT_SIGN_MASK;
int replacement = sign | exponent | mantissa;
return encodeToHex(replacement);
}

/**
 * Encode a double into a string that orders correctly according to string
 * comparison. Note that there is no negative NaN but there are codings that
 * imply this. So NaN and -Infinity may not compare as expected.
 * 
 * @param doubleToEncode
 * @return
 */
public static String encode(double doubleToEncode)
{
long bits = Double.doubleToLongBits(doubleToEncode);
long sign = bits  DOUBLE_SIGN_MASK;
long exponent = bits  DOUBLE_EXPONENT_MASK;
long mantissa = bits  DOUBLE_MANTISSA_MASK;
if (sign != 0)
{
exponent ^= DOUBLE_EXPONENT_MASK;
mantissa ^= DOUBLE_MANTISSA_MASK;
}
sign ^= DOUBLE_SIGN_MASK;
long replacement = sign | exponent | mantissa;
return encodeToHex(replacement);
}

private static String encodeToHex(int i)
{
char[] buf = new char[] { '0', '0', '0', '0', '0', '0', '0', '0' };
int charPos = 8;
do
{
buf[--charPos] = DIGITS[i  MASK];
i = 4;
}
while (i != 0);
return new String(buf);
}

private static String encodeToHex(long l)
{
char[] buf = new char[] { '0', '0', '0', '0', '0', '0', '0', '0', '0', 
'0', '0', '0', '0', '0', '0', '0' };
int charPos = 16;
do
{
buf[--charPos] = DIGITS[(int) l  MASK];
l = 4;
}
while (l != 0);
return new String(buf);
}

private static final char[] DIGITS = { '0', '1', '2', '3', '4', '5', '6', 
'7', '8', '9', 'a', 'b', 'c', 'd', 'e',
'f' };

private static final int MASK = (1  4) - 1;
}
























public class NumericEncodingTest extends TestCase
{

public NumericEncodingTest()
{
super();
}

public NumericEncodingTest(String arg0)
{
super(arg0);
}

[jira] Commented: (LUCENE-415) Merge error during add to index (IndexOutOfBoundsException)

2006-03-21 Thread Andy Hind (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-415?page=comments#action_12371284 ] 

Andy Hind commented on LUCENE-415:
--

We have tested the above solution pretty heavily since 18/11/2005 and would 
regard it as stable in 1.4.3.

Looking at the 1.9 code stream the issue is likely to be present, unless there 
is some other code that checks if an index segment file already exists when not 
expected, or the next segement is generated based in the segments that actually 
exist in the directory.

In 1.4.3 . in FSDirectory

..
final class FSOutputStream extends OutputStream
{
RandomAccessFile file = null;

public FSOutputStream(File path) throws IOException
{
file = new RandomAccessFile(path, rw);
file.setLength(0);
file.getChannel();
}
..

will sort this issue and some other file handle issues I have seen under XP

Something similar is likely to be required in FSIndexOutput  in the 1.9 code 
line



 Merge error during add to index (IndexOutOfBoundsException)
 ---

  Key: LUCENE-415
  URL: http://issues.apache.org/jira/browse/LUCENE-415
  Project: Lucene - Java
 Type: Bug
   Components: Index
 Versions: 1.4
  Environment: Operating System: Linux
 Platform: Other
 Reporter: Daniel Quaroni
 Assignee: Lucene Developers


 I've been batch-building indexes, and I've build a couple hundred indexes 
 with 
 a total of around 150 million records.  This only happened once, so it's 
 probably impossible to reproduce, but anyway... I was building an index with 
 around 9.6 million records, and towards the end I got this:
 java.lang.IndexOutOfBoundsException: Index: 54, Size: 24
 at java.util.ArrayList.RangeCheck(ArrayList.java:547)
 at java.util.ArrayList.get(ArrayList.java:322)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
 at 
 org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java
 :149)
 at org.apache.lucene.index.SegmentTermEnum.next
 (SegmentTermEnum.java:115)
 at org.apache.lucene.index.SegmentMergeInfo.next
 (SegmentMergeInfo.java:52)
 at org.apache.lucene.index.SegmentMerger.mergeTermInfos
 (SegmentMerger.java:294)
 at org.apache.lucene.index.SegmentMerger.mergeTerms
 (SegmentMerger.java:254)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:93)
 at org.apache.lucene.index.IndexWriter.mergeSegments
 (IndexWriter.java:487)
 at org.apache.lucene.index.IndexWriter.maybeMergeSegments
 (IndexWriter.java:458)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-415) Merge error during add to index (IndexOutOfBoundsException)

2006-03-21 Thread Andy Hind (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-415?page=comments#action_12371385 ] 

Andy Hind commented on LUCENE-415:
--

The problem is that the output is going into a file that already exists. 
I assume it leaves and then finds old bits during random access and gets 
confused.

If a merge fails while it is writing its output segment file you have a segment 
file that contains rubbish.
This can occur if you are unlucky when you kill the JVM (and to repeat the 
problem, set a break point and kill the JVM just before the segment write 
completes). The next time a merge takes place it writes to the segment file 
that already exists - as the same file name is generated for the new segment 
file. It always blows with an error similar to that reported for this bug.

The file.getChannel() solved some fairly odd but repeatable issues with 
stale/invalid file handles under windows XP.



 Merge error during add to index (IndexOutOfBoundsException)
 ---

  Key: LUCENE-415
  URL: http://issues.apache.org/jira/browse/LUCENE-415
  Project: Lucene - Java
 Type: Bug
   Components: Index
 Versions: 1.4
  Environment: Operating System: Linux
 Platform: Other
 Reporter: Daniel Quaroni
 Assignee: Lucene Developers


 I've been batch-building indexes, and I've build a couple hundred indexes 
 with 
 a total of around 150 million records.  This only happened once, so it's 
 probably impossible to reproduce, but anyway... I was building an index with 
 around 9.6 million records, and towards the end I got this:
 java.lang.IndexOutOfBoundsException: Index: 54, Size: 24
 at java.util.ArrayList.RangeCheck(ArrayList.java:547)
 at java.util.ArrayList.get(ArrayList.java:322)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
 at 
 org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java
 :149)
 at org.apache.lucene.index.SegmentTermEnum.next
 (SegmentTermEnum.java:115)
 at org.apache.lucene.index.SegmentMergeInfo.next
 (SegmentMergeInfo.java:52)
 at org.apache.lucene.index.SegmentMerger.mergeTermInfos
 (SegmentMerger.java:294)
 at org.apache.lucene.index.SegmentMerger.mergeTerms
 (SegmentMerger.java:254)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:93)
 at org.apache.lucene.index.IndexWriter.mergeSegments
 (IndexWriter.java:487)
 at org.apache.lucene.index.IndexWriter.maybeMergeSegments
 (IndexWriter.java:458)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-415) Merge error during add to index (IndexOutOfBoundsException)

2005-11-17 Thread Andy Hind (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-415?page=comments#action_12357882 ] 

Andy Hind commented on LUCENE-415:
--

And I can reproduce it .on 1.4.3

When FSDirectory.createFile creates a FSOutputStream the random access file may 
already exist and contain data. The content is not cleaned out.

So if segment merging is taking place to a new segment, and the merge has 
written data to this file and the machine crashes/app is terminated  
you can end up with a partial or full segment file that the segment infos knows 
nothing about. If you restart, then any merge will try to reuse the same file 
name...and the content it contains.

To reproduce the issue I created the next segment file by copying one that 
already exists  and bangon the next merge

I suggest that in FSOutputStream sets the file length to 0 on initialisation 
(as well as opening the channel to the file which can aslo produce some nasty 
deferred IO erorrs in windows XP a least)

I am not sure of any side effect of this but will test it.

We are seeing this 2-3 times a day if under heavy load or single thread and 
killing the app at random, which may be in the procedss of a segment write... 


 Merge error during add to index (IndexOutOfBoundsException)
 ---

  Key: LUCENE-415
  URL: http://issues.apache.org/jira/browse/LUCENE-415
  Project: Lucene - Java
 Type: Bug
   Components: Index
 Versions: 1.4
  Environment: Operating System: Linux
 Platform: Other
 Reporter: Daniel Quaroni
 Assignee: Lucene Developers


 I've been batch-building indexes, and I've build a couple hundred indexes 
 with 
 a total of around 150 million records.  This only happened once, so it's 
 probably impossible to reproduce, but anyway... I was building an index with 
 around 9.6 million records, and towards the end I got this:
 java.lang.IndexOutOfBoundsException: Index: 54, Size: 24
 at java.util.ArrayList.RangeCheck(ArrayList.java:547)
 at java.util.ArrayList.get(ArrayList.java:322)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
 at 
 org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java
 :149)
 at org.apache.lucene.index.SegmentTermEnum.next
 (SegmentTermEnum.java:115)
 at org.apache.lucene.index.SegmentMergeInfo.next
 (SegmentMergeInfo.java:52)
 at org.apache.lucene.index.SegmentMerger.mergeTermInfos
 (SegmentMerger.java:294)
 at org.apache.lucene.index.SegmentMerger.mergeTerms
 (SegmentMerger.java:254)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:93)
 at org.apache.lucene.index.IndexWriter.mergeSegments
 (IndexWriter.java:487)
 at org.apache.lucene.index.IndexWriter.maybeMergeSegments
 (IndexWriter.java:458)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
 at 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

47 matches

Mail list logo