[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3821: Attachment: LUCENE-3821-SloppyDecays.patch Patch adds NonExactPhraseScorer (temporary name) as discussed above - work in progress, it does not yet do any sloppy matching or scoring. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3821: Attachment: LUCENE-3821.patch Attached updated patch. Repeating PPs with multi-Phrase-query is handled as well. This called for more cases in the sloppy phrase scorer and more code, and, although I think the code is cleaner now, I don't know to what extent is it easier to maintain. It definitely fixes wrong behavior that exists in current 3x and trunk (patch is for 3x). However, although the random test passes for me even with -Dtests.iter=2000, it is possible to break the scorer - that is, create a document and a query which should match each other but would not. The patch adds just such a case as an @Ignored test case: TestMultiPhraseQuery.testMultiSloppyWithRepeats(). I don't see how to solve this specific case in the context of current sloppy phrase scorer. So there are 3 options: # leave things as they are # commit this patch and for now document the failing scenario (also keep it in the ignored test case). # devise a different algorithm for this. I would love it to be the 3rd if I just knew how to do it. Otherwise I like the 2nd, just need to keep in mind that the random test might from time to time create this scenario and so there will be noise in the test builds. Preferences? SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3821: Attachment: LUCENE-3821.patch updated patch with fixed MFQ.toString(), which prints the problematic doc and queries in case of failure. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3821: Attachment: LUCENE-3821.patch Patch with fix for this problem. I would expect SloppyPhrase scoring performance to degrade, though I did not measure it. The single test that still fails (and I think the bug is in ExactPhraseScorer) is testRandomIncreasingSloppiness, and can be recreated like this: {noformat} ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtestmethod=testRandomIncreasingSloppiness -Dtests.seed=47267613db69f714:-617bb800c4a3c645:-456a673444fdc184 -Dargs=-Dfile.encoding=UTF-8 {noformat} SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3821: Attachment: LUCENE-3821.patch bq. Hmm patch has this: ... import backport.api... Oops, here's a fixed patch, also added some comments, and removed the @Ignore from the test bq. I'm going to be ecstatic if that crazy test finds bugs both in exact and sloppy phrase scorers :) It is a great test! Interestingly one thing it exposed is the dependency of the SloppyPhraseScorer in the order of PPs in PhraseScorer when phraseFreq() is invoked. The way things work in the super class, that order depends on the content of previously processed documents. This fix removes that wrong dependency, of course. The point is that deliberately devising a test that exposes such a bug seems almost impossible: first you need to think about such a case, and if you did, writing a test that would create this specific scenario is buggy by itself. Praise to random testing, and this random test in particular. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()
[ https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3746: Attachment: LUCENE-3746.patch Updated patch using ManagementFactory.getMemoryMXBean().getHeapMemoryUsage(). Javadocs are not explicit about this call being atomic, but from the wording it seems almost certain to conclude that each call returns a new Usage instance. In this patch this is (Java) asserted and the assert passes (-ea) in two different JVMs - IBM and Oracle - so this might be correct. I searched some more explicit info on this with no success. Annoyingly though, in IBM JDK, running the tests like this produces the nice warning: {noformat} WARNING: test class left thread running: Thread[MemoryPoolMXBean notification dispatcher,6,main] RESOURCE LEAK: test class left 1 thread(s) running {noformat} This makes me reluctant to use the memory bean - I did not find a way to prevent that thread leak. So perhaps a better approach would be to be conservative about the sequence of calls when using Runtime? something like this: {code} long free = rt.freeMemory(); if (free is sufficient) return decideBy(free); long max = rt.maxMemory(); long total = rt.totalMemory(); return decideBy(max - total) {code} This is conservative in that 'total' is computed last, and in that total-free is not added to the computed available bytes. In both approaches, even if atomicity is guaranteed, it is possible that more heap is allocated in another thread between the time that the size is computed, to the time that the bytes are actually allocated, so not sure how safe this check can be made. suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory() -- Key: LUCENE-3746 URL: https://issues.apache.org/jira/browse/LUCENE-3746 Project: Lucene - Java Issue Type: Bug Components: modules/spellchecker Reporter: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3746.patch, LUCENE-3746.patch Follow up op dev thread: [FSTCompletionTest failure At least 0.5MB RAM buffer is needed | http://markmail.org/message/d7ugfo5xof4h5jeh] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()
[ https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3746: Attachment: LUCENE-3746.patch Updated patch - without MemoryMXBean - computing 'max, total, free' (in that order) and deciding by 'free' or falling to 'max-free'. This is more conservative, than MemoryMxBean but since the latter is not full proof either, I prefer the simpler approach. suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory() -- Key: LUCENE-3746 URL: https://issues.apache.org/jira/browse/LUCENE-3746 Project: Lucene - Java Issue Type: Bug Components: modules/spellchecker Reporter: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3746.patch, LUCENE-3746.patch, LUCENE-3746.patch Follow up op dev thread: [FSTCompletionTest failure At least 0.5MB RAM buffer is needed | http://markmail.org/message/d7ugfo5xof4h5jeh] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()
[ https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3746: Attachment: LUCENE-3746.patch Simple fix: consult also with maxMemory if freeMemory not suffice. suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory() -- Key: LUCENE-3746 URL: https://issues.apache.org/jira/browse/LUCENE-3746 Project: Lucene - Java Issue Type: Bug Components: modules/spellchecker Reporter: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3746.patch Follow up op dev thread: [FSTCompletionTest failure At least 0.5MB RAM buffer is needed | http://markmail.org/message/d7ugfo5xof4h5jeh] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1812: Attachment: pruning.patch Updated patch: package.html and all pruning classes moved to another package, except for PruningReader. Now ant javadocs-all passes as well. There are 3 TODO's: # implement CarmelTermPruningDeltaTopPolicy # dead code question in CarmelUniformTermPruningPolicy # missing details in package.html The first one can wait but the other two I would like to handle before committing. Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3718) SamplingWrapperTest failure with certain test seed
[ https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3718: Attachment: LUCENE-3718.patch Attached simple fix to Lucene40PostingsReader: linearScan() should set doc also when returning refill(). SamplingWrapperTest failure with certain test seed -- Key: LUCENE-3718 URL: https://issues.apache.org/jira/browse/LUCENE-3718 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3718.patch Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/ 1 tests failed. REGRESSION: org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping Error Message: Results are not the same! Stack Trace: org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the same! at org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest -Dtestmethod=testCountUsingSamping -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8 NOTE: test params are: codec=Lucene40: {$facets=PostingsFormat(name=MockRandom), $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 minBlockSize=65 maxBlockSize=209), $payloads$=PostingsFormat(name=Lucene40WithOrds)}, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, timezone=Asia/Manila -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3718) SamplingWrapperTest failure with certain test seed
[ https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3718: Attachment: LUCENE-3718.patch updated patch with same fix also in AllDocsSegmentDocsEnum.linearScan() (previous patch fixed only LiveDocsSegmentDocsEnum.linearScan()). I also verified that this facets test does not fail in 3x with same seed. SamplingWrapperTest failure with certain test seed -- Key: LUCENE-3718 URL: https://issues.apache.org/jira/browse/LUCENE-3718 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3718.patch, LUCENE-3718.patch Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/ 1 tests failed. REGRESSION: org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping Error Message: Results are not the same! Stack Trace: org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the same! at org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest -Dtestmethod=testCountUsingSamping -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8 NOTE: test params are: codec=Lucene40: {$facets=PostingsFormat(name=MockRandom), $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 minBlockSize=65 maxBlockSize=209), $payloads$=PostingsFormat(name=Lucene40WithOrds)}, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, timezone=Asia/Manila -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-1812: Attachment: pruning.patch Updated patch for current 3x. Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream
[ https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3596: Attachment: LUCENE-3596.patch Patch taking approach (1) above, and moving createIWC() to constructor. In addition fixed some javadoc comments, and added an assert to the constructor, which, only when assertions are enabled, will verify that the IWC in effect is not an instance of TieredMergePolicy. Imperfect as this is, it at least exposed the problem in current test (fixed to use newLogMP()). I think this is ready to commit. DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream Key: LUCENE-3596 URL: https://issues.apache.org/jira/browse/LUCENE-3596 Project: Lucene - Java Issue Type: Improvement Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3596.patch, LUCENE-3596.patch Current protected openIndexWriter(Directory directory, OpenMode openMode) does not provide access to the IWC it creates. So extensions must reimplement this method completely in order to set e.f. info stream for the internal index writer. This came up in [user question: Taxonomy indexer debug |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream
[ https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3596: Attachment: LUCENE-3596.patch patch adds the method createIndexWriterConfig(OpenMode openMode) and javadocs for in-order segments merging. DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream Key: LUCENE-3596 URL: https://issues.apache.org/jira/browse/LUCENE-3596 Project: Lucene - Java Issue Type: Improvement Components: modules/facet Reporter: Doron Cohen Priority: Minor Attachments: LUCENE-3596.patch Current protected openIndexWriter(Directory directory, OpenMode openMode) does not provide access to the IWC it creates. So extensions must reimplement this method completely in order to set e.f. info stream for the internal index writer. This came up in [user question: Taxonomy indexer debug |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern
[ https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3573: Attachment: LUCENE-3573.patch Final patch. Also updated the user-guide about refresh() behavior. Removed the changes entry - for facet this would go only into 3x. Planning to commit this soon. TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern Key: LUCENE-3573 URL: https://issues.apache.org/jira/browse/LUCENE-3573 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3573.patch, LUCENE-3573.patch, LUCENE-3573.patch When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore. As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern
[ https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3573: Attachment: LUCENE-3573.patch Patch, in principle ready to commit, though I plan to go through it once more. In this patch: * new tests moved to TestDirectoryTaxonomyReader * an exception added: InconsistentTaxonomyException * when the reader cannot refresh because the taxonomy was recreated since the last time open/refresh, that exception is thrown and the application should open a fresh taxonomy reader. Bumped into 3 TODO's while working on this: * FilterIndexReader does not implement getCommitUserData(). Once this is fixed can resolvethe TODO in TestIndexClose. I'll open an issue later. * TR.refresh() should return a boolean indicating anything was changed (issue). * DTW.rollback() seems wrong to me - it rollback the internal IW, which also closes it, but then it refreshes its internal TR, seems wrong... TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern Key: LUCENE-3573 URL: https://issues.apache.org/jira/browse/LUCENE-3573 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3573.patch, LUCENE-3573.patch When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore. As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern
[ https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3573: Attachment: LUCENE-3573.patch Attached patch for trunk adds two tests: * one of them is opening a new TR and passes * the other is refreshing the TR and fails. TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern Key: LUCENE-3573 URL: https://issues.apache.org/jira/browse/LUCENE-3573 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3573.patch When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore. As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError
[ https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3506: Attachment: LUCENE-3506.patch Attached fix for this: - assertionsEnabled() method added to LTC. - tests that were no op were fixed to actually test that the assertion failed. - after the fix, in trunk, analyzer's final'ness assertion tests failed because being final (class or method) is no longer needed in trunk. So these tests were removed in TestAssertions. -- note: should not remove these tests when merging to 3x. - TestSegmentMerger also failed with this fix - because it used the stale IW's SegmentInfos to create a compound segment. Fixed by reading a fresh SIS. - only one test (TestAssertions.testbasics()) fails if assertions are not enabled. The other tests do not fail (though they do execute). I think that this was intended in the code, thought since it did not work it is hard to tell... This is ready to commit. tests for verifying that assertions are enabled do nothing since they ignore AssertionError --- Key: LUCENE-3506 URL: https://issues.apache.org/jira/browse/LUCENE-3506 Project: Lucene - Java Issue Type: Bug Components: general/test Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3506.patch Follow-up from LUCENE-3501 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError
[ https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3506: Attachment: LUCENE-3506.patch Updated patch as suggested, thanks guys for reviewing and your helpful comments. tests for verifying that assertions are enabled do nothing since they ignore AssertionError --- Key: LUCENE-3506 URL: https://issues.apache.org/jira/browse/LUCENE-3506 Project: Lucene - Java Issue Type: Bug Components: general/test Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3506.patch, LUCENE-3506.patch Follow-up from LUCENE-3501 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3501) random sampler is not random (and so facet SamplingWrapperTest occasionally fails)
[ https://issues.apache.org/jira/browse/LUCENE-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3501: Attachment: LUCENE-3501.patch Before applying this patch should do: {noformat} svn mv modules/facet/src/java/org/apache/lucene/facet/util/RandomSample.java modules/facet/src/java/org/apache/lucene/facet/search/sampling/RepeatableSampler.java {noformat} I looked at this, and also discussed with Gilad, and it seems that: * The test is broken as it claims to do N trials in case of failure but it does not, because its try/catch does not catch AssertionError, and so only one trial is attempted. (Few trials make sense because with sampling, there is always a possibility that the selected sample set of docs would not contain the correct best facets even with a high over sampling ratio (over sampling means that for the selected set of docs more best facets are collected). * Even after fixing the test to actually try more than once, it still fails, because there is no randomness in RandomSample... surprising but true. In this patch: * Sampler made an abstract class. * RandomSample renamed to RepeatableSampler which extends RandomSampler. * RandomSampler was added - it too extends Sampler - this is a simple *random* implementation, which is now the default (used by default in SamplingWrapperAccumulator). * The test randomly selects between the two sampler implementations. If you want to see the behavior that created the bug, remove that latter randomness by setting to false the variable *useRandomSampler* of *BaseSampleTestTopK.testCountUsingSamping()*. I think this is ready to commit. Wasn't sure though, where should the Changes entry go? random sampler is not random (and so facet SamplingWrapperTest occasionally fails) -- Key: LUCENE-3501 URL: https://issues.apache.org/jira/browse/LUCENE-3501 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3501.patch RandomSample is not random at all: It does not even import java.util.Random, and its behavior is deterministic. in addition, the test testCountUsingSamping() never retries as it was supposed to (for taking care of the hoped-for randomness). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3262: Attachment: LUCENE-3262.patch Updated patch according to Shai's comments and with AddFacetedDoc task. Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3262: Attachment: LUCENE-3262.patch Updated patch with a test, more javadocs, and a comment as Shai suggested. I think this is ready to commit. More tests are needed, and also Search with facets is missing, but that can go in a separate issue. Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3262: Attachment: LUCENE-3262.patch Patch (3x) with working facets indexing benchmark. It follows the outline above, except that: - there is no FacetDocMaker - only FacetSource - there is no AddDocWithFacet - instead, AddDoc takes an additional config param: with.facet 'ant run-task -Dtask.alg=conf/facets.alg' will run an algorithm that indexes facets. Not ready to commit yet - need some testing and docs. Also, only covers indexing for now, though perhaps search with facets can go in a separate issue. Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3484) TaxonomyWriter parents array creation is not thread safe, can cause NPE
[ https://issues.apache.org/jira/browse/LUCENE-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-3484: Attachment: LUCENE-3484.patch Patch with test that fails same as the reported error. None of the changes here should be committed, just showing the error. TaxonomyWriter parents array creation is not thread safe, can cause NPE --- Key: LUCENE-3484 URL: https://issues.apache.org/jira/browse/LUCENE-3484 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Attachments: LUCENE-3484.patch Following user list thread [TaxWriter leakage? | http://markmail.org/thread/jkkhemfzpnbdzoft] it appears that if two threads or more are asking for the parent array for the first time, a context switch after the first thread created the empty parents array but before it initialized it would cause the other array to use an uninitialized array, causing an NPE. Fix is simple: synchronize the method getParentArray() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org