[jira] Closed: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header
[ https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1947. --- Resolution: Fixed Committed in revision 823445 Snowball package contains BSD licensed code with ASL header --- Key: LUCENE-1947 URL: https://issues.apache.org/jira/browse/LUCENE-1947 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 2.9 Reporter: Karl Wettin Assignee: Karl Wettin Fix For: 3.0 Attachments: LUCENE-1947.patch, LUCENE-1947.patch All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) has for some reason been given an ASL header. These classes are licensed with BSD. Thus the ASL header should be removed. I suppose this a misstake or possible due to the ASL header automation tool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1959) Index Splitter
[ https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763919#action_12763919 ] Uwe Schindler commented on LUCENE-1959: --- Ah ok, I didn't look into the test failure yesterday (was too late in the evening), I only wanted to make a quick design and if it would generally work. But you are right, the numDocs() return value is incorrect, leading to a failure in this test. But as the test pass in your test environment, the assertion in the SegmentMerger seems not important for functionality. So in general my code and your first code would work correct. I do not know how costly the initial building of the BitSet used for the input reader's deleted docs is, but one possibility would be to only build/use the additional bitset, if hasDeletions() on the original index returns true. Thanks for clarifying. Index Splitter -- Key: LUCENE-1959 URL: https://issues.apache.org/jira/browse/LUCENE-1959 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch If an index has multiple segments, this tool allows splitting those segments into separate directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1959) Index Splitter
[ https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763933#action_12763933 ] Andrzej Bialecki commented on LUCENE-1959: --- The test passed in Eclipse only - ant test ran from cmdline didn't pass without this fix, so I suspect my Eclipse is to blame for hiding the problem. Re: lazy allocation of bitset - good point, I'll make this change. Index Splitter -- Key: LUCENE-1959 URL: https://issues.apache.org/jira/browse/LUCENE-1959 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch If an index has multiple segments, this tool allows splitting those segments into separate directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #973
This failure was back compat test for TestConcurrentMergeScheduler, caused by my removing autocommit (LUCENE-1950). I've already fixed it, but, didn't plant a new back compat tag (I was hoping it'd simply pass until we made a new back compat tag, which is happening daily :) But, it didn't). I'll make a new back-compat tag now... Mike On Thu, Oct 8, 2009 at 11:04 PM, Apache Hudson Server hud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/973/changes Changes: [mikemccand] LUCENE-1950: remove autoCommit entirely [mikemccand] LUCENE-1950: fix intermittent exception in testcase [mikemccand] revert accidental commit; switch to new back-compat tag [mikemccand] LUCENE-1951: fix WildcardQuery to correctly rewrite single term query and prefix query [mikemccand] LUCENE-1950: remove autoCommit=true from IndexWriter [simonw] fixed spelling [simonw] LUCENE-1965, LUCENE-1962: Added possible performance improvments for Persian-, Arabic- and SmartChineseAnalyzer to changes.txt [simonw] LUCENE-1965: Lazy Atomic Loading Stopwords in SmartCN [buschmi] LUCENE-1961: Remove remaining deprecations from document package. [buschmi] LUCENE-1961: switch to latest back-compat tag. [simonw] LUCENE-1962: Cleaned up Persian Arabic Analyzer. Prevent default stopword list from being loaded more than once. - replace if blocks with a single switch - marking private members final where needed - changed protected visibility to final in final class. [koji] LUCENE-1953: FastVectorHighlighter: small fragCharSize can cause StringIndexOutOfBoundsException [mikemccand] LUCENE-1959: reuse the copy buffer [mikemccand] LUCENE-1959: add IndexSplitter tool to pull segment files out of an index into another -- [...truncated 15926 lines...] [junit] [junit] Testsuite: org.apache.lucene.search.TestSimilarity [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.602 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestSimpleExplanations [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 14.753 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestSimpleExplanationsOfNonMatches [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 1.043 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestSloppyPhraseQuery [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 1.579 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestSort [junit] Tests run: 23, Failures: 0, Errors: 0, Time elapsed: 6.046 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestSpanQueryFilter [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.698 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestStressSort [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 43.554 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestTermRangeFilter [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 4.856 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestTermRangeQuery [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 0.983 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestTermScorer [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.584 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestTermVectors [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 1.628 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestThreadSafe [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 8.821 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestTimeLimitingCollector [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 8.07 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestTopDocsCollector [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.556 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestTopScoreDocCollector [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.515 sec [junit] [junit] Testsuite: org.apache.lucene.search.TestWildcard [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.794 sec [junit] [junit] Testsuite: org.apache.lucene.search.function.TestCustomScoreQuery [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 120.864 sec [junit] [junit] Testsuite: org.apache.lucene.search.function.TestDocValues [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.335 sec [junit] [junit] Testsuite: org.apache.lucene.search.function.TestFieldScoreQuery [junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 2.335 sec [junit] [junit] Testsuite: org.apache.lucene.search.function.TestOrdValues [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 1.557 sec [junit] [junit] Testsuite:
[jira] Updated: (LUCENE-1959) Index Splitter
[ https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated LUCENE-1959: -- Attachment: mp-splitter3.patch As suggested by Uwe, don't allocate the old deletions bitset if there are no deletions. Index Splitter -- Key: LUCENE-1959 URL: https://issues.apache.org/jira/browse/LUCENE-1959 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch, mp-splitter3.patch If an index has multiple segments, this tool allows splitting those segments into separate directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763946#action_12763946 ] Michael McCandless commented on LUCENE-1458: No problem :) Please post the patch once you have it working! We'll need to implement captureState/seek for the other codes too. The pulsing case will be interesting since it's state will hold the actual postings for the low freq case. BTW I think an interesting codec would be one that pre-loads postings into RAM, storing them uncompressed (eg docs/positions as simple int[]) or slightly compressed (stored as packed bits). This should be a massive performance win at the expense of sizable RAM consumption, ie it makes the same tradeoff as contrib/memory and contrib/instantiated. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands,
[jira] Commented: (LUCENE-1959) Index Splitter
[ https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763950#action_12763950 ] Michael McCandless commented on LUCENE-1959: Good progress! Andrzej, how about you go ahead commit yourself? Index Splitter -- Key: LUCENE-1959 URL: https://issues.apache.org/jira/browse/LUCENE-1959 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch, mp-splitter3.patch If an index has multiple segments, this tool allows splitting those segments into separate directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763956#action_12763956 ] Michael McCandless commented on LUCENE-1458: {quote} Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc. I imagine at or before semantics would also work (like the current semantics of TermEnum in reverse) {quote} Right now seek(TermRef seekTerm) stops at the earliest term that's = seekTerm. It sounds like you're asking for a variant of seek that'd stop at the latest term that's = seekTerm? How would you use this to seek to the last term in a field? With the flex API, the TermsEnum only works with a single field's terms. So I guess we'd need TermRef constants, eg TermRef.FIRST and TermRef.LAST, that act like -infinity / +infinity. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (LUCENE-1959) Index Splitter
[ https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated LUCENE-1959: -- Attachment: mp-splitter4.patch I moved the files in this patch to contrib/misc and updated the contrib/CHANGES.txt. If there are no objections I'll commit it soon. Index Splitter -- Key: LUCENE-1959 URL: https://issues.apache.org/jira/browse/LUCENE-1959 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch, mp-splitter3.patch, mp-splitter4.patch If an index has multiple segments, this tool allows splitting those segments into separate directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763984#action_12763984 ] Michael McCandless commented on LUCENE-1458: Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()). Though that'd only work for TermsEnum impls that support ords. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement
[ https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1966: Affects Version/s: (was: 2.9.1) 2.9 Fix Version/s: (was: 2.9) 3.0 Arabic Analyzer: Stopwords list needs enhancement - Key: LUCENE-1966 URL: https://issues.apache.org/jira/browse/LUCENE-1966 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Basem Narmok Assignee: Robert Muir Priority: Trivial Fix For: 3.0 Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter
[ https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-1963. - Resolution: Fixed Committed revision 823534. (if it is ok to apply this to 2.9 branch as DM requested, we should reopen) ArabicAnalyzer: Lowercase before Stopfilter --- Key: LUCENE-1963 URL: https://issues.apache.org/jira/browse/LUCENE-1963 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Robert Muir Assignee: Robert Muir Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1963.patch, LUCENE-1963.patch ArabicAnalyzer lowercases text in case you have some non-Arabic text around. It also allows you to set a custom stopword list (you might augment the Arabic list with some English ones, for example). In this case its helpful for these non-Arabic stopwords, to lowercase before stopfilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[no subject]
[jira] Commented: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter
[ https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764013#action_12764013 ] Mark Miller commented on LUCENE-1963: - Your issue - if you can stretch it to bugish territory, I'd +1 it. I'd be wary of getting into porting features to 2.9.1 - but I wouldn't have a problem with this one myself. ArabicAnalyzer: Lowercase before Stopfilter --- Key: LUCENE-1963 URL: https://issues.apache.org/jira/browse/LUCENE-1963 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Robert Muir Assignee: Robert Muir Priority: Trivial Fix For: 3.0 Attachments: LUCENE-1963.patch, LUCENE-1963.patch ArabicAnalyzer lowercases text in case you have some non-Arabic text around. It also allows you to set a custom stopword list (you might augment the Arabic list with some English ones, for example). In this case its helpful for these non-Arabic stopwords, to lowercase before stopfilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter
[ https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1963: Fix Version/s: 2.9.1 ArabicAnalyzer: Lowercase before Stopfilter --- Key: LUCENE-1963 URL: https://issues.apache.org/jira/browse/LUCENE-1963 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Robert Muir Assignee: Robert Muir Priority: Trivial Fix For: 2.9.1, 3.0 Attachments: LUCENE-1963.patch, LUCENE-1963.patch ArabicAnalyzer lowercases text in case you have some non-Arabic text around. It also allows you to set a custom stopword list (you might augment the Arabic list with some English ones, for example). In this case its helpful for these non-Arabic stopwords, to lowercase before stopfilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter
[ https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reopened LUCENE-1963: - Mark, I think the problem is really that I overlooked this use case in LUCENE-1758, because Arabic is not case sensitive. It won't affect the default usage of the Analyzer (where all the stopwords are in Arabic and lowercase is a no-op). I am going to also set fix for 2.9.1 and give a day or two for people to comment if they disagree with applying to 2.9 branch. ArabicAnalyzer: Lowercase before Stopfilter --- Key: LUCENE-1963 URL: https://issues.apache.org/jira/browse/LUCENE-1963 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9 Reporter: Robert Muir Assignee: Robert Muir Priority: Trivial Fix For: 2.9.1, 3.0 Attachments: LUCENE-1963.patch, LUCENE-1963.patch ArabicAnalyzer lowercases text in case you have some non-Arabic text around. It also allows you to set a custom stopword list (you might augment the Arabic list with some English ones, for example). In this case its helpful for these non-Arabic stopwords, to lowercase before stopfilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764020#action_12764020 ] Yonik Seeley commented on LUCENE-1458: -- bq. How would you use this to seek to the last term in a field? It's not just last in a field, since one may be looking for last out of any given term range (the highest value of a trie int is not the last value encoded in that field). So if you had a trie based field, one would find the highest value via seekAtOrBefore(triecoded(MAXINT)) bq. Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()). Ahhh... right, prev could be implemented like so: int ord = seek(triecoded(MAXINT))).ord seek(ord-1) bq. Though that'd only work for TermsEnum impls that support ords. As long as ord is supported at the segment level, it's doable. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail:
[jira] Created: (LUCENE-1967) make it easier to access default stopwords for language analyzers
make it easier to access default stopwords for language analyzers - Key: LUCENE-1967 URL: https://issues.apache.org/jira/browse/LUCENE-1967 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Priority: Minor DM Smith made the following comment: (sometimes it is hard to dig out the stop set from the analyzers) Looking around, some of these analyzers have very different ways of storing the default list. One idea is to consider generalizing something like what Simon did with LUCENE-1965, LUCENE-1962, and having all stopwords lists stored as .txt files in resources folder. {code} /** * Returns an unmodifiable instance of the default stop-words set. * @return an unmodifiable instance of the default stop-words set. */ public static SetString getDefaultStopSet() {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1967) make it easier to access default stopwords for language analyzers
[ https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-1967: --- Assignee: Simon Willnauer make it easier to access default stopwords for language analyzers - Key: LUCENE-1967 URL: https://issues.apache.org/jira/browse/LUCENE-1967 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Assignee: Simon Willnauer Priority: Minor DM Smith made the following comment: (sometimes it is hard to dig out the stop set from the analyzers) Looking around, some of these analyzers have very different ways of storing the default list. One idea is to consider generalizing something like what Simon did with LUCENE-1965, LUCENE-1962, and having all stopwords lists stored as .txt files in resources folder. {code} /** * Returns an unmodifiable instance of the default stop-words set. * @return an unmodifiable instance of the default stop-words set. */ public static SetString getDefaultStopSet() {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1967) make it easier to access default stopwords for language analyzers
[ https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764025#action_12764025 ] Simon Willnauer commented on LUCENE-1967: - Thanks Robert for bringing this up in a general context. I will take care of it soon. make it easier to access default stopwords for language analyzers - Key: LUCENE-1967 URL: https://issues.apache.org/jira/browse/LUCENE-1967 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Assignee: Simon Willnauer Priority: Minor DM Smith made the following comment: (sometimes it is hard to dig out the stop set from the analyzers) Looking around, some of these analyzers have very different ways of storing the default list. One idea is to consider generalizing something like what Simon did with LUCENE-1965, LUCENE-1962, and having all stopwords lists stored as .txt files in resources folder. {code} /** * Returns an unmodifiable instance of the default stop-words set. * @return an unmodifiable instance of the default stop-words set. */ public static SetString getDefaultStopSet() {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
unsubscribe
The information contained in this message and any attachment may be proprietary, confidential, and privileged or subject to the work product doctrine and thus protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify me immediately by replying to this message and deleting it and all copies and backups thereof. Thank you.
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764050#action_12764050 ] Mark Miller commented on LUCENE-1458: - hmm - I think I'm close. Everything passes except for omitTermsTest, LazyProxTest, and for some odd reason the multi term tests. Getting close though. My main concern at the moment is the state capturing. It seems I have to capture the state before readTerm in next() - but I might not use that state if there are multiple next calls before the hit. So thats a lot of wasted capturing. Have to deal with that somehow. Doing things more correctly like this, the gain is much less significant. What really worries me is that my hack test was still slower than the old - and that skipped a bunch of necessary work, so its almost a better than best case here - I think you might need more gains elsewhere to get back up to speed. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764050#action_12764050 ] Mark Miller edited comment on LUCENE-1458 at 10/9/09 8:37 AM: -- hmm - I think I'm close. Everything passes except for omitTermsTest, LazyProxTest, and for some odd reason the multi term tests. Getting close though. My main concern at the moment is the state capturing. It seems I have to capture the state before readTerm in next() - but I might not use that state if there are multiple next calls before the hit. So thats a lot of wasted capturing. Have to deal with that somehow. Doing things more correctly like this, the gain is much less significant. What really worries me is that my hack test was still slower than the old - and that skipped a bunch of necessary work, so its almost a better than best case here - I think you might need more gains elsewhere to get back up to speed. *edit* Hmm - still no equivalent of the cached enum for one I guess. And at the least, since you only cache when the scan is great than one, you can at least skip one capture there... was (Author: markrmil...@gmail.com): hmm - I think I'm close. Everything passes except for omitTermsTest, LazyProxTest, and for some odd reason the multi term tests. Getting close though. My main concern at the moment is the state capturing. It seems I have to capture the state before readTerm in next() - but I might not use that state if there are multiple next calls before the hit. So thats a lot of wasted capturing. Have to deal with that somehow. Doing things more correctly like this, the gain is much less significant. What really worries me is that my hack test was still slower than the old - and that skipped a bunch of necessary work, so its almost a better than best case here - I think you might need more gains elsewhere to get back up to speed. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields,
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764060#action_12764060 ] Michael McCandless commented on LUCENE-1458: bq. It seems I have to capture the state before readTerm in next() Wait, how come? It seems like we should only cache if we find exactly the requested term (ie, where we return SeekStatus.FOUND)? So you should only have to capture the state once, there? Hmm I wonder whether we should also cache the seek(ord) calls? Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764073#action_12764073 ] Mark Miller commented on LUCENE-1458: - Hmm - I must have something off then. I've never been into this stuff much before. on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex) - I'm just caching the freq, isIndex, and the positions with a CurrentState object. The captureCurrentState now telescopes down capturing the state of each object Perhaps I'm off there - because if I do that, it seems I have to capture the state right before the call to readTerm in next() - otherwise readTerm will move everything forward before I can grab it when I actually put the state into the cache - when its FOUND. I may be all wet though - no worries - I'm really just playing around trying to learn some of this - only way I learn to is to code. bq. Hmm I wonder whether we should also cache the seek(ord) calls? I was wondering about that, but hand't even got to thinking about it :) Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764078#action_12764078 ] Michael Busch commented on LUCENE-1458: --- I added this cache originally because it seemed the easiest to improve the term lookup performance. Now we're adding the burden of implementing such a cache to every codec, right? Maybe instead we should improve the search runtime to not call idf() twice for every term? Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764079#action_12764079 ] Michael McCandless commented on LUCENE-1458: bq. on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex) Hmm... I think your cache might be one level too low? I think we want the cache to live in StandardTermsDictReader. Only the seek(TermRef) method interacts with the cache for now (until we maybe add ord as well). So, seek first checks if that term is in cache, and if so pulls the opaque state and asks the docsReader to restore to that state. Else, it does the normal seek, but then if the exact term is found, it calls docsReader.captureState and stores it in the cache. Make sure the cache lives high enough to be shared by different TermsEnum instances. I think it should probably live in StandardTermsDictReader.FieldReader. There is one instance of that per field. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764085#action_12764085 ] Michael McCandless commented on LUCENE-1458: bq. Now we're adding the burden of implementing such a cache to every codec, right? I suspect most codecs will reuse the StandardTermsDictReader, ie, they will usually only change the docs/positions/payloads format. So each codec will only have to implement capture/restoreState. bq. Maybe instead we should improve the search runtime to not call idf() twice for every term? Oh I didn't realize we call idf() twice per term -- we should separately just fix that. Where are we doing that? (I thought the two calls were first for idf() and then 2nd when it's time to get the actual TermDocs/Positions to step through). Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail:
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764089#action_12764089 ] Michael Busch commented on LUCENE-1458: --- {quote} Oh I didn't realize we call idf() twice per term {quote} Hmm I take that back. I looked in LUCENE-1195 again: {quote} Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called. The second time when the posting list is opened (TermDocs or TermPositions). {quote} Hmm something's wrong with my memory this morning! Maybe the lack of caffeine :) Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764114#action_12764114 ] Mark Miller commented on LUCENE-1458: - Ah - okay - that helps. I think the cache itself is currently around the right level (StandardTermsDictReader, and it gets hit pretty hard), but I thought it was funky I still had to make that read call - I think I see how it should work without that now, but just queuing up the docsReader to where it should be correctly. We will see. Vacation till Tuesday - don't let me stop you from doing it correctly if its on your timeline. Just playing over here - and I don't have a lot of time to play really. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?
[ https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764134#action_12764134 ] MRIT64 commented on LUCENE-1958: - Yes, I use a custom analyser which uses reusableToken - I dont know if reusableToken is supported or not in this version, but the method next(Token reusableToken) is proposed on the ShingleFilter 2.4.1 Javadoc (see http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/shingle/ShingleFilter.html). That's the reason why I have used it and I don't know how it works internally and the is nothing mentionned on the documentation. Anyway, it doesnt' matter know because the problem doesnt occur with Lucene 2.9. Regards ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ? Key: LUCENE-1958 URL: https://issues.apache.org/jira/browse/LUCENE-1958 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4.1 Environment: Windows XP / jdk1.6.0_15 Reporter: MRIT64 Priority: Minor HI I add two consecutive documents that are indexed with some filters. The last one is ShingleFilter. ShingleFilter creates a shingle spannnig the two documents, which has no sense in my context. Is that a bug oris it ShingleFilter normal behaviour ? If it's normal behaviour, is it possible to change it optionnaly ? Thanks MR -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?
[ https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764141#action_12764141 ] Robert Muir commented on LUCENE-1958: - MRIT64, actually I am not curious about next(reusableToken), but instead whether your Analyzer implements {code} public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {code} If you were trying to reuse ShingleFilters in 2.4.1 with this technique, I think this would be unsafe. It is safe in 2.9 bq. Anyway, it doesnt' matter know because the problem doesnt occur with Lucene 2.9. Ok to mark this issue as resolved? ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ? Key: LUCENE-1958 URL: https://issues.apache.org/jira/browse/LUCENE-1958 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.4.1 Environment: Windows XP / jdk1.6.0_15 Reporter: MRIT64 Priority: Minor HI I add two consecutive documents that are indexed with some filters. The last one is ShingleFilter. ShingleFilter creates a shingle spannnig the two documents, which has no sense in my context. Is that a bug oris it ShingleFilter normal behaviour ? If it's normal behaviour, is it possible to change it optionnaly ? Thanks MR -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1822) FastVectorHighlighter: SimpleFragListBuilder hard-coded 6 char margin is too naive
[ https://issues.apache.org/jira/browse/LUCENE-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764168#action_12764168 ] Chas Emerick commented on LUCENE-1822: -- Thank you for the patch. I agree, the context surrounding each fragment could definitely be improved. FastVectorHighlighter: SimpleFragListBuilder hard-coded 6 char margin is too naive -- Key: LUCENE-1822 URL: https://issues.apache.org/jira/browse/LUCENE-1822 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Environment: any Reporter: Alex Vigdor Priority: Minor Attachments: LUCENE-1822.patch The new FastVectorHighlighter performs extremely well, however I've found in testing that the window of text chosen per fragment is often very poor, as it is hard coded in SimpleFragListBuilder to always select starting 6 characters to the left of the first phrase match in a fragment. When selecting long fragments, this often means that there is barely any context before the highlighted word, and lots after; even worse, when highlighting a phrase at the end of a short text the beginning is cut off, even though the entire phrase would fit in the specified fragCharSize. For example, highlighting Punishment in Crime and Punishment returns e and bPunishment/b no matter what fragCharSize is specified. I am going to attach a patch that improves the text window selection by recalculating the starting margin once all phrases in the fragment have been identified - this way if a single word is matched in a fragment, it will appear in the middle of the highlight, instead of 6 characters from the beginning. This way one can also guarantee that the entirety of short texts are represented in a fragment by specifying a large enough fragCharSize. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1458: --- Attachment: LUCENE-1458.patch New patch attached. All tests pass. A few small changes (eg sync'd to trunk) but the biggest change is a new test case (TestExternalCodecs) that contains two new codecs: * RAMOnlyCodec -- like instantiated, it writes and reads all postings into RAM in dedicated classes * PerFieldCodecWrapper -- dispatches by field name to different codecs (this was asked about a couple times) The test indexes one field using the standard codec, and the other using the RAMOnlyCodec. It also verifies one can in fact make a custom codec external to oal.index. Further steps towards flexible indexing --- Key: LUCENE-1458 URL: https://issues.apache.org/jira/browse/LUCENE-1458 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 I attached a very rough checkpoint of my current patch, to get early feedback. All tests pass, though back compat tests don't pass due to changes to package-private APIs plus certain bugs in tests that happened to work (eg call TermPostions.nextPosition() too many times, which the new API asserts against). [Aside: I think, when we commit changes to package-private APIs such that back-compat tests don't pass, we could go back, make a branch on the back-compat tag, commit changes to the tests to use the new package private APIs on that branch, then fix nightly build to use the tip of that branch?o] There's still plenty to do before this is committable! This is a rather large change: * Switches to a new more efficient terms dict format. This still uses tii/tis files, but the tii only stores term long offset (not a TermInfo). At seek points, tis encodes term freq/prox offsets absolutely instead of with deltas delta. Also, tis/tii are structured by field, so we don't have to record field number in every term. . On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB). . RAM usage when loading terms dict index is significantly less since we only load an array of offsets and an array of String (no more TermInfo array). It should be faster to init too. . This part is basically done. * Introduces modular reader codec that strongly decouples terms dict from docs/positions readers. EG there is no more TermInfo used when reading the new format. . There's nice symmetry now between reading writing in the codec chain -- the current docs/prox format is captured in: {code} FormatPostingsTermsDictWriter/Reader FormatPostingsDocsWriter/Reader (.frq file) and FormatPostingsPositionsWriter/Reader (.prx file). {code} This part is basically done. * Introduces a new flex API for iterating through the fields, terms, docs and positions: {code} FieldProducer - TermsEnum - DocsEnum - PostingsEnum {code} This replaces TermEnum/Docs/Positions. SegmentReader emulates the old API on top of the new API to keep back-compat. Next steps: * Plug in new codecs (pulsing, pfor) to exercise the modularity / fix any hidden assumptions. * Expose new API out of IndexReader, deprecate old API but emulate old API on top of new one, switch all core/contrib users to the new API. * Maybe switch to AttributeSources as the base class for TermsEnum, DocsEnum, PostingsEnum -- this would give readers API flexibility (not just index-file-format flexibility). EG if someone wanted to store payload at the term-doc level instead of term-doc-position level, you could just add a new attribute. * Test performance iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
Hudson build is back to normal: Lucene-trunk #974
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/974/changes - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org