[jira] Closed: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-09 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1947.
---

Resolution: Fixed

Committed in revision 823445

 Snowball package contains BSD licensed code with ASL header
 ---

 Key: LUCENE-1947
 URL: https://issues.apache.org/jira/browse/LUCENE-1947
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 3.0

 Attachments: LUCENE-1947.patch, LUCENE-1947.patch


 All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
 has for some reason been given an ASL header. These classes are licensed with 
 BSD. Thus the ASL header should be removed. I suppose this a misstake or 
 possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-09 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763919#action_12763919
 ] 

Uwe Schindler commented on LUCENE-1959:
---

Ah ok, I didn't look into the test failure yesterday (was too late in the 
evening), I only wanted to make a quick design and if it would generally work.
But you are right, the numDocs() return value is incorrect, leading to a 
failure in this test. But as the test pass in your test environment, the 
assertion in the SegmentMerger seems not important for functionality. So in 
general my code and your first code would work correct. I do not know how 
costly the initial building of the BitSet used for the input reader's deleted 
docs is, but one possibility would be to only build/use the additional bitset, 
if hasDeletions() on the original index returns true.

Thanks for clarifying.

 Index Splitter
 --

 Key: LUCENE-1959
 URL: https://issues.apache.org/jira/browse/LUCENE-1959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, 
 mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch


 If an index has multiple segments, this tool allows splitting those segments 
 into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763933#action_12763933
 ] 

Andrzej Bialecki  commented on LUCENE-1959:
---

The test passed in Eclipse only - ant test ran from cmdline didn't pass 
without this fix, so I suspect my Eclipse is to blame for hiding the problem. 
Re: lazy allocation of bitset - good point, I'll make this change.

 Index Splitter
 --

 Key: LUCENE-1959
 URL: https://issues.apache.org/jira/browse/LUCENE-1959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, 
 mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch


 If an index has multiple segments, this tool allows splitting those segments 
 into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #973

2009-10-09 Thread Michael McCandless
This failure was back compat test for TestConcurrentMergeScheduler,
caused by my removing autocommit (LUCENE-1950).

I've already fixed it, but, didn't plant a new back compat tag (I was
hoping it'd simply pass until we made a new back compat tag, which is
happening daily :)  But, it didn't).

I'll make a new back-compat tag now...

Mike

On Thu, Oct 8, 2009 at 11:04 PM, Apache Hudson Server
hud...@hudson.zones.apache.org wrote:
 See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/973/changes

 Changes:

 [mikemccand] LUCENE-1950: remove autoCommit entirely

 [mikemccand] LUCENE-1950: fix intermittent exception in testcase

 [mikemccand] revert accidental commit; switch to new back-compat tag

 [mikemccand] LUCENE-1951: fix WildcardQuery to correctly rewrite single term 
 query and prefix query

 [mikemccand] LUCENE-1950: remove autoCommit=true from IndexWriter

 [simonw] fixed spelling

 [simonw] LUCENE-1965, LUCENE-1962: Added possible performance improvments for 
 Persian-, Arabic- and SmartChineseAnalyzer to changes.txt

 [simonw] LUCENE-1965: Lazy Atomic Loading Stopwords in SmartCN

 [buschmi] LUCENE-1961: Remove remaining deprecations from document package.

 [buschmi] LUCENE-1961: switch to latest back-compat tag.

 [simonw] LUCENE-1962: Cleaned up Persian  Arabic Analyzer. Prevent default 
 stopword list from being loaded more than once.
 - replace if blocks with a single switch
 - marking private members final where needed
 - changed protected visibility to final in final class.

 [koji] LUCENE-1953: FastVectorHighlighter: small fragCharSize can cause 
 StringIndexOutOfBoundsException

 [mikemccand] LUCENE-1959: reuse the copy buffer

 [mikemccand] LUCENE-1959: add IndexSplitter tool to pull segment files out of 
 an index into another

 --
 [...truncated 15926 lines...]
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestSimilarity
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.602 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestSimpleExplanations
    [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 14.753 sec
    [junit]
    [junit] Testsuite: 
 org.apache.lucene.search.TestSimpleExplanationsOfNonMatches
    [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 1.043 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestSloppyPhraseQuery
    [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 1.579 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestSort
    [junit] Tests run: 23, Failures: 0, Errors: 0, Time elapsed: 6.046 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestSpanQueryFilter
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.698 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestStressSort
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 43.554 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestTermRangeFilter
    [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 4.856 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestTermRangeQuery
    [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 0.983 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestTermScorer
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.584 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestTermVectors
    [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 1.628 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestThreadSafe
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 8.821 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestTimeLimitingCollector
    [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 8.07 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestTopDocsCollector
    [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.556 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestTopScoreDocCollector
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.515 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.TestWildcard
    [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.794 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.function.TestCustomScoreQuery
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 120.864 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.function.TestDocValues
    [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.335 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.function.TestFieldScoreQuery
    [junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 2.335 sec
    [junit]
    [junit] Testsuite: org.apache.lucene.search.function.TestOrdValues
    [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 1.557 sec
    [junit]
    [junit] Testsuite: 

[jira] Updated: (LUCENE-1959) Index Splitter

2009-10-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated LUCENE-1959:
--

Attachment: mp-splitter3.patch

As suggested by Uwe, don't allocate the old deletions bitset if there are no 
deletions.

 Index Splitter
 --

 Key: LUCENE-1959
 URL: https://issues.apache.org/jira/browse/LUCENE-1959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, 
 mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch, 
 mp-splitter3.patch


 If an index has multiple segments, this tool allows splitting those segments 
 into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763946#action_12763946
 ] 

Michael McCandless commented on LUCENE-1458:


No problem :)  Please post the patch once you have it working!  We'll need to 
implement captureState/seek for the other codes too.  The pulsing case will be 
interesting since it's state will hold the actual postings for the low freq 
case.

BTW I think an interesting codec would be one that pre-loads postings into RAM, 
storing them uncompressed (eg docs/positions as simple int[]) or slightly 
compressed (stored as packed bits).  This should be a massive performance win 
at the expense of sizable RAM consumption, ie it makes the same tradeoff as 
contrib/memory and contrib/instantiated.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, 

[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763950#action_12763950
 ] 

Michael McCandless commented on LUCENE-1959:


Good progress!  Andrzej, how about you go ahead  commit yourself?

 Index Splitter
 --

 Key: LUCENE-1959
 URL: https://issues.apache.org/jira/browse/LUCENE-1959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, 
 mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch, 
 mp-splitter3.patch


 If an index has multiple segments, this tool allows splitting those segments 
 into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763956#action_12763956
 ] 

Michael McCandless commented on LUCENE-1458:



{quote}
Another for theTermsEnum wishlist: the ability to seek to the term before the 
given term... useful for finding the largest value in a field, etc.
I imagine at or before semantics would also work (like the current semantics 
of TermEnum in reverse)
{quote}

Right now seek(TermRef seekTerm) stops at the earliest term that's =
seekTerm.

It sounds like you're asking for a variant of seek that'd stop at the
latest term that's = seekTerm?

How would you use this to seek to the last term in a field?  With the
flex API, the TermsEnum only works with a single field's terms.  So I
guess we'd need TermRef constants, eg TermRef.FIRST and TermRef.LAST,
that act like -infinity / +infinity.


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (LUCENE-1959) Index Splitter

2009-10-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated LUCENE-1959:
--

Attachment: mp-splitter4.patch

I moved the files in this patch to contrib/misc and updated the 
contrib/CHANGES.txt. If there are no objections I'll commit it soon.

 Index Splitter
 --

 Key: LUCENE-1959
 URL: https://issues.apache.org/jira/browse/LUCENE-1959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1959.patch, LUCENE-1959.patch, 
 mp-splitter-inline.patch, mp-splitter.patch, mp-splitter2.patch, 
 mp-splitter3.patch, mp-splitter4.patch


 If an index has multiple segments, this tool allows splitting those segments 
 into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763984#action_12763984
 ] 

Michael McCandless commented on LUCENE-1458:


Actually, FIRST/LAST could be achieved with seek-by-ord (plus 
getUniqueTermCount()).  Though that'd only work for TermsEnum impls that 
support ords.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1966) Arabic Analyzer: Stopwords list needs enhancement

2009-10-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1966:


Affects Version/s: (was: 2.9.1)
   2.9
Fix Version/s: (was: 2.9)
   3.0

 Arabic Analyzer: Stopwords list needs enhancement
 -

 Key: LUCENE-1966
 URL: https://issues.apache.org/jira/browse/LUCENE-1966
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Basem Narmok
Assignee: Robert Muir
Priority: Trivial
 Fix For: 3.0

 Attachments: arabic-stopwords-comments.txt, LUCENE-1966.patch


 The provided Arabic stopwords list needs some enhancements (e.g. it contains 
 a lot of words that not stopwords, and some cleanup) . patch will be provided 
 with this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-1963.
-

Resolution: Fixed

Committed revision 823534.
(if it is ok to apply this to 2.9 branch as DM requested, we should reopen)


 ArabicAnalyzer: Lowercase before Stopfilter
 ---

 Key: LUCENE-1963
 URL: https://issues.apache.org/jira/browse/LUCENE-1963
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1963.patch, LUCENE-1963.patch


 ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
 It also allows you to set a custom stopword list (you might augment the 
 Arabic list with some English ones, for example).
 In this case its helpful for these non-Arabic stopwords, to lowercase before 
 stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[no subject]

2009-10-09 Thread Pablo Nuñez



[jira] Commented: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-09 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764013#action_12764013
 ] 

Mark Miller commented on LUCENE-1963:
-

Your issue - if you can stretch it to bugish territory, I'd +1 it. I'd be wary 
of getting into porting features to 2.9.1 - but I wouldn't have a problem with 
this one myself.

 ArabicAnalyzer: Lowercase before Stopfilter
 ---

 Key: LUCENE-1963
 URL: https://issues.apache.org/jira/browse/LUCENE-1963
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Fix For: 3.0

 Attachments: LUCENE-1963.patch, LUCENE-1963.patch


 ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
 It also allows you to set a custom stopword list (you might augment the 
 Arabic list with some English ones, for example).
 In this case its helpful for these non-Arabic stopwords, to lowercase before 
 stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1963:


Fix Version/s: 2.9.1

 ArabicAnalyzer: Lowercase before Stopfilter
 ---

 Key: LUCENE-1963
 URL: https://issues.apache.org/jira/browse/LUCENE-1963
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-1963.patch, LUCENE-1963.patch


 ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
 It also allows you to set a custom stopword list (you might augment the 
 Arabic list with some English ones, for example).
 In this case its helpful for these non-Arabic stopwords, to lowercase before 
 stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reopened LUCENE-1963:
-


Mark, I think the problem is really that I overlooked this use case in 
LUCENE-1758, because Arabic is not case sensitive.

It won't affect the default usage of the Analyzer (where all the stopwords are 
in Arabic and lowercase is a no-op).

I am going to also set fix for 2.9.1 and give a day or two for people to 
comment if they disagree with applying to 2.9 branch.

 ArabicAnalyzer: Lowercase before Stopfilter
 ---

 Key: LUCENE-1963
 URL: https://issues.apache.org/jira/browse/LUCENE-1963
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-1963.patch, LUCENE-1963.patch


 ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
 It also allows you to set a custom stopword list (you might augment the 
 Arabic list with some English ones, for example).
 In this case its helpful for these non-Arabic stopwords, to lowercase before 
 stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764020#action_12764020
 ] 

Yonik Seeley commented on LUCENE-1458:
--

bq. How would you use this to seek to the last term in a field?

It's not just last in a field, since one may be looking for last out of any 
given term range (the highest value of a trie int is not the last value encoded 
in that field).
So if you had a trie based field, one would find the highest value via 
seekAtOrBefore(triecoded(MAXINT))

bq. Actually, FIRST/LAST could be achieved with seek-by-ord (plus 
getUniqueTermCount()).

Ahhh... right, prev could be implemented like so:

int ord = seek(triecoded(MAXINT))).ord
seek(ord-1)

bq. Though that'd only work for TermsEnum impls that support ords. 

As long as ord is supported at the segment level, it's doable.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: 

[jira] Created: (LUCENE-1967) make it easier to access default stopwords for language analyzers

2009-10-09 Thread Robert Muir (JIRA)
make it easier to access default stopwords for language analyzers
-

 Key: LUCENE-1967
 URL: https://issues.apache.org/jira/browse/LUCENE-1967
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor


DM Smith made the following comment: (sometimes it is hard to dig out the stop 
set from the analyzers)

Looking around, some of these analyzers have very different ways of storing the 
default list.
One idea is to consider generalizing something like what Simon did with 
LUCENE-1965, LUCENE-1962,
and having all stopwords lists stored as .txt files in resources folder.

{code}
  /**
   * Returns an unmodifiable instance of the default stop-words set.
   * @return an unmodifiable instance of the default stop-words set.
   */
  public static SetString getDefaultStopSet()
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1967) make it easier to access default stopwords for language analyzers

2009-10-09 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-1967:
---

Assignee: Simon Willnauer

 make it easier to access default stopwords for language analyzers
 -

 Key: LUCENE-1967
 URL: https://issues.apache.org/jira/browse/LUCENE-1967
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Simon Willnauer
Priority: Minor

 DM Smith made the following comment: (sometimes it is hard to dig out the 
 stop set from the analyzers)
 Looking around, some of these analyzers have very different ways of storing 
 the default list.
 One idea is to consider generalizing something like what Simon did with 
 LUCENE-1965, LUCENE-1962,
 and having all stopwords lists stored as .txt files in resources folder.
 {code}
   /**
* Returns an unmodifiable instance of the default stop-words set.
* @return an unmodifiable instance of the default stop-words set.
*/
   public static SetString getDefaultStopSet()
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1967) make it easier to access default stopwords for language analyzers

2009-10-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764025#action_12764025
 ] 

Simon Willnauer commented on LUCENE-1967:
-

Thanks Robert for bringing this up in a general context. I will take care of it 
soon.

 make it easier to access default stopwords for language analyzers
 -

 Key: LUCENE-1967
 URL: https://issues.apache.org/jira/browse/LUCENE-1967
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Simon Willnauer
Priority: Minor

 DM Smith made the following comment: (sometimes it is hard to dig out the 
 stop set from the analyzers)
 Looking around, some of these analyzers have very different ways of storing 
 the default list.
 One idea is to consider generalizing something like what Simon did with 
 LUCENE-1965, LUCENE-1962,
 and having all stopwords lists stored as .txt files in resources folder.
 {code}
   /**
* Returns an unmodifiable instance of the default stop-words set.
* @return an unmodifiable instance of the default stop-words set.
*/
   public static SetString getDefaultStopSet()
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



unsubscribe

2009-10-09 Thread Zoheir, Amr



The information contained in this message and any attachment may be
proprietary, confidential, and privileged or subject to the work
product doctrine and thus protected from disclosure.  If the reader
of this message is not the intended recipient, or an employee or
agent responsible for delivering this message to the intended
recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify me
immediately by replying to this message and deleting it and all
copies and backups thereof.  Thank you.



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764050#action_12764050
 ] 

Mark Miller commented on LUCENE-1458:
-

hmm - I think I'm close. Everything passes except for omitTermsTest, 
LazyProxTest, and for some odd reason the multi term tests. Getting close 
though.

My main concern at the moment is the state capturing. It seems I have to 
capture the state before readTerm in next() - but I might not use that state if 
there are multiple next calls before the hit. So thats a lot of wasted 
capturing. Have to deal with that somehow.

Doing things more correctly like this, the gain is much less significant. What 
really worries me is that my hack test was still slower than the old - and that 
skipped a bunch of necessary work, so its almost a better than best case here - 
I think you might need more gains elsewhere to get back up to speed.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764050#action_12764050
 ] 

Mark Miller edited comment on LUCENE-1458 at 10/9/09 8:37 AM:
--

hmm - I think I'm close. Everything passes except for omitTermsTest, 
LazyProxTest, and for some odd reason the multi term tests. Getting close 
though.

My main concern at the moment is the state capturing. It seems I have to 
capture the state before readTerm in next() - but I might not use that state if 
there are multiple next calls before the hit. So thats a lot of wasted 
capturing. Have to deal with that somehow.

Doing things more correctly like this, the gain is much less significant. What 
really worries me is that my hack test was still slower than the old - and that 
skipped a bunch of necessary work, so its almost a better than best case here - 
I think you might need more gains elsewhere to get back up to speed.

*edit*

Hmm - still no equivalent of the cached enum for one I guess.
And at the least, since you only cache when the scan is great than one, you can 
at least skip one capture there...

  was (Author: markrmil...@gmail.com):
hmm - I think I'm close. Everything passes except for omitTermsTest, 
LazyProxTest, and for some odd reason the multi term tests. Getting close 
though.

My main concern at the moment is the state capturing. It seems I have to 
capture the state before readTerm in next() - but I might not use that state if 
there are multiple next calls before the hit. So thats a lot of wasted 
capturing. Have to deal with that somehow.

Doing things more correctly like this, the gain is much less significant. What 
really worries me is that my hack test was still slower than the old - and that 
skipped a bunch of necessary work, so its almost a better than best case here - 
I think you might need more gains elsewhere to get back up to speed.
  
 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
   

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764060#action_12764060
 ] 

Michael McCandless commented on LUCENE-1458:


bq. It seems I have to capture the state before readTerm in next() 

Wait, how come?  It seems like we should only cache if we find exactly the 
requested term (ie, where we return SeekStatus.FOUND)?  So you should only have 
to capture the state once, there?

Hmm I wonder whether we should also cache the seek(ord) calls?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764073#action_12764073
 ] 

Mark Miller commented on LUCENE-1458:
-

Hmm - I must have something off then. I've never been into this stuff much 
before.

on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex) - 
I'm just caching the freq, isIndex, and the positions with a CurrentState 
object. The captureCurrentState now telescopes down capturing the state of each 
object 

Perhaps I'm off there - because if I do that, it seems I have to capture the 
state right before the call to readTerm in next() - otherwise readTerm will 
move everything forward before I can grab it when I actually put the state into 
the cache - when its FOUND.

I may be all wet though - no worries - I'm really just playing around trying to 
learn some of this - only way I learn to is to code.

bq. Hmm I wonder whether we should also cache the seek(ord) calls?

I was wondering about that, but hand't even got to thinking about it :)

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764078#action_12764078
 ] 

Michael Busch commented on LUCENE-1458:
---

I added this cache originally because it seemed the easiest to improve the term 
lookup performance. 

Now we're adding the burden of implementing such a cache to every codec, right? 
Maybe instead we should improve the search runtime to not call idf() twice for 
every term?

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764079#action_12764079
 ] 

Michael McCandless commented on LUCENE-1458:


bq. on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex) 

Hmm... I think your cache might be one level too low?  I think we want the 
cache to live in StandardTermsDictReader.  Only the seek(TermRef) method 
interacts with the cache for now (until we maybe add ord as well).

So, seek first checks if that term is in cache, and if so pulls the opaque 
state and asks the docsReader to restore to that state.  Else, it does the 
normal seek, but then if the exact term is found, it calls 
docsReader.captureState and stores it in the cache.

Make sure the cache lives high enough to be shared by different TermsEnum 
instances.  I think it should probably live in 
StandardTermsDictReader.FieldReader.  There is one instance of that per field.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764085#action_12764085
 ] 

Michael McCandless commented on LUCENE-1458:


bq. Now we're adding the burden of implementing such a cache to every codec, 
right?

I suspect most codecs will reuse the StandardTermsDictReader, ie, they will 
usually only change the docs/positions/payloads format.  So each codec will 
only have to implement capture/restoreState.

bq. Maybe instead we should improve the search runtime to not call idf() twice 
for every term?

Oh I didn't realize we call idf() twice per term -- we should separately just 
fix that.  Where are we doing that?

(I thought the two calls were first for idf() and then 2nd when it's time to 
get the actual TermDocs/Positions to step through).


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: 

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764089#action_12764089
 ] 

Michael Busch commented on LUCENE-1458:
---

{quote}
Oh I didn't realize we call idf() twice per term
{quote}

Hmm I take that back. I looked in LUCENE-1195 again:

{quote}
Currently we have a bottleneck for multi-term queries: the dictionary lookup is 
being done
twice for each term. The first time in Similarity.idf(), where 
searcher.docFreq() is called.
The second time when the posting list is opened (TermDocs or TermPositions). 
{quote}

Hmm something's wrong with my memory this morning! Maybe the lack of caffeine :)

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764114#action_12764114
 ] 

Mark Miller commented on LUCENE-1458:
-

Ah - okay - that helps. I think the cache itself is currently around the right 
level (StandardTermsDictReader, and it gets hit pretty hard), but I thought it 
was funky I still had to make that read call - I think I see how it should work 
without that now, but just queuing up the docsReader to where it should be 
correctly. We will see. Vacation till Tuesday - don't let me stop you from 
doing it correctly if its on your timeline. Just playing over here - and I 
don't have a lot of time to play really.

 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?

2009-10-09 Thread MRIT64 (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764134#action_12764134
 ] 

MRIT64 commented on LUCENE-1958:


- Yes, I use a custom analyser which uses reusableToken

- I dont know if reusableToken is supported or not in this version, but the 
method next(Token reusableToken)  is proposed on the ShingleFilter  2.4.1 
Javadoc (see 
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/shingle/ShingleFilter.html).
 That's the reason why I have used it and I don't know how it works internally 
and the is nothing mentionned on the documentation.

Anyway, it doesnt' matter know because the problem doesnt occur with Lucene 2.9.

Regards

 ShingleFilter creates shingles across two consecutives documents : bug or 
 normal behaviour ?
 

 Key: LUCENE-1958
 URL: https://issues.apache.org/jira/browse/LUCENE-1958
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: Windows XP / jdk1.6.0_15
Reporter: MRIT64
Priority: Minor

 HI
 I add two consecutive documents that are indexed with some filters. The last 
 one is ShingleFilter.
 ShingleFilter creates a shingle spannnig the two documents, which has no 
 sense in my context.
 Is that a bug oris it  ShingleFilter normal behaviour ? If it's normal 
 behaviour, is it possible to change it optionnaly ?
 Thanks
 MR

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?

2009-10-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764141#action_12764141
 ] 

Robert Muir commented on LUCENE-1958:
-

MRIT64, actually I am not curious about next(reusableToken), but instead 
whether your Analyzer implements 

{code}
public TokenStream reusableTokenStream(String fieldName, Reader reader) throws 
IOException
{code}

If you were trying to reuse ShingleFilters in 2.4.1 with this technique, I 
think this would be unsafe. It is safe in 2.9

bq. Anyway, it doesnt' matter know because the problem doesnt occur with Lucene 
2.9.

Ok to mark this issue as resolved?


 ShingleFilter creates shingles across two consecutives documents : bug or 
 normal behaviour ?
 

 Key: LUCENE-1958
 URL: https://issues.apache.org/jira/browse/LUCENE-1958
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.4.1
 Environment: Windows XP / jdk1.6.0_15
Reporter: MRIT64
Priority: Minor

 HI
 I add two consecutive documents that are indexed with some filters. The last 
 one is ShingleFilter.
 ShingleFilter creates a shingle spannnig the two documents, which has no 
 sense in my context.
 Is that a bug oris it  ShingleFilter normal behaviour ? If it's normal 
 behaviour, is it possible to change it optionnaly ?
 Thanks
 MR

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1822) FastVectorHighlighter: SimpleFragListBuilder hard-coded 6 char margin is too naive

2009-10-09 Thread Chas Emerick (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764168#action_12764168
 ] 

Chas Emerick commented on LUCENE-1822:
--

Thank you for the patch.  I agree, the context surrounding each fragment could 
definitely be improved.

 FastVectorHighlighter: SimpleFragListBuilder hard-coded 6 char margin is too 
 naive
 --

 Key: LUCENE-1822
 URL: https://issues.apache.org/jira/browse/LUCENE-1822
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
 Environment: any
Reporter: Alex Vigdor
Priority: Minor
 Attachments: LUCENE-1822.patch


 The new FastVectorHighlighter performs extremely well, however I've found in 
 testing that the window of text chosen per fragment is often very poor, as it 
 is hard coded in SimpleFragListBuilder to always select starting 6 characters 
 to the left of the first phrase match in a fragment.  When selecting long 
 fragments, this often means that there is barely any context before the 
 highlighted word, and lots after; even worse, when highlighting a phrase at 
 the end of a short text the beginning is cut off, even though the entire 
 phrase would fit in the specified fragCharSize.  For example, highlighting 
 Punishment in Crime and Punishment  returns e and bPunishment/b no 
 matter what fragCharSize is specified.  I am going to attach a patch that 
 improves the text window selection by recalculating the starting margin once 
 all phrases in the fragment have been identified - this way if a single word 
 is matched in a fragment, it will appear in the middle of the highlight, 
 instead of 6 characters from the beginning.  This way one can also guarantee 
 that the entirety of short texts are represented in a fragment by specifying 
 a large enough fragCharSize.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-10-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1458:
---

Attachment: LUCENE-1458.patch

New patch attached.  All tests pass.

A few small changes (eg sync'd to trunk) but the biggest change is a
new test case (TestExternalCodecs) that contains two new codecs:

  * RAMOnlyCodec -- like instantiated, it writes and reads all
postings into RAM in dedicated classes

  * PerFieldCodecWrapper -- dispatches by field name to different
codecs (this was asked about a couple times)

The test indexes one field using the standard codec, and the other
using the RAMOnlyCodec.  It also verifies one can in fact make a
custom codec external to oal.index.


 Further steps towards flexible indexing
 ---

 Key: LUCENE-1458
 URL: https://issues.apache.org/jira/browse/LUCENE-1458
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
 LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
 LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
 LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2


 I attached a very rough checkpoint of my current patch, to get early
 feedback.  All tests pass, though back compat tests don't pass due to
 changes to package-private APIs plus certain bugs in tests that
 happened to work (eg call TermPostions.nextPosition() too many times,
 which the new API asserts against).
 [Aside: I think, when we commit changes to package-private APIs such
 that back-compat tests don't pass, we could go back, make a branch on
 the back-compat tag, commit changes to the tests to use the new
 package private APIs on that branch, then fix nightly build to use the
 tip of that branch?o]
 There's still plenty to do before this is committable! This is a
 rather large change:
   * Switches to a new more efficient terms dict format.  This still
 uses tii/tis files, but the tii only stores term  long offset
 (not a TermInfo).  At seek points, tis encodes term  freq/prox
 offsets absolutely instead of with deltas delta.  Also, tis/tii
 are structured by field, so we don't have to record field number
 in every term.
 .
 On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
 - 0.64 MB) and tis file is 9% smaller (75.5 MB - 68.5 MB).
 .
 RAM usage when loading terms dict index is significantly less
 since we only load an array of offsets and an array of String (no
 more TermInfo array).  It should be faster to init too.
 .
 This part is basically done.
   * Introduces modular reader codec that strongly decouples terms dict
 from docs/positions readers.  EG there is no more TermInfo used
 when reading the new format.
 .
 There's nice symmetry now between reading  writing in the codec
 chain -- the current docs/prox format is captured in:
 {code}
 FormatPostingsTermsDictWriter/Reader
 FormatPostingsDocsWriter/Reader (.frq file) and
 FormatPostingsPositionsWriter/Reader (.prx file).
 {code}
 This part is basically done.
   * Introduces a new flex API for iterating through the fields,
 terms, docs and positions:
 {code}
 FieldProducer - TermsEnum - DocsEnum - PostingsEnum
 {code}
 This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
 old API on top of the new API to keep back-compat.
 
 Next steps:
   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
 fix any hidden assumptions.
   * Expose new API out of IndexReader, deprecate old API but emulate
 old API on top of new one, switch all core/contrib users to the
 new API.
   * Maybe switch to AttributeSources as the base class for TermsEnum,
 DocsEnum, PostingsEnum -- this would give readers API flexibility
 (not just index-file-format flexibility).  EG if someone wanted
 to store payload at the term-doc level instead of
 term-doc-position level, you could just add a new attribute.
   * Test performance  iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

Hudson build is back to normal: Lucene-trunk #974

2009-10-09 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/974/changes



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org