[jira] Commented: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

2009-06-10 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718300#action_12718300 ] Robert Muir commented on LUCENE-1545: - if you are looking for a more short-term soluti

MMap certain files, leave the rest to the regular dir

2009-06-10 Thread Jason Rutherglen
On the topic of MMaping files. Would a Directory implementation that transparently MMaps only certain files be interesting? It could MMap files that are accessed frequently (term dict, postings), as opposed to files such as docstores that are accessed less frequently. This could be built using LUCE

[jira] Updated: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-06-10 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1460: Attachment: LUCENE-1460_partial.txt only partial solution... some of the analyzers don't have any

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-10 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718292#action_12718292 ] Robert Muir commented on LUCENE-1628: - mark, on the same topic: if possible, at some t

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

2009-06-10 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718291#action_12718291 ] Robert Muir commented on LUCENE-1466: - just as an alternative, i have a different mech

[jira] Updated: (LUCENE-1583) SpanOrQuery skipTo() doesn't always move forwards

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1583: Attachment: LUCENE-1583.patch I havn't looked closely at it yet, but tests appear to pass (unknown

[jira] Commented: (LUCENE-1504) SerialChainFilter should use DocSet API rather then deprecated BitSet API

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718289#action_12718289 ] Mark Miller commented on LUCENE-1504: - Looks like we are close on this? Someone want t

[jira] Updated: (LUCENE-1361) QueryParser should have a setDateFormat(DateFormat) method

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1361: Fix Version/s: (was: 2.9) If a patch is supplied, we could consider for 2.9, but otherwise I a

[jira] Commented: (LUCENE-1650) Small fix in CustomScoreQuery JavaDoc

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718286#action_12718286 ] Mark Miller commented on LUCENE-1650: - This looks like it makes sense to me. Unless an

[jira] Commented: (LUCENE-1167) add compatibility statement to README.txt for all contribs

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718285#action_12718285 ] Mark Miller commented on LUCENE-1167: - I wouldn't argue against doing this, but person

[jira] Assigned: (LUCENE-1681) DocValues infinite loop caused by - a call to getMinValue | getMaxValue | getAverageValue

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-1681: --- Assignee: Mark Miller > DocValues infinite loop caused by - a call to getMinValue | getMaxVa

[jira] Commented: (LUCENE-1405) Support for new Resources model in ant 1.7 in Lucene ant task.

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718283#action_12718283 ] Mark Miller commented on LUCENE-1405: - Hey Erik, does this make sense? It looks like a

[jira] Updated: (LUCENE-1545) Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1545: Priority: Minor (was: Major) Fix Version/s: (was: 2.9) 3.0 Feel f

Re: back compat is good

2009-06-10 Thread Mark Miller
Yonik Seeley wrote: On Wed, Jun 10, 2009 at 4:11 PM, Mark Miller wrote: The computer should handle that for me. It really should be as easy as saying, look I want the best new defaults, or I want the back compat defaults. The computer should figure out the rest for me. actsAsVersion ;

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718281#action_12718281 ] Mark Miller commented on LUCENE-1486: - What do you think about this for 2.9 Mark H? b

[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718280#action_12718280 ] Mark Miller commented on LUCENE-1460: - wanna post what you have Robert? perhaps then s

[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718279#action_12718279 ] Mark Miller commented on LUCENE-1644: - I'm inclined to think we push to 3.0? > Enable

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718277#action_12718277 ] Mark Miller commented on LUCENE-1466: - Anyone want to step up for this one or should w

[jira] Commented: (LUCENE-1571) DistanceFilter problem with deleted documents

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718276#action_12718276 ] Mark Miller commented on LUCENE-1571: - Can someone that knows LocalLucene comment on w

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718275#action_12718275 ] Mark Miller commented on LUCENE-1628: - Okay, I see that the stopword list for Arabic w

[jira] Commented: (LUCENE-1628) Persian Analyzer

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718274#action_12718274 ] Mark Miller commented on LUCENE-1628: - Thanks Robert, looks cool. Anyone know what th

[jira] Assigned: (LUCENE-1628) Persian Analyzer

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-1628: --- Assignee: Mark Miller > Persian Analyzer > > > Key: LUCENE-

[jira] Commented: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-06-10 Thread Trejkaz (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718270#action_12718270 ] Trejkaz commented on LUCENE-1683: - I screwed up the formatting. Fixed version: {code}

[jira] Resolved: (LUCENE-1598) While you could use a custom Sort Comparator source with remote searchable before, you can no longer do so with FieldComparatorSource

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1598. - Resolution: Fixed > While you could use a custom Sort Comparator source with remote searchable

[jira] Created: (LUCENE-1683) RegexQuery matches terms the input regex doesn't actually match

2009-06-10 Thread Trejkaz (JIRA)
RegexQuery matches terms the input regex doesn't actually match --- Key: LUCENE-1683 URL: https://issues.apache.org/jira/browse/LUCENE-1683 Project: Lucene - Java Issue Type: Improv

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
I read over the LUCENE-1458 comments again. Interesting. I think the most compelling argument is that the various files we're normally loading into the heap are, after merging, in the IO cache. If we can simply reuse the IO cache rather then allocate a bunch of redundant arrays in heap, we could be

[jira] Resolved: (LUCENE-1455) org.apache.lucene.ant.HtmlDocument creates a FileInputStream in its constructor that it doesn't close

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1455. - Resolution: Fixed > org.apache.lucene.ant.HtmlDocument creates a FileInputStream in its > const

[jira] Resolved: (LUCENE-1572) luceneweb

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1572. - Resolution: Incomplete not sure what issue is here, but it looks like it belongs on the mailing

[jira] Updated: (LUCENE-1407) Refactor Searchable to not have RMI Remote dependency

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1407: Fix Version/s: (was: 2.9) No work yet on this issue right? I'm going to pull off the 2.9 and l

[jira] Updated: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2009-06-10 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1482: Fix Version/s: (was: 2.9) 3.0 I'm going to go out on a limb and say this on

[jira] Resolved: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1584. Resolution: Fixed Fix Version/s: 2.9 > Callback for intercepting merging se

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718261#action_12718261 ] Michael McCandless commented on LUCENE-1584: OK I committed it! Thanks Jason!

Re: Lucene memory usage

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 7:23 PM, Jason Rutherglen wrote: > Cool! Sounds like with LUCENE-1458 we can experiment with some > of these things. Does CSF become just another codec? I believe LUCENE-1458 currently only makes terms dict & postings pluggable... >> I'm leary of having terms dict live ent

[jira] Updated: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1584: - Attachment: LUCENE-1584.patch Yep! > Callback for intercepting merging segments in Inde

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718258#action_12718258 ] Michael McCandless commented on LUCENE-1584: Shouldn't we make that method pac

[jira] Commented: (LUCENE-1592) fix or deprecate TermsEnum.skipTo

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718257#action_12718257 ] Michael McCandless commented on LUCENE-1592: OK, excellent -- can you commit t

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
Cool! Sounds like with LUCENE-1458 we can experiment with some of these things. Does CSF become just another codec? > I'm leary of having terms dict live entirely on disk, though we should certainly explore it. Yeah, it should theoretically help with reloading, it could use a skiplist (as we have

[jira] Updated: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1584: - Attachment: LUCENE-1584.patch Added a protected IW.mergeSuccess method. We can't really

[jira] Commented: (LUCENE-1592) fix or deprecate TermsEnum.skipTo

2009-06-10 Thread Uwe Schindler (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718237#action_12718237 ] Uwe Schindler commented on LUCENE-1592: --- I think deprecation is to do now (before 3.

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718235#action_12718235 ] Michael McCandless commented on LUCENE-1584: But you'll still need access to p

[jira] Updated: (LUCENE-1560) maxDocBytesToAnalyze should be required arg up front

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1560: --- Description: We recently changed IndexWriter to require you to specify up-front Ma

[jira] Updated: (LUCENE-1592) fix or deprecate TermsEnum.skipTo

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1592: --- Fix Version/s: (was: 2.9) Moving out. > fix or deprecate TermsEnum.skipTo > ---

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718231#action_12718231 ] Jason Rutherglen commented on LUCENE-1584: -- We can make it protected that way it'

[jira] Updated: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1574: --- Fix Version/s: (was: 2.9) Moving out. > PooledSegmentReader, pools SegmentReade

[jira] Updated: (LUCENE-1667) ConcurrentMergeScheduler use a thread pool (per directory)

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1667: --- Fix Version/s: (was: 2.9) Moving out. > ConcurrentMergeScheduler use a thread p

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718228#action_12718228 ] Michael McCandless commented on LUCENE-1584: Well, I'm worried about how much

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 5:45 PM, Michael McCandless wrote: > But, I realize this is a stretch... eg we'd have to fix rewrite to be > per-segment, which certainly seems spooky.  A top-level schema would > definitely be cleaner. Really goes into Solr land... my pref for Lucene is to remain a core e

RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> > Another question not so simple to answer: When embedding these > TermPositions > > into the whole process, how would this work with MultiTermQuery? > > There's no reason why Trie has to use MultiTermQuery, right? No but is elegant and simplifies much (see current code in trunk). Uwe --

RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> I think we'd need richer communication between MTQ and its subclasses, > so that eg your enum would return a Query instead of a Term? > > Then you'd either return a TermQuery, or, a BooleanQuery that's > filtering the TermQuery? > > But yes, doing after 3.0 seems good! There is one other thing

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
> Another question not so simple to answer: When embedding these TermPositions > into the whole process, how would this work with MultiTermQuery? There's no reason why Trie has to use MultiTermQuery, right? -Yonik http://www.lucidimagination.com --

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 5:24 PM, Yonik Seeley wrote: > On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless > wrote: >> On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote: >> * Was the field even indexed w/ Trie, or indexed as "simple text"? > > Why the special treatment for Trie? So that at

[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718209#action_12718209 ] Michael McCandless commented on LUCENE-1673: bq. NumericRangeQuery.newFloatRan

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
I think we'd need richer communication between MTQ and its subclasses, so that eg your enum would return a Query instead of a Term? Then you'd either return a TermQuery, or, a BooleanQuery that's filtering the TermQuery? But yes, doing after 3.0 seems good! Mike On Wed, Jun 10, 2009 at 5:26 PM,

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot
>  * Was the field even indexed w/ Trie, or indexed as "simple text"? >    It's useful to know this "automatically" at search time, so eg a >    RangeQuery can do the right thing by default.  FieldInfos seems >    like the natural place to store this.  It's basically Lucene's >    per-segment write

RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> I would like to go forward with moving the classes into the right packages > and optimize the way, how queries and analyzers are created (only one > class > for each). The idea from LUCENE-1673 to use static factories to create > these > classes for the different data types seems to be more elega

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 5:03 PM, Michael McCandless wrote: > On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote: >  * Was the field even indexed w/ Trie, or indexed as "simple text"? Why the special treatment for Trie? >    It's useful to know this "automatically" at search time, so eg a >  

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 5:07 PM, Uwe Schindler wrote: > I would really like to leave this optimization out for 2.9. We can still add > this after 2.9 as an optimization. The number of bits encoded into the > TermPosition (this is really a cool idea, thanks Yonik, I was missing > exactly that, becau

RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless > wrote: > > On Wed, Jun 10, 2009 at 3:19 PM, Yonik > Seeley wrote: > > > >>> And this information about the trie > >>> structure and where payloads are should be stored in FieldInfos. > >> > >> As is the case today, the info is encoded in the

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 4:04 PM, Earwin Burrfoot wrote: > And then, when you merge segments indexed with different Trie* > settings, you need to convert them to some common form. > Sounds like something too complex and with minimum returns. Oh yeah... tricky. So... there are various situations t

[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Earwin Burrfoot (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718198#action_12718198 ] Earwin Burrfoot commented on LUCENE-1607: - bq. but I was waiting for some kind of

[jira] Updated: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-1607: - Attachment: LUCENE-1607.patch latest patch - could use a multi-threaded testcase to ensure no ex

Re: back compat is good

2009-06-10 Thread Grant Ingersoll
I'm not against back compatibility. In fact, I agree with your points, especially the use of the phrase "commonly used interfaces". My main problem is our approach seems to be very dogmatic and detrimental for _less_ commonly used interfaces (more importantly less commonly _implemented_ In

[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718188#action_12718188 ] Yonik Seeley commented on LUCENE-1607: -- I think so... but I was waiting for some kind

Re: Lucene memory usage

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 4:13 PM, Jason Rutherglen wrote: > Great! If I understand correctly it looks like RAM savings? Will > there be an improvement in lookup speed? (We're using binary > search here?). Yes, sizable RAM reduction for apps that have many unique terms. And, init'ing (warming) the

Re: back compat is good

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 4:11 PM, Mark Miller wrote: > The computer should handle that > for me. It really should be as easy > as saying, look I want the best new defaults, or I want the back compat > defaults. The computer should figure > out the rest for me. actsAsVersion ;-) nice and back compa

Re: Lucene's default settings & back compatibility

2009-06-10 Thread Mark Miller
Right - I'd actually hold off now. I figured the threat of sending might prompt some action ;) It still wouldn't hurt to know what the users think, perhaps at more digestible, overview level though. I do think Yonik torpedoed something this liberal :) Thats not a bad thing though. We will fi

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718185#action_12718185 ] Jason Rutherglen commented on LUCENE-1584: -- Can we put this one in 2.9? It seems

[jira] Updated: (LUCENE-1671) FSDirectory internally caches and clones FSIndexInput

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1671: --- Fix Version/s: (was: 2.9) Moving out. > FSDirectory internally caches and clone

[jira] Resolved: (LUCENE-1682) unit tests should use private directories

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1682. Resolution: Fixed > unit tests should use private directories > --

[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718181#action_12718181 ] Michael McCandless commented on LUCENE-1607: Yonik is this ready to go in...?

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
Great! If I understand correctly it looks like RAM savings? Will there be an improvement in lookup speed? (We're using binary search here?). Is there a precedence in database systems for what was mentioned about placing the term dict, delDocs, and filters onto disk and reading them from there (wit

Re: Lucene's default settings & back compatibility

2009-06-10 Thread Shai Erera
Well .. to be honest I haven't monitored java-user for quite some time, so I don't know if it hasn't been raised there. But now there's the other thread that Yonik started, so I'm not really sure where to answer. I think that if we look back at 2.0 and compare to 2.9, anyone upgrading from that v

[jira] Updated: (LUCENE-1577) Benchmark of different in RAM realtime techniques

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1577: --- Fix Version/s: (was: 2.9) Moving out. > Benchmark of different in RAM realtime

Re: back compat is good

2009-06-10 Thread Mark Miller
As far as default settings, it seems like it can be mostly fixed with documentation (i.e. recommended settings for maximum performance). That seems like a very small burden for people writing new applications with Lucene anyway (compare to the cost of writing the whole application). On the othe

[jira] Updated: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1584: --- Fix Version/s: (was: 2.9) Moving out. > Callback for intercepting merging segme

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718176#action_12718176 ] Michael McCandless commented on LUCENE-1448: Michael are you going to get to t

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 3:43 PM, Michael McCandless wrote: > On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley > wrote: > >>> And this information about the trie >>> structure and where payloads are should be stored in FieldInfos. >> >> As is the case today, the info is encoded in the class you use (

[jira] Commented: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead

2009-06-10 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718175#action_12718175 ] Michael McCandless commented on LUCENE-1609: Alas... the big problem with doin

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot
>>> And this information about the trie >>> structure and where payloads are should be stored in FieldInfos. >> >> As is the case today, the info is encoded in the class you use (and >> it's settings)... no need to add it to the index structure.  In any >> case, it's a completely different issue an

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 3:19 PM, Yonik Seeley wrote: >> And this information about the trie >> structure and where payloads are should be stored in FieldInfos. > > As is the case today, the info is encoded in the class you use (and > it's settings)... no need to add it to the index structure.  In

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler wrote: >> I wonder how performance would compare.  Without payloads, there are >> many more terms (for the tiny ranges) in the index, and your OR query >> will have lots of these tiny terms.  But then these tiny terms don't >> hit many docs, and with

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 3:07 PM, Uwe Schindler wrote: > My problem with all this is how to optimize after which shift value to > switch between terms and payloads. Just make it a configurable number of bits at the end that are "stored" instead of indexed. People will want to select different tra

Re: Lucene / Solr Function API

2009-06-10 Thread Michael McCandless
Well, it's unassigned and has no comments so my guess is: it's all yours! This would be a great step forward. The line between Solr & Lucene ought to be more "crisp" and this issue is a step towards that... Mike On Wed, Jun 10, 2009 at 2:59 PM, Simon Willnauer wrote: > Hey there, > > I'm curiou

Re: back compat is good

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 2:23 PM, Yonik Seeley wrote: >> Well... Lucene still seems to be experiencing strong adoption/growth, >> eg combined user+dev email traffic: >> http://lucene.markmail.org/ > > I think that includes all Lucene sub-projects (Solr, Tika, Mahout, > Nutch, Droids, etc). > > http

RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
> Ooh that sounds compelling! > > So you would not need to use payloads for the "inside" brackets, > right? Only for the edges? Exactly. > I wonder how performance would compare. Without payloads, there are > many more terms (for the tiny ranges) in the index, and your OR query > will have lot

Lucene / Solr Function API

2009-06-10 Thread Simon Willnauer
Hey there, I'm curious if anybody is working on the issue https://issues.apache.org/jira/browse/LUCENE-1085 and the blocker https://issues.apache.org/jira/browse/LUCENE-1085 ? I would love to see both solr and lucene using the same api for search functions. The issues have been idle for a while so

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Yonik Seeley
Yep, makes sense. It could be a little slower, but it would decrease the number of terms indexed by a factor of 256 (for 8 bits). But the payload part... seems like another case of using that because CSF isn't there yet, right? (well, perhaps except if you didn't want to store the field...) -Yon

Re: back compat is good

2009-06-10 Thread Simon Willnauer
On Wed, Jun 10, 2009 at 7:00 PM, Yonik Seeley wrote: > I'm starting to feel like the lone holdout that thinks back compat for > commonly used interfaces and index formats is important.  So I'll sum > up some of my thoughts and leave it at that: > > - I doubt that the number of new users for each re

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Michael McCandless
Ooh that sounds compelling! So you would not need to use payloads for the "inside" brackets, right? Only for the edges? I wonder how performance would compare. Without payloads, there are many more terms (for the tiny ranges) in the index, and your OR query will have lots of these tiny terms.

RE: Payloads and TrieRangeQuery

2009-06-10 Thread Uwe Schindler
Hi, sorry I missed the first mail. The idea we discussed in Amsterdam during ApacheCon was: Instead of indexing all trie precisions from e.g. the leftmost 8 bits downto all 64 bits, the TrieTokenStream only creates terms from e.g. precisions 8 to 56. The last precision is left out. Instead

Re: back compat is good

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 2:01 PM, Michael McCandless wrote: > Well... Lucene still seems to be experiencing strong adoption/growth, > eg combined user+dev email traffic: > http://lucene.markmail.org/ I think that includes all Lucene sub-projects (Solr, Tika, Mahout, Nutch, Droids, etc). http://lu

Re: back compat is good

2009-06-10 Thread Michael McCandless
Well... Lucene still seems to be experiencing strong adoption/growth, eg combined user+dev email traffic: http://lucene.markmail.org/ Net/net, I also think that back-compat is important and we shouldn't up and abandon it or relax our policy too much. However, I wish we had better tools for *im

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Jason Rutherglen
I think instead of ORing postings (trie range, rangequery, etc), have a custom Query + Scorer that examines the payload (somehow)? It could encode the multiple levels of trie bits in it? (I'm just guessing here). On Wed, Jun 10, 2009 at 4:04 AM, Michael McCandless < luc...@mikemccandless.com> wr

Re: back compat is good

2009-06-10 Thread Mark Miller
Yonik Seeley wrote: I'm starting to feel like the lone holdout that thinks back compat for commonly used interfaces and index formats is important. I think the fact that your not the only one is why things got stymied. I wouldnt personally support anything that didnt try and maintain stabili

Re: [jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Michael McCandless
On Wed, Jun 10, 2009 at 12:45 PM, Mark Miller wrote: > I've heard that one before ;) In fact, we pretty much committed to releasing > more often. Now if 2.9 would just fall into line with our darn commitments > :) I hear you! So... how about we try to wrap up 2.9/3.0 and ship with what we have,

back compat is good

2009-06-10 Thread Yonik Seeley
I'm starting to feel like the lone holdout that thinks back compat for commonly used interfaces and index formats is important. So I'll sum up some of my thoughts and leave it at that: - I doubt that the number of new users for each release of Lucene exceeds the sum total of all existing users of

Re: [jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Mark Miller
Michael McCandless (JIRA) wrote: bq. Adopting a fixed release cycle with small intervals between releases (compared to what we have now). I think this is almost a good solution, though instead of "fixed" it could be that we try [harder] to do major

Re: Lucene's default settings & back compatibility

2009-06-10 Thread Mark Miller
No one really responded to this Shai? And I take it that the user list never saw it? Perhaps we should just ask for opinion from the user list based on what you already have - just to gauge the reaction on different points. Unless someone responds shortly, we could take a year waiting to shake

[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-10 Thread Shai Erera (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718122#action_12718122 ] Shai Erera commented on LUCENE-1678: We've had this thread http://www.nabble.com/Luce

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

2009-06-10 Thread Yonik Seeley
On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera wrote: >> it makes sense because isDeleted() is essentially the *only* thing >> being done in the loop, and hence we can eliminate the loop entirely > > You mean that in case there is a matching segment, we can call > matchingVectorsReader.rawDocs(rawDo

  1   2   >