date:20091205

[jira] Assigned: (LUCENE-2115) Port to Generics - test cases in contrib

2009-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2115:
--

Assignee: Michael McCandless

> Port to Generics - test cases in contrib 
> -
>
> Key: LUCENE-2115
> URL: https://issues.apache.org/jira/browse/LUCENE-2115
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2115.patch
>
>
> LUCENE-1257 in Lucene 3.0 addressed porting to generics across public api-s . 
> LUCENE-2065 addressed across src/test . 
> This would be a placeholder JIRA for any remaining pending generic 
> conversions across the code base. 
> Please keep it open after commiting and we can close it when we are near a 
> 3.1 release , so that this could be a placeholder ticket. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: LUCENE-2110.patch

New patch with the attribute support of LUCENE-2109.

- Also fixes a bug in the BW compatibility layer of MTQ (if clause wrong).
- Some code cleanup in FilteredTermsEnum (now easier to read, as next() and 
seekNextTerm is complicated).
- Added EmptyTermsEnum for shortcuts (used by NRQ and TRQ on inverse ranges). 
This enum never does any disk I/O to terms dict, it is just empty. 
EmptyTermsEnum again supports seeking (although subclass of FilteredTermsEnum), 
but it is simple there, it returns just END :-)

I will now port Automaton and apply will provide a combined patch there.

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2115) Port to Generics - test cases in contrib

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786363#action_12786363
 ] 

Uwe Schindler commented on LUCENE-2115:
---

Mike: Oh a new generics policeman, your second police operation :-)

> Port to Generics - test cases in contrib 
> -
>
> Key: LUCENE-2115
> URL: https://issues.apache.org/jira/browse/LUCENE-2115
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2115.patch
>
>
> LUCENE-1257 in Lucene 3.0 addressed porting to generics across public api-s . 
> LUCENE-2065 addressed across src/test . 
> This would be a placeholder JIRA for any remaining pending generic 
> conversions across the code base. 
> Please keep it open after commiting and we can close it when we are near a 
> 3.1 release , so that this could be a placeholder ticket. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2115) Port to Generics - test cases in contrib

2009-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786364#action_12786364
 ] 

Michael McCandless commented on LUCENE-2115:


Hey I'm just the messenger -- Kay Kay is the 2nd policeman :)

> Port to Generics - test cases in contrib 
> -
>
> Key: LUCENE-2115
> URL: https://issues.apache.org/jira/browse/LUCENE-2115
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2115.patch
>
>
> LUCENE-1257 in Lucene 3.0 addressed porting to generics across public api-s . 
> LUCENE-2065 addressed across src/test . 
> This would be a placeholder JIRA for any remaining pending generic 
> conversions across the code base. 
> Please keep it open after commiting and we can close it when we are near a 
> 3.1 release , so that this could be a placeholder ticket. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2115) Port to Generics - test cases in contrib

2009-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786365#action_12786365
 ] 

Michael McCandless commented on LUCENE-2115:


Patch looks good -- I'll commit shortly.  Thanks Kay Kay!

> Port to Generics - test cases in contrib 
> -
>
> Key: LUCENE-2115
> URL: https://issues.apache.org/jira/browse/LUCENE-2115
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2115.patch
>
>
> LUCENE-1257 in Lucene 3.0 addressed porting to generics across public api-s . 
> LUCENE-2065 addressed across src/test . 
> This would be a placeholder JIRA for any remaining pending generic 
> conversions across the code base. 
> Please keep it open after commiting and we can close it when we are near a 
> 3.1 release , so that this could be a placeholder ticket. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2115) Port to Generics - test cases in contrib

2009-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2115.


Resolution: Fixed

Thanks Kay Kay!

> Port to Generics - test cases in contrib 
> -
>
> Key: LUCENE-2115
> URL: https://issues.apache.org/jira/browse/LUCENE-2115
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.0
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2115.patch
>
>
> LUCENE-1257 in Lucene 3.0 addressed porting to generics across public api-s . 
> LUCENE-2065 addressed across src/test . 
> This would be a placeholder JIRA for any remaining pending generic 
> conversions across the code base. 
> Please keep it open after commiting and we can close it when we are near a 
> 3.1 release , so that this could be a placeholder ticket. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786367#action_12786367
 ] 

Michael McCandless commented on LUCENE-2037:


Thanks Kay Kay.  Erick, once you've had a chance to review/iterate, I plan to 
commit... then we can make use of junit4 features in our tests.

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch, 
> LUCENE-2037_revised_2.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2009-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2111:
---

Attachment: LUCENE-2111.patch

Attached patch -- will commit soon.  I found a bug in the "flex API on
non-flex" layer (the preflex codec) -- exposed with a new test case in
TestBackCompat, and fixed.  Also cleaned up some nocommits, added
indexDivisor to the loadTermsIndex API, and fixed preflex to actually
implement it.


> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2111.patch
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2116) Add link to irc channel #lucene on the website

2009-12-05 Thread Simon Willnauer (JIRA)

Add link to irc channel #lucene on the website
--

 Key: LUCENE-2116
 URL: https://issues.apache.org/jira/browse/LUCENE-2116
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Website
Reporter: Simon Willnauer
Priority: Trivial


We should add a link to #lucene IRC channel on chat.freenode.org. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2108.
-

Resolution: Fixed

committed in revision 887532

Mike, thanks for review. We should backport this change to 2.9 - can you commit 
that please, I can not though.

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Simon Willnauer
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, 
> LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2039.
-

Resolution: Fixed

Commited in revision 887533

> Regex support and beyond in JavaCC QueryParser
> --
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, 
> LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch, 
> LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries 
> living in core, adding other queries or extending the parser in any way 
> always forced people to change the grammar file and regenerate. Even if you 
> change the grammar you have to be extremely careful how you modify the parser 
> so that other parts of the standard parser are affected by customisation 
> changes. Eventually you had to live with all the limitation the current 
> parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
> the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to 
> the query parser without introducing any dependency to core. I added a new 
> special character that basically prevents the parser from interpreting any of 
> the characters enclosed in the new special characters. I choose the forward 
> slash  '/' as the delimiter so that everything in between two forward slashes 
> is basically escaped and ignored by the parser. All chars embedded within 
> forward slashes are treated as one token even if it contains other special 
> chars like * []?{} or whitespaces. This token is subsequently passed to a 
> pluggable "parser extension" with builds a query from the embedded string. I 
> do not interpret the embedded string in any way but leave all the subsequent 
> work to the parser extension. Such an extension could be another full 
> featured query parser itself or simply a ctor call for regex query. The 
> interface remains quiet simple but makes the parser extendible in an easy way 
> compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char 
> into the syntax but I guess that would not be that much of a deal as it is 
> reflected in the escape method though. It would truly be nice to have more 
> than once extension an have this even more flexible so treat this patch as a 
> kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK 
> version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the 
> query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based 
> approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2102) LowerCaseFilter for Turkish language

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer resolved LUCENE-2102.
-

Resolution: Fixed

Committed in revision 887535

Thanks Ahmet / Robert!

> LowerCaseFilter for Turkish language
> 
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Ahmet Arslan
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch, 
> LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch, 
> LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish 
> alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

2009-12-05 Thread Simon Willnauer (JIRA)

Fix SnowballAnalyzer casing behavior for Turkish Language
-

 Key: LUCENE-2117
 URL: https://issues.apache.org/jira/browse/LUCENE-2117
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 3.1


LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior 
correctly. We should fix the casing behavior in SnowballAnalyzer too as it 
supports a TurkishStemmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Push fast-vector-highlighter mvn artifacts for 3.0 and 2.9

2009-12-05 Thread Simon Willnauer

hi folks,
The maven artifacts for fast-vector-highlighter have never been pushed
since it was released because there were no pom.xml.template inside
the module. I added a pom file a day ago in the context of
LUCENE-2107. I already talked to uwe and grant how to deal with this
issues and if we should push the artifact for Lucene 2.9 / 3.0. Since
this is only a metadata file we could consider rebuilding the
artefacts and publish them for those releases. I can not remember that
anything like that happened before, so we should discuss how to deal
with this situation and if we should wait until 3.1.

simon

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2116) Add link to irc channel #lucene on the website

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2116:


Attachment: LUCENE-2116.patch

Created a patch for the website. 
As mike mentioned in the chat it would be desirable to have an archive for the 
IRC channel. Does anybody know how IRC archiving works and who initially 
created the channel?

> Add link to irc channel #lucene on the website
> --
>
> Key: LUCENE-2116
> URL: https://issues.apache.org/jira/browse/LUCENE-2116
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Website
>Reporter: Simon Willnauer
>Priority: Trivial
> Attachments: LUCENE-2116.patch
>
>
> We should add a link to #lucene IRC channel on chat.freenode.org. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2112) Flex on non-flex emulation of TermsEnum incorrectly seeks/nexts beyond current field

2009-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2112:
---

Attachment: LUCENE-2112.patch

Attached patch; added 2 new test cases... fixed a few issues in the emulation 
layers.

> Flex on non-flex emulation of TermsEnum incorrectly seeks/nexts beyond 
> current field
> 
>
> Key: LUCENE-2112
> URL: https://issues.apache.org/jira/browse/LUCENE-2112
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: Flex Branch
>
> Attachments: LUCENE-2112.patch
>
>
> Spinoff of LUCENE-2111, where Uwe found this issue with the flex on non-flex 
> emulation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786404#action_12786404
 ] 

Michael McCandless commented on LUCENE-1526:


Jake, have you guys had a chance to re-run your tests across varying
reopen rates?  Are you still hitting OOM / file handle leaks with
straight Lucene NRT?  I've been unable to reproduce these issues in
my stress testing so I'd like to hone in on what's different in our
testing.

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch, LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786406#action_12786406
 ] 

Michael McCandless commented on LUCENE-2108:


bq. We should backport this change to 2.9 - can you commit that please, I can 
not though.

And, to 3.0.  OK will do...

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Simon Willnauer
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, 
> LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

2009-12-05 Thread John Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786408#action_12786408
 ] 

John Wang commented on LUCENE-1526:
---

Yes, we still see the issue. The performance/stress test after 20+ min of run, 
latency spiked from 5ms to 550ms and file handle leakage was severe enough that 
the test crashed. This is the code:

http://code.google.com/p/zoie/source/browse/branches/BR_DELETE_OPT/java/proj/zoie/impl/indexing/luceneNRT/ThrottledLuceneNRTDataConsumer.java

Our logging indicates there is at most 3 index readers instances at open state. 
Yet the file handle count is very high.

> For near real-time search, use paged copy-on-write BitVector impl
> -
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1526.patch, LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-05 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786410#action_12786410
 ] 

Simon Willnauer commented on LUCENE-2108:
-

Mike, I just realized that we need to change the test as it uses java5
classes. I will provide you a patch compatible w. 1.4 later.

On Dec 5, 2009 2:35 PM, "Michael McCandless (JIRA)"  wrote:


   [
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786406#action_12786406]

Michael McCandless commented on LUCENE-2108:


bq. We should backport this change to 2.9 - can you commit that please, I
can not though.

And, to 3.0.  OK will do...

by SpellChecker internally
-
LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Simon Willnauer
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, 
> LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786411#action_12786411
 ] 

Mark Miller commented on LUCENE-2110:
-

Hey Uwe, since your editing this code anyway, wanna add a comment fix for the 
ref of TermInfo here?

{code}
+  // Loading the TermInfo from the terms dict here
+  // should not be costly, because 1) the
+  // query/filter will load the TermInfo when it
+  // runs, and 2) the terms dict has a cache:
{code}

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Release artifacts

2009-12-05 Thread DM Smith

I'm wondering about the size of the builds, which are surprisingly big to me. 
The src is 12M/13M and the bin is 17M/26M (tar.gz/zip) for 2.9.1, similar for 
3.0.0.

In looking at the binary artifact I see the following:
* Every contrib jar has a corresponding javadoc jar, but there is no 
core-javadoc.jar, however, there is a doc folder that is not jarred and it has 
all the contrib documents in it.

Is it really needed to include it in the bin at all? In my working environment, 
Eclipse, they are entirely unnecessary when one has the src.zip. I imagine that 
other IDEs are similar. And they are trivial to generate. I'd rather see a 
separate JavaDoc tar.gz/zip.

And if it is needed in the bin artifact, is it necessary to have it 
uncompressed and partially duplicated?

The contrib javadoc.jars total 4.3M and the docs/api when zipped or jarred has 
a size of 13.7M.

For whatever reason, gzip is much better at compressing javadoc than zip or jar 
is. While not duplicating the contrib javadocs would be better, not jarring the 
contrib javadocs would improve the gzip compressibility for bin artifact and 
not adversely affect the size of the zip.

* There is a src folder that has a few things in it.

Aren't these in another jar? And shouldn't they folder not be there? After all 
it is not a src artifact.

* The lib folder has the servlet-api-2.4.jar in it, but the junit-3.8.2.jar is 
not.

Should either of these be there?


Regarding the src artifact I see the following:
* It is far more than the src for the bin artifact.
** It includes site files.
** It includes tests.
** It includes 3-rd party jars for contrib. 6.3M of them.

I get that it is merely an export of SVN, but should it be?
Could it be broken out into separate parts? Would that make sense?
E.g.
lucene-src -- Contains the parts for the bin jar.
lucene-test -- Contains the tests.
lucene-dependency -- Contains the 3-rd party jars.
lucene-misc -- Contains the site files and miscellaneous other stuff.

Regarding the 3-rd party jars, there are 2 jars that are not in svn that 
lucene/contrib requires and to get them one has to bootstrap by running ant. 
These are the bdb libs.

I also see that lucene has a patched Xerces (is that still necessary?) and a 
custom build of ICU4J (are there instructions for creating that? I didn't find 
them.).

Every release of Lucene, I find my self repackaging the bin and src to skinny 
it down to what we need for our development environment. That's my motivation 
for raising these questions.

If we can figure out if or what should change, I'd be glad to do the ant work. 
I know enough of ant to be dangerous ;)

-- DM





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2096) Investigate parallelizing Ant junit tests

2009-12-05 Thread Erick Erickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson reassigned LUCENE-2096:
--

Assignee: (was: Erick Erickson)

Maybe for later

> Investigate parallelizing Ant junit tests
> -
>
> Key: LUCENE-2096
> URL: https://issues.apache.org/jira/browse/LUCENE-2096
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Reporter: Erick Erickson
>Priority: Minor
>
> Ant Contrib has a "ForEach" construct that may speed up running all of the 
> Junit tests by parallelizing them with a configurable number of threads. I 
> envision this in several stages. First, see if ForEach works for us with 
> hard-coded lists, distribute this for testing then make the changes "for 
> real". I intend to hard-code the list for the first pass, ordered by the time 
> they take. This won't do for check-in, but will give us a fast 
> proof-of-concept.
> This approach will be most useful for multi-core machines.
> In particular, we need to see whether the parallel tasks are isolated enough 
> from each other to prevent mutual interference.
> All this assumes the fragmentary reference I found is still available...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-05 Thread Erick Erickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-2037:
---

Attachment: LUCENE-2037.patch

Had enough time this morning to reconcile this with Kay Kay's changes,

All tests pass.

Junit 3.X no longer necessary, running with Junit 4.7 jar runs junit 3 style 
tests as well as annotated Junit4 style tests.

It's preferable (but not necessary) to import from org.junit rather than 
junit.framework.

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch, 
> LUCENE-2037.patch, LUCENE-2037_revised_2.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: LUCENE-2110.patch

After porting Automaton, I realized, that the seeking code should be changed 
and made a little bit more flexible.

AcceptStatus can now return 5 stati:
- YES, NO: Accept / not accept the term and go forward, the simple linear case 
that iterates until the end and filters terms (FuzzyQuery case, linear 
Automaton)
- YES_AND_SEEK, NO_AND_SEEK: the same like above, but instead of simply going 
forward, nextSeekTerm() is called to retrieve a new term to seek to. This 
method is now supposed to always return a greater term than before, if not, the 
enumeration can end too early (see below).
- END: end the enumeration, so seeking. This status is used by TermRangeQuery 
and PrefixQuery as before.

nextSeekTerm() should always return a greater term that the last one before 
seeking. This is asserted by NRQ. It is not bad to do this, but after that the 
enum is no longer correctly sorted. Also, if the consumer reaches the last term 
of the underlying enum, call next() will end enumeration and so further terms 
in the nextSeekTerm() interation will not consulted (the same happens when END 
is returned in accept, of course).

If nextSeekTerm() returns null, the enumeration is also ended, so it is not 
required to return AcceptStatus.END instead of X_AND_SEEK.

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786426#action_12786426
 ] 

Uwe Schindler commented on LUCENE-2110:
---

Mark: I do not know about what you are talking about (sorry, my brain is fuming 
after automaton).

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: LUCENE-1606-flex.patch

Here a flex patch for automaton. It contains LUCENE-2110, as soon as 2110 is 
committed I will upload a new patch. But its hard to differentiate between all 
modified files.

Robert: Can you do performance tests with the old and new flex patch, I do not 
want to commit 2110 before.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786431#action_12786431
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Robert: Can you do performance tests with the old and new flex patch, I do 
not want to commit 2110 before.

Uwe I will run a benchmark on both versions!

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: LUCENE-1606-flex.patch

New patch, there was a lost private field. Also changed the nextSeekTerm method 
to be more straigtForward.

Robert: Sorry, it would be better to test this one *g*

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: (was: LUCENE-1606-flex.patch)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786433#action_12786433
 ] 

Mark Miller commented on LUCENE-2110:
-

No problem, we can get it after - its not really related, just figured since 
you were patching here anyway and I happened to notice it will taking a look at 
the patch:

TermInfo is no longer used in flex, but its referenced in the above comment, in 
MTQ.



> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786438#action_12786438
 ] 

Robert Muir commented on LUCENE-1606:
-

Hi Uwe, I ran my benchmarks, and with your patch the performance is the same.

But the code is much simpler and easier to read... great work.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786439#action_12786439
 ] 

Robert Muir commented on LUCENE-2110:
-

Uwe, I really like what you have done here (as commented on LUCENE-1606)

Seeking around in a filteredtermsenum is even simpler here. (in my opinion, 
this thing is very tricky with trunk and it is good to simplify)


> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786451#action_12786451
 ] 

Michael McCandless commented on LUCENE-2110:


bq. nextSeekTerm() should always return a greater term that the last one before 
seeking. 

Uwe, why was this constraint needed?  What goes wrong if we allow terms to be 
returned out of order?  The consumers of this (MTQ's rewrite methods) don't 
mind if terms are out of order, right?

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786452#action_12786452
 ] 

Uwe Schindler commented on LUCENE-2110:
---

It will work (theoretically) but can fail:
if you seek to the last term and accept it, the next call to next() will end 
the enum, even if there may be more positions to seek. You cannot rely on the 
fact that all seek terms are visited. Because of that it *should* be foreward 
only, if other, you must know what you do

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

2009-12-05 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2117:


Attachment: LUCENE-2117.patch

patch for the bug that:
* for Turkish language, when Version >= 3.1, use TurkishLowerCaseFilter instead 
in SnowballAnalyzer
* Add javadoc note to SnowballFilter noting that it expects lowercased text to 
work (and in the turkish case, you must use the special filter)
* add contrib/analyzers dependency to contrib/snowball (perhaps not the best 
but what is the other option?)


> Fix SnowballAnalyzer casing behavior for Turkish Language
> -
>
> Key: LUCENE-2117
> URL: https://issues.apache.org/jira/browse/LUCENE-2117
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior 
> correctly. We should fix the casing behavior in SnowballAnalyzer too as it 
> supports a TurkishStemmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786457#action_12786457
 ] 

Uwe Schindler commented on LUCENE-2110:
---

I have a solution for this problem: If the end of the enum is reached i just 
asks for a new term is seek==true (that is what iwas before). But 
nextPrefixTerm() gets the information that the end was already finished and 
*could* return null then. This is important for automaton, because it would 
loop endless else (because it would produce terms and terms and terms... in 
nextSeekTerm).

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786460#action_12786460
 ] 

Michael McCandless commented on LUCENE-2037:


OK patch looks good, thanks Erick & Kay Kay!  I'll commit shortly.

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch, 
> LUCENE-2037.patch, LUCENE-2037_revised_2.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2037.


Resolution: Fixed

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch, 
> LUCENE-2037.patch, LUCENE-2037_revised_2.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: LUCENE-2110.patch

Attached is patch that allows the TermsEnum to go backwards and not break if 
end of underlying TermsEnum is reached after next() or seek().

The method nextSeekTerm() gets a boolean if the underlying TermsEnum is 
exhausted. Enums that work in order can the simply return null to break 
iteration. But they are free to reposition to a term before.

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: LUCENE-1606-flex.patch

An update with the changed nextSeekTerm() semantics from LUCENE-2110.

Robert: Can you test performance again and compare with old?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: (was: LUCENE-2110.patch)

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: LUCENE-2110.patch

fixed patch - i have to stop for today.

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: (was: LUCENE-1606-flex.patch)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: LUCENE-1606-flex.patch

There was a bug in the patch before, sorry. I will finish work for today, I am 
exhausted like the enums.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786479#action_12786479
 ] 

Uwe Schindler commented on LUCENE-1606:
---

Stop everything I get a collaps! Again wrong patch.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786480#action_12786480
 ] 

Uwe Schindler commented on LUCENE-2110:
---

Stop everything I get a collaps! Again wrong patch.

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: (was: LUCENE-2110.patch)

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: (was: LUCENE-1606-flex.patch)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: LUCENE-2110.patch

Now the final one.

I somehow need a test enum which does very strange things like seeking forward 
and backwards and returning all strange stati.

Will think about one tomorrow.

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: LUCENE-1606-flex.patch

Now the final one.

I somehow need a test enum which does very strange things like seeking forward 
and backwards and returning all strange stati.

Will think about one tomorrow.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786486#action_12786486
 ] 

Uwe Schindler commented on LUCENE-2110:
---

Robert and me analyzed the latest change. It is so complex and I am really not 
sure, if we should do this. It is impossible to maintain this.

We should enforce only seeking forwards (even if MTQ could accept terms out of 
order). Violating TermsEnums order stupid, so we should use the patch before. 
NRQ and also Automaton enforce stepping forwards only.

Mike?

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786489#action_12786489
 ] 

Mark Miller commented on LUCENE-1606:
-

The new WildcardQuery is holding up very well under random testing -

I'm comparing the results of the old WildcardQuery impl with the new 
WildcardQuery impl.

I'm using a 2 million doc english and 2 million doc french index.

Generating random queries - both random short strings built up from random 
unicode chars mixed with some random wildcards, and random english/french words 
from dictionaries, randomly chopped or not, with random wildcards injected. A 
whole lot of crazy randomness.

They have always produced the same number of results so far (a few hours of 
running).

The new impl is generally either a bit faster in these cases, or about the same 
- at worst (in general), I've seen it about .01s  slower. When its faster, its 
offten > .1s faster (or more when a few '?' are involved).

On avg, I'd say the perf is about the same - where the new impl shines appears 
to be when '?' is used (as I think Robert has mentioned).

So far I haven't seen any anomalies in time taken or anything of that nature.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786487#action_12786487
 ] 

Robert Muir commented on LUCENE-2110:
-

yeah compared to the last patch, the backwards seeking makes the code more 
complex in my opinion.

i do not understand why a MTQ would need to backwards seek? Can we say instead 
if you want to do such a thing with flexible indexing, that the way is to 
instead define custom sort order in your codec?



> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786489#action_12786489
 ] 

Mark Miller edited comment on LUCENE-1606 at 12/5/09 8:43 PM:
--

The new WildcardQuery is holding up very well under random testing -

I'm comparing the results of the old WildcardQuery impl with the new 
WildcardQuery impl.

I'm using a 2 million doc english and 2 million doc french index. (wikipedia 
dumps)

Generating random queries - both random short strings built up from random 
unicode chars mixed with some random wildcards, and random english/french words 
from dictionaries, randomly chopped or not, with random wildcards injected. A 
whole lot of crazy randomness.

They have always produced the same number of results so far (a few hours of 
running).

The new impl is generally either a bit faster in these cases, or about the same 
- at worst (in general), I've seen it about .01s  slower. When its faster, its 
offten > .1s faster (or more when a few '?' are involved).

On avg, I'd say the perf is about the same - where the new impl shines appears 
to be when '?' is used (as I think Robert has mentioned).

So far I haven't seen any anomalies in time taken or anything of that nature.

  was (Author: markrmil...@gmail.com):
The new WildcardQuery is holding up very well under random testing -

I'm comparing the results of the old WildcardQuery impl with the new 
WildcardQuery impl.

I'm using a 2 million doc english and 2 million doc french index.

Generating random queries - both random short strings built up from random 
unicode chars mixed with some random wildcards, and random english/french words 
from dictionaries, randomly chopped or not, with random wildcards injected. A 
whole lot of crazy randomness.

They have always produced the same number of results so far (a few hours of 
running).

The new impl is generally either a bit faster in these cases, or about the same 
- at worst (in general), I've seen it about .01s  slower. When its faster, its 
offten > .1s faster (or more when a few '?' are involved).

On avg, I'd say the perf is about the same - where the new impl shines appears 
to be when '?' is used (as I think Robert has mentioned).

So far I haven't seen any anomalies in time taken or anything of that nature.
  
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786493#action_12786493
 ] 

Robert Muir commented on LUCENE-1606:
-

Mark, thanks for testing!

Yes, the new wildcard should really only help for ? with trunk (especially 
leading ?)
With flex it should help a lot more, even leading * gets the benefit of "common 
suffix" and byte[] comparison and things like that.
This code is in the trunk patch but does not really help yet because trunk enum 
works on String.

btw how many uniq terms is the field you are testing... this is where it starts 
to help with ?, when you have a ton of unique terms.
But I am glad you are testing with hopefully a smaller # of uniq terms, this is 
probably more common.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786494#action_12786494
 ] 

Uwe Schindler commented on LUCENE-2110:
---

+1, I reverted it here completely. It is not senseful to support unordered 
filtered enums. If somebody wants to implement that he should do it otherwise 
by overriding next() himself and not use nextSeekTerm() and accept().

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch, LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: LUCENE-1606-flex.patch

Here is the patch with the getEnum/getTermsEnum changes instead of rewrite but 
with reverted LUCENE-2110, which was stupid.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-05 Thread Shalin Shekhar Mangar (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786497#action_12786497
 ] 

Shalin Shekhar Mangar commented on LUCENE-2108:
---

I ran into index corruption during stress testing in SOLR-785. After upgrading 
contrib-spellcheck to lucene trunk, those issues are no longer reproducible. 
You guys have saved me a lot of time :)

Thanks for fixing this!

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Simon Willnauer
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, 
> LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2100) Make contrib analyzers final

2009-12-05 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786499#action_12786499
 ] 

Simon Willnauer commented on LUCENE-2100:
-

bq. Simon what do you suggest? Instead of breaking in 3.1
I suggest to move the core analyzer into a separate issue and link those. That 
way we can make progress here as the bw policy is not that strict or people do 
not care that much than they do for core analyzers. I doubt that many people 
have subclassed StandardAnalyzer and if they do they might do something wrong 
though. Lets have two issues so we can drive the discussion independently from 
contrib.
My personal feeling is that we should break it in 3.1 lets see what the other 
devs object.

> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786498#action_12786498
 ] 

Mark Miller commented on LUCENE-1606:
-

bq. how many uniq terms is the field you are testing

I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather 
high actually. It is hitting the standard analyzer going in (mainly because I 
didn't think about changing it on building the indexes). And the queries are 
getting hit with the lowercase filter (stole the code anyway).

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: LUCENE-1606-flex.patch

again - krr to the hell with the AM/PM bug in JIRA! It is *xxx***

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Attachment: (was: LUCENE-1606-flex.patch)

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2110:
--

Attachment: (was: LUCENE-2110.patch)

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, 
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786502#action_12786502
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. I'm not sure at the moment - but its wikipedia dumps, so I'd guess its 
rather high actually. 

See the description, I created this for working mainly regexp on indexes with 
100M+ unique terms.
Wildcard doesn't get as much benefit, except ? operator and the comparisons 
being faster (array-based DFA)

I'm pleased to hear its doing so well on such a "small" index as wikipedia, as 
I would think automata overhead would make it slower (although this can 
probably be optimized away)


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-05 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786506#action_12786506
 ] 

Simon Willnauer commented on LUCENE-2108:
-

bq. Thanks for fixing this!
YW! very good feedback - I will port it to 2.9 soon.

simon

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Simon Willnauer
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, 
> LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2100) Make contrib analyzers final

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2100:


Attachment: LUCENE-2100.patch

This patch marks all analyzers in contrib as final and removes the backwards 
compat tests checking if subclasses implement reusableTokenStream.

> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Push fast-vector-highlighter mvn artifacts for 3.0 and 2.9

2009-12-05 Thread Grant Ingersoll

I suppose we could put up the artifacts on a dev site and then we could vote to 
release both of them pretty quickly.  I think that should be easy to do, since 
it pretty much only involves verifying the jar and the signatures.

On Dec 5, 2009, at 1:03 PM, Simon Willnauer wrote:

> hi folks,
> The maven artifacts for fast-vector-highlighter have never been pushed
> since it was released because there were no pom.xml.template inside
> the module. I added a pom file a day ago in the context of
> LUCENE-2107. I already talked to uwe and grant how to deal with this
> issues and if we should push the artifact for Lucene 2.9 / 3.0. Since
> this is only a metadata file we could consider rebuilding the
> artefacts and publish them for those releases. I can not remember that
> anything like that happened before, so we should discuss how to deal
> with this situation and if we should wait until 3.1.
> 
> simon
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Push fast-vector-highlighter mvn artifacts for 3.0 and 2.9

2009-12-05 Thread Simon Willnauer

On Sat, Dec 5, 2009 at 10:25 PM, Grant Ingersoll  wrote:
> I suppose we could put up the artifacts on a dev site and then we could vote 
> to release both of them pretty quickly.  I think that should be easy to do, 
> since it pretty much only involves verifying the jar and the signatures.
Yep - that might be the best solution as it does not change code
though. Whoever volunteers to do so has to checkout the same revision
to make sure it is the same code while I doubt that we had changes in
fast-vector-highlighter in the branch. -- Doh! no change in 3.0 branch
but in 2.9.

simon
>
> On Dec 5, 2009, at 1:03 PM, Simon Willnauer wrote:
>
>> hi folks,
>> The maven artifacts for fast-vector-highlighter have never been pushed
>> since it was released because there were no pom.xml.template inside
>> the module. I added a pom file a day ago in the context of
>> LUCENE-2107. I already talked to uwe and grant how to deal with this
>> issues and if we should push the artifact for Lucene 2.9 / 3.0. Since
>> this is only a metadata file we could consider rebuilding the
>> artefacts and publish them for those releases. I can not remember that
>> anything like that happened before, so we should discuss how to deal
>> with this situation and if we should wait until 3.1.
>>
>> simon
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2100) Make contrib analyzers final

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2100:
---

Assignee: Simon Willnauer

> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2117:
---

Assignee: Simon Willnauer

> Fix SnowballAnalyzer casing behavior for Turkish Language
> -
>
> Key: LUCENE-2117
> URL: https://issues.apache.org/jira/browse/LUCENE-2117
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior 
> correctly. We should fix the casing behavior in SnowballAnalyzer too as it 
> supports a TurkishStemmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Push fast-vector-highlighter mvn artifacts for 3.0 and 2.9

2009-12-05 Thread Uwe Schindler

I will regenerate both artifacts and publish in my people.a.o home (2.9.1
and 3.0, but not 2.9.0).

Also 2.9.0? Thats not what you want!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
> Sent: Saturday, December 05, 2009 10:34 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Push fast-vector-highlighter mvn artifacts for 3.0 and 2.9
> 
> On Sat, Dec 5, 2009 at 10:25 PM, Grant Ingersoll 
> wrote:
> > I suppose we could put up the artifacts on a dev site and then we could
> vote to release both of them pretty quickly.  I think that should be easy
> to do, since it pretty much only involves verifying the jar and the
> signatures.
> Yep - that might be the best solution as it does not change code
> though. Whoever volunteers to do so has to checkout the same revision
> to make sure it is the same code while I doubt that we had changes in
> fast-vector-highlighter in the branch. -- Doh! no change in 3.0 branch
> but in 2.9.
> 
> simon
> >
> > On Dec 5, 2009, at 1:03 PM, Simon Willnauer wrote:
> >
> >> hi folks,
> >> The maven artifacts for fast-vector-highlighter have never been pushed
> >> since it was released because there were no pom.xml.template inside
> >> the module. I added a pom file a day ago in the context of
> >> LUCENE-2107. I already talked to uwe and grant how to deal with this
> >> issues and if we should push the artifact for Lucene 2.9 / 3.0. Since
> >> this is only a metadata file we could consider rebuilding the
> >> artefacts and publish them for those releases. I can not remember that
> >> anything like that happened before, so we should discuss how to deal
> >> with this situation and if we should wait until 3.1.
> >>
> >> simon
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >
> >
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2100) Make contrib analyzers final

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786512#action_12786512
 ] 

Robert Muir commented on LUCENE-2100:
-

Hi Simon, this sounds good to me if we clean up contrib first. There are not 
many analyzers in core anyway (is it just StandardAnalyzer that is not final?)

My motivation for those was so we could get rid of the deprecated 
setOverridesTokenStreamMethod method.


> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2100) Make contrib analyzers final

2009-12-05 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786514#action_12786514
 ] 

Simon Willnauer commented on LUCENE-2100:
-

bq. There are not many analyzers in core anyway (is it just StandardAnalyzer 
that is not final?) 
Three of them:
 * StandardAnalyzer
 * KeywordAnalyzer
 * PerFieldAnalyzerWrapper

bq. My motivation for those was so we could get rid of the deprecated 
setOverridesTokenStreamMethod method.
+1 this make me mad each time I look at those analyzers

> Make contrib analyzers final
> 
>
> Key: LUCENE-2100
> URL: https://issues.apache.org/jira/browse/LUCENE-2100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 
> 2.9, 2.9.1, 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2100.patch
>
>
> The analyzers in contrib/analyzers should all be marked final. None of the 
> Analyzers should ever be subclassed - users should build their own analyzers 
> if a different combination of filters and Tokenizers is desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

2009-12-05 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786515#action_12786515
 ] 

Simon Willnauer commented on LUCENE-2117:
-

Robert, the patch looks almost good. You should also change the 
pom.xml.template to reflect the new dependency. I'm still thinking about moving 
snowball into analyzers as a analyzers/snowball would that make sense?

Somewhat unrelated but still ugly:
{code}
  Class stemClass = Class.forName("org.tartarus.snowball.ext." + name + 
"Stemmer");
{code}
When I look through the patch I see this "name" parameter which is used to load 
a stemmer per reflection. We should really define a factory interface that 
creates the stemmer and get rid of the refelction code

> Fix SnowballAnalyzer casing behavior for Turkish Language
> -
>
> Key: LUCENE-2117
> URL: https://issues.apache.org/jira/browse/LUCENE-2117
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior 
> correctly. We should fix the casing behavior in SnowballAnalyzer too as it 
> supports a TurkishStemmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

2009-12-05 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2117:


Attachment: LUCENE-2117.patch

this patch includes update to pom.xml.template

> Fix SnowballAnalyzer casing behavior for Turkish Language
> -
>
> Key: LUCENE-2117
> URL: https://issues.apache.org/jira/browse/LUCENE-2117
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior 
> correctly. We should fix the casing behavior in SnowballAnalyzer too as it 
> supports a TurkishStemmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Ghazal Gharooni

Hello,

I am new in the community and I've completely been confused. Please anybody
help me out to know which part of codes you are working with. How should I
participate in work? Thank you!




On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) wrote:

>
> [
> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Uwe Schindler updated LUCENE-1606:
> --
>
> Attachment: (was: LUCENE-1606-flex.patch)
>
> > Automaton Query/Filter (scalable regex)
> > ---
> >
> > Key: LUCENE-1606
> > URL: https://issues.apache.org/jira/browse/LUCENE-1606
> > Project: Lucene - Java
> >  Issue Type: New Feature
> >  Components: Search
> >Reporter: Robert Muir
> >Assignee: Robert Muir
> >Priority: Minor
> > Fix For: 3.1
> >
> > Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606_nodep.patch
> >
> >
> > Attached is a patch for an AutomatonQuery/Filter (name can change if its
> not suitable).
> > Whereas the out-of-box contrib RegexQuery is nice, I have some very large
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc.
> Additionally all of the existing RegexQuery implementations in Lucene are
> really slow if there is no constant prefix. This implementation does not
> depend upon constant prefix, and runs the same query in 640ms.
> > Some use cases I envision:
> >  1. lexicography/etc on large text corpora
> >  2. looking for things such as urls where the prefix is not constant
> (http:// or ftp://)
> > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
> convert regular expressions into a DFA. Then, the filter "enumerates" terms
> in a special way, by using the underlying state machine. Here is my short
> description from the comments:
> >  The algorithm here is pretty basic. Enumerate terms but instead of a
> binary accept/reject do:
> >
> >  1. Look at the portion that is OK (did not enter a reject state in
> the DFA)
> >  2. Generate the next possible String and seek to that.
> > the Query simply wraps the filter with ConstantScoreQuery.
> > I did not include the automaton.jar inside the patch but it can be
> downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

[jira] Commented: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786518#action_12786518
 ] 

Robert Muir commented on LUCENE-2117:
-

bq. I'm still thinking about moving snowball into analyzers as a 
analyzers/snowball would that make sense? 

we have to do something about the duplication (LUCENE-2055). There i have 
suggested we upload the snowball stoplists (which are nice) so that we can get 
rid of some hand-coded java functionality. It is silly to have the exact same 
Russian stemmer in two different places in contrib, etc.

then we have open issues like LUCENE-559...

> Fix SnowballAnalyzer casing behavior for Turkish Language
> -
>
> Key: LUCENE-2117
> URL: https://issues.apache.org/jira/browse/LUCENE-2117
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior 
> correctly. We should fix the casing behavior in SnowballAnalyzer too as it 
> supports a TurkishStemmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Simon Willnauer

On Sat, Dec 5, 2009 at 10:58 PM, Ghazal Gharooni
 wrote:
> Hello,
>
> I am new in the community and I've completely been confused. Please anybody
> help me out to know which part of codes you are working with. How should I
> participate in work? Thank you!

Hi Ghazal,
what exact information do you need? Are you asking for info on this
particular issue?

simon
>
>
>
>
> On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) 
> wrote:
>>
>>     [
>> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>>
>> Uwe Schindler updated LUCENE-1606:
>> --
>>
>>    Attachment:     (was: LUCENE-1606-flex.patch)
>>
>> > Automaton Query/Filter (scalable regex)
>> > ---
>> >
>> >                 Key: LUCENE-1606
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>> >             Project: Lucene - Java
>> >          Issue Type: New Feature
>> >          Components: Search
>> >            Reporter: Robert Muir
>> >            Assignee: Robert Muir
>> >            Priority: Minor
>> >             Fix For: 3.1
>> >
>> >         Attachments: automaton.patch, automatonMultiQuery.patch,
>> > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
>> > automatonWithWildCard.patch, automatonWithWildCard2.patch,
>> > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>> > LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>> > LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> > LUCENE-1606_nodep.patch
>> >
>> >
>> > Attached is a patch for an AutomatonQuery/Filter (name can change if its
>> > not suitable).
>> > Whereas the out-of-box contrib RegexQuery is nice, I have some very
>> > large indexes (100M+ unique tokens) where queries are quite slow, 2 
>> > minutes,
>> > etc. Additionally all of the existing RegexQuery implementations in Lucene
>> > are really slow if there is no constant prefix. This implementation does 
>> > not
>> > depend upon constant prefix, and runs the same query in 640ms.
>> > Some use cases I envision:
>> >  1. lexicography/etc on large text corpora
>> >  2. looking for things such as urls where the prefix is not constant
>> > (http:// or ftp://)
>> > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
>> > convert regular expressions into a DFA. Then, the filter "enumerates" terms
>> > in a special way, by using the underlying state machine. Here is my short
>> > description from the comments:
>> >      The algorithm here is pretty basic. Enumerate terms but instead of
>> > a binary accept/reject do:
>> >
>> >      1. Look at the portion that is OK (did not enter a reject state in
>> > the DFA)
>> >      2. Generate the next possible String and seek to that.
>> > the Query simply wraps the filter with ConstantScoreQuery.
>> > I did not include the automaton.jar inside the patch but it can be
>> > downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller

Could you be more specific :)

This patch is part of an issue to add an AutomatonQuery class to Lucene
that allows for a fast RegexpQuery and replaces our WildcardQuery impl.
Its being developed in two flavors - one for the current trunk version
of Lucene, and a slightly altered version for our "flexible indexing"
branch - which is a branch where another large issue is being developed
- eventually it will be merged back into trunk.

This might not be an issue where you want to get your feet wet ;) But if
you could be more explicit with what you want to know, we might be able
to be of more help. Thats a pretty broad question. To take a stab
anyway: the short of it is - find an issue you find compelling and jump
in ! :)

Ghazal Gharooni wrote:
> Hello,
>
> I am new in the community and I've completely been confused. Please
> anybody help me out to know which part of codes you are working with.
> How should I participate in work? Thank you!
>
>
>
>
> On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA)  > wrote:
>
>
> [
> 
> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Uwe Schindler updated LUCENE-1606:
> --
>
>Attachment: (was: LUCENE-1606-flex.patch)
>
> > Automaton Query/Filter (scalable regex)
> > ---
> >
> > Key: LUCENE-1606
> > URL:
> https://issues.apache.org/jira/browse/LUCENE-1606
> > Project: Lucene - Java
> >  Issue Type: New Feature
> >  Components: Search
> >Reporter: Robert Muir
> >Assignee: Robert Muir
> >Priority: Minor
> > Fix For: 3.1
> >
> > Attachments: automaton.patch, automatonMultiQuery.patch,
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
> automatonWithWildCard.patch, automatonWithWildCard2.patch,
> BenchWildcard.java, LUCENE-1606-flex.patch,
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
> >
> >
> > Attached is a patch for an AutomatonQuery/Filter (name can
> change if its not suitable).
> > Whereas the out-of-box contrib RegexQuery is nice, I have some
> very large indexes (100M+ unique tokens) where queries are quite
> slow, 2 minutes, etc. Additionally all of the existing RegexQuery
> implementations in Lucene are really slow if there is no constant
> prefix. This implementation does not depend upon constant prefix,
> and runs the same query in 640ms.
> > Some use cases I envision:
> >  1. lexicography/etc on large text corpora
> >  2. looking for things such as urls where the prefix is not
> constant (http:// or ftp://)
> > The Filter uses the BRICS package
> (http://www.brics.dk/automaton/) to convert regular expressions
> into a DFA. Then, the filter "enumerates" terms in a special way,
> by using the underlying state machine. Here is my short
> description from the comments:
> >  The algorithm here is pretty basic. Enumerate terms but
> instead of a binary accept/reject do:
> >
> >  1. Look at the portion that is OK (did not enter a reject
> state in the DFA)
> >  2. Generate the next possible String and seek to that.
> > the Query simply wraps the filter with ConstantScoreQuery.
> > I did not include the automaton.jar inside the patch but it can
> be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> 
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
>
>


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir

Hi Ghazal,

I am sorry this one is a bit confusing. I think it is because a lot of
people are working on it (which is great) and a lot of ideas going back and
forth, causing lots of files to be uploaded, etc.

Can you tell us more about your interest in working with NFA/DFA in Lucene?
I am very curious to hear any uses cases you might have, or why you are
interested!

In general, for contributing to lucene this link is helpful:
http://wiki.apache.org/lucene-java/HowToContribute

It tells you how the patch submission process works, how to get the latest
code from subversion, etc.

On Sat, Dec 5, 2009 at 4:58 PM, Ghazal Gharooni
wrote:

> Hello,
>
> I am new in the community and I've completely been confused. Please anybody
> help me out to know which part of codes you are working with. How should I
> participate in work? Thank you!
>
>
>
>
>
> On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) wrote:
>
>>
>> [
>> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Uwe Schindler updated LUCENE-1606:
>> --
>>
>> Attachment: (was: LUCENE-1606-flex.patch)
>>
>> > Automaton Query/Filter (scalable regex)
>> > ---
>> >
>> > Key: LUCENE-1606
>> > URL: https://issues.apache.org/jira/browse/LUCENE-1606
>> > Project: Lucene - Java
>> >  Issue Type: New Feature
>> >  Components: Search
>> >Reporter: Robert Muir
>> >Assignee: Robert Muir
>> >Priority: Minor
>> > Fix For: 3.1
>> >
>> > Attachments: automaton.patch, automatonMultiQuery.patch,
>> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
>> automatonWithWildCard.patch, automatonWithWildCard2.patch,
>> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606_nodep.patch
>> >
>> >
>> > Attached is a patch for an AutomatonQuery/Filter (name can change if its
>> not suitable).
>> > Whereas the out-of-box contrib RegexQuery is nice, I have some very
>> large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes,
>> etc. Additionally all of the existing RegexQuery implementations in Lucene
>> are really slow if there is no constant prefix. This implementation does not
>> depend upon constant prefix, and runs the same query in 640ms.
>> > Some use cases I envision:
>> >  1. lexicography/etc on large text corpora
>> >  2. looking for things such as urls where the prefix is not constant
>> (http:// or ftp://)
>> > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
>> convert regular expressions into a DFA. Then, the filter "enumerates" terms
>> in a special way, by using the underlying state machine. Here is my short
>> description from the comments:
>> >  The algorithm here is pretty basic. Enumerate terms but instead of
>> a binary accept/reject do:
>> >
>> >  1. Look at the portion that is OK (did not enter a reject state in
>> the DFA)
>> >  2. Generate the next possible String and seek to that.
>> > the Query simply wraps the filter with ConstantScoreQuery.
>> > I did not include the automaton.jar inside the patch but it can be
>> downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>


-- 
Robert Muir
rcm...@gmail.com

Lots of results

2009-12-05 Thread Grant Ingersoll

At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and the 
speaker said one of the optimizations they do when retrieving a large result 
set is that instead of managing a Priority Queue, they just allocate a large 
array to hold all of the results and then sort afterward.   Seemed like a good 
idea since you could avoid a whole slew of PQ operations at the cost of 
(possibly) some extra memory, so I thought I would see if anyone has looked at 
doing something similar here.  (Xapian is C++, I believe, so it likely has 
different perf. characteristics which may or may not matter).  

A few things come to mind:
1. In many cases, when someone is asking for a lot of results, my guess is they 
want all the results.  You often see this in people who are doing significant 
post processing on large result sets (often for machine learning)
2. My gut says there is actually some heuristic to be applied here, whereby if 
the requested number of results is some percentage of the total num. of hits, 
then do the whole results + sort, otherwise, use PQ.  Thus, this maybe would be 
useful even when doing something like, say, 500 or a 1000 docs, as opposed to 
the bigger cases I'm thinking about.  Still, don't need to prematurely optimize.
3.  I think we could almost implement this entirely w/ a Collector, except for 
the sort step, which presumably could be a callback (empty implementation in 
the base class).
4. When doing this, you don't bother to truncate the list when n < totalHits 
(although see below).

Perhaps, we could also even save having a big array for doc ids and instead 
just flip bits in a bit set (would still need the array for scores) and then 
materialize the bit set to an array at the end when the num requested is less 
than the total number of hits.

Anyone have thoughts on this?  Seems fairly trivial to crank out a Collector to 
do it, minus the post processing step, which would be relatively trivial to add.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-05 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2108:


Attachment: LUCENE-2108_test_java14.patch

Mike, I changed the testcase to be 1.4 compatible this might help you to merge 
the spellchecker into 2.9.1 since I can not commit into branches.
It does not make sense to create a patch against the branch as you really want 
the mergeinfo and don't do it by patching things in branches. 

simon

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Simon Willnauer
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch, 
> LUCENE-2108.patch, LUCENE-2108.patch, LUCENE-2108.patch, 
> LUCENE-2108_test_java14.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Push fast-vector-highlighter mvn artifacts for 3.0 and 2.9

2009-12-05 Thread Uwe Schindler

I rebuilt the maven-dir for 2.9.1 and 3.0.0, merged them (3.0.0 is top-level
version) and extracted only fast-vector-highlighter:

http://people.apache.org/~uschindler/staging-area/

I will copy this dir to the maven folder on people.a.o, when I got votes
(how many)? At least someone should check the signatures.

By the way, we have a small error in our ant build.xml that inserts
svnversion into the manifest file. This version is not the version of the
last changed item (would be svnversion -c) but the current svn version, even
that I checked out the corresponding tags. It's no problem at all, but not
very nice.

Maybe we should change build.xml to call "svnversion -c" in future, to get
the real number.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Grant Ingersoll [mailto:gsing...@apache.org]
> Sent: Saturday, December 05, 2009 10:26 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Push fast-vector-highlighter mvn artifacts for 3.0 and 2.9
> 
> I suppose we could put up the artifacts on a dev site and then we could
> vote to release both of them pretty quickly.  I think that should be easy
> to do, since it pretty much only involves verifying the jar and the
> signatures.
> 
> On Dec 5, 2009, at 1:03 PM, Simon Willnauer wrote:
> 
> > hi folks,
> > The maven artifacts for fast-vector-highlighter have never been pushed
> > since it was released because there were no pom.xml.template inside
> > the module. I added a pom file a day ago in the context of
> > LUCENE-2107. I already talked to uwe and grant how to deal with this
> > issues and if we should push the artifact for Lucene 2.9 / 3.0. Since
> > this is only a metadata file we could consider rebuilding the
> > artefacts and publish them for those releases. I can not remember that
> > anything like that happened before, so we should discuss how to deal
> > with this situation and if we should wait until 3.1.
> >
> > simon
> >
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lots of results

2009-12-05 Thread Earwin Burrfoot

If someone needs all results, they know it beforehand. Why can't they
write this collector themselves? It's trivial, just like you said.

On Sun, Dec 6, 2009 at 01:22, Grant Ingersoll  wrote:
> At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and the 
> speaker said one of the optimizations they do when retrieving a large result 
> set is that instead of managing a Priority Queue, they just allocate a large 
> array to hold all of the results and then sort afterward.   Seemed like a 
> good idea since you could avoid a whole slew of PQ operations at the cost of 
> (possibly) some extra memory, so I thought I would see if anyone has looked 
> at doing something similar here.  (Xapian is C++, I believe, so it likely has 
> different perf. characteristics which may or may not matter).
>
> A few things come to mind:
> 1. In many cases, when someone is asking for a lot of results, my guess is 
> they want all the results.  You often see this in people who are doing 
> significant post processing on large result sets (often for machine learning)
> 2. My gut says there is actually some heuristic to be applied here, whereby 
> if the requested number of results is some percentage of the total num. of 
> hits, then do the whole results + sort, otherwise, use PQ.  Thus, this maybe 
> would be useful even when doing something like, say, 500 or a 1000 docs, as 
> opposed to the bigger cases I'm thinking about.  Still, don't need to 
> prematurely optimize.
> 3.  I think we could almost implement this entirely w/ a Collector, except 
> for the sort step, which presumably could be a callback (empty implementation 
> in the base class).
> 4. When doing this, you don't bother to truncate the list when n < totalHits 
> (although see below).
>
> Perhaps, we could also even save having a big array for doc ids and instead 
> just flip bits in a bit set (would still need the array for scores) and then 
> materialize the bit set to an array at the end when the num requested is less 
> than the total number of hits.
>
> Anyone have thoughts on this?  Seems fairly trivial to crank out a Collector 
> to do it, minus the post processing step, which would be relatively trivial 
> to add.
>
> -Grant
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786528#action_12786528
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. I'm not sure at the moment - but its wikipedia dumps, so I'd guess its 
rather high actually.

I looked at the wikipedia dump in benchmark (when indexed with 
standardanalyzer), body only has 65k terms... I think thats pretty small :)
I do not think automaton will help much with such a small number of terms, its 
definitely a worst case benchmark you are performing.
I think very little time is probably spent here in term enumeration so 
scalability does not matter for that corpus.

More interesting to see the benefits would be something like indexing geonames 
data (lots of terms), or even that (much smaller) persian corpus i mentioned 
with nearly 500k terms... 


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lots of results

2009-12-05 Thread Grant Ingersoll


On Dec 5, 2009, at 10:47 PM, Earwin Burrfoot wrote:

> If someone needs all results, they know it beforehand. Why can't they
> write this collector themselves? It's trivial, just like you said.

I'm not following your comment.  Of course they can write it.  But that's true 
for all the implementations we provide.

However, the Collector stuff doesn't handle the post collection sort, so, it 
would require someone to hack some low level Lucene internals.  Also, I think 
it's interesting to think about the case of getting most of the results, but 
maybe not all.

> 
> On Sun, Dec 6, 2009 at 01:22, Grant Ingersoll  wrote:
>> At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and 
>> the speaker said one of the optimizations they do when retrieving a large 
>> result set is that instead of managing a Priority Queue, they just allocate 
>> a large array to hold all of the results and then sort afterward.   Seemed 
>> like a good idea since you could avoid a whole slew of PQ operations at the 
>> cost of (possibly) some extra memory, so I thought I would see if anyone has 
>> looked at doing something similar here.  (Xapian is C++, I believe, so it 
>> likely has different perf. characteristics which may or may not matter).
>> 
>> A few things come to mind:
>> 1. In many cases, when someone is asking for a lot of results, my guess is 
>> they want all the results.  You often see this in people who are doing 
>> significant post processing on large result sets (often for machine learning)
>> 2. My gut says there is actually some heuristic to be applied here, whereby 
>> if the requested number of results is some percentage of the total num. of 
>> hits, then do the whole results + sort, otherwise, use PQ.  Thus, this maybe 
>> would be useful even when doing something like, say, 500 or a 1000 docs, as 
>> opposed to the bigger cases I'm thinking about.  Still, don't need to 
>> prematurely optimize.
>> 3.  I think we could almost implement this entirely w/ a Collector, except 
>> for the sort step, which presumably could be a callback (empty 
>> implementation in the base class).
>> 4. When doing this, you don't bother to truncate the list when n < totalHits 
>> (although see below).
>> 
>> Perhaps, we could also even save having a big array for doc ids and instead 
>> just flip bits in a bit set (would still need the array for scores) and then 
>> materialize the bit set to an array at the end when the num requested is 
>> less than the total number of hits.
>> 
>> Anyone have thoughts on this?  Seems fairly trivial to crank out a Collector 
>> to do it, minus the post processing step, which would be relatively trivial 
>> to add.
>> 
>> -Grant
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> 
>> 
> 
> 
> 
> -- 
> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Ghazal Gharooni

Hello,
Thank you all for your description. Actually, this is my first experiences
in an open source community. I downloaded the source code (lucene-3.0.0.zip)
and would like to work on part of the code in order to learn new skills from
group and have a positive contribution. To be honest, I really don't know
from which part I should start my work. Please let me know the exact address
of the source code you are discussing about (folder, file), then I will join
you :)



On Sat, Dec 5, 2009 at 2:10 PM, Robert Muir  wrote:

> Hi Ghazal,
>
> I am sorry this one is a bit confusing. I think it is because a lot of
> people are working on it (which is great) and a lot of ideas going back and
> forth, causing lots of files to be uploaded, etc.
>
> Can you tell us more about your interest in working with NFA/DFA in Lucene?
>
> I am very curious to hear any uses cases you might have, or why you are
> interested!
>
> In general, for contributing to lucene this link is helpful:
> http://wiki.apache.org/lucene-java/HowToContribute
>
> It tells you how the patch submission process works, how to get the latest
> code from subversion, etc.
>
>
> On Sat, Dec 5, 2009 at 4:58 PM, Ghazal Gharooni  > wrote:
>
>> Hello,
>>
>> I am new in the community and I've completely been confused. Please
>> anybody help me out to know which part of codes you are working with. How
>> should I participate in work? Thank you!
>>
>>
>>
>>
>>
>> On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) wrote:
>>
>>>
>>> [
>>> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>>
>>> Uwe Schindler updated LUCENE-1606:
>>> --
>>>
>>> Attachment: (was: LUCENE-1606-flex.patch)
>>>
>>> > Automaton Query/Filter (scalable regex)
>>> > ---
>>> >
>>> > Key: LUCENE-1606
>>> > URL: https://issues.apache.org/jira/browse/LUCENE-1606
>>> > Project: Lucene - Java
>>> >  Issue Type: New Feature
>>> >  Components: Search
>>> >Reporter: Robert Muir
>>> >Assignee: Robert Muir
>>> >Priority: Minor
>>> > Fix For: 3.1
>>> >
>>> > Attachments: automaton.patch, automatonMultiQuery.patch,
>>> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
>>> automatonWithWildCard.patch, automatonWithWildCard2.patch,
>>> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>>> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>>> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>>> LUCENE-1606_nodep.patch
>>> >
>>> >
>>> > Attached is a patch for an AutomatonQuery/Filter (name can change if
>>> its not suitable).
>>> > Whereas the out-of-box contrib RegexQuery is nice, I have some very
>>> large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes,
>>> etc. Additionally all of the existing RegexQuery implementations in Lucene
>>> are really slow if there is no constant prefix. This implementation does not
>>> depend upon constant prefix, and runs the same query in 640ms.
>>> > Some use cases I envision:
>>> >  1. lexicography/etc on large text corpora
>>> >  2. looking for things such as urls where the prefix is not constant
>>> (http:// or ftp://)
>>> > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
>>> convert regular expressions into a DFA. Then, the filter "enumerates" terms
>>> in a special way, by using the underlying state machine. Here is my short
>>> description from the comments:
>>> >  The algorithm here is pretty basic. Enumerate terms but instead of
>>> a binary accept/reject do:
>>> >
>>> >  1. Look at the portion that is OK (did not enter a reject state in
>>> the DFA)
>>> >  2. Generate the next possible String and seek to that.
>>> > the Query simply wraps the filter with ConstantScoreQuery.
>>> > I did not include the automaton.jar inside the patch but it can be
>>> downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786529#action_12786529
 ] 

Mark Miller commented on LUCENE-1606:
-

bq.  I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would 
have guess (wrongly it appears), that it would be more. One diff is that I 
think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 
gig dump - though that prob doesn't too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no 
loss of performance. I think the benches you have already done are probably 
plenty good for benefits testing.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786529#action_12786529
 ] 

Mark Miller edited comment on LUCENE-1606 at 12/5/09 11:06 PM:
---

bq.  I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would 
have guess (wrongly it appears), that it would be more. One diff is that I 
think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 
gig dump - though that prob doesn't too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no 
loss of performance (on a fairly standard corpus). I think the benches you have 
already done are probably plenty good for benefits testing.

  was (Author: markrmil...@gmail.com):
bq.  I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would 
have guess (wrongly it appears), that it would be more. One diff is that I 
think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 
gig dump - though that prob doesn't too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no 
loss of performance. I think the benches you have already done are probably 
plenty good for benefits testing.
  
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786531#action_12786531
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Right, but I'm not really testing for benefits - more for correctness and 
no loss of performance (on a fairly standard corpus). I think the benches you 
have already done are probably plenty good for benefits testing.

oh ok, I didnt know. Because my benchmark as Mike said, is definitely very 
"contrived". 

But its kind of realistic, there are situations where the number of terms 
compared to the number of docs is much higher (maybe even 1-1 for unique 
product ids and things like that). 

I am glad you did this test, because I was concerned about the "small index" 
case too. And definitely correctness

I think you are right about the partial dump. I am indexing the full dump now 
(at least I think). I will look at it too, at least for curiousity sake.


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786533#action_12786533
 ] 

Mark Miller commented on LUCENE-1606:
-

bq. And definitely correctness

Right - thats my main motivation - comparing the results of the old 
wildcardquery with the new - I actually put the timings in there as an after 
thought - just because I was curious.

I really just wanted to make sure every random query acts the same with both 
impls and that no random input can somehow screw things up (Im using commons 
lang to pump in random unicode strings, along with turning the dict entires 
into wildcards that more likely to get many hits).

Didn't expect to find anything, but it will make me feel better about +1ing the 
commit ;)

Also going over the code, but thats going to take more time.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786529#action_12786529
 ] 

Mark Miller edited comment on LUCENE-1606 at 12/5/09 11:18 PM:
---

bq.  I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would 
have guess (wrongly it appears), that it would be more. One diff is that I 
think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 
gig dump - though that prob doesn't add too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no 
loss of performance (on a fairly standard corpus). I think the benches you have 
already done are probably plenty good for benefits testing.

  was (Author: markrmil...@gmail.com):
bq.  I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would 
have guess (wrongly it appears), that it would be more. One diff is that I 
think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 
gig dump - though that prob doesn't too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no 
loss of performance (on a fairly standard corpus). I think the benches you have 
already done are probably plenty good for benefits testing.
  
> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir

Hi Ghazal, I think if you are looking to help with Lucene in general, the
HowToContribute link is the best place to start:
http://wiki.apache.org/lucene-java/HowToContribute

We are not working with the source code from the zip file, but instead the
latest unreleased code in the subversion repository. There are instructions
on that page on how you can access it.

I agree with Mark this might be a tricky one to attack as your first issue,
perhaps you want to tackle something smaller to get used to the process of
how things work?

Also keep in mind, you can contribute in more ways than actually writing the
code, you can always contribute by providing comments or feedback,
suggesting improvements to the documentation or tests, answering questions
on the user list, etc.

Finally, as I mentioned before, if you are interested in this particular
issue for some reason, I think even telling us more information such as "I
am trying to run regular expressions/wildcard/fuzzy on a large index" or
something like that, would be helpful.

On Sat, Dec 5, 2009 at 6:04 PM, Ghazal Gharooni
wrote:

> Hello,
> Thank you all for your description. Actually, this is my first experiences
> in an open source community. I downloaded the source code (lucene-3.0.0.zip)
> and would like to work on part of the code in order to learn new skills from
> group and have a positive contribution. To be honest, I really don't know
> from which part I should start my work. Please let me know the exact address
> of the source code you are discussing about (folder, file), then I will join
> you :)
>
>
>
>
> On Sat, Dec 5, 2009 at 2:10 PM, Robert Muir  wrote:
>
>> Hi Ghazal,
>>
>> I am sorry this one is a bit confusing. I think it is because a lot of
>> people are working on it (which is great) and a lot of ideas going back and
>> forth, causing lots of files to be uploaded, etc.
>>
>> Can you tell us more about your interest in working with NFA/DFA in
>> Lucene?
>> I am very curious to hear any uses cases you might have, or why you are
>> interested!
>>
>> In general, for contributing to lucene this link is helpful:
>> http://wiki.apache.org/lucene-java/HowToContribute
>>
>> It tells you how the patch submission process works, how to get the latest
>> code from subversion, etc.
>>
>>
>> On Sat, Dec 5, 2009 at 4:58 PM, Ghazal Gharooni <
>> ghazal.gharo...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am new in the community and I've completely been confused. Please
>>> anybody help me out to know which part of codes you are working with. How
>>> should I participate in work? Thank you!
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) wrote:
>>>

 [
 https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

 Uwe Schindler updated LUCENE-1606:
 --

 Attachment: (was: LUCENE-1606-flex.patch)

 > Automaton Query/Filter (scalable regex)
 > ---
 >
 > Key: LUCENE-1606
 > URL:
 https://issues.apache.org/jira/browse/LUCENE-1606
 > Project: Lucene - Java
 >  Issue Type: New Feature
 >  Components: Search
 >Reporter: Robert Muir
 >Assignee: Robert Muir
 >Priority: Minor
 > Fix For: 3.1
 >
 > Attachments: automaton.patch, automatonMultiQuery.patch,
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
 automatonWithWildCard.patch, automatonWithWildCard2.patch,
 BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
 LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
 LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
 LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
 LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
 LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
 LUCENE-1606_nodep.patch
 >
 >
 > Attached is a patch for an AutomatonQuery/Filter (name can change if
 its not suitable).
 > Whereas the out-of-box contrib RegexQuery is nice, I have some very
 large indexes (100M+ unique tokens) where queries are quite slow, 2 
 minutes,
 etc. Additionally all of the existing RegexQuery implementations in Lucene
 are really slow if there is no constant prefix. This implementation does 
 not
 depend upon constant prefix, and runs the same query in 640ms.
 > Some use cases I envision:
 >  1. lexicography/etc on large text corpora
 >  2. looking for things such as urls where the prefix is not constant
 (http:// or ftp://)
 > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
 convert regular expressions into a DFA. Then, the filter "enumerates" terms
 in a spe

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786536#action_12786536
 ] 

Robert Muir commented on LUCENE-1606:
-

Mark oh ok, well thanks for spending so much time here testing and reviewing.

bq. I really just wanted to make sure every random query acts the same with 
both impls and that no random input can somehow screw things up (Im using 
commons lang to pump in random unicode strings, along with turning the dict 
entires into wildcards that more likely to get many hits).

Yeah I tried to do some of this in a very quick way if you look at the tests... 
I generate some random wildcard/regexp queries (mainly to prevent bugs from 
being introduced).

The unicode tests (TestAutomatonUnicode) took me quite some time, they are 
definitely contrived but I think cover the bases for any unicode problems.
One problem is that none of this unicode stuff is ever a problem on trunk!

If you save this test setup, maybe in the future I can trick you into running 
your tests on flex, where the unicode handling matters (as TermRef must be 
valid UTF-8 there).


> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lots of results

2009-12-05 Thread Paul Elschot

Could one get the best of both worlds by not heapifying the PQ
until it is full?

Regards,
Paul Elschot

Op zondag 06 december 2009 00:01:49 schreef Grant Ingersoll:
> 
> On Dec 5, 2009, at 10:47 PM, Earwin Burrfoot wrote:
> 
> > If someone needs all results, they know it beforehand. Why can't they
> > write this collector themselves? It's trivial, just like you said.
> 
> I'm not following your comment.  Of course they can write it.  But that's 
> true for all the implementations we provide.
> 
> However, the Collector stuff doesn't handle the post collection sort, so, it 
> would require someone to hack some low level Lucene internals.  Also, I think 
> it's interesting to think about the case of getting most of the results, but 
> maybe not all.
> 
> > 
> > On Sun, Dec 6, 2009 at 01:22, Grant Ingersoll  wrote:
> >> At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and 
> >> the speaker said one of the optimizations they do when retrieving a large 
> >> result set is that instead of managing a Priority Queue, they just 
> >> allocate a large array to hold all of the results and then sort afterward. 
> >>   Seemed like a good idea since you could avoid a whole slew of PQ 
> >> operations at the cost of (possibly) some extra memory, so I thought I 
> >> would see if anyone has looked at doing something similar here.  (Xapian 
> >> is C++, I believe, so it likely has different perf. characteristics which 
> >> may or may not matter).
> >> 
> >> A few things come to mind:
> >> 1. In many cases, when someone is asking for a lot of results, my guess is 
> >> they want all the results.  You often see this in people who are doing 
> >> significant post processing on large result sets (often for machine 
> >> learning)
> >> 2. My gut says there is actually some heuristic to be applied here, 
> >> whereby if the requested number of results is some percentage of the total 
> >> num. of hits, then do the whole results + sort, otherwise, use PQ.  Thus, 
> >> this maybe would be useful even when doing something like, say, 500 or a 
> >> 1000 docs, as opposed to the bigger cases I'm thinking about.  Still, 
> >> don't need to prematurely optimize.
> >> 3.  I think we could almost implement this entirely w/ a Collector, except 
> >> for the sort step, which presumably could be a callback (empty 
> >> implementation in the base class).
> >> 4. When doing this, you don't bother to truncate the list when n < 
> >> totalHits (although see below).
> >> 
> >> Perhaps, we could also even save having a big array for doc ids and 
> >> instead just flip bits in a bit set (would still need the array for 
> >> scores) and then materialize the bit set to an array at the end when the 
> >> num requested is less than the total number of hits.
> >> 
> >> Anyone have thoughts on this?  Seems fairly trivial to crank out a 
> >> Collector to do it, minus the post processing step, which would be 
> >> relatively trivial to add.
> >> 
> >> -Grant
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >> 
> >> 
> > 
> > 
> > 
> > -- 
> > Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> > Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> > ICQ: 104465785
> > 
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> > 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
> Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
> 
> 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786541#action_12786541
 ] 

Mark Miller commented on LUCENE-1606:
-

bq. Yeah I tried to do some of this in a very quick way if you look at the 
tests... I generate some random wildcard/regexp queries (mainly to prevent bugs 
from being introduced).

Yeah, I think the tests are pretty solid (from the briefs looks I've had thus 
far) - this is mainly just a precaution - so that we are not surprised by a 
more realistic corpus. And to have the opportunity to compare with the old 
WildcardQuery - I'd rather not keep it around for tests - once we are confident 
its the same (and I am at this point), I'm happy to see it fade into the night. 
Replacing such a core piece though, I want to be absolutely sure everything is 
on the level.

bq, they are definitely contrived but I think cover the bases for any unicode 
problems.

Right - in terms of unit tests, I think you've done great based on what I've 
seen. This is just throwing more variety at a larger more realistic corpus. 
More of a one time deal than something that should be incorporated into the 
tests. Ensures there are no surprises for me - since I didn't write any of this 
code (and I'm not yet super familiar with it), it helps with my comfort level :)

bq. One problem is that none of this unicode stuff is ever a problem on trunk!

Yeah - I assumed not. But as I'm not that familiar with the automaton stuff 
yet, I wanted to be sure there wasn't going to be any input that somehow 
confused it. I realize that your familiarity level probably tells you thats not 
possible - but mine puts me in the position of testing anyway - else I'll look 
like a moron when I +1 this thing ;)

bq. If you save this test setup, 

I'll save it for sure.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786542#action_12786542
 ] 

Robert Muir commented on LUCENE-1606:
-

bq. Also going over the code, but thats going to take more time.

Btw, I will accept any criticism here. I am not happy with the complexity of 
the enum in the trunk patch, personally.
But here are the three main issues that I think make it complex: (not to try to 
place blame elsewhere)

* This trie<->DFA intersection is inherently something i would want to define 
recursively, but this would be obviously bad.
* The DFA library uses UTF-16 whereas TermRef requires UTF-8. Changing 
automaton to use 'int' would fix this, but then would destroy performance. The 
reason brics is the fastest java regex library is that it tableizes the DFA 
into a 64k UTF-16 char[]. See RunAutomaton for the impl. I think making this 
require 1MB for the corner cases is bad.
* MultiTermQuerys that seek around are pretty complex in trunk. In my opinion 
this enum is a lot easier to understand with the improvements Uwe is working on 
for FilteredTermsEnum (see his branch patch, I think its easier there).

if you have ideas how we can simplify any of this in trunk for easier 
readability (instead of just adding absurd amounts of comments as I did), I'd 
be very interested.

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2117) Fix SnowballAnalyzer casing behavior for Turkish Language

2009-12-05 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786549#action_12786549
 ] 

Simon Willnauer commented on LUCENE-2117:
-

Robert, Patch looks good and all tests pass.
I plan to commit this later tomorrow if nobody objects.

> Fix SnowballAnalyzer casing behavior for Turkish Language
> -
>
> Key: LUCENE-2117
> URL: https://issues.apache.org/jira/browse/LUCENE-2117
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2117.patch, LUCENE-2117.patch
>
>
> LUCENE-2102 added a new TokenFilter to handle Turkish unique casing behavior 
> correctly. We should fix the casing behavior in SnowballAnalyzer too as it 
> supports a TurkishStemmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-12-05 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786551#action_12786551
 ] 

Mark Miller commented on LUCENE-1606:
-

Sorry - haven't been paying a lot of attention to all of the Unicode 
issues/talk lately.

Could you briefly explain cleanupPosition? Whats the case where a seek position 
cannot be converted to UTF-8?

> Automaton Query/Filter (scalable regex)
> ---
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, 
> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>  The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>   
>  1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>  2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

1 2 >

1 - 100 of 111 matches

Mail list logo