[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763389#action_12763389
 ] 

Michael Busch commented on LUCENE-1960:
---

Users can use CompressionTools#decompress() now. They just must know now which 
binary fields are compressed.

I don't think the SegmentMerger should uncompress automatically? That'd make 
<=2.9 indexes suddenly bigger.

> Remove deprecated Field.Store.COMPRESS
> --
>
> Key: LUCENE-1960
> URL: https://issues.apache.org/jira/browse/LUCENE-1960
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1960.patch
>
>
> Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1961) Remove remaining deprecations in document package

2009-10-08 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1961:
--

Attachment: lucene-1961.patch

All tests pass.

> Remove remaining deprecations in document package
> -
>
> Key: LUCENE-1961
> URL: https://issues.apache.org/jira/browse/LUCENE-1961
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1961.patch
>
>
> Remove different deprecated APIs:
> - Field.Index.NO_NORMS, etc.
> - Field.binaryValue()
> - getOmitTf()/setOmitTf()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Output from a small Snowball benchmark

2009-10-08 Thread Karl Wettin
There have been a few small comments in the Jira about the reflection  
in Snowball's Among class. There is very little to do about this  
unless one want to redesign the stemmers so they include an inner  
class that handle the method callbacks. That's quite a bit of work and  
I don't even know how much CPU one would save by doing this.


So I was thinking maybe it would save a some resources if one reused  
the stemmers instead of reinstantiating them, which I presume  
everybody does.


I thought it would make most sense to simulate query time stemming so  
my benchmark contained 4 words where 2 of them are plural. Each test  
ran 1 000 000 times. The amount of CPU time used is bearly noticeable  
relative to what other things cost: 0.0109ms/iteration when  
reinstantiating, 0.0067ms/iteration when reusing.


The heap consuption was however rather different. At the end of  
reinstantiation it had consumed about 10x more than when reusing.  
~20MB vs. ~2MB.



I realize people don't usally run 1 000 000 queries in so short time,  
but at least this is an indication that one could save some GC time  
here. Many a mickle makes a muckle...


So I was thinking that perhaps it would make sense with something like  
a singleton concurrent queue in the SnowballFilter and a new  
constructor that takes the snowball program implementation class as an  
argument.


But this might also be way premature optimization.


 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763399#action_12763399
 ] 

Michael Busch commented on LUCENE-1960:
---

{quote}
Also the constant bitmask for compression should stay "reserved" for futrure 
use.
{quote}

Yeah I think you're right, we must make sure that we don't use this bit for 
something else, as old indexes might have it set to true already. I'll add it 
back with a deprecation comment saying that we'll remove it in 4.0. (4.0 won't 
have to be able to read <3.0 indexes anymore).

> Remove deprecated Field.Store.COMPRESS
> --
>
> Key: LUCENE-1960
> URL: https://issues.apache.org/jira/browse/LUCENE-1960
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1960.patch
>
>
> Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-08 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reopened LUCENE-1960:
---


Reopening so that I don't forget to add back the COMPRESS bit.

> Remove deprecated Field.Store.COMPRESS
> --
>
> Key: LUCENE-1960
> URL: https://issues.apache.org/jira/browse/LUCENE-1960
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1960.patch
>
>
> Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763403#action_12763403
 ] 

Uwe Schindler commented on LUCENE-1960:
---

In the discussion with Mike, we said, that all pre-2.9 compressed fields should 
behave as before, e.g. they should automatically decompress. It should not be 
possibile to create new ones. This is just index compatibility, in the current 
version it is simply not defined what happens with pre-2.9 fields. The second 
problem are older compressed fields using the modified Java-UTF-8 encoding, 
which may not correctly decompress now (if you receive with getByte())

The problem with your patch: If the field is compressed and you try to get it, 
you would not hit it (because it is marked as String, not binary). The new 
self-compressed fields are now should be "binary", before they were binary *or* 
string. See the discussion in LUCENE-652

> Remove deprecated Field.Store.COMPRESS
> --
>
> Key: LUCENE-1960
> URL: https://issues.apache.org/jira/browse/LUCENE-1960
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1960.patch
>
>
> Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763408#action_12763408
 ] 

Michael Busch commented on LUCENE-1960:
---

{quote}
The problem with your patch: If the field is compressed and you try to get it, 
you would not hit it (because it is marked as String, not binary). The new 
self-compressed fields are now should be "binary", before they were binary or 
string. See the discussion in LUCENE-652
{quote}

Hmm I see. I should have waited a bit with committing - sorry!
I'll take care of it tomorrow, it's getting too late now.

> Remove deprecated Field.Store.COMPRESS
> --
>
> Key: LUCENE-1960
> URL: https://issues.apache.org/jira/browse/LUCENE-1960
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1960.patch
>
>
> Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763410#action_12763410
 ] 

Uwe Schindler commented on LUCENE-1960:
---

No problem. :-)

I think it should not be a big task to preserve the decompression of previously 
compressed fields. Just revert FieldsReader changes and modify a little bit.

> Remove deprecated Field.Store.COMPRESS
> --
>
> Key: LUCENE-1960
> URL: https://issues.apache.org/jira/browse/LUCENE-1960
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1960.patch
>
>
> Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1960) Remove deprecated Field.Store.COMPRESS

2009-10-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763417#action_12763417
 ] 

Uwe Schindler commented on LUCENE-1960:
---

bq. I don't think the SegmentMerger should uncompress automatically? That'd 
make <=2.9 indexes suddenly bigger.

If we re-add support for compressed fields, we have to also provide the special 
FieldForMerge again. To get rid of this code (which is the source of the 
problems behind the whole COMPRESS problem), we could simply let it as it is 
now without FieldForMerge. If you merge two segmets with compressed data then, 
without FieldToMerge it gets automatically decompressed and written in 
uncompressed variant to the new segment. As compressed fields are no longer 
supported, this is the correct behaviour. During merging, the compress bit must 
be removed.

The problem are suddenly bigger indexes, but we should note this in docs: "As 
compressed fields are no longer supported, during mering the compression is 
removed. If you want to compress your fields, do it yourself and store it as 
binary stored field."

Just for confirmation:
I have some indexes with compress enabled (for some of the documents, since 2.9 
we do not compress anymore, newly added docs have no compression anymore [it 
was never an good idea because of performance]). So I have no possibility to 
get these fields anymore, because I do not know if they are compressed and 
cannot do the decompression myself. For me, this data is simply lost.

I think Solr will have the same problem.

> Remove deprecated Field.Store.COMPRESS
> --
>
> Key: LUCENE-1960
> URL: https://issues.apache.org/jira/browse/LUCENE-1960
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1960.patch
>
>
> Also remove FieldForMerge and related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1951:
--

Assignee: Michael McCandless

> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763429#action_12763429
 ] 

Michael McCandless commented on LUCENE-1951:


Patch looks good, thanks Robert!  And those are good perf numbers;
rewriting to PrefixQuery seems a clear win.

The only thing that makes me nervous here is we've baked-in MTQ's
rewrite logic into WildcardQuery.rewrite.  Ie, MTQ in general accepts
any rewrite method, and so conceivably one could create their own
rewrite method and then see that it's unused in the special case where
WildcardQuery is a single term.

And while it's true today that if the rewrite method != scoring
boolean query, it must be a constant scoring one, that could
conceivably some day change.

Maybe a different approach would be to make a degenerate
"SingleTermEnum" (subclasses FilteredTermEnum) that produces only a
single term?  Then in getEnum we could return that, instead, so the
rewrite method remains intact?

> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1959:
--

Assignee: Michael McCandless

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1962) Persian Arabic Analyzer cleanup

2009-10-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1962:


Attachment: LUCENE_1962.patch

> Persian Arabic Analyzer cleanup
> ---
>
> Key: LUCENE-1962
> URL: https://issues.apache.org/jira/browse/LUCENE-1962
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE_1962.patch
>
>
> While browsing through the code I found some places for minor improvements in 
> the new Arabic / Persian Analyzer code. 
> - prevent default stopwords from being loaded each time a default constructor 
> is called
> - replace if blocks with a single switch
> - marking private members final where needed
> - changed protected visibility to final in final class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1962) Persian Arabic Analyzer cleanup

2009-10-08 Thread Simon Willnauer (JIRA)
Persian Arabic Analyzer cleanup
---

 Key: LUCENE-1962
 URL: https://issues.apache.org/jira/browse/LUCENE-1962
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 3.0
 Attachments: LUCENE_1962.patch

While browsing through the code I found some places for minor improvements in 
the new Arabic / Persian Analyzer code. 

- prevent default stopwords from being loaded each time a default constructor 
is called
- replace if blocks with a single switch
- marking private members final where needed
- changed protected visibility to final in final class.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763433#action_12763433
 ] 

Michael McCandless commented on LUCENE-1959:


Looks great, thanks Jason!  I just tweaked the javadoc to this:

/**
 * Command-line tool that enables listing segments in an
 * index, copying specific segments to another index, and
 * deleting segments from an index.
 *
 * NOTE: The tool is experimental and might change
 * in incompatible ways in the next release.  You can easily
 * accidentally remove segments from your index so be
 * careful!
 */

My inclination would be to commit this today (ie for 3.0), since it's such an 
isolated change, but we have said that 3.0 would only turnaround removal of 
deprecated APIs, cutover to Java 1.5 features, and bug fixes, so if anyone 
objects to my committing this for 3.0, please speak up soon!


> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763440#action_12763440
 ] 

Andrzej Bialecki  commented on LUCENE-1959:
---

I'm of a split mind about this splitter ;) in the sense that I'm not sure how 
useful it is - if your input is an optimized index then it has just 1 segment, 
so this tool won't be able to split it, right?

AFAIK a similar functionality can be implemented also using two other methods 
that would work on indexes with any number of segments: one method is trivial, 
based on a "delete/IndexWriter.addIndexes/undeletAll" loop that requires 
multiple passes over input data, the other would use the same method as 
SegmentMerger uses, i.e. working with FieldsWriter, FormatPostings*Consumer, 
TermVectorsWriter, etc. for a single-pass splitting.

So I guess I'm -0 on this index splitting method, because I think we can do it 
better.


> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763443#action_12763443
 ] 

Uwe Schindler commented on LUCENE-1959:
---

I would put it into contrib, as it is a utility tool. I see no real reason to 
have it in core. We have then all flexibility to change and optimize it, as 
Andrzey suggested.

One thing against this tool in its current form: To copy the files it should 
use the directory abstraction lay and not use java.io directly. So open 
IndexInput/IndexOutput to copy the files.

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1959) Index Splitter

2009-10-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763443#action_12763443
 ] 

Uwe Schindler edited comment on LUCENE-1959 at 10/8/09 3:52 AM:


I would put it into contrib (misc next to IndexNormModifier which is also 
command line), as it is a utility tool. I see no real reason to have it in 
core. We have then all flexibility to change and optimize it, as Andrzey 
suggested.

One thing against this tool in its current form: To copy the files it should 
use the directory abstraction lay and not use java.io directly. So open 
IndexInput/IndexOutput to copy the files.

  was (Author: thetaphi):
I would put it into contrib, as it is a utility tool. I see no real reason 
to have it in core. We have then all flexibility to change and optimize it, as 
Andrzey suggested.

One thing against this tool in its current form: To copy the files it should 
use the directory abstraction lay and not use java.io directly. So open 
IndexInput/IndexOutput to copy the files.
  
> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Arabic Analyzer: possible bug

2009-10-08 Thread DM Smith
I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't  
know Arabic or Farsi, but have some texts to index in those languages.)


The tokenizer/filter chain for ArabicAnalyzer is:
TokenStream result = new ArabicLetterTokenizer( reader );
result = new StopFilter( result, stoptable );
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter( result );
result = new ArabicStemFilter( result );

return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?


As a comparison the PersianAnalyzer has:
TokenStream result = new ArabicLetterTokenizer(reader);
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(result);
/* additional persian-specific normalization */
result = new PersianNormalizationFilter(result);
/*
 * the order here is important: the stopword list is normalized  
with the

 * above!
 */
result = new StopFilter(result, stoptable);

return result;


Thanks,
DM

[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763457#action_12763457
 ] 

Michael McCandless commented on LUCENE-1959:


bq. I would put it into contrib

+1, I'll do that.

bq. To copy the files it should use the directory abstraction lay and not use 
java.io directly.

I agree, that'd be nice, but I don't think really necessary before 
committing... it can be another future improvement.  But, we should not the 
limitations of the tool; I'll add javadocs.

Jason do you want to address any of these issues now (before committing to 
contrib)?

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763460#action_12763460
 ] 

Mark Miller commented on LUCENE-1959:
-

bq. To copy the files it should use the directory abstraction lay and not use 
java.io directly.

I'd use Channels instead - generally much faster.

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763461#action_12763461
 ] 

Mark Miller commented on LUCENE-1959:
-

bq. So I guess I'm -0 on this index splitting method, because I think we can do 
it better.

Improvements welcome :) No reason not to start somewhere though.

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763464#action_12763464
 ] 

Michael McCandless commented on LUCENE-1959:


bq. No reason not to start somewhere though.

+1

Progress not perfection!

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1959:
---

Attachment: LUCENE-1959.patch

New patch attached: move to contrib/misc, renamed TestFileSplitter -> 
TestIndexSplitter, added javadocs noting the limitations, added CHANGES entry.  
I'll commit shortly.

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763468#action_12763468
 ] 

Robert Muir commented on LUCENE-1951:
-

Michael, I thought about this problem too, but didnt know what to do.

I rather like the SingleTermEnum idea. I'll do it.


> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did not want
to have to create variations with farsi yah versus arabic yah for each one.

On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:

> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
> Arabic or Farsi, but have some texts to index in those languages.)
> The tokenizer/filter chain for ArabicAnalyzer is:
> TokenStream result = new ArabicLetterTokenizer( reader );
> result = new StopFilter( result, stoptable );
> result = new LowerCaseFilter(result);
> result = new ArabicNormalizationFilter( result );
> result = new ArabicStemFilter( result );
>
> return result;
>
> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>
>
> As a comparison the PersianAnalyzer has:
> TokenStream result = new ArabicLetterTokenizer(reader);
> result = new LowerCaseFilter(result);
> result = new ArabicNormalizationFilter(result);
> /* additional persian-specific normalization */
> result = new PersianNormalizationFilter(result);
> /*
>  * the order here is important: the stopword list is normalized with
> the
>  * above!
>  */
> result = new StopFilter(result, stoptable);
>
> return result;
>
>
> Thanks,
>  DM
>



-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763479#action_12763479
 ] 

Mark Miller commented on LUCENE-1959:
-

small opt - you might switch it to reuse the buffer between files.

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.1
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763481#action_12763481
 ] 

Michael McCandless commented on LUCENE-1959:


bq. small opt - you might switch it to reuse the buffer between files.

OK I just committed that!

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1959.


   Resolution: Fixed
Fix Version/s: (was: 3.1)
   3.0

Thanks Jason!

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1962) Persian Arabic Analyzer cleanup

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763485#action_12763485
 ] 

Robert Muir commented on LUCENE-1962:
-

Simon, thanks, please commit this :)


> Persian Arabic Analyzer cleanup
> ---
>
> Key: LUCENE-1962
> URL: https://issues.apache.org/jira/browse/LUCENE-1962
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE_1962.patch
>
>
> While browsing through the code I found some places for minor improvements in 
> the new Arabic / Persian Analyzer code. 
> - prevent default stopwords from being loaded each time a default constructor 
> is called
> - replace if blocks with a single switch
> - marking private members final where needed
> - changed protected visibility to final in final class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread DM Smith

Robert,
Thanks for the info.
As I said, I am illiterate in Arabic. So I have another, perhaps  
nonsensical, question:
Does the stop word list have every combination of upper/lower case for  
each Arabic word in the list? (i.e. is it fully de-normalized?) Or  
should it come after LowerCaseFilter?


-- DM

On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:


DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did  
not want to have to create variations with farsi yah versus arabic  
yah for each one.


On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:
I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't  
know Arabic or Farsi, but have some texts to index in those  
languages.)


The tokenizer/filter chain for ArabicAnalyzer is:
TokenStream result = new ArabicLetterTokenizer( reader );
result = new StopFilter( result, stoptable );
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter( result );
result = new ArabicStemFilter( result );

return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?


As a comparison the PersianAnalyzer has:
TokenStream result = new ArabicLetterTokenizer(reader);
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(result);
/* additional persian-specific normalization */
result = new PersianNormalizationFilter(result);
/*
 * the order here is important: the stopword list is normalized  
with the

 * above!
 */
result = new StopFilter(result, stoptable);

return result;


Thanks,
DM



--
Robert Muir
rcm...@gmail.com




Re: Arabic Analyzer: possible bug

2009-10-08 Thread Ahmed Al-Obaidy
There is no upper and lower case in Arabic.

--- On Thu, 10/8/09, DM Smith  wrote:

From: DM Smith 
Subject: Re: Arabic Analyzer: possible bug
To: java-dev@lucene.apache.org
Date: Thursday, October 8, 2009, 3:14 PM

Robert,Thanks for the info.As I said, I am illiterate in Arabic. So I have 
another, perhaps nonsensical, question:Does the stop word list have every 
combination of upper/lower case for each Arabic word in the list? (i.e. is it 
fully de-normalized?) Or should it come after LowerCaseFilter?
-- DM
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did not want to 
have to create variations with farsi yah versus arabic yah for each one.



On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:


I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know Arabic 
or Farsi, but have some texts to index in those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:

        TokenStream result = new ArabicLetterTokenizer( reader );
        result = new StopFilter( result, stoptable );
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter( result );


        result = new ArabicStemFilter( result );

        return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?


 As a comparison the PersianAnalyzer has:

    TokenStream result = new ArabicLetterTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new ArabicNormalizationFilter(result);


    /* additional persian-specific normalization */
    result = new PersianNormalizationFilter(result);
    /*
     * the order here is important: the stopword list is normalized with the
     * above!
     */


    result = new StopFilter(result, stoptable);

    return result;


Thanks,

DM


-- 
Robert Muir
rcm...@gmail.com







  

Re: Arabic Analyzer: possible bug

2009-10-08 Thread Basem Narmok
DM, there is no upper/lower cases in Arabic, so don't worry, but the
stop word list needs some corrections and may miss some common/stop
Arabic words.

Best,

On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
> Robert,
> Thanks for the info.
> As I said, I am illiterate in Arabic. So I have another, perhaps
> nonsensical, question:
> Does the stop word list have every combination of upper/lower case for each
> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it come
> after LowerCaseFilter?
> -- DM
> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>
> DM, this isn't a bug.
>
> The arabic stopwords are not normalized.
>
> but for persian, i normalized the stopwords. mostly because i did not want
> to have to create variations with farsi yah versus arabic yah for each one.
>
> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:
>>
>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
>> Arabic or Farsi, but have some texts to index in those languages.)
>> The tokenizer/filter chain for ArabicAnalyzer is:
>>         TokenStream result = new ArabicLetterTokenizer( reader );
>>         result = new StopFilter( result, stoptable );
>>         result = new LowerCaseFilter(result);
>>         result = new ArabicNormalizationFilter( result );
>>         result = new ArabicStemFilter( result );
>>
>>         return result;
>>
>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>
>> As a comparison the PersianAnalyzer has:
>>     TokenStream result = new ArabicLetterTokenizer(reader);
>>     result = new LowerCaseFilter(result);
>>     result = new ArabicNormalizationFilter(result);
>>     /* additional persian-specific normalization */
>>     result = new PersianNormalizationFilter(result);
>>     /*
>>      * the order here is important: the stopword list is normalized with
>> the
>>      * above!
>>      */
>>     result = new StopFilter(result, stoptable);
>>
>>     return result;
>>
>>
>> Thanks,
>> DM
>
>
> --
> Robert Muir
> rcm...@gmail.com
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Arabic Analyzer: possible bug

2009-10-08 Thread Uwe Schindler
Just an addition: The lowercase filter is only for the case of embedded
non-arabic words. And these will not appear in the stop words.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Basem Narmok [mailto:nar...@gmail.com]
> Sent: Thursday, October 08, 2009 4:20 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Arabic Analyzer: possible bug
> 
> DM, there is no upper/lower cases in Arabic, so don't worry, but the
> stop word list needs some corrections and may miss some common/stop
> Arabic words.
> 
> Best,
> 
> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
> > Robert,
> > Thanks for the info.
> > As I said, I am illiterate in Arabic. So I have another, perhaps
> > nonsensical, question:
> > Does the stop word list have every combination of upper/lower case for
> each
> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
> come
> > after LowerCaseFilter?
> > -- DM
> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
> >
> > DM, this isn't a bug.
> >
> > The arabic stopwords are not normalized.
> >
> > but for persian, i normalized the stopwords. mostly because i did not
> want
> > to have to create variations with farsi yah versus arabic yah for each
> one.
> >
> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:
> >>
> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
> know
> >> Arabic or Farsi, but have some texts to index in those languages.)
> >> The tokenizer/filter chain for ArabicAnalyzer is:
> >>         TokenStream result = new ArabicLetterTokenizer( reader );
> >>         result = new StopFilter( result, stoptable );
> >>         result = new LowerCaseFilter(result);
> >>         result = new ArabicNormalizationFilter( result );
> >>         result = new ArabicStemFilter( result );
> >>
> >>         return result;
> >>
> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
> >>
> >> As a comparison the PersianAnalyzer has:
> >>     TokenStream result = new ArabicLetterTokenizer(reader);
> >>     result = new LowerCaseFilter(result);
> >>     result = new ArabicNormalizationFilter(result);
> >>     /* additional persian-specific normalization */
> >>     result = new PersianNormalizationFilter(result);
> >>     /*
> >>      * the order here is important: the stopword list is normalized
> with
> >> the
> >>      * above!
> >>      */
> >>     result = new StopFilter(result, stoptable);
> >>
> >>     return result;
> >>
> >>
> >> Thanks,
> >> DM
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
> >
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
the upper/lower case is there, in case you happen to have some english text
mixed in :)

but to answer your question, the stopword list contains some variant forms,
and I added a couple more in LUCENE-1758.

Maybe this will help:
ArabicNormalizer is 'aggressive' for arabic language.
ArabicNormalizer + PersianNormalizer is 'not very aggressive' for persian
language.

So for arabic language, i thought it unsafe to normalize the stopwords.

For persian language, the normalizer is really important so the stopwords
list will work regardless of encoding (they use a variant form of yah and
kaf sometimes, especially depending on computer system/legacy encoding).
Also, most words in persian stopword list, aren't even real words on their
own.

the languages are very different so the analyzers work in different ways...

On Thu, Oct 8, 2009 at 9:18 AM, Ahmed Al-Obaidy wrote:

> There is no upper and lower case in Arabic.
>
> --- On *Thu, 10/8/09, DM Smith * wrote:
>
>
> From: DM Smith 
> Subject: Re: Arabic Analyzer: possible bug
> To: java-dev@lucene.apache.org
> Date: Thursday, October 8, 2009, 3:14 PM
>
>
> Robert,Thanks for the info.
> As I said, I am illiterate in Arabic. So I have another, perhaps
> nonsensical, question:
> Does the stop word list have every combination of upper/lower case for each
> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it come
> after LowerCaseFilter?
>
> -- DM
>
> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>
> DM, this isn't a bug.
>
> The arabic stopwords are not normalized.
>
> but for persian, i normalized the stopwords. mostly because i did not want
> to have to create variations with farsi yah versus arabic yah for each one.
>
> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith 
> http://mc/compose?to=dmsmith...@gmail.com>
> > wrote:
>
>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
>> Arabic or Farsi, but have some texts to index in those languages.)
>> The tokenizer/filter chain for ArabicAnalyzer is:
>> TokenStream result = new ArabicLetterTokenizer( reader );
>> result = new StopFilter( result, stoptable );
>> result = new LowerCaseFilter(result);
>> result = new ArabicNormalizationFilter( result );
>> result = new ArabicStemFilter( result );
>>
>> return result;
>>
>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>
>>
>> As a comparison the PersianAnalyzer has:
>> TokenStream result = new ArabicLetterTokenizer(reader);
>> result = new LowerCaseFilter(result);
>> result = new ArabicNormalizationFilter(result);
>> /* additional persian-specific normalization */
>> result = new PersianNormalizationFilter(result);
>> /*
>>  * the order here is important: the stopword list is normalized with
>> the
>>  * above!
>>  */
>> result = new StopFilter(result, stoptable);
>>
>> return result;
>>
>>
>> Thanks,
>>  DM
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com 
>
>
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
Basem, by any chance would you be willing to help improve it for us?

On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok  wrote:

> DM, there is no upper/lower cases in Arabic, so don't worry, but the
> stop word list needs some corrections and may miss some common/stop
> Arabic words.
>
> Best,
>
> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
> > Robert,
> > Thanks for the info.
> > As I said, I am illiterate in Arabic. So I have another, perhaps
> > nonsensical, question:
> > Does the stop word list have every combination of upper/lower case for
> each
> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
> come
> > after LowerCaseFilter?
> > -- DM
> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
> >
> > DM, this isn't a bug.
> >
> > The arabic stopwords are not normalized.
> >
> > but for persian, i normalized the stopwords. mostly because i did not
> want
> > to have to create variations with farsi yah versus arabic yah for each
> one.
> >
> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:
> >>
> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't know
> >> Arabic or Farsi, but have some texts to index in those languages.)
> >> The tokenizer/filter chain for ArabicAnalyzer is:
> >> TokenStream result = new ArabicLetterTokenizer( reader );
> >> result = new StopFilter( result, stoptable );
> >> result = new LowerCaseFilter(result);
> >> result = new ArabicNormalizationFilter( result );
> >> result = new ArabicStemFilter( result );
> >>
> >> return result;
> >>
> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
> >>
> >> As a comparison the PersianAnalyzer has:
> >> TokenStream result = new ArabicLetterTokenizer(reader);
> >> result = new LowerCaseFilter(result);
> >> result = new ArabicNormalizationFilter(result);
> >> /* additional persian-specific normalization */
> >> result = new PersianNormalizationFilter(result);
> >> /*
> >>  * the order here is important: the stopword list is normalized with
> >> the
> >>  * above!
> >>  */
> >> result = new StopFilter(result, stoptable);
> >>
> >> return result;
> >>
> >>
> >> Thanks,
> >> DM
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
> >
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: Arabic Analyzer: possible bug

2009-10-08 Thread DM Smith

On 10/08/2009 09:23 AM, Uwe Schindler wrote:

Just an addition: The lowercase filter is only for the case of embedded
non-arabic words. And these will not appear in the stop words.
   

I learned something new!

Hmm. If one has a mixed Arabic / English text, shouldn't one be able to 
augment the stopwords list with English stop words? And if so, shouldn't 
the stop filter come after the lower case filter?


-- DM


-Original Message-
From: Basem Narmok [mailto:nar...@gmail.com]
Sent: Thursday, October 08, 2009 4:20 PM
To: java-dev@lucene.apache.org
Subject: Re: Arabic Analyzer: possible bug

DM, there is no upper/lower cases in Arabic, so don't worry, but the
stop word list needs some corrections and may miss some common/stop
Arabic words.

Best,

On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
 

Robert,
Thanks for the info.
As I said, I am illiterate in Arabic. So I have another, perhaps
nonsensical, question:
Does the stop word list have every combination of upper/lower case for
   

each
 

Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
   

come
 

after LowerCaseFilter?
-- DM
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did not
   

want
 

to have to create variations with farsi yah versus arabic yah for each
   

one.
 

On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:
   

I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
 

know
 

Arabic or Farsi, but have some texts to index in those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:
 TokenStream result = new ArabicLetterTokenizer( reader );
 result = new StopFilter( result, stoptable );
 result = new LowerCaseFilter(result);
 result = new ArabicNormalizationFilter( result );
 result = new ArabicStemFilter( result );

 return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?

As a comparison the PersianAnalyzer has:
 TokenStream result = new ArabicLetterTokenizer(reader);
 result = new LowerCaseFilter(result);
 result = new ArabicNormalizationFilter(result);
 /* additional persian-specific normalization */
 result = new PersianNormalizationFilter(result);
 /*
  * the order here is important: the stopword list is normalized
 

with
 

the
  * above!
  */
 result = new StopFilter(result, stoptable);

 return result;


Thanks,
DM
 


--
Robert Muir
rcm...@gmail.com


   

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Basem Narmok
Robert,

I will be happy to do so. Currently, I am testing the new Arabic
analyzer in 2.9, and also I will prepare a new stop word list. I will
provide you with my findings/comments soon.

Best,

On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir  wrote:
> Basem, by any chance would you be willing to help improve it for us?
>
> On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok  wrote:
>>
>> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>> stop word list needs some corrections and may miss some common/stop
>> Arabic words.
>>
>> Best,
>>
>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
>> > Robert,
>> > Thanks for the info.
>> > As I said, I am illiterate in Arabic. So I have another, perhaps
>> > nonsensical, question:
>> > Does the stop word list have every combination of upper/lower case for
>> > each
>> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
>> > come
>> > after LowerCaseFilter?
>> > -- DM
>> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>> >
>> > DM, this isn't a bug.
>> >
>> > The arabic stopwords are not normalized.
>> >
>> > but for persian, i normalized the stopwords. mostly because i did not
>> > want
>> > to have to create variations with farsi yah versus arabic yah for each
>> > one.
>> >
>> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:
>> >>
>> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>> >> know
>> >> Arabic or Farsi, but have some texts to index in those languages.)
>> >> The tokenizer/filter chain for ArabicAnalyzer is:
>> >>         TokenStream result = new ArabicLetterTokenizer( reader );
>> >>         result = new StopFilter( result, stoptable );
>> >>         result = new LowerCaseFilter(result);
>> >>         result = new ArabicNormalizationFilter( result );
>> >>         result = new ArabicStemFilter( result );
>> >>
>> >>         return result;
>> >>
>> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>> >>
>> >> As a comparison the PersianAnalyzer has:
>> >>     TokenStream result = new ArabicLetterTokenizer(reader);
>> >>     result = new LowerCaseFilter(result);
>> >>     result = new ArabicNormalizationFilter(result);
>> >>     /* additional persian-specific normalization */
>> >>     result = new PersianNormalizationFilter(result);
>> >>     /*
>> >>      * the order here is important: the stopword list is normalized
>> >> with
>> >> the
>> >>      * above!
>> >>      */
>> >>     result = new StopFilter(result, stoptable);
>> >>
>> >>     return result;
>> >>
>> >>
>> >> Thanks,
>> >> DM
>> >
>> >
>> > --
>> > Robert Muir
>> > rcm...@gmail.com
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
DM, i suppose. but this is a tricky subject, what if you have mixed Arabic /
German or something like that?

for some other languages written in the Latin script, English stopwords
could be bad :)

I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe
across the board though.

On Thu, Oct 8, 2009 at 9:29 AM, DM Smith  wrote:

> On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>
>> Just an addition: The lowercase filter is only for the case of embedded
>> non-arabic words. And these will not appear in the stop words.
>>
>>
> I learned something new!
>
> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to
> augment the stopwords list with English stop words? And if so, shouldn't the
> stop filter come after the lower case filter?
>
> -- DM
>
>
>  -Original Message-
>>> From: Basem Narmok [mailto:nar...@gmail.com]
>>> Sent: Thursday, October 08, 2009 4:20 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: Arabic Analyzer: possible bug
>>>
>>> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>>> stop word list needs some corrections and may miss some common/stop
>>> Arabic words.
>>>
>>> Best,
>>>
>>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
>>>
>>>
 Robert,
 Thanks for the info.
 As I said, I am illiterate in Arabic. So I have another, perhaps
 nonsensical, question:
 Does the stop word list have every combination of upper/lower case for


>>> each
>>>
>>>
 Arabic word in the list? (i.e. is it fully de-normalized?) Or should it


>>> come
>>>
>>>
 after LowerCaseFilter?
 -- DM
 On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

 DM, this isn't a bug.

 The arabic stopwords are not normalized.

 but for persian, i normalized the stopwords. mostly because i did not


>>> want
>>>
>>>
 to have to create variations with farsi yah versus arabic yah for each


>>> one.
>>>
>>>
 On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:


> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>
>
 know
>>>
>>>
 Arabic or Farsi, but have some texts to index in those languages.)
> The tokenizer/filter chain for ArabicAnalyzer is:
> TokenStream result = new ArabicLetterTokenizer( reader );
> result = new StopFilter( result, stoptable );
> result = new LowerCaseFilter(result);
> result = new ArabicNormalizationFilter( result );
> result = new ArabicStemFilter( result );
>
> return result;
>
> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>
> As a comparison the PersianAnalyzer has:
> TokenStream result = new ArabicLetterTokenizer(reader);
> result = new LowerCaseFilter(result);
> result = new ArabicNormalizationFilter(result);
> /* additional persian-specific normalization */
> result = new PersianNormalizationFilter(result);
> /*
>  * the order here is important: the stopword list is normalized
>
>
 with
>>>
>>>
 the
>  * above!
>  */
> result = new StopFilter(result, stoptable);
>
> return result;
>
>
> Thanks,
> DM
>
>

 --
 Robert Muir
 rcm...@gmail.com




>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1953) FastVectorHighlighter: small fragCharSize can cause StringIndexOutOfBoundsException

2009-10-08 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763497#action_12763497
 ] 

Koji Sekiguchi commented on LUCENE-1953:


bq. Koji can't commit to the 2.9 branch can he? Not sure how that karma works - 
we can do it for him if not - lets wait to resolve until thats done though.

I couldn't. The error I got:

{code}
[k...@macbook COMMIT-1953-lucene_2_9]$ svn up
At revision 823174.
[k...@macbook COMMIT-1953-lucene_2_9]$ svn commit -m "LUCENE-1953: 
FastVectorHighlighter: small fragCharSize can cause 
StringIndexOutOfBoundsException"
Sendingcontrib/CHANGES.txt
svn: Commit failed (details follow):
svn: CHECKOUT of 
'/repos/asf/!svn/ver/818600/lucene/java/branches/lucene_2_9/contrib/CHANGES.txt':
 403 Forbidden (https://svn.apache.org)
{code}

Can you commit it for me please?

> FastVectorHighlighter: small fragCharSize can cause 
> StringIndexOutOfBoundsException 
> 
>
> Key: LUCENE-1953
> URL: https://issues.apache.org/jira/browse/LUCENE-1953
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 2.9
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 2.9.1
>
> Attachments: LUCENE-1953.patch
>
>
> If fragCharSize is smaller than Query string, StringIndexOutOfBoundsException 
> is thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1953) FastVectorHighlighter: small fragCharSize can cause StringIndexOutOfBoundsException

2009-10-08 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763497#action_12763497
 ] 

Koji Sekiguchi edited comment on LUCENE-1953 at 10/8/09 6:52 AM:
-

Committed revision 823170 in trunk.

bq. Koji can't commit to the 2.9 branch can he? Not sure how that karma works - 
we can do it for him if not - lets wait to resolve until thats done though.

I couldn't. The error I got:

{code}
[k...@macbook COMMIT-1953-lucene_2_9]$ svn up
At revision 823174.
[k...@macbook COMMIT-1953-lucene_2_9]$ svn commit -m "LUCENE-1953: 
FastVectorHighlighter: small fragCharSize can cause 
StringIndexOutOfBoundsException"
Sendingcontrib/CHANGES.txt
svn: Commit failed (details follow):
svn: CHECKOUT of 
'/repos/asf/!svn/ver/818600/lucene/java/branches/lucene_2_9/contrib/CHANGES.txt':
 403 Forbidden (https://svn.apache.org)
{code}

Can you commit it for me please?

  was (Author: koji):
bq. Koji can't commit to the 2.9 branch can he? Not sure how that karma 
works - we can do it for him if not - lets wait to resolve until thats done 
though.

I couldn't. The error I got:

{code}
[k...@macbook COMMIT-1953-lucene_2_9]$ svn up
At revision 823174.
[k...@macbook COMMIT-1953-lucene_2_9]$ svn commit -m "LUCENE-1953: 
FastVectorHighlighter: small fragCharSize can cause 
StringIndexOutOfBoundsException"
Sendingcontrib/CHANGES.txt
svn: Commit failed (details follow):
svn: CHECKOUT of 
'/repos/asf/!svn/ver/818600/lucene/java/branches/lucene_2_9/contrib/CHANGES.txt':
 403 Forbidden (https://svn.apache.org)
{code}

Can you commit it for me please?
  
> FastVectorHighlighter: small fragCharSize can cause 
> StringIndexOutOfBoundsException 
> 
>
> Key: LUCENE-1953
> URL: https://issues.apache.org/jira/browse/LUCENE-1953
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 2.9
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 2.9.1
>
> Attachments: LUCENE-1953.patch
>
>
> If fragCharSize is smaller than Query string, StringIndexOutOfBoundsException 
> is thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1962) Persian Arabic Analyzer cleanup

2009-10-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer closed LUCENE-1962.
---

Resolution: Fixed

Commited in r823180
thx robert

> Persian Arabic Analyzer cleanup
> ---
>
> Key: LUCENE-1962
> URL: https://issues.apache.org/jira/browse/LUCENE-1962
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE_1962.patch
>
>
> While browsing through the code I found some places for minor improvements in 
> the new Arabic / Persian Analyzer code. 
> - prevent default stopwords from being loaded each time a default constructor 
> is called
> - replace if blocks with a single switch
> - marking private members final where needed
> - changed protected visibility to final in final class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Arabic Analyzer: possible bug

2009-10-08 Thread Uwe Schindler
I think the idea of lowercase filter in the arabic analyzers is not to
really index mixed language texts. It is more for the case, if you have some
word between the Arabic content (like product names,.), which happens often.
You see this often also in Japanese texts. And for these embedded English
fragments you really need no stop word list. And if there is a stop word in
it, for the target language it is not a real stop word, it may be additional
information. Stop word removal is done mostly because of they are needless
(appear in every text). But if you have one Arabic sentence where "the" also
appears next to an English word, it is more important than all the "the" in
this mail.


Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
Uwe, I might add to what you say. I do disagree a bit and think mixed
english/arabic text is pretty common (aside from the "product name" issue
you discussed).

this can get really complex for some informal text: you have maybe some
english, arabic, and arabic written in informal romanization, sometimes all
mixed together:

Example:
http://www.mahjoob.com/en/forums/showthread.php?t=211597&page=3

Not really sure how to make the default ArabicAnalyzer to meet everyone's
needs, in this example its gonna screw up the romanized arabic, because they
use numerics for some letters, and it uses something based on CharTokenizer
:) But allowing a word to say, start with or contain a numeric, this might
not be the best thing for higher-quality text...


On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler  wrote:

> I think the idea of lowercase filter in the arabic analyzers is not to
> really index mixed language texts. It is more for the case, if you have
> some
> word between the Arabic content (like product names,.), which happens
> often.
> You see this often also in Japanese texts. And for these embedded English
> fragments you really need no stop word list. And if there is a stop word in
> it, for the target language it is not a real stop word, it may be
> additional
> information. Stop word removal is done mostly because of they are needless
> (appear in every text). But if you have one Arabic sentence where "the"
> also
> appears next to an English word, it is more important than all the "the" in
> this mail.
>
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1953) FastVectorHighlighter: small fragCharSize can cause StringIndexOutOfBoundsException

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763505#action_12763505
 ] 

Mark Miller commented on LUCENE-1953:
-

just committed Koji.

> FastVectorHighlighter: small fragCharSize can cause 
> StringIndexOutOfBoundsException 
> 
>
> Key: LUCENE-1953
> URL: https://issues.apache.org/jira/browse/LUCENE-1953
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 2.9
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 2.9.1
>
> Attachments: LUCENE-1953.patch
>
>
> If fragCharSize is smaller than Query string, StringIndexOutOfBoundsException 
> is thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r823189 - in /lucene/java/branches/lucene_2_9/contrib: ./ fast-vector-highlighter/src/java/org/apache/lucene/search/vectorhighlight/ fast-vector-highlighter/src/test/org/apache/lucene/

2009-10-08 Thread Koji Sekiguchi

Thanks, Mark!
Can you change "Trunk" to "2.9 branch" in CHANGES.txt? :-)

+=== Trunk (not yet released) ===

Koji

markrmil...@apache.org wrote:

Author: markrmiller
Date: Thu Oct  8 14:32:09 2009
New Revision: 823189

URL: http://svn.apache.org/viewvc?rev=823189&view=rev
Log:
LUCENE-1953: FastVectorHighlighter: small fragCharSize can cause 
StringIndexOutOfBoundsException

Modified:
lucene/java/branches/lucene_2_9/contrib/CHANGES.txt

lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilder.java

lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/test/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilderTest.java

Modified: lucene/java/branches/lucene_2_9/contrib/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9/contrib/CHANGES.txt?rev=823189&r1=823188&r2=823189&view=diff
==
--- lucene/java/branches/lucene_2_9/contrib/CHANGES.txt (original)
+++ lucene/java/branches/lucene_2_9/contrib/CHANGES.txt Thu Oct  8 14:32:09 2009
@@ -1,5 +1,14 @@
 Lucene contrib change Log
 
+=== Trunk (not yet released) ===

+
+Changes in backwards compatibility policy
+   
+Bug fixes

+
+ * LUCENE-1953: FastVectorHighlighter: small fragCharSize can cause
+   StringIndexOutOfBoundsException. (Koji Sekiguchi)
+
 === Release 2.9.0 2009-09-23 ===
 
 Changes in runtime behavior


Modified: 
lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilder.java
URL: 
http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilder.java?rev=823189&r1=823188&r2=823189&view=diff
==
--- 
lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilder.java
 (original)
+++ 
lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilder.java
 Thu Oct  8 14:32:09 2009
@@ -59,6 +59,8 @@
   int st = phraseInfo.getStartOffset() - MARGIN < startOffset ?
   startOffset : phraseInfo.getStartOffset() - MARGIN;
   int en = st + fragCharSize;
+  if( phraseInfo.getEndOffset() > en )
+en = phraseInfo.getEndOffset();
   startOffset = en;
 
   while( true ){


Modified: 
lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/test/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilderTest.java
URL: 
http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/test/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilderTest.java?rev=823189&r1=823188&r2=823189&view=diff
==
--- 
lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/test/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilderTest.java
 (original)
+++ 
lucene/java/branches/lucene_2_9/contrib/fast-vector-highlighter/src/test/org/apache/lucene/search/vectorhighlight/SimpleFragListBuilderTest.java
 Thu Oct  8 14:32:09 2009
@@ -37,6 +37,21 @@
 }
   }
   
+  public void testSmallerFragSizeThanTermQuery() throws Exception {

+SimpleFragListBuilder sflb = new SimpleFragListBuilder();
+FieldFragList ffl = sflb.createFieldFragList( fpl( "abcdefghijklmnopqrs", 
"abcdefghijklmnopqrs" ), SimpleFragListBuilder.MIN_FRAG_CHAR_SIZE );
+assertEquals( 1, ffl.fragInfos.size() );
+assertEquals( "subInfos=(abcdefghijklmnopqrs((0,19)))/1.0(0,19)", 
ffl.fragInfos.get( 0 ).toString() );
+  }
+  
+  public void testSmallerFragSizeThanPhraseQuery() throws Exception {

+SimpleFragListBuilder sflb = new SimpleFragListBuilder();
+FieldFragList ffl = sflb.createFieldFragList( fpl( "\"abcdefgh jklmnopqrs\"", 
"abcdefgh   jklmnopqrs" ), SimpleFragListBuilder.MIN_FRAG_CHAR_SIZE );
+assertEquals( 1, ffl.fragInfos.size() );
+System.out.println( ffl.fragInfos.get( 0 ).toString() );
+assertEquals( "subInfos=(abcdefghjklmnopqrs((0,21)))/1.0(0,21)", 
ffl.fragInfos.get( 0 ).toString() );
+  }
+  
   public void test1TermIndex() throws Exception {

 SimpleFragListBuilder sflb = new SimpleFragListBuilder();
 FieldFragList ffl = sflb.createFieldFragList( fpl( "a", "a" ), 100 );



  



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763507#action_12763507
 ] 

Robert Muir commented on LUCENE-1951:
-

think there would be objection to making this proposed SingleTermEnum public?

I would like to use it in LUCENE-1606 (contrib) to have consistency there as 
well.

> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1953) FastVectorHighlighter: small fragCharSize can cause StringIndexOutOfBoundsException

2009-10-08 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved LUCENE-1953.


Resolution: Fixed

Thanks, Mark!

BTW, I cannot assign myself because I cannot find "Assign" link in Lucene JIRA. 
Could anyone solve this problem?

> FastVectorHighlighter: small fragCharSize can cause 
> StringIndexOutOfBoundsException 
> 
>
> Key: LUCENE-1953
> URL: https://issues.apache.org/jira/browse/LUCENE-1953
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 2.9
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 2.9.1
>
> Attachments: LUCENE-1953.patch
>
>
> If fragCharSize is smaller than Query string, StringIndexOutOfBoundsException 
> is thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763521#action_12763521
 ] 

Michael McCandless commented on LUCENE-1951:


bq. think there would be objection to making this proposed SingleTermEnum 
public?

I think that's fine.

> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread DM Smith

Robert,
Yes it is tricky.

I'm not suggesting that the ArabicAnalyzer have any stopwords other than 
Arabic.


I'm suggesting that if I know my input document well and know that it 
has mixed text and that the text is Arabic and one other known language 
that I might want to augment the stop list with stop words appropriate 
for that known language. I think that in this case, stop filter should 
be after lower case filter.


As to lower casing across the board, I also think it is pretty safe. But 
I think there are some edge cases. For example, lowercasing a Greek word 
in all upper case ending in sigma will not produce the same as lower 
casing the same Greek word in all lower case. The Greek word should have 
a final sigma rather than a small sigma. For Greek, using an 
UpperCaseFilter followed by a LowerCaseFilter would handle this case.


IMHO, this is not an issue for the Arabic or Persian analyzers.

-- DM

On 10/08/2009 09:36 AM, Robert Muir wrote:
DM, i suppose. but this is a tricky subject, what if you have mixed 
Arabic / German or something like that?


for some other languages written in the Latin script, English 
stopwords could be bad :)


I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty 
safe across the board though.


On Thu, Oct 8, 2009 at 9:29 AM, DM Smith > wrote:


On 10/08/2009 09:23 AM, Uwe Schindler wrote:

Just an addition: The lowercase filter is only for the case of
embedded
non-arabic words. And these will not appear in the stop words.

I learned something new!

Hmm. If one has a mixed Arabic / English text, shouldn't one be
able to augment the stopwords list with English stop words? And if
so, shouldn't the stop filter come after the lower case filter?

-- DM


-Original Message-
From: Basem Narmok [mailto:nar...@gmail.com
]
Sent: Thursday, October 08, 2009 4:20 PM
To: java-dev@lucene.apache.org

Subject: Re: Arabic Analyzer: possible bug

DM, there is no upper/lower cases in Arabic, so don't
worry, but the
stop word list needs some corrections and may miss some
common/stop
Arabic words.

Best,

On Thu, Oct 8, 2009 at 4:14 PM, DM
Smithmailto:dmsmith...@gmail.com>>
 wrote:

Robert,
Thanks for the info.
As I said, I am illiterate in Arabic. So I have
another, perhaps
nonsensical, question:
Does the stop word list have every combination of
upper/lower case for

each

Arabic word in the list? (i.e. is it fully
de-normalized?) Or should it

come

after LowerCaseFilter?
-- DM
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly
because i did not

want

to have to create variations with farsi yah versus
arabic yah for each

one.

On Thu, Oct 8, 2009 at 7:24 AM, DM
Smithmailto:dmsmith...@gmail.com>>  wrote:

I'm wondering if there is  a bug in ArabicAnalyzer
in 2.9. (I don't

know

Arabic or Farsi, but have some texts to index in
those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:
TokenStream result = new
ArabicLetterTokenizer( reader );
result = new StopFilter( result, stoptable );
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(
result );
result = new ArabicStemFilter( result );

return result;

Shouldn't the StopFilter come after
ArabicNormalizationFilter?

As a comparison the PersianAnalyzer has:
TokenStream result = new
ArabicLetterTokenizer(reader);
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(result);
/* additional persian-specific normalization */
result = new PersianNormalizationFilter(result);
/*
 * the order here is important: the stopword
list is normalized

with

the
  

[jira] Updated: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1951:


Attachment: LUCENE-1951.patch

updated patch, using SingleTermEnum instead of TermQuery rewrite when there are 
no wildcards to preserve all the MultiTermQuery semantics.


> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951.patch, 
> LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1953) FastVectorHighlighter: small fragCharSize can cause StringIndexOutOfBoundsException

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763526#action_12763526
 ] 

Mark Miller commented on LUCENE-1953:
-

I think that means someone has to give you JIRA power and hasn't yet - can't 
remember who to bug on that - Hoss or Grant I think? Perhaps the right person 
is watching ...

> FastVectorHighlighter: small fragCharSize can cause 
> StringIndexOutOfBoundsException 
> 
>
> Key: LUCENE-1953
> URL: https://issues.apache.org/jira/browse/LUCENE-1953
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/highlighter
>Affects Versions: 2.9
>Reporter: Koji Sekiguchi
>Priority: Trivial
> Fix For: 2.9.1
>
> Attachments: LUCENE-1953.patch
>
>
> If fragCharSize is smaller than Query string, StringIndexOutOfBoundsException 
> is thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
>
> I'm suggesting that if I know my input document well and know that it has
> mixed text and that the text is Arabic and one other known language that I
> might want to augment the stop list with stop words appropriate for that
> known language. I think that in this case, stop filter should be after lower
> case filter.
>
 I think this is a good idea?

>
> As to lower casing across the board, I also think it is pretty safe. But I
> think there are some edge cases. For example, lowercasing a Greek word in
> all upper case ending in sigma will not produce the same as lower casing the
> same Greek word in all lower case. The Greek word should have a final sigma
> rather than a small sigma. For Greek, using an UpperCaseFilter followed by a
> LowerCaseFilter would handle this case.
>
or you could use unicode case folding. lowercasing is for display purposes,
not search.

>
> IMHO, this is not an issue for the Arabic or Persian analyzers.
>
> -- DM
>
>
> On 10/08/2009 09:36 AM, Robert Muir wrote:
>
> DM, i suppose. but this is a tricky subject, what if you have mixed Arabic
> / German or something like that?
>
> for some other languages written in the Latin script, English stopwords
> could be bad :)
>
> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe
> across the board though.
>
> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith  wrote:
>
>> On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>>
>>> Just an addition: The lowercase filter is only for the case of embedded
>>> non-arabic words. And these will not appear in the stop words.
>>>
>>>
>>  I learned something new!
>>
>> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to
>> augment the stopwords list with English stop words? And if so, shouldn't the
>> stop filter come after the lower case filter?
>>
>> -- DM
>>
>>  -Original Message-
 From: Basem Narmok [mailto:nar...@gmail.com]
 Sent: Thursday, October 08, 2009 4:20 PM
 To: java-dev@lucene.apache.org
 Subject: Re: Arabic Analyzer: possible bug

 DM, there is no upper/lower cases in Arabic, so don't worry, but the
 stop word list needs some corrections and may miss some common/stop
 Arabic words.

 Best,

 On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:


> Robert,
> Thanks for the info.
> As I said, I am illiterate in Arabic. So I have another, perhaps
> nonsensical, question:
> Does the stop word list have every combination of upper/lower case for
>
>
 each


> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it
>
>
 come


> after LowerCaseFilter?
> -- DM
> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>
> DM, this isn't a bug.
>
> The arabic stopwords are not normalized.
>
> but for persian, i normalized the stopwords. mostly because i did not
>
>
 want


> to have to create variations with farsi yah versus arabic yah for each
>
>
 one.


> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith  wrote:
>
>
>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>>
>>
>  know


> Arabic or Farsi, but have some texts to index in those languages.)
>> The tokenizer/filter chain for ArabicAnalyzer is:
>> TokenStream result = new ArabicLetterTokenizer( reader );
>> result = new StopFilter( result, stoptable );
>> result = new LowerCaseFilter(result);
>> result = new ArabicNormalizationFilter( result );
>> result = new ArabicStemFilter( result );
>>
>> return result;
>>
>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>
>> As a comparison the PersianAnalyzer has:
>> TokenStream result = new ArabicLetterTokenizer(reader);
>> result = new LowerCaseFilter(result);
>> result = new ArabicNormalizationFilter(result);
>> /* additional persian-specific normalization */
>> result = new PersianNormalizationFilter(result);
>> /*
>>  * the order here is important: the stopword list is normalized
>>
>>
>  with


> the
>>  * above!
>>  */
>> result = new StopFilter(result, stoptable);
>>
>> return result;
>>
>>
>> Thanks,
>> DM
>>
>>
>
> --
> Robert Muir
> rcm...@gmail.com
>

>


-- 
Robert Muir
rcm...@gmail.com


Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
DM by the way, if you want this lowercasing behavior with edge cases, check
out LUCENE-1488. There is a case folding filter there, as well as a
normalization filter, and they interact correctly for what you want :)

its my understanding that contrib/analyzers should not have any external
dependencies, so it could be eons before the jdk exposes these things, so I
don't know what to do. It would be nice if things like ArabicAnalyzer
handled greek edge cases correctly, don't you think?

On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir  wrote:

>  I'm suggesting that if I know my input document well and know that it has
>> mixed text and that the text is Arabic and one other known language that I
>> might want to augment the stop list with stop words appropriate for that
>> known language. I think that in this case, stop filter should be after lower
>> case filter.
>>
>  I think this is a good idea?
>
>>
>> As to lower casing across the board, I also think it is pretty safe. But I
>> think there are some edge cases. For example, lowercasing a Greek word in
>> all upper case ending in sigma will not produce the same as lower casing the
>> same Greek word in all lower case. The Greek word should have a final sigma
>> rather than a small sigma. For Greek, using an UpperCaseFilter followed by a
>> LowerCaseFilter would handle this case.
>>
> or you could use unicode case folding. lowercasing is for display purposes,
> not search.
>
>>
>> IMHO, this is not an issue for the Arabic or Persian analyzers.
>>
>> -- DM
>>
>>
>> On 10/08/2009 09:36 AM, Robert Muir wrote:
>>
>> DM, i suppose. but this is a tricky subject, what if you have mixed Arabic
>> / German or something like that?
>>
>> for some other languages written in the Latin script, English stopwords
>> could be bad :)
>>
>> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe
>> across the board though.
>>
>> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith  wrote:
>>
>>> On 10/08/2009 09:23 AM, Uwe Schindler wrote:
>>>
 Just an addition: The lowercase filter is only for the case of embedded
 non-arabic words. And these will not appear in the stop words.


>>>  I learned something new!
>>>
>>> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to
>>> augment the stopwords list with English stop words? And if so, shouldn't the
>>> stop filter come after the lower case filter?
>>>
>>> -- DM
>>>
>>>  -Original Message-
> From: Basem Narmok [mailto:nar...@gmail.com]
> Sent: Thursday, October 08, 2009 4:20 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Arabic Analyzer: possible bug
>
> DM, there is no upper/lower cases in Arabic, so don't worry, but the
> stop word list needs some corrections and may miss some common/stop
> Arabic words.
>
> Best,
>
> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
>
>
>> Robert,
>> Thanks for the info.
>> As I said, I am illiterate in Arabic. So I have another, perhaps
>> nonsensical, question:
>> Does the stop word list have every combination of upper/lower case for
>>
>>
> each
>
>
>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should
>> it
>>
>>
> come
>
>
>> after LowerCaseFilter?
>> -- DM
>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>>
>> DM, this isn't a bug.
>>
>> The arabic stopwords are not normalized.
>>
>> but for persian, i normalized the stopwords. mostly because i did not
>>
>>
> want
>
>
>> to have to create variations with farsi yah versus arabic yah for each
>>
>>
> one.
>
>
>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith
>>  wrote:
>>
>>
>>> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>>>
>>>
>>  know
>
>
>> Arabic or Farsi, but have some texts to index in those languages.)
>>> The tokenizer/filter chain for ArabicAnalyzer is:
>>> TokenStream result = new ArabicLetterTokenizer( reader );
>>> result = new StopFilter( result, stoptable );
>>> result = new LowerCaseFilter(result);
>>> result = new ArabicNormalizationFilter( result );
>>> result = new ArabicStemFilter( result );
>>>
>>> return result;
>>>
>>> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>>>
>>> As a comparison the PersianAnalyzer has:
>>> TokenStream result = new ArabicLetterTokenizer(reader);
>>> result = new LowerCaseFilter(result);
>>> result = new ArabicNormalizationFilter(result);
>>> /* additional persian-specific normalization */
>>> result = new PersianNormalizationFilter(result);
>>> /*
>>>  * the order here is important: the stopword list is normalized
>>>
>>>
>>  with
>
>
>> the
>>>  * above!
>>> 

[jira] Created: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-08 Thread Robert Muir (JIRA)
ArabicAnalyzer: Lowercase before Stopfilter
---

 Key: LUCENE-1963
 URL: https://issues.apache.org/jira/browse/LUCENE-1963
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Trivial
 Attachments: LUCENE-1963.patch

ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
It also allows you to set a custom stopword list (you might augment the Arabic 
list with some English ones, for example).

In this case its helpful for these non-Arabic stopwords, to lowercase before 
stopfilter.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1963:


Attachment: LUCENE-1963.patch

simple patch, but will need to warn in CHANGES.txt that folks should reindex, 
if they are using non-Arabic stopwords.

> ArabicAnalyzer: Lowercase before Stopfilter
> ---
>
> Key: LUCENE-1963
> URL: https://issues.apache.org/jira/browse/LUCENE-1963
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Trivial
> Attachments: LUCENE-1963.patch
>
>
> ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
> It also allows you to set a custom stopword list (you might augment the 
> Arabic list with some English ones, for example).
> In this case its helpful for these non-Arabic stopwords, to lowercase before 
> stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1121) Use nio.transferTo when copying large blocks of bytes

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763544#action_12763544
 ] 

Mark Miller commented on LUCENE-1121:
-

Isn't this still a nice little optimization for compound copies? When not using 
Win server, its faster in general, and even when similar, you get the less CPU 
usage optimization.

At worst it seems we should enable for that case when detecting non windows? We 
could even throw in a couple specific Windows versions we know work well - the 
XP results I got were fantastic, and the ones Mike got were not bad. Prob not 
necessary, as most deployments will prob be on server, but future versions 
might be better.

Seems like a little win on 'nix systems anyway, just from the CPU savings.

> Use nio.transferTo when copying large blocks of bytes
> -
>
> Key: LUCENE-1121
> URL: https://issues.apache.org/jira/browse/LUCENE-1121
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1121.patch, LUCENE-1121.patch, testIO.java
>
>
> When building a CFS file, and also when merging stored fields (and
> term vectors, with LUCENE-1120), we copy large blocks of bytes at
> once.
> We currently do this with an intermediate buffer.
> But, nio.transferTo should be somewhat faster on OS's that offer low
> level IO APIs for moving blocks of bytes between files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1963:


Fix Version/s: 3.0

if no one objects, I'd like to commit this for 3.0 at the end of the day.

> ArabicAnalyzer: Lowercase before Stopfilter
> ---
>
> Key: LUCENE-1963
> URL: https://issues.apache.org/jira/browse/LUCENE-1963
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1963.patch
>
>
> ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
> It also allows you to set a custom stopword list (you might augment the 
> Arabic list with some English ones, for example).
> In this case its helpful for these non-Arabic stopwords, to lowercase before 
> stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1963:


Attachment: LUCENE-1963.patch

here also update the javadocs to reflect the new order of what is going on in 
ArabicAnalyzer, to prevent any confusion to users.

> ArabicAnalyzer: Lowercase before Stopfilter
> ---
>
> Key: LUCENE-1963
> URL: https://issues.apache.org/jira/browse/LUCENE-1963
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1963.patch, LUCENE-1963.patch
>
>
> ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
> It also allows you to set a custom stopword list (you might augment the 
> Arabic list with some English ones, for example).
> In this case its helpful for these non-Arabic stopwords, to lowercase before 
> stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread DM Smith

On 10/08/2009 11:46 AM, Robert Muir wrote:
DM by the way, if you want this lowercasing behavior with edge cases, 
check out LUCENE-1488. There is a case folding filter there, as well 
as a normalization filter, and they interact correctly for what you 
want :)

Robert,

So cool. I've been following the emails on this to java-dev that JIRA 
puts out, but I had not looked at the patch till now. Brought tears to 
my eyes.


How ready is it? I'd like to use it if it is "good enough".

BTW, does it handle the case where ' (an apostrophe) is used as a 
character in some languages? (IIRC in some African languages it is a 
whistle.) That is, do you know whether ICU will consider the context of 
adjacent characters in determining whether something is a word break?




its my understanding that contrib/analyzers should not have any 
external dependencies,
That's my understanding too. But there has got to be a way to provide it 
w/o duplication of code.



so it could be eons before the jdk exposes these things
I'm using ICU now for that very reason. It takes too long for the JDK to 
be current on anything let alone something that Java boasted of in the 
early days.



, so I don't know what to do. It would be nice if things like 
ArabicAnalyzer handled greek edge cases correctly, don't you think?


I do think so. Maybe in the new package (org.apache.lucene.icu) have a 
subpackage analyzer that's dependant on contrib/analyzers. Or create a 
PluggableAnalyzer that one could supply a Tokenizer and an ordered list 
of Filters, changing the contrib/analyzers to derive from it. Or, use 
reflection to bring in the ICU ability if the lucene-icu.jar is present. 
Or, ...


Right now, for each of the contrib/analyzers I have my own copy that 
mimics them but doesn't use the StandardAnalyzer/StandardFilter (I think 
I want to use LUCENE-1488), does NFKC normalization, optionally uses a 
StopFilter (sometimes it is hard to dig out the stop set from the 
analyzers) and optionally uses a stemmer (snowball if available.) 
Basically, I like all the parts that were provided by a 
contrib/analyzer, but I have different requirements than how those parts 
were packaged by the contrib/analyzer's Analyzer. (Thus my question on 
the order of filters in ArabicAnalyzer).


It'd really be nice if there were a way to specify that "tool chain". 
Ideally, I'd like to get the default chain, and modify it. (And I'd like 
to store a description of that tool chain with the index, with version 
info for each of the parts, so that I can tell when an index needs to be 
rebuilt.)


-- DM



On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir > wrote:


I'm suggesting that if I know my input document well and know
that it has mixed text and that the text is Arabic and one
other known language that I might want to augment the stop
list with stop words appropriate for that known language. I
think that in this case, stop filter should be after lower
case filter.

 I think this is a good idea?


As to lower casing across the board, I also think it is pretty
safe. But I think there are some edge cases. For example,
lowercasing a Greek word in all upper case ending in sigma
will not produce the same as lower casing the same Greek word
in all lower case. The Greek word should have a final sigma
rather than a small sigma. For Greek, using an UpperCaseFilter
followed by a LowerCaseFilter would handle this case.

or you could use unicode case folding. lowercasing is for display
purposes, not search.


IMHO, this is not an issue for the Arabic or Persian analyzers.

-- DM


On 10/08/2009 09:36 AM, Robert Muir wrote:

DM, i suppose. but this is a tricky subject, what if you have
mixed Arabic / German or something like that?

for some other languages written in the Latin script, English
stopwords could be bad :)

I think that Lowercasing non-Arabic (also cyrillic, etc), is
pretty safe across the board though.

On Thu, Oct 8, 2009 at 9:29 AM, DM Smith
mailto:dmsmith...@gmail.com>> wrote:

On 10/08/2009 09:23 AM, Uwe Schindler wrote:

Just an addition: The lowercase filter is only for
the case of embedded
non-arabic words. And these will not appear in the
stop words.

I learned something new!

Hmm. If one has a mixed Arabic / English text, shouldn't
one be able to augment the stopwords list with English
stop words? And if so, shouldn't the stop filter come
after the lower case filter?

-- DM


-Original Message-
From: Basem Narmok [mailto:nar...@gmail.com
]
Sent: Thursday, October 08, 2009 4:20

[jira] Commented: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-08 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763554#action_12763554
 ] 

DM Smith commented on LUCENE-1963:
--

can you commit it to 2.9.1 too? (For those stuck on Java 1.4, there is no 3.0).

> ArabicAnalyzer: Lowercase before Stopfilter
> ---
>
> Key: LUCENE-1963
> URL: https://issues.apache.org/jira/browse/LUCENE-1963
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1963.patch, LUCENE-1963.patch
>
>
> ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
> It also allows you to set a custom stopword list (you might augment the 
> Arabic list with some English ones, for example).
> In this case its helpful for these non-Arabic stopwords, to lowercase before 
> stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
DM, thanks. I will reply to your comments below.

How ready is it? I'd like to use it if it is "good enough".
>

It is not committed yet, so I think it would be best to say it is not ready,
but I think it works, give it a try if you have time :). Mainly it needs
better doc and tests, but I am focusing on making it customizable: and I
think I need to improve the API for this.


>
> BTW, does it handle the case where ' (an apostrophe) is used as a character
> in some languages? (IIRC in some African languages it is a whistle.) That
> is, do you know whether ICU will consider the context of adjacent characters
> in determining whether something is a word break?
>

I do not think it does this by default (it might depend if the apostrophe is
at the end of the word, or middle of a word, I'd have to check UAX#29 and
the appropriate properties).

If you look at the patch, you can see I customized the RBBI rules for Hebrew
script to take single quote and double quote into account. In this case,
double quote is allowed to be "MidLetter", for acronyms, and single quote is
allowed to "Extend", so it can represent a transliterated character.

So, you could do the same thing for the Latin script, if the unicode
defaults are not what you want (again I want to make it easy for you to
supply tailored rules, especially ones that apply only to a specific
script).

And yes, by default the UAX#29 spec considers adjacent characters when
determining word break. its based on the Unicode word break property. (for
some scripts in LUCENE-1488: thai, myanmar, lao, etc, this is not used so
much, i provided some more sophisticated mechanisms for these)


>
>
>  so it could be eons before the jdk exposes these things
>
> I'm using ICU now for that very reason. It takes too long for the JDK to be
> current on anything let alone something that Java boasted of in the early
> days.
>

Agreed.


>
>
> It'd really be nice if there were a way to specify that "tool chain".
> Ideally, I'd like to get the default chain, and modify it. (And I'd like to
> store a description of that tool chain with the index, with version info for
> each of the parts, so that I can tell when an index needs to be rebuilt.)
>

This is something that concerns me a bit about LUCENE-1488. It is driven by
properties that will change when ICU/Unicode is updated. This is both good
and bad, its good in the case that it will "improve" automatically, based on
improvements done in those places, and we can remain current with the
Unicode standard. Its bad because you will probably have to reindex when
these components are updated. I think its complex enough, that we won't be
able to really guarantee much backwards compat if we want to stay current
with Unicode, because things change and improve.

A great example is how the word break property changed for zero-width space
in Unicode 5.2

But, something to mention on this topic, is the great work being done in
JFlex right now, which would allow you to specify a specific unicode
version, and tokenize according to that version. Tokenization is only one
piece of the puzzle though :)


>
> -- DM
>
>
> On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir  wrote:
>
>>   I'm suggesting that if I know my input document well and know that it
>>> has mixed text and that the text is Arabic and one other known language that
>>> I might want to augment the stop list with stop words appropriate for that
>>> known language. I think that in this case, stop filter should be after lower
>>> case filter.
>>>
>>   I think this is a good idea?
>>
>>>
>>> As to lower casing across the board, I also think it is pretty safe. But
>>> I think there are some edge cases. For example, lowercasing a Greek word in
>>> all upper case ending in sigma will not produce the same as lower casing the
>>> same Greek word in all lower case. The Greek word should have a final sigma
>>> rather than a small sigma. For Greek, using an UpperCaseFilter followed by a
>>> LowerCaseFilter would handle this case.
>>>
>>  or you could use unicode case folding. lowercasing is for display
>> purposes, not search.
>>
>>>
>>> IMHO, this is not an issue for the Arabic or Persian analyzers.
>>>
>>> -- DM
>>>
>>> On 10/08/2009 09:36 AM, Robert Muir wrote:
>>>
>>> DM, i suppose. but this is a tricky subject, what if you have mixed
>>> Arabic / German or something like that?
>>>
>>> for some other languages written in the Latin script, English stopwords
>>> could be bad :)
>>>
>>> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe
>>> across the board though.
>>>
>>> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith  wrote:
>>>
 On 10/08/2009 09:23 AM, Uwe Schindler wrote:

> Just an addition: The lowercase filter is only for the case of embedded
> non-arabic words. And these will not appear in the stop words.
>
>
  I learned something new!

 Hmm. If one has a mixed Arabic / English text, shouldn't one be able to
 augment the stopwords list wit

[jira] Commented: (LUCENE-1963) ArabicAnalyzer: Lowercase before Stopfilter

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763562#action_12763562
 ] 

Robert Muir commented on LUCENE-1963:
-

bq. can you commit it to 2.9.1 too? (For those stuck on Java 1.4, there is no 
3.0). 

can someone comment on this one for me. 
I don't think its too much of a stretch to consider this a bug, even if it does 
not affect Arabic text.


> ArabicAnalyzer: Lowercase before Stopfilter
> ---
>
> Key: LUCENE-1963
> URL: https://issues.apache.org/jira/browse/LUCENE-1963
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1963.patch, LUCENE-1963.patch
>
>
> ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
> It also allows you to set a custom stopword list (you might augment the 
> Arabic list with some English ones, for example).
> In this case its helpful for these non-Arabic stopwords, to lowercase before 
> stopfilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1950) Remove autoCommit from IndexWriter

2009-10-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1950:
---

Attachment: LUCENE-1950.patch

Attached patch.  All tests pass.  This is just the first step, which
is to remove autoCommit from IW's public APIs, and fix all places that
were using those APIs.

Second step is to remove autoCommit from inside IW.

When there were tests testing both aC=false and true, I just remove
the "true" case.  When the test was only testing aC=true, I switched
it to false, and sometimes had to tweak the test (eg to insert calls
to IW.commit) to make it happy.  For one test, when I did this
(TestIndexWriter.testImmediateDiskFullWithThreads) it uncovered a
latent bug in IW where if an IOException is hit on building the cfx
file for flushed doc stores, we were missing a call to
docWriter.abort().

I also removed autoCommit from the alg files in
contrib/benchmark/conf, fixed up the javadocs, and removed some unused
imports.

I plan to commit soon!


> Remove autoCommit from IndexWriter
> --
>
> Key: LUCENE-1950
> URL: https://issues.apache.org/jira/browse/LUCENE-1950
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-1950.patch
>
>
> IndexWriter's autoCommit is deprecated; in 3.0 it will be hardwired to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1121) Use nio.transferTo when copying large blocks of bytes

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763573#action_12763573
 ] 

Mark Miller commented on LUCENE-1121:
-

NM - it appears that when you chunk, you lose the CPU win - and when you don't 
chunk, you get the win, but it performs nasty after other java io operations. 
Bummer.

> Use nio.transferTo when copying large blocks of bytes
> -
>
> Key: LUCENE-1121
> URL: https://issues.apache.org/jira/browse/LUCENE-1121
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1121.patch, LUCENE-1121.patch, testIO.java
>
>
> When building a CFS file, and also when merging stored fields (and
> term vectors, with LUCENE-1120), we copy large blocks of bytes at
> once.
> We currently do this with an intermediate buffer.
> But, nio.transferTo should be somewhat faster on OS's that offer low
> level IO APIs for moving blocks of bytes between files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763579#action_12763579
 ] 

Michael McCandless commented on LUCENE-1951:


Patch looks good Robert!  Thanks.  I'll commit soon.

> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951.patch, 
> LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763580#action_12763580
 ] 

Robert Muir commented on LUCENE-1951:
-

Michael, cool. The bw_compat patch is still valid with these changes.

I will mention one concern, just for the record (you can tell me if it is an 
issue).

These tests test that for example, a WildcardQuery with SCORING_REWRITE 
rewrites to a TermQuery, which is correct, but now its a bit wierd how this 
happens.

SingleTermEnum -> MultiTermQuery -> BooleanQuery with one term -> TermQuery.

I couldnt think of a better way to test the correct behavior, but it is testing 
a bit more than just what happens in WildcardQuery...


> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951.patch, 
> LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Robert Muir
Basem, I really appreciate your time if you are able to do this.

Its been my hope that introducing Arabic/Farsi support will create enough
interest to encourage more qualified people to come and really make things
nice.

If you don't mind, you can look at
http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issue
with a patch file to improve our stopwords list.

Otherwise, in my opinion a good list is also acceptable and I will volunteer
to turn it into a patch :)

On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok  wrote:

> Robert,
>
> I will be happy to do so. Currently, I am testing the new Arabic
> analyzer in 2.9, and also I will prepare a new stop word list. I will
> provide you with my findings/comments soon.
>
> Best,
>
> On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir  wrote:
> > Basem, by any chance would you be willing to help improve it for us?
> >
> > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok  wrote:
> >>
> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the
> >> stop word list needs some corrections and may miss some common/stop
> >> Arabic words.
> >>
> >> Best,
> >>
> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
> >> > Robert,
> >> > Thanks for the info.
> >> > As I said, I am illiterate in Arabic. So I have another, perhaps
> >> > nonsensical, question:
> >> > Does the stop word list have every combination of upper/lower case for
> >> > each
> >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should
> it
> >> > come
> >> > after LowerCaseFilter?
> >> > -- DM
> >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
> >> >
> >> > DM, this isn't a bug.
> >> >
> >> > The arabic stopwords are not normalized.
> >> >
> >> > but for persian, i normalized the stopwords. mostly because i did not
> >> > want
> >> > to have to create variations with farsi yah versus arabic yah for each
> >> > one.
> >> >
> >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith 
> wrote:
> >> >>
> >> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
> >> >> know
> >> >> Arabic or Farsi, but have some texts to index in those languages.)
> >> >> The tokenizer/filter chain for ArabicAnalyzer is:
> >> >> TokenStream result = new ArabicLetterTokenizer( reader );
> >> >> result = new StopFilter( result, stoptable );
> >> >> result = new LowerCaseFilter(result);
> >> >> result = new ArabicNormalizationFilter( result );
> >> >> result = new ArabicStemFilter( result );
> >> >>
> >> >> return result;
> >> >>
> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
> >> >>
> >> >> As a comparison the PersianAnalyzer has:
> >> >> TokenStream result = new ArabicLetterTokenizer(reader);
> >> >> result = new LowerCaseFilter(result);
> >> >> result = new ArabicNormalizationFilter(result);
> >> >> /* additional persian-specific normalization */
> >> >> result = new PersianNormalizationFilter(result);
> >> >> /*
> >> >>  * the order here is important: the stopword list is normalized
> >> >> with
> >> >> the
> >> >>  * above!
> >> >>  */
> >> >> result = new StopFilter(result, stoptable);
> >> >>
> >> >> return result;
> >> >>
> >> >>
> >> >> Thanks,
> >> >> DM
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > rcm...@gmail.com
> >> >
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763597#action_12763597
 ] 

Michael McCandless commented on LUCENE-1951:


That is a rather roundabout way to arrive at the TermQuery, but I think the 
test is fine?

> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951.patch, 
> LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1951) wildcardquery rewrite improvements

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763599#action_12763599
 ] 

Robert Muir commented on LUCENE-1951:
-

bq. That is a rather roundabout way to arrive at the TermQuery, but I think the 
test is fine? 

Ok, that was my only concern, the test. I like the SingleTermEnum otherwise, I 
think it will reduce maintenance.

> wildcardquery rewrite improvements
> --
>
> Key: LUCENE-1951
> URL: https://issues.apache.org/jira/browse/LUCENE-1951
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1951.patch, LUCENE-1951.patch, 
> LUCENE-1951_bwcompatbranch.patch
>
>
> wildcardquery has logic to rewrite to termquery if there is no wildcard 
> character, but
> * it needs to pass along the boost if it does this
> * if the user asked for a 'constant score' rewriteMethod, it should rewrite 
> to a constant score query for consistency.
> additionally, if the query is really a prefixquery, it would be nice to 
> rewrite to prefix query.
> both will enumerate the same number of terms, but prefixquery has a simpler 
> comparison function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1961) Remove remaining deprecations in document package

2009-10-08 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-1961.
---

Resolution: Fixed

Committed revision 823252.

> Remove remaining deprecations in document package
> -
>
> Key: LUCENE-1961
> URL: https://issues.apache.org/jira/browse/LUCENE-1961
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1961.patch
>
>
> Remove different deprecated APIs:
> - Field.Index.NO_NORMS, etc.
> - Field.binaryValue()
> - getOmitTf()/setOmitTf()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1961) Remove remaining deprecations in document package

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763607#action_12763607
 ] 

Michael McCandless commented on LUCENE-1961:


I'm seeing this when I run "ant test-tag":

{code}
test-tag:
[mkdir] Created dir: 
/lucene/clean/build/lucene_2_9_back_compat_tests_20091008
[mkdir] Created dir: 
/lucene/clean/build/lucene_2_9_back_compat_tests_20091008/classes/java
[javac] Compiling 413 source files to 
/lucene/clean/build/lucene_2_9_back_compat_tests_20091008/classes/java
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
  [jar] Building jar: 
/lucene/clean/build/lucene_2_9_back_compat_tests_20091008/lucene_2_9_back_compat_tests_20091008.jar
[mkdir] Created dir: 
/lucene/clean/build/lucene_2_9_back_compat_tests_20091008/classes/test
[javac] Compiling 204 source files to 
/lucene/clean/build/lucene_2_9_back_compat_tests_20091008/classes/test
[javac] 
/lucene/clean/tags/lucene_2_9_back_compat_tests_20091008/src/test/org/apache/lucene/index/DocHelper.java:172:
 cannot find symbol
[javac] symbol  : method getOmitTermFreqAndPositions()
[javac] location: interface org.apache.lucene.document.Fieldable
[javac]   if (f.getOmitTermFreqAndPositions()) add(noTf,f);
[javac]^
[javac] 
/lucene/clean/tags/lucene_2_9_back_compat_tests_20091008/src/test/org/apache/lucene/index/TestFieldsReader.java:75:
 cannot find symbol
[javac] symbol  : method getOmitTermFreqAndPositions()
[javac] location: interface org.apache.lucene.document.Fieldable
[javac] assertTrue(field.getOmitTermFreqAndPositions() == false);
[javac] ^
[javac] 
/lucene/clean/tags/lucene_2_9_back_compat_tests_20091008/src/test/org/apache/lucene/index/TestFieldsReader.java:83:
 cannot find symbol
[javac] symbol  : method getOmitTermFreqAndPositions()
[javac] location: interface org.apache.lucene.document.Fieldable
[javac] assertTrue(field.getOmitTermFreqAndPositions() == false);
[javac] ^
[javac] 
/lucene/clean/tags/lucene_2_9_back_compat_tests_20091008/src/test/org/apache/lucene/index/TestFieldsReader.java:91:
 cannot find symbol
[javac] symbol  : method getOmitTermFreqAndPositions()
[javac] location: interface org.apache.lucene.document.Fieldable
[javac] assertTrue(field.getOmitTermFreqAndPositions() == true);
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] 4 errors
{code}

I think you have to add Fieldable.getOmitTFAP on the back compat branch's 
src/java?

> Remove remaining deprecations in document package
> -
>
> Key: LUCENE-1961
> URL: https://issues.apache.org/jira/browse/LUCENE-1961
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1961.patch
>
>
> Remove different deprecated APIs:
> - Field.Index.NO_NORMS, etc.
> - Field.binaryValue()
> - getOmitTf()/setOmitTf()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1964) InstantiatedIndex : TermFreqVector is missing

2009-10-08 Thread David Causse (JIRA)
InstantiatedIndex : TermFreqVector is missing
-

 Key: LUCENE-1964
 URL: https://issues.apache.org/jira/browse/LUCENE-1964
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.9
 Environment: java 1.6
Reporter: David Causse


TermFrecVector is missing when index is created via constructor.
The constructor expect that fields with TermVector are retreived with the 
getFields call, but this call returns only stored field, but such fields are 
never/rarely stored.
I've attached a patch to fix this issue.
I had to add a int freq field to InstantiatedTermDocumentInformation because we 
are not sure we can use the size of termPositions array as freq information, 
this information may not be available with TermVector.YES.
Don't know if did well but works with unit test attached.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1964) InstantiatedIndex : TermFreqVector is missing

2009-10-08 Thread David Causse (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Causse updated LUCENE-1964:
-

Attachment: term-vector-fix.patch

Fix the TermVector storing problem.

> InstantiatedIndex : TermFreqVector is missing
> -
>
> Key: LUCENE-1964
> URL: https://issues.apache.org/jira/browse/LUCENE-1964
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9
> Environment: java 1.6
>Reporter: David Causse
> Attachments: term-vector-fix.patch
>
>
> TermFrecVector is missing when index is created via constructor.
> The constructor expect that fields with TermVector are retreived with the 
> getFields call, but this call returns only stored field, but such fields are 
> never/rarely stored.
> I've attached a patch to fix this issue.
> I had to add a int freq field to InstantiatedTermDocumentInformation because 
> we are not sure we can use the size of termPositions array as freq 
> information, this information may not be available with TermVector.YES.
> Don't know if did well but works with unit test attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1965) Lazy Atomic Loading Stopwords in SmartCN

2009-10-08 Thread Simon Willnauer (JIRA)
Lazy Atomic Loading Stopwords in SmartCN 
-

 Key: LUCENE-1965
 URL: https://issues.apache.org/jira/browse/LUCENE-1965
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 3.0
 Attachments: LUCENE-1965.patch

The default constructor in SmartChineseAnalyzer loads the default (jar 
embedded) stopwords each time the constructor is invoked. 
This should be atomically loaded only once in an unmodifiable set.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1965) Lazy Atomic Loading Stopwords in SmartCN

2009-10-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1965:


Attachment: LUCENE-1965.patch

attached patch

> Lazy Atomic Loading Stopwords in SmartCN 
> -
>
> Key: LUCENE-1965
> URL: https://issues.apache.org/jira/browse/LUCENE-1965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 3.0
>
> Attachments: LUCENE-1965.patch
>
>
> The default constructor in SmartChineseAnalyzer loads the default (jar 
> embedded) stopwords each time the constructor is invoked. 
> This should be atomically loaded only once in an unmodifiable set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1965) Lazy Atomic Loading Stopwords in SmartCN

2009-10-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1965:


Priority: Trivial  (was: Major)

> Lazy Atomic Loading Stopwords in SmartCN 
> -
>
> Key: LUCENE-1965
> URL: https://issues.apache.org/jira/browse/LUCENE-1965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1965.patch
>
>
> The default constructor in SmartChineseAnalyzer loads the default (jar 
> embedded) stopwords each time the constructor is invoked. 
> This should be atomically loaded only once in an unmodifiable set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-08 Thread John Wang
Hi guys:

 What are your thoughts about contributing Kamikaze as a lucene contrib
package? We just finished porting kamikaze to lucene 2.9. With the new 2.9
api, it allows us for some more code tuning and optimization improvements.

 We will be releasing kamikaze, it might a good time to add it to the
lucene contrib package if there is interest.

Thanks

-John

On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler  wrote:

> By the way: In the last RC of Lucene 2.9 we added a new method to DocIdSet
> called isCacheable(). It is used by e.g. CachingWrapperFilter to determine,
> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI (the
> default is false, so all custom DocIdSets are copied to OpenBitSetDISI by
> CachingWrapperFilter, even if not needed - if a DocIdSet does not do disk
> IO
> and have a fast iterator like e.g. the FieldCache ones in
> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe this
> should also be added to Kamikaze, which is a really nice project!
> Especially
> filter DocIdSets should pass this method to its delegate (see
> FilterDocIdSet
> in Lucene).
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: John Wang (JIRA) [mailto:j...@apache.org]
> > Sent: Thursday, September 24, 2009 3:14 PM
> > To: java-dev@lucene.apache.org
> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible
> > indexing
> >
> >
> > [ https://issues.apache.org/jira/browse/LUCENE-
> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> > tabpanel&focusedCommentId=12759112#action_12759112 ]
> >
> > John Wang commented on LUCENE-1458:
> > ---
> >
> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
> > abstraction that was migrated from Solr)
> >
> > It has three components:
> >
> > 1) P4Delta
> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
> jira
> > ticket and a patch for Lucene awhile ago with performance numbers. It is
> > significantly faster than DisjunctionScorer)
> > 3) algorithm to determine which DocIdSet implementations to use given
> some
> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the
> > application behavior if not all parameters are given.
> >
> > So please feel free to incorporate anything you see if or move it to
> > contrib.
> >
> >
> > > Further steps towards flexible indexing
> > > ---
> > >
> > > Key: LUCENE-1458
> > > URL: https://issues.apache.org/jira/browse/LUCENE-1458
> > > Project: Lucene - Java
> > >  Issue Type: New Feature
> > >  Components: Index
> > >Affects Versions: 2.9
> > >Reporter: Michael McCandless
> > >Assignee: Michael McCandless
> > >Priority: Minor
> > > Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-
> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-
> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
> > 1458.tar.bz2, LUCENE-1458.tar.bz2
> > >
> > >
> > > I attached a very rough checkpoint of my current patch, to get early
> > > feedback.  All tests pass, though back compat tests don't pass due to
> > > changes to package-private APIs plus certain bugs in tests that
> > > happened to work (eg call TermPostions.nextPosition() too many times,
> > > which the new API asserts against).
> > > [Aside: I think, when we commit changes to package-private APIs such
> > > that back-compat tests don't pass, we could go back, make a branch on
> > > the back-compat tag, commit changes to the tests to use the new
> > > package private APIs on that branch, then fix nightly build to use the
> > > tip of that branch?o]
> > > There's still plenty to do before this is committable! This is a
> > > rather large change:
> > >   * Switches to a new more efficient terms dict format.  This still
> > > uses tii/tis files, but the tii only stores term & long offset
> > > (not a TermInfo).  At seek points, tis encodes term & freq/prox
> > > offsets absolutely instead of with deltas delta.  Also, tis/tii
> > > are structured by field, so we don't have to record field number
> > > in every term.
> > > .
> > > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> > > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> > > .
> > > RAM usage when loading terms dict index is significantly less
> > > since we only load an array of offsets and an array of String (no
> > > more TermInfo array).  It should be faster to init too.
> > > .
> > > This part is basically done.
> > >   * Introduces modular reader cod

Back-compat tags

2009-10-08 Thread Michael Busch

Hi,

for the last patches I committed I created a new back-compat tag each 
time. Since this is happening so often right now, because we're removing 
APIs, I was wondering whether we should not create a separate tag for 
each patch, but instead gather the changes in the back-compat branch and 
create one tag every day or every other day? Depending on when the build 
gets kicked off it might fail then on test-tag, but that should be 
acceptable?


 Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1965) Lazy Atomic Loading Stopwords in SmartCN

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763621#action_12763621
 ] 

Robert Muir commented on LUCENE-1965:
-

Simon, everything is ok, but i have one comment:

the new test: testChineseStopWordsNull, I think this is a duplicate of the one 
above. here is the context:
{code}
  /*
   * Punctuation is handled in a strange way if you disable stopwords
   * In this example the IDEOGRAPHIC FULL STOP is converted into a comma.
   * if you don't supply (true) to the constructor, or use a different 
stopwords list,
   * then punctuation is indexed.
   */
  public void testChineseStopWordsOff() throws Exception {  
Analyzer ca = new SmartChineseAnalyzer(false); /* doesnt load stopwords */
String sentence = "我购买了道具和服装。";
String result[] = { "我", "购买", "了", "道具", "和", "服装", "," };
assertAnalyzesTo(ca, sentence, result);


  }
  
  public void testChineseStopWordsNull() throws IOException{
Analyzer ca = new SmartChineseAnalyzer(false); /* sets stopwords to empty 
set */
String sentence = "我购买了道具和服装。";
String result[] = { "我", "购买", "了", "道具", "和", "服装", "," };
assertAnalyzesTo(ca, sentence, result);
assertAnalyzesToReuse(ca, sentence, result);
  }
{code}

> Lazy Atomic Loading Stopwords in SmartCN 
> -
>
> Key: LUCENE-1965
> URL: https://issues.apache.org/jira/browse/LUCENE-1965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1965.patch
>
>
> The default constructor in SmartChineseAnalyzer loads the default (jar 
> embedded) stopwords each time the constructor is invoked. 
> This should be atomically loaded only once in an unmodifiable set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Back-compat tags

2009-10-08 Thread Michael McCandless
How about we just use the tip of the back-compat branch?  (Ie no
tagging).  Until we settle down.

Mike

On Thu, Oct 8, 2009 at 2:42 PM, Michael Busch  wrote:
> Hi,
>
> for the last patches I committed I created a new back-compat tag each time.
> Since this is happening so often right now, because we're removing APIs, I
> was wondering whether we should not create a separate tag for each patch,
> but instead gather the changes in the back-compat branch and create one tag
> every day or every other day? Depending on when the build gets kicked off it
> might fail then on test-tag, but that should be acceptable?
>
>  Michael
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-08 Thread Michael McCandless
+1!

Mike

On Thu, Oct 8, 2009 at 2:41 PM, John Wang  wrote:
> Hi guys:
>
>  What are your thoughts about contributing Kamikaze as a lucene contrib
> package? We just finished porting kamikaze to lucene 2.9. With the new 2.9
> api, it allows us for some more code tuning and optimization improvements.
>
>  We will be releasing kamikaze, it might a good time to add it to the
> lucene contrib package if there is interest.
>
> Thanks
>
> -John
>
> On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler  wrote:
>>
>> By the way: In the last RC of Lucene 2.9 we added a new method to DocIdSet
>> called isCacheable(). It is used by e.g. CachingWrapperFilter to
>> determine,
>> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
>> (the
>> default is false, so all custom DocIdSets are copied to OpenBitSetDISI by
>> CachingWrapperFilter, even if not needed - if a DocIdSet does not do disk
>> IO
>> and have a fast iterator like e.g. the FieldCache ones in
>> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe this
>> should also be added to Kamikaze, which is a really nice project!
>> Especially
>> filter DocIdSets should pass this method to its delegate (see
>> FilterDocIdSet
>> in Lucene).
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>> > -Original Message-
>> > From: John Wang (JIRA) [mailto:j...@apache.org]
>> > Sent: Thursday, September 24, 2009 3:14 PM
>> > To: java-dev@lucene.apache.org
>> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible
>> > indexing
>> >
>> >
>> >     [ https://issues.apache.org/jira/browse/LUCENE-
>> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> > tabpanel&focusedCommentId=12759112#action_12759112 ]
>> >
>> > John Wang commented on LUCENE-1458:
>> > ---
>> >
>> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
>> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
>> > abstraction that was migrated from Solr)
>> >
>> > It has three components:
>> >
>> > 1) P4Delta
>> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
>> > jira
>> > ticket and a patch for Lucene awhile ago with performance numbers. It is
>> > significantly faster than DisjunctionScorer)
>> > 3) algorithm to determine which DocIdSet implementations to use given
>> > some
>> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the
>> > application behavior if not all parameters are given.
>> >
>> > So please feel free to incorporate anything you see if or move it to
>> > contrib.
>> >
>> >
>> > > Further steps towards flexible indexing
>> > > ---
>> > >
>> > >                 Key: LUCENE-1458
>> > >                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>> > >             Project: Lucene - Java
>> > >          Issue Type: New Feature
>> > >          Components: Index
>> > >    Affects Versions: 2.9
>> > >            Reporter: Michael McCandless
>> > >            Assignee: Michael McCandless
>> > >            Priority: Minor
>> > >         Attachments: LUCENE-1458-back-compat.patch, LUCENE-1458-back-
>> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-
>> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
>> > 1458.tar.bz2, LUCENE-1458.tar.bz2
>> > >
>> > >
>> > > I attached a very rough checkpoint of my current patch, to get early
>> > > feedback.  All tests pass, though back compat tests don't pass due to
>> > > changes to package-private APIs plus certain bugs in tests that
>> > > happened to work (eg call TermPostions.nextPosition() too many times,
>> > > which the new API asserts against).
>> > > [Aside: I think, when we commit changes to package-private APIs such
>> > > that back-compat tests don't pass, we could go back, make a branch on
>> > > the back-compat tag, commit changes to the tests to use the new
>> > > package private APIs on that branch, then fix nightly build to use the
>> > > tip of that branch?o]
>> > > There's still plenty to do before this is committable! This is a
>> > > rather large change:
>> > >   * Switches to a new more efficient terms dict format.  This still
>> > >     uses tii/tis files, but the tii only stores term & long offset
>> > >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>> > >     offsets absolutely instead of with deltas delta.  Also, tis/tii
>> > >     are structured by field, so we don't have to record field number
>> > >     in every term.
>> > > .
>> > >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>> > >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
>> > > .
>> > >     RAM usage when loading terms dict index is significantly less
>> > >     since we only load an a

[jira] Updated: (LUCENE-1964) InstantiatedIndex : TermFreqVector is missing

2009-10-08 Thread David Causse (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Causse updated LUCENE-1964:
-

Attachment: iiw-regression-fix.patch

My previous patch has broken the Writer, sorry...
I tried to fix but this class is way too much complicated for me, so here my 
attempt to repair my mistake.

> InstantiatedIndex : TermFreqVector is missing
> -
>
> Key: LUCENE-1964
> URL: https://issues.apache.org/jira/browse/LUCENE-1964
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9
> Environment: java 1.6
>Reporter: David Causse
> Attachments: iiw-regression-fix.patch, term-vector-fix.patch
>
>
> TermFrecVector is missing when index is created via constructor.
> The constructor expect that fields with TermVector are retreived with the 
> getFields call, but this call returns only stored field, but such fields are 
> never/rarely stored.
> I've attached a patch to fix this issue.
> I had to add a int freq field to InstantiatedTermDocumentInformation because 
> we are not sure we can use the size of termPositions array as freq 
> information, this information may not be available with TermVector.YES.
> Don't know if did well but works with unit test attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-08 Thread John Wang
Awesome!

Mike, can you let us know what the process is and the time line?

Thanks

-John

On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> +1!
>
> Mike
>
> On Thu, Oct 8, 2009 at 2:41 PM, John Wang  wrote:
> > Hi guys:
> >
> >  What are your thoughts about contributing Kamikaze as a lucene
> contrib
> > package? We just finished porting kamikaze to lucene 2.9. With the new
> 2.9
> > api, it allows us for some more code tuning and optimization
> improvements.
> >
> >  We will be releasing kamikaze, it might a good time to add it to the
> > lucene contrib package if there is interest.
> >
> > Thanks
> >
> > -John
> >
> > On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler  wrote:
> >>
> >> By the way: In the last RC of Lucene 2.9 we added a new method to
> DocIdSet
> >> called isCacheable(). It is used by e.g. CachingWrapperFilter to
> >> determine,
> >> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
> >> (the
> >> default is false, so all custom DocIdSets are copied to OpenBitSetDISI
> by
> >> CachingWrapperFilter, even if not needed - if a DocIdSet does not do
> disk
> >> IO
> >> and have a fast iterator like e.g. the FieldCache ones in
> >> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe
> this
> >> should also be added to Kamikaze, which is a really nice project!
> >> Especially
> >> filter DocIdSets should pass this method to its delegate (see
> >> FilterDocIdSet
> >> in Lucene).
> >>
> >> -
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: u...@thetaphi.de
> >>
> >>
> >> > -Original Message-
> >> > From: John Wang (JIRA) [mailto:j...@apache.org]
> >> > Sent: Thursday, September 24, 2009 3:14 PM
> >> > To: java-dev@lucene.apache.org
> >> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards
> flexible
> >> > indexing
> >> >
> >> >
> >> > [ https://issues.apache.org/jira/browse/LUCENE-
> >> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> >> > tabpanel&focusedCommentId=12759112#action_12759112 ]
> >> >
> >> > John Wang commented on LUCENE-1458:
> >> > ---
> >> >
> >> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
> >> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
> >> > abstraction that was migrated from Solr)
> >> >
> >> > It has three components:
> >> >
> >> > 1) P4Delta
> >> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
> >> > jira
> >> > ticket and a patch for Lucene awhile ago with performance numbers. It
> is
> >> > significantly faster than DisjunctionScorer)
> >> > 3) algorithm to determine which DocIdSet implementations to use given
> >> > some
> >> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from
> the
> >> > application behavior if not all parameters are given.
> >> >
> >> > So please feel free to incorporate anything you see if or move it to
> >> > contrib.
> >> >
> >> >
> >> > > Further steps towards flexible indexing
> >> > > ---
> >> > >
> >> > > Key: LUCENE-1458
> >> > > URL:
> https://issues.apache.org/jira/browse/LUCENE-1458
> >> > > Project: Lucene - Java
> >> > >  Issue Type: New Feature
> >> > >  Components: Index
> >> > >Affects Versions: 2.9
> >> > >Reporter: Michael McCandless
> >> > >Assignee: Michael McCandless
> >> > >Priority: Minor
> >> > > Attachments: LUCENE-1458-back-compat.patch,
> LUCENE-1458-back-
> >> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch,
> LUCENE-
> >> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> >> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
> >> > 1458.tar.bz2, LUCENE-1458.tar.bz2
> >> > >
> >> > >
> >> > > I attached a very rough checkpoint of my current patch, to get early
> >> > > feedback.  All tests pass, though back compat tests don't pass due
> to
> >> > > changes to package-private APIs plus certain bugs in tests that
> >> > > happened to work (eg call TermPostions.nextPosition() too many
> times,
> >> > > which the new API asserts against).
> >> > > [Aside: I think, when we commit changes to package-private APIs such
> >> > > that back-compat tests don't pass, we could go back, make a branch
> on
> >> > > the back-compat tag, commit changes to the tests to use the new
> >> > > package private APIs on that branch, then fix nightly build to use
> the
> >> > > tip of that branch?o]
> >> > > There's still plenty to do before this is committable! This is a
> >> > > rather large change:
> >> > >   * Switches to a new more efficient terms dict format.  This still
> >> > > uses tii/tis files, but the tii only stores term & long offset
> >> > > (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >> > > o

Re: Back-compat tags

2009-10-08 Thread Michael Busch

+1.

I guess then we have to make some changes to the build script. Currently 
it's only possible to specify a tag. I'll open a JIRA issue.


 Michael

On 10/8/09 11:45 AM, Michael McCandless wrote:

How about we just use the tip of the back-compat branch?  (Ie no
tagging).  Until we settle down.

Mike

On Thu, Oct 8, 2009 at 2:42 PM, Michael Busch  wrote:
   

Hi,

for the last patches I committed I created a new back-compat tag each time.
Since this is happening so often right now, because we're removing APIs, I
was wondering whether we should not create a separate tag for each patch,
but instead gather the changes in the back-compat branch and create one tag
every day or every other day? Depending on when the build gets kicked off it
might fail then on test-tag, but that should be acceptable?

  Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?

2009-10-08 Thread MRIT64 (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763638#action_12763638
 ] 

MRIT64 commented on LUCENE-1958:


It doesnt happen with Lucene 2.9 (just downloaded).

> ShingleFilter creates shingles across two consecutives documents : bug or 
> normal behaviour ?
> 
>
> Key: LUCENE-1958
> URL: https://issues.apache.org/jira/browse/LUCENE-1958
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: Windows XP / jdk1.6.0_15
>Reporter: MRIT64
>Priority: Minor
>
> HI
> I add two consecutive documents that are indexed with some filters. The last 
> one is ShingleFilter.
> ShingleFilter creates a shingle spannnig the two documents, which has no 
> sense in my context.
> Is that a bug oris it  ShingleFilter normal behaviour ? If it's normal 
> behaviour, is it possible to change it optionnaly ?
> Thanks
> MR

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763641#action_12763641
 ] 

Robert Muir commented on LUCENE-1958:
-

bq. It doesnt happen with Lucene 2.9 (just downloaded). 

Can you tell me if you have made a custom analyzer? If so, does this analyzer 
implement reusableTokenStream?

If this is the case, its really not a bug, reset() is an optional operation and 
with Lucene 2.4.1 you can't safely reuse instances of ShingleFilter for this 
reason, it does not support reuse as of that version.


> ShingleFilter creates shingles across two consecutives documents : bug or 
> normal behaviour ?
> 
>
> Key: LUCENE-1958
> URL: https://issues.apache.org/jira/browse/LUCENE-1958
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: Windows XP / jdk1.6.0_15
>Reporter: MRIT64
>Priority: Minor
>
> HI
> I add two consecutive documents that are indexed with some filters. The last 
> one is ShingleFilter.
> ShingleFilter creates a shingle spannnig the two documents, which has no 
> sense in my context.
> Is that a bug oris it  ShingleFilter normal behaviour ? If it's normal 
> behaviour, is it possible to change it optionnaly ?
> Thanks
> MR

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1965) Lazy Atomic Loading Stopwords in SmartCN

2009-10-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-1965:


Attachment: LUCENE-1965.patch

Thanks robert, good catch! I was adding one test with null in the constructor 
but I missed to finish it apparently. 
I merged it into testChineseStopWordsOff().

Patch attached.


> Lazy Atomic Loading Stopwords in SmartCN 
> -
>
> Key: LUCENE-1965
> URL: https://issues.apache.org/jira/browse/LUCENE-1965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1965.patch, LUCENE-1965.patch
>
>
> The default constructor in SmartChineseAnalyzer loads the default (jar 
> embedded) stopwords each time the constructor is invoked. 
> This should be atomically loaded only once in an unmodifiable set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1965) Lazy Atomic Loading Stopwords in SmartCN

2009-10-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763646#action_12763646
 ] 

Robert Muir commented on LUCENE-1965:
-

Simon, cool. I like it now, think its a good improvement, same as with Persian 
and Arabic, thanks :)

> Lazy Atomic Loading Stopwords in SmartCN 
> -
>
> Key: LUCENE-1965
> URL: https://issues.apache.org/jira/browse/LUCENE-1965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1965.patch, LUCENE-1965.patch
>
>
> The default constructor in SmartChineseAnalyzer loads the default (jar 
> embedded) stopwords each time the constructor is invoked. 
> This should be atomically loaded only once in an unmodifiable set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1965) Lazy Atomic Loading Stopwords in SmartCN

2009-10-08 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer closed LUCENE-1965.
---

Resolution: Fixed

commited in r823285

thx robert for reviewing

> Lazy Atomic Loading Stopwords in SmartCN 
> -
>
> Key: LUCENE-1965
> URL: https://issues.apache.org/jira/browse/LUCENE-1965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1965.patch, LUCENE-1965.patch
>
>
> The default constructor in SmartChineseAnalyzer loads the default (jar 
> embedded) stopwords each time the constructor is invoked. 
> This should be atomically loaded only once in an unmodifiable set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-08 Thread Michael McCandless
Well, it's the usual process... pull together a big patch, open an issue, etc.

Probably because it's a large amount of code (I think?) you'll need to
submit a software grant
(http://www.apache.org/licenses/software-grant.txt).

Mike

On Thu, Oct 8, 2009 at 2:58 PM, John Wang  wrote:
> Awesome!
>
> Mike, can you let us know what the process is and the time line?
>
> Thanks
>
> -John
>
> On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless
>  wrote:
>>
>> +1!
>>
>> Mike
>>
>> On Thu, Oct 8, 2009 at 2:41 PM, John Wang  wrote:
>> > Hi guys:
>> >
>> >  What are your thoughts about contributing Kamikaze as a lucene
>> > contrib
>> > package? We just finished porting kamikaze to lucene 2.9. With the new
>> > 2.9
>> > api, it allows us for some more code tuning and optimization
>> > improvements.
>> >
>> >  We will be releasing kamikaze, it might a good time to add it to
>> > the
>> > lucene contrib package if there is interest.
>> >
>> > Thanks
>> >
>> > -John
>> >
>> > On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler  wrote:
>> >>
>> >> By the way: In the last RC of Lucene 2.9 we added a new method to
>> >> DocIdSet
>> >> called isCacheable(). It is used by e.g. CachingWrapperFilter to
>> >> determine,
>> >> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
>> >> (the
>> >> default is false, so all custom DocIdSets are copied to OpenBitSetDISI
>> >> by
>> >> CachingWrapperFilter, even if not needed - if a DocIdSet does not do
>> >> disk
>> >> IO
>> >> and have a fast iterator like e.g. the FieldCache ones in
>> >> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe
>> >> this
>> >> should also be added to Kamikaze, which is a really nice project!
>> >> Especially
>> >> filter DocIdSets should pass this method to its delegate (see
>> >> FilterDocIdSet
>> >> in Lucene).
>> >>
>> >> -
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> >>
>> >>
>> >> > -Original Message-
>> >> > From: John Wang (JIRA) [mailto:j...@apache.org]
>> >> > Sent: Thursday, September 24, 2009 3:14 PM
>> >> > To: java-dev@lucene.apache.org
>> >> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards
>> >> > flexible
>> >> > indexing
>> >> >
>> >> >
>> >> >     [ https://issues.apache.org/jira/browse/LUCENE-
>> >> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> >> > tabpanel&focusedCommentId=12759112#action_12759112 ]
>> >> >
>> >> > John Wang commented on LUCENE-1458:
>> >> > ---
>> >> >
>> >> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
>> >> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
>> >> > abstraction that was migrated from Solr)
>> >> >
>> >> > It has three components:
>> >> >
>> >> > 1) P4Delta
>> >> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
>> >> > jira
>> >> > ticket and a patch for Lucene awhile ago with performance numbers. It
>> >> > is
>> >> > significantly faster than DisjunctionScorer)
>> >> > 3) algorithm to determine which DocIdSet implementations to use given
>> >> > some
>> >> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from
>> >> > the
>> >> > application behavior if not all parameters are given.
>> >> >
>> >> > So please feel free to incorporate anything you see if or move it to
>> >> > contrib.
>> >> >
>> >> >
>> >> > > Further steps towards flexible indexing
>> >> > > ---
>> >> > >
>> >> > >                 Key: LUCENE-1458
>> >> > >                 URL:
>> >> > > https://issues.apache.org/jira/browse/LUCENE-1458
>> >> > >             Project: Lucene - Java
>> >> > >          Issue Type: New Feature
>> >> > >          Components: Index
>> >> > >    Affects Versions: 2.9
>> >> > >            Reporter: Michael McCandless
>> >> > >            Assignee: Michael McCandless
>> >> > >            Priority: Minor
>> >> > >         Attachments: LUCENE-1458-back-compat.patch,
>> >> > > LUCENE-1458-back-
>> >> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch,
>> >> > LUCENE-
>> >> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>> >> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
>> >> > 1458.tar.bz2, LUCENE-1458.tar.bz2
>> >> > >
>> >> > >
>> >> > > I attached a very rough checkpoint of my current patch, to get
>> >> > > early
>> >> > > feedback.  All tests pass, though back compat tests don't pass due
>> >> > > to
>> >> > > changes to package-private APIs plus certain bugs in tests that
>> >> > > happened to work (eg call TermPostions.nextPosition() too many
>> >> > > times,
>> >> > > which the new API asserts against).
>> >> > > [Aside: I think, when we commit changes to package-private APIs
>> >> > > such
>> >> > > that back-compat tests don't pass, we could go back, make a branch
>> >> > > on
>> >> > > the back-compat tag, commit cha

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-10-08 Thread Mark Miller
Yup - you need for anything developed outside of Apache.

Michael McCandless wrote:
> Well, it's the usual process... pull together a big patch, open an issue, etc.
>
> Probably because it's a large amount of code (I think?) you'll need to
> submit a software grant
> (http://www.apache.org/licenses/software-grant.txt).
>
> Mike
>
> On Thu, Oct 8, 2009 at 2:58 PM, John Wang  wrote:
>   
>> Awesome!
>>
>> Mike, can you let us know what the process is and the time line?
>>
>> Thanks
>>
>> -John
>>
>> On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless
>>  wrote:
>> 
>>> +1!
>>>
>>> Mike
>>>
>>> On Thu, Oct 8, 2009 at 2:41 PM, John Wang  wrote:
>>>   
 Hi guys:

  What are your thoughts about contributing Kamikaze as a lucene
 contrib
 package? We just finished porting kamikaze to lucene 2.9. With the new
 2.9
 api, it allows us for some more code tuning and optimization
 improvements.

  We will be releasing kamikaze, it might a good time to add it to
 the
 lucene contrib package if there is interest.

 Thanks

 -John

 On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler  wrote:
 
> By the way: In the last RC of Lucene 2.9 we added a new method to
> DocIdSet
> called isCacheable(). It is used by e.g. CachingWrapperFilter to
> determine,
> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
> (the
> default is false, so all custom DocIdSets are copied to OpenBitSetDISI
> by
> CachingWrapperFilter, even if not needed - if a DocIdSet does not do
> disk
> IO
> and have a fast iterator like e.g. the FieldCache ones in
> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe
> this
> should also be added to Kamikaze, which is a really nice project!
> Especially
> filter DocIdSets should pass this method to its delegate (see
> FilterDocIdSet
> in Lucene).
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>   
>> -Original Message-
>> From: John Wang (JIRA) [mailto:j...@apache.org]
>> Sent: Thursday, September 24, 2009 3:14 PM
>> To: java-dev@lucene.apache.org
>> Subject: [jira] Commented: (LUCENE-1458) Further steps towards
>> flexible
>> indexing
>>
>>
>> [ https://issues.apache.org/jira/browse/LUCENE-
>> 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel&focusedCommentId=12759112#action_12759112 ]
>>
>> John Wang commented on LUCENE-1458:
>> ---
>>
>> Just a FYI: Kamikaze was originally started as our sandbox for Lucene
>> contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
>> abstraction that was migrated from Solr)
>>
>> It has three components:
>>
>> 1) P4Delta
>> 2) Logical boolean operations on DocIdSet/Iterators (I have created a
>> jira
>> ticket and a patch for Lucene awhile ago with performance numbers. It
>> is
>> significantly faster than DisjunctionScorer)
>> 3) algorithm to determine which DocIdSet implementations to use given
>> some
>> parameters, e.g. miniD,maxid,id count etc. It learns and adjust from
>> the
>> application behavior if not all parameters are given.
>>
>> So please feel free to incorporate anything you see if or move it to
>> contrib.
>>
>>
>> 
>>> Further steps towards flexible indexing
>>> ---
>>>
>>> Key: LUCENE-1458
>>> URL:
>>> https://issues.apache.org/jira/browse/LUCENE-1458
>>> Project: Lucene - Java
>>>  Issue Type: New Feature
>>>  Components: Index
>>>Affects Versions: 2.9
>>>Reporter: Michael McCandless
>>>Assignee: Michael McCandless
>>>Priority: Minor
>>> Attachments: LUCENE-1458-back-compat.patch,
>>> LUCENE-1458-back-
>>>   
>> compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch,
>> LUCENE-
>> 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
>> LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
>> 1458.tar.bz2, LUCENE-1458.tar.bz2
>> 
>>> I attached a very rough checkpoint of my current patch, to get
>>> early
>>> feedback.  All tests pass, though back compat tests don't pass due
>>> to
>>> changes to package-private APIs plus certain bugs in tests that
>>> happened to work (eg call TermPostions.nextPosition() too many
>>> times,
>>> which the new API asserts against).
>>> [Aside: I think, when we commit changes to package-private APIs
>>> such
>>> that back-com

Re: Arabic Analyzer: possible bug

2009-10-08 Thread Basem Narmok
Uwe,
!00% correct

On Thu, Oct 8, 2009 at 4:56 PM, Uwe Schindler  wrote:
> I think the idea of lowercase filter in the arabic analyzers is not to
> really index mixed language texts. It is more for the case, if you have some
> word between the Arabic content (like product names,.), which happens often.
> You see this often also in Japanese texts. And for these embedded English
> fragments you really need no stop word list. And if there is a stop word in
> it, for the target language it is not a real stop word, it may be additional
> information. Stop word removal is done mostly because of they are needless
> (appear in every text). But if you have one Arabic sentence where "the" also
> appears next to an English word, it is more important than all the "the" in
> this mail.
>
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Basem Narmok
Ok, the list is ready (initial one, as I will continue enhancing it).
I will create JIRA issue and send the patch.

Also, I have some small changes to the normalization (e.g. removing
some diacritics, and other changes)

Best,
Basem

On Thu, Oct 8, 2009 at 8:51 PM, Robert Muir  wrote:
> Basem, I really appreciate your time if you are able to do this.
>
> Its been my hope that introducing Arabic/Farsi support will create enough
> interest to encourage more qualified people to come and really make things
> nice.
>
> If you don't mind, you can look at
> http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issue
> with a patch file to improve our stopwords list.
>
> Otherwise, in my opinion a good list is also acceptable and I will volunteer
> to turn it into a patch :)
>
> On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok  wrote:
>>
>> Robert,
>>
>> I will be happy to do so. Currently, I am testing the new Arabic
>> analyzer in 2.9, and also I will prepare a new stop word list. I will
>> provide you with my findings/comments soon.
>>
>> Best,
>>
>> On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir  wrote:
>> > Basem, by any chance would you be willing to help improve it for us?
>> >
>> > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok  wrote:
>> >>
>> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the
>> >> stop word list needs some corrections and may miss some common/stop
>> >> Arabic words.
>> >>
>> >> Best,
>> >>
>> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith  wrote:
>> >> > Robert,
>> >> > Thanks for the info.
>> >> > As I said, I am illiterate in Arabic. So I have another, perhaps
>> >> > nonsensical, question:
>> >> > Does the stop word list have every combination of upper/lower case
>> >> > for
>> >> > each
>> >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should
>> >> > it
>> >> > come
>> >> > after LowerCaseFilter?
>> >> > -- DM
>> >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
>> >> >
>> >> > DM, this isn't a bug.
>> >> >
>> >> > The arabic stopwords are not normalized.
>> >> >
>> >> > but for persian, i normalized the stopwords. mostly because i did not
>> >> > want
>> >> > to have to create variations with farsi yah versus arabic yah for
>> >> > each
>> >> > one.
>> >> >
>> >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith 
>> >> > wrote:
>> >> >>
>> >> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
>> >> >> know
>> >> >> Arabic or Farsi, but have some texts to index in those languages.)
>> >> >> The tokenizer/filter chain for ArabicAnalyzer is:
>> >> >>         TokenStream result = new ArabicLetterTokenizer( reader );
>> >> >>         result = new StopFilter( result, stoptable );
>> >> >>         result = new LowerCaseFilter(result);
>> >> >>         result = new ArabicNormalizationFilter( result );
>> >> >>         result = new ArabicStemFilter( result );
>> >> >>
>> >> >>         return result;
>> >> >>
>> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
>> >> >>
>> >> >> As a comparison the PersianAnalyzer has:
>> >> >>     TokenStream result = new ArabicLetterTokenizer(reader);
>> >> >>     result = new LowerCaseFilter(result);
>> >> >>     result = new ArabicNormalizationFilter(result);
>> >> >>     /* additional persian-specific normalization */
>> >> >>     result = new PersianNormalizationFilter(result);
>> >> >>     /*
>> >> >>      * the order here is important: the stopword list is normalized
>> >> >> with
>> >> >> the
>> >> >>      * above!
>> >> >>      */
>> >> >>     result = new StopFilter(result, stoptable);
>> >> >>
>> >> >>     return result;
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> DM
>> >> >
>> >> >
>> >> > --
>> >> > Robert Muir
>> >> > rcm...@gmail.com
>> >> >
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > rcm...@gmail.com
>> >
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1959) Index Splitter

2009-10-08 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated LUCENE-1959:
--

Attachment: mp-splitter.patch

Here's my submission to the index splitting race ;) This version implements the 
multi-pass method that uses loops of delete/addIndexes/undelete.

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Arabic Analyzer: possible bug

2009-10-08 Thread Basem Narmok
Robert,

Yes, this issue will not work, as some numbers are used to represent
(transliterate if I may say) some English letters (e.g. 3 for Arabic
Aeen, and 7 for Arabic H'a).

Some online services provide instant translation for such
transliteration (e.g. http://www.yamli.com/ try this word "7elo" it
means nice/cool in Arabic), so we may provide analyzer stage that
could translate such content to Arabic :)

Basem

On Thu, Oct 8, 2009 at 5:11 PM, Robert Muir  wrote:
> Uwe, I might add to what you say. I do disagree a bit and think mixed
> english/arabic text is pretty common (aside from the "product name" issue
> you discussed).
>
> this can get really complex for some informal text: you have maybe some
> english, arabic, and arabic written in informal romanization, sometimes all
> mixed together:
>
> Example:
> http://www.mahjoob.com/en/forums/showthread.php?t=211597&page=3
>
> Not really sure how to make the default ArabicAnalyzer to meet everyone's
> needs, in this example its gonna screw up the romanized arabic, because they
> use numerics for some letters, and it uses something based on CharTokenizer
> :) But allowing a word to say, start with or contain a numeric, this might
> not be the best thing for higher-quality text...
>
>
> On Thu, Oct 8, 2009 at 9:56 AM, Uwe Schindler  wrote:
>>
>> I think the idea of lowercase filter in the arabic analyzers is not to
>> really index mixed language texts. It is more for the case, if you have
>> some
>> word between the Arabic content (like product names,.), which happens
>> often.
>> You see this often also in Japanese texts. And for these embedded English
>> fragments you really need no stop word list. And if there is a stop word
>> in
>> it, for the target language it is not a real stop word, it may be
>> additional
>> information. Stop word removal is done mostly because of they are needless
>> (appear in every text). But if you have one Arabic sentence where "the"
>> also
>> appears next to an English word, it is more important than all the "the"
>> in
>> this mail.
>>
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763684#action_12763684
 ] 

Michael McCandless commented on LUCENE-1959:


Excellent!

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1961) Remove remaining deprecations in document package

2009-10-08 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763685#action_12763685
 ] 

Michael Busch commented on LUCENE-1961:
---

I committed a fix for this to the back-compat branch but didn't create a new 
tag yet. Shall I create a new one?

> Remove remaining deprecations in document package
> -
>
> Key: LUCENE-1961
> URL: https://issues.apache.org/jira/browse/LUCENE-1961
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Other
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
> Attachments: lucene-1961.patch
>
>
> Remove different deprecated APIs:
> - Field.Index.NO_NORMS, etc.
> - Field.binaryValue()
> - getOmitTf()/setOmitTf()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763682#action_12763682
 ] 

Mark Miller commented on LUCENE-1959:
-

Nice! Lets add it to the mix - I'm guessing Jason's is quite a bit faster for 
splitting segs and this one nicer in that it can split indivd segs. Do we keep 
two tools or merge them into one with options?

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1959) Index Splitter

2009-10-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763683#action_12763683
 ] 

Uwe Schindler commented on LUCENE-1959:
---

Really cool!

> Index Splitter
> --
>
> Key: LUCENE-1959
> URL: https://issues.apache.org/jira/browse/LUCENE-1959
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 3.0
>
> Attachments: LUCENE-1959.patch, LUCENE-1959.patch, mp-splitter.patch
>
>
> If an index has multiple segments, this tool allows splitting those segments 
> into separate directories.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   >