Re: lucene index file randomly crash and need to reindex

2010-01-13 Thread Michael McCandless
If you follow the rules Otis listed, you should never hit index
corruption, unless something is wrong with your hardware.

Or, if you hit an as-yet-undiscovered bug in Lucene ;)

Mike

On Wed, Jan 13, 2010 at 1:11 AM, zhang99  wrote:
>
> what is the longest time you ever keep index file without required to
> reindex. i notice even big open source life liferay suffer from this.
> thanks for the tips
> --
> View this message in context: 
> http://old.nabble.com/lucene-index-file-randomly-crash-and-need-to-reindex-tp27139147p27139613.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Supported way to get segment from IndexWriter?

2010-01-13 Thread Michael McCandless
Indeed, getReader is an expensive way to get the segment count (it
flushes the current RAM buffer to disk as a new segment).

Since SegmentInfos is now public, you could use SegmentInfos.read to
read the current segments_N file, and then call its .size() method?

But, this will only count as of the last commit... which is probably
not sufficient for SOLR-1559?

We could simply make getSegmentCount public / expert / not only for tests?

Mike

On Tue, Jan 12, 2010 at 8:42 PM, Chris Hostetter
 wrote:
>
> A conversation with someone earlier today got me thinking about cranking out
> a patch for SOLR-1559 (in which the goal is to allow for rules do dermine
> the iput to optimize(maxNumSegments) instead of requiring a fixed integer
> value as input)  when i realized that i wasn't certain what "approved"
> methods there might be for deterrmining hte current number of segments from
> an IndexWriter.
>
> I see IndexWriter.getSegmentCount() but it's package protected (with a
> comment that it exists for tests).  So my best guess using only public APIs
> would be something like...
>
>  int numCurrentSegments = -1;
>  IndexReader r = writer.getReader();
>  try {
>   IndexReader[]tmp = r.getSequentialSubReaders();
>   numCurrentSegments = null==tmp ? 1 : tmp.length;
>  } finally {
>   r.close();
>  }
>
> Is there a better way?
>
> (My main concern about this approach being that my intuition (which seems
> supported by the javadocs) is that getReader might be a little
> expensive/excesive just to count the segments)
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Text extraction from ms word doc

2010-01-13 Thread Michael McCandless
We could also fix WhitespaceAnalyzer to filter that character out?
(Or you could make your own analyzer to do so...).

You could also try asking on the tika-user list whether Tika has a
solution for mapping "extended" whitespace characters...

Mike

On Mon, Jan 11, 2010 at 3:04 PM, maxSchlein  wrote:
>
> I was looking for an option for Text extraction from a word doc.
>
> Currently I am using POI; however, when there is a table in the doc, for
> each column POI brings back a  .  The whitespace analyzer is not filtering
> out this character.  So whatever word or phrase that is the last word or
> phrase within a table column is not found during searching.  That is, if the
> word dog is the only word in a column, a search for the word dog would
> return nothing because the word that was indexed was "dog ".
>
> I can create a filter to fix this, using Apache's
> StringUtils.isAsciiPrintable, but I would rather not.
>
> Any and all help is welcome and thanked.
> --
> View this message in context: 
> http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to limit the size of an index?

2010-01-13 Thread Michael McCandless
On Sun, Jan 10, 2010 at 7:33 AM, Dvora  wrote:
>
> I'm storing and reading the documents using Compass, not Lucene directly. I
> didn't touch those parameters, so I guess the default values are being used
> (I do see cfs files in the index).

OK.  If your index directory has *.cfs files, then you are using the
default compound file format (I'm not sure whether Compass changes
that default either).

> How the ramBufferSizeMB parameter affect the files size? What value should I
> use in order to have 6MB files?

ramBufferSizeMB controls how big the initially created segments are.
Just how big a segment you get for a given ramBufferMB is very app
dependent, because the RAM efficiency of IndexWriter depends on things
like whether you have many unique terms (= worse RAM efficiency).
Generally larger RAM buffers have better efficiency.

Merging can only create bigger segments from those flushed segments.

So, you have to ensure ramBufferSizeMB is set such that in your use
case it never up and flushes a segment bigger than your 10 MB size
limit.

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there any difference in a document between one added field with a number of terms and a field added a number of times ?

2010-01-13 Thread Paul Taylor
So not much help here, (I wonder if its because I posted 3 questions in 
one day) but Ive made some progress in my understaning.


I understand there is only one norm per field and I think Lucene does no 
differentiating between adding the same field a number of times and 
adding mutiple text to the same field. But Ive discovered 
getPositionIncrementGap() to seperate my multiple adds of the same field 
within a doc and I was wondering if they was a way I could use the 
position gap to get DefaultSimailrity.lengthNorm() to be called with 
only the number of tokens within one field passed to it rather than the 
complete terms within the field as a whole.


Paul

Paul Taylor wrote:
Thanks Felipe, but you  are missing the point Artist really doesnt 
come into it, my problem is confined to the alias field, forget about 
artist its just detailed to give the complete scenario


Paul

Felipe wrote:
You could change the boost of the field artist to be bigger than the 
field alias.

field.setBoost(artistBoost);


2010/1/12 Paul Taylor >


Been doing some analysis with Luke (BTW doesnt work with
StandardAnalyzer since Version field introduced) and discovered a
problem with field lenghth boosting for me.

I have a document that represents a recording artist (i.e Madonna,
The Beatles ectera) it contains an artist and an alias field, the
alias field contains other names that the artist is maybe known
as, and so there can be multiple aliases for an artist.

PseudoCode:
(
doc.addField(ArtistIndexField.ARTIST, rs.getString("name"));
for (String alias : aliases.get(artistId)) {
doc.addField(ArtistIndexField.ALIAS, alias);
}
)

Im finding that when I search by for the artist by the alias field
if the value matches an alias in two different documents the
document with the least number of aliases get the best score
because the boost of the alias is split between the aliases on the
other doc, if I ANALYSED_NO_NORMS then both documents return the
same score.

The trouble is I don't want to disable norms because I want a
match on a single field containing less terms to score better than
one with more scores.

Full example:


http://musicbrainz.org/search/textsearch.html?query=minihamuzu&type=artist&limit=25&adv=on&handlearguments=1 


 


return two results , the second result only has score of 8 because
it more aliases than the first result, even the alias it matched
on was an exact single term match.
http://musicbrainz.org/show/artist/aliases.html?artistid=174327

but if I remove norms then the following query (which is currently
working)


http://musicbrainz.org/search/textsearch.html?query=%22the+beatles%22&type=artist&limit=25&adv=on&handlearguments=1 


 



would stop working, in that  searching for 'The beatles' would no
longer score rate artist 'The Beatles' better than 'The Beatles
revival Band'

So isn't there any way to recognise that repeated calls to
addField() is not creating a single field with many terms,but many
fields with few terms.

thanks Paul





-

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org





--
Felipe Lobo
www.jusbrasil.com.br 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field creation with TokenStream and stored value

2010-01-13 Thread Benjamin Heilbrunn
Sorry for pushing this thing.

Would it be possible to add the demanded constructor or would it break
anything of lucenes logic?


2010/1/11 Benjamin Heilbrunn :
> Hey out there,
>
> in lucene it's not possible to create a Field based on a TokenStream
> AND supply a stored value.
>
> Is there a reason why a Field constructor in the form of
>   public Field(String name, TokenStream tokenStream, String storedValue)
> does not exist?
>
> I am using trees of TeeSinkTokenFilter's for the creation of many
> fields, based on a source string.
> That's the reason why I can't use the "standardconstructor"
>    public Field(String name, String value, Store store, Index index)
>
>
> Regards
> Benjamin
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Field creation with TokenStream and stored value

2010-01-13 Thread Uwe Schindler
Why not simply add the field twice, one time with TokenStream, one time stored 
only? Internally stored/indexed fields are handled like that.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Benjamin Heilbrunn [mailto:ben...@gmail.com]
> Sent: Wednesday, January 13, 2010 2:31 PM
> To: java-user@lucene.apache.org
> Subject: Re: Field creation with TokenStream and stored value
> 
> Sorry for pushing this thing.
> 
> Would it be possible to add the demanded constructor or would it break
> anything of lucenes logic?
> 
> 
> 2010/1/11 Benjamin Heilbrunn :
> > Hey out there,
> >
> > in lucene it's not possible to create a Field based on a TokenStream
> > AND supply a stored value.
> >
> > Is there a reason why a Field constructor in the form of
> >   public Field(String name, TokenStream tokenStream, String
> storedValue)
> > does not exist?
> >
> > I am using trees of TeeSinkTokenFilter's for the creation of many
> > fields, based on a source string.
> > That's the reason why I can't use the "standardconstructor"
> >public Field(String name, String value, Store store, Index index)
> >
> >
> > Regards
> > Benjamin
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



NullPointerExc in CloseableThreadLocal... (Lucene 3.0.0)

2010-01-13 Thread Konstantyn Smirnov

Hi all

Consider the following piece of code:

Searcher s = this.getSearcher()
def hits = s.search( query, filter, params.offset + params.max, sort )

for( hit in hits.scoreDocs[ lower..http://www.poiradar.ru www.poiradar.ru 
http://www.poiradar.com.ua www.poiradar.com.ua 
http://www.poiradar.com www.poiradar.com 
http://www.poiradar.de www.poiradar.de 
-- 
View this message in context: 
http://old.nabble.com/NullPointerExc-in-CloseableThreadLocal...-%28Lucene-3.0.0%29-tp27145825p27145825.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field creation with TokenStream and stored value

2010-01-13 Thread Benjamin Heilbrunn
Thanks!
Didn't know that it's so easy ;)

2010/1/13 Uwe Schindler :
> Why not simply add the field twice, one time with TokenStream, one time 
> stored only? Internally stored/indexed fields are handled like that.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: Benjamin Heilbrunn [mailto:ben...@gmail.com]
>> Sent: Wednesday, January 13, 2010 2:31 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Field creation with TokenStream and stored value
>>
>> Sorry for pushing this thing.
>>
>> Would it be possible to add the demanded constructor or would it break
>> anything of lucenes logic?
>>
>>
>> 2010/1/11 Benjamin Heilbrunn :
>> > Hey out there,
>> >
>> > in lucene it's not possible to create a Field based on a TokenStream
>> > AND supply a stored value.
>> >
>> > Is there a reason why a Field constructor in the form of
>> >   public Field(String name, TokenStream tokenStream, String
>> storedValue)
>> > does not exist?
>> >
>> > I am using trees of TeeSinkTokenFilter's for the creation of many
>> > fields, based on a source string.
>> > That's the reason why I can't use the "standardconstructor"
>> >    public Field(String name, String value, Store store, Index index)
>> >
>> >
>> > Regards
>> > Benjamin
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Field creation with TokenStream and stored value

2010-01-13 Thread Andrzej Bialecki

On 2010-01-13 15:29, Benjamin Heilbrunn wrote:

Thanks!
Didn't know that it's so easy ;)

2010/1/13 Uwe Schindler:

Why not simply add the field twice, one time with TokenStream, one time stored 
only? Internally stored/indexed fields are handled like that.


Actually, you can implement your own Fieldable, and return what you want 
from its methods. You can also use Field constructor that takes the 
stored value, and then use Field.setTokenStream(TokenStream) - it 
doesn't override the stored value.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Extracting contact data

2010-01-13 Thread Ortelli, Gian Luca
Hi community,

 

I have a general understanding of Lucene concepts, and I'm wondering if
it's the right tool for my job:

 

- I need to extract data like e.g. time intervals ("8am - 12pm"), street
addresses from a set of files. The common issue with this data unit is
that they contain spaces and are not always definable through regexes.

 

- the extraction must take into consideration the "proximity": for
example, a mail address which is close to the work "Contacts" will
receive a higher rank, since I'm looking for contact data.

 

Do you think I can get any advantage from building a solution on Lucene?

 

  Gianluca



Re: NullPointerExc in CloseableThreadLocal... (Lucene 3.0.0)

2010-01-13 Thread Michael McCandless
Is it possible you are closing the searcher before / while running
that for loop?

Mike

On Wed, Jan 13, 2010 at 9:26 AM, Konstantyn Smirnov  wrote:
>
> Hi all
>
> Consider the following piece of code:
>
> Searcher s = this.getSearcher()
> def hits = s.search( query, filter, params.offset + params.max, sort )
>
> for( hit in hits.scoreDocs[ lower..   def obj = binder( s.doc( hit.doc ), hit.doc ) // << here the NPE is
> thrown
> }
>
> the code *sporadically* throws the following:
>
>        ... 6 more
> Caused by: java.lang.NullPointerException
>        at
> org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:64)
>        at
> org.apache.lucene.index.SegmentReader.getFieldsReader(SegmentReader.java:778)
>        at
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:879)
>        at
> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:518)
>        at
> org.apache.lucene.index.IndexReader.document(IndexReader.java:658)
>        at
> org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:144)
>        at SearchableService.asResult(SearchableService.groovy:128)
>
> I assume, that the reason for that could be indexing of a new document/
> expungeDeletes() / re-openning the IndexReader.
>
> Can you elaborate on this? Shall I implement/configure some sort of
> time-out?
>
> Thanks in advance
>
> -
> Konstantyn Smirnov, CTO
> http://www.poiradar.ru www.poiradar.ru
> http://www.poiradar.com.ua www.poiradar.com.ua
> http://www.poiradar.com www.poiradar.com
> http://www.poiradar.de www.poiradar.de
> --
> View this message in context: 
> http://old.nabble.com/NullPointerExc-in-CloseableThreadLocal...-%28Lucene-3.0.0%29-tp27145825p27145825.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extracting contact data

2010-01-13 Thread Karl Wettin
Lucene will probably only be helpful if you know what you are looking  
for, e.g. that you search for a given person, a given street and given  
time intervals.


Is this what you want to do?

If you instead are looking for a way to really extract any person,  
street and time interval that a document is associated with you  
probably want to look for a natural language processing project that  
can do something like semantic part of speech tagging for you.



  karl

13 jan 2010 kl. 17.39 skrev Ortelli, Gian Luca:


Hi community,



I have a general understanding of Lucene concepts, and I'm wondering  
if

it's the right tool for my job:



- I need to extract data like e.g. time intervals ("8am - 12pm"),  
street

addresses from a set of files. The common issue with this data unit is
that they contain spaces and are not always definable through regexes.



- the extraction must take into consideration the "proximity": for
example, a mail address which is close to the work "Contacts" will
receive a higher rank, since I'm looking for contact data.



Do you think I can get any advantage from building a solution on  
Lucene?




 Gianluca




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extracting contact data

2010-01-13 Thread Erick Erickson
Before answering, how to you measure "proximity"? You can make
Lucene work with locations (there's an example in Lucene In Action)
readily enough though

HTH
Erick

On Wed, Jan 13, 2010 at 11:39 AM, Ortelli, Gian Luca <
gianluca.orte...@truvo.com> wrote:

> Hi community,
>
>
>
> I have a general understanding of Lucene concepts, and I'm wondering if
> it's the right tool for my job:
>
>
>
> - I need to extract data like e.g. time intervals ("8am - 12pm"), street
> addresses from a set of files. The common issue with this data unit is
> that they contain spaces and are not always definable through regexes.
>
>
>
> - the extraction must take into consideration the "proximity": for
> example, a mail address which is close to the work "Contacts" will
> receive a higher rank, since I'm looking for contact data.
>
>
>
> Do you think I can get any advantage from building a solution on Lucene?
>
>
>
>  Gianluca
>
>


RangeFilter

2010-01-13 Thread AlexElba

Hello,

I am currently using lucene 2.4 and have document with 3 fields

id  
name 
rank

and have query and filter when I am trying to use rang filter on rank I am
not getting any result back

RangeFilter rangeFilter = new RangeFilter("rank", "3", "10", true, true);

I have documents which are in this interval 


Any suggestion what am I doing wrong?

Regards




-- 
View this message in context: 
http://old.nabble.com/RangeFilter-tp27148785p27148785.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba,

The problem is that Lucene only knows how to handle character strings, not 
numbers.  Lexicographically, "3" > "10", so you get the expected results 
(nothing).

The standard thing to do is transform your numbers into strings that sort as 
you want them to.  E.g., you can left-pad the "rank" field values with zeroes: 
"03", "04", ..., "10", and then create a RangeFilter over "03" .. "10".  You 
will of course need to left-zero-pad to at least the maximum character length 
of the largest rank.

Facilities to handle this problem are available in NumberTools:



(Note that NumberTools converts longs to base-36 fixed-length padded strings.)

More info here:

   

Steve

On 01/13/2010 at 12:51 PM, AlexElba wrote:
> 
> Hello,
> 
> I am currently using lucene 2.4 and have document with 3 fields
> 
> id
> name
> rank
> 
> and have query and filter when I am trying to use rang filter on rank I
> am not getting any result back
> 
> RangeFilter rangeFilter = new RangeFilter("rank", "3", "10", true, true);
> 
> I have documents which are in this interval
> 
> 
> Any suggestion what am I doing wrong?
> 
> Regards

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter

2010-01-13 Thread Michael McCandless
Actually, as of Lucene 2.9 (if you can upgrade), you should use
NumericField to index numerics and NumericRangeQuery to do range
search/filter -- it all just works -- no more padding.

Mike

On Wed, Jan 13, 2010 at 1:17 PM, Steven A Rowe  wrote:
> Hi AlexElba,
>
> The problem is that Lucene only knows how to handle character strings, not 
> numbers.  Lexicographically, "3" > "10", so you get the expected results 
> (nothing).
>
> The standard thing to do is transform your numbers into strings that sort as 
> you want them to.  E.g., you can left-pad the "rank" field values with 
> zeroes: "03", "04", ..., "10", and then create a RangeFilter over "03" .. 
> "10".  You will of course need to left-zero-pad to at least the maximum 
> character length of the largest rank.
>
> Facilities to handle this problem are available in NumberTools:
>
> 
>
> (Note that NumberTools converts longs to base-36 fixed-length padded strings.)
>
> More info here:
>
>   
>
> Steve
>
> On 01/13/2010 at 12:51 PM, AlexElba wrote:
>>
>> Hello,
>>
>> I am currently using lucene 2.4 and have document with 3 fields
>>
>> id
>> name
>> rank
>>
>> and have query and filter when I am trying to use rang filter on rank I
>> am not getting any result back
>>
>> RangeFilter rangeFilter = new RangeFilter("rank", "3", "10", true, true);
>>
>> I have documents which are in this interval
>>
>>
>> Any suggestion what am I doing wrong?
>>
>> Regards
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NullPointerExc in CloseableThreadLocal... (Lucene 3.0.0)

2010-01-13 Thread Konstantyn Smirnov

Thanks for the answer Mike

indeed it is possible, but practically...

I start the loop immediately after searcher.search(), and with my index size
of 3 MB, the whole operation takes max 100 ms. Given the rate of like 50
updates - addDocument()/expungeDeletes()/IR.reopen() per day, the
probability is really low, I think, because the update takes also about 100
ms...

Anyway, it would be worth trying some IR reopen lock. Do you have any idea
on that?

-
Konstantyn Smirnov, CTO 
http://www.poiradar.ru www.poiradar.ru 
http://www.poiradar.com.ua www.poiradar.com.ua 
http://www.poiradar.com www.poiradar.com 
http://www.poiradar.de www.poiradar.de 
-- 
View this message in context: 
http://old.nabble.com/NullPointerExc-in-CloseableThreadLocal...-%28Lucene-3.0.0%29-tp27145825p27149158.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance Results on changing the way fields are stored

2010-01-13 Thread Paul Taylor

Grant Ingersoll wrote:

On Jan 5, 2010, at 7:44 AM, Paul Taylor wrote:

  

So currently in my index I index and store a number of small fields, I need 
both so I can search on the fields, then I use the stored versions to generate 
the output document (which is either an XML or JSON representation), because I 
read stored and index fields are dealt with completely seperately I tried 
another tact only storing one field which was a serialized version of the 
output documentation. This solves a couple of issues I was having but I was 
disappointed that both the size of the index increased and the index build  
time increased, I thought that if all the stored data was held in one field 
that the resultant index would be smaller, and I didn't expect index time to 
increase by as much as it did. I was also suprised that Java serilaization was 
slower and used more space than both JSON and XML serialization.

Results as Follows

Type: Time : Index 
Size
Only indexed  no norms  
  105   : 38 MB
Only indexed
 111   : 43 MB
Same fields written as Indexed and Stored  (current Situation)   115   
: 83 MB
Fields Indexed, One JAXB classed Stored using JSON Marshalling 140   : 115 MB
Fields Indexed, One JAXB classed Stored using XML Marshalling  189   : 198 MB
Fields Indexed, One JAXB classed Stored using Java Serialization   305   : 485 
MB



How much more verbose are these than the "raw" content?  Even as terse as JSON is, it is still verbose compared to a binary format, and XML Marshalling and Java Serialization will be even more.  Given that you are likely only displaying 10 or so at a time, I'd think it would be much more efficient to only store the minimal amount needed to recreate the docs in the current result set.  

  
Yes, in the end I came to the conclusion to just stick with current 
situation except for cases where i have sets of related fields that 
would otherwise nessecitate holding 'placeholder' fields, in which case 
I've used json

I've also seen people have success simply storing a key in Lucene that is then 
used for lookup in something like Memcachedb, Tokyo Cabinet or one of the many 
other key-value stores.
  
In my situation 90% of the fields stored are also required for 
searching, so they are held in the search index anyway so there is not 
much point moving the stored version into a memcahe


thanks Paul

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter

2010-01-13 Thread AlexElba

Thanks Steve.

Mike for now I can not upgrade... 
-- 
View this message in context: 
http://old.nabble.com/RangeFilter-tp27148785p27151315.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread TJ Kolev
Greetings,

Let's assume I have to index and search "resume" documents. Two fields are
defined: Language and Years. The fields are associated together in a group
called Experience. A resume document may have 0 or more Experience groups:

Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
Rb{ E1(Java,2); E2(C,5); E3(VB,1);}

How do I index such documents, and how do I search, so I can formulate a
query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
know I can index multiple fields of the same name, and do "(Language:Java
AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that would
also return Rb, which I don't want. The problem here is that the "grouping"
is lost. I can create fields with compound names (E1Language, E1Years,
E2Language, E2Years, etc), but that helps me none, as I don't know which
group to search. I'd also like to query for "(Language:Java AND Years:5) OR
(Language:C AND Years:2)"

This is a simplified example. Real documents may have 30 - 40 groups, each
one with several fields. Putting all the fields in a group in one index
field won't work as the numeric/date ones should be available for range
searchers.

So far the way I see it is to do my own post processing on the results. The
issue is that text fields will need to be untokenized, or otherwise it would
be difficult to work on the result, and determine what matches.

Thank you.
tjk :)


Re: Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread Erick Erickson
One approach would be to do this with multi-valued fields. The
idea here is to index all your E fields in the *same* Lucene
field with an increment gap (see getPositionIncrementGap) > 1.

For this example, assume getPositionIncrementGap returns 100.

Then, for your documents you have something like
doc.add(new Field("experience", "java,5" blah blah));
doc.add(new Field("experience", "C,2" blah blah));
doc.add(new Field("experience", "PHP,3" blah blah));

Then you do proximity searches with a slop of < 100.

The trick is that, the above tokens are positioned (roughly)
1 - java
2 - 5
102 - c
103 - 2
203 - php
204 - 3

Of course you have to override a suitable analyzer to break
your tokens up appropriately.

Now a query (SpanNear? Proximity? your choice) of the
form "java 5"~90 AND "c 2"~90 should only return Ra.

HTH
Erick

On Wed, Jan 13, 2010 at 3:59 PM, TJ Kolev  wrote:

> Greetings,
>
> Let's assume I have to index and search "resume" documents. Two fields are
> defined: Language and Years. The fields are associated together in a group
> called Experience. A resume document may have 0 or more Experience groups:
>
> Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
>
> How do I index such documents, and how do I search, so I can formulate a
> query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
> know I can index multiple fields of the same name, and do "(Language:Java
> AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> would
> also return Rb, which I don't want. The problem here is that the "grouping"
> is lost. I can create fields with compound names (E1Language, E1Years,
> E2Language, E2Years, etc), but that helps me none, as I don't know which
> group to search. I'd also like to query for "(Language:Java AND Years:5) OR
> (Language:C AND Years:2)"
>
> This is a simplified example. Real documents may have 30 - 40 groups, each
> one with several fields. Putting all the fields in a group in one index
> field won't work as the numeric/date ones should be available for range
> searchers.
>
> So far the way I see it is to do my own post processing on the results. The
> issue is that text fields will need to be untokenized, or otherwise it
> would
> be difficult to work on the result, and determine what matches.
>
> Thank you.
> tjk :)
>


RE: Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread Digy
How about using languages as fieldnames?
Doc1(Ra):
Java:5
C:2
PHP:3

Doc2(Rb)
Java:2
C:5
VB:1

Query:Java:5 AND C:2

DIGY

-Original Message-
From: TJ Kolev [mailto:tjko...@gmail.com] 
Sent: Wednesday, January 13, 2010 11:00 PM
To: java-user@lucene.apache.org
Subject: Problem: Indexing and searching repeating groups of fields

Greetings,

Let's assume I have to index and search "resume" documents. Two fields are
defined: Language and Years. The fields are associated together in a group
called Experience. A resume document may have 0 or more Experience groups:

Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
Rb{ E1(Java,2); E2(C,5); E3(VB,1);}

How do I index such documents, and how do I search, so I can formulate a
query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
know I can index multiple fields of the same name, and do "(Language:Java
AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that would
also return Rb, which I don't want. The problem here is that the "grouping"
is lost. I can create fields with compound names (E1Language, E1Years,
E2Language, E2Years, etc), but that helps me none, as I don't know which
group to search. I'd also like to query for "(Language:Java AND Years:5) OR
(Language:C AND Years:2)"

This is a simplified example. Real documents may have 30 - 40 groups, each
one with several fields. Putting all the fields in a group in one index
field won't work as the numeric/date ones should be available for range
searchers.

So far the way I see it is to do my own post processing on the results. The
issue is that text fields will need to be untokenized, or otherwise it would
be difficult to work on the result, and determine what matches.

Thank you.
tjk :)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Hi,



I am trying to optimize the index which would merge different segment
together. Let say the index folder is 1Gb in total, I need each segmentation
to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy *and
setMaxMergeMB(100) to ensure no segment after merging would be 200Mb.
However, I still see segment that are larger than 200Mb. I did call
IndexWriter.optimize(20) to make sure there are enough number segmentation
to allow each segment to be under 200Mb.



Can someone let me know if I am using this right? Or any suggestion on how
to tackle this would be helpful.



Thanks,

Trin


Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Hi Trin,

There was recently a discussion about this, the max size is
for the before merge segments, rather than the resultant merged
segment (if that makes sense). It'd be great if we had a merge
policy that limited the resultant merged segment, though that'd
by a rough approximation at best.

Jason

On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong  wrote:
> Hi,
>
>
>
> I am trying to optimize the index which would merge different segment
> together. Let say the index folder is 1Gb in total, I need each segmentation
> to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy *and
> setMaxMergeMB(100) to ensure no segment after merging would be 200Mb.
> However, I still see segment that are larger than 200Mb. I did call
> IndexWriter.optimize(20) to make sure there are enough number segmentation
> to allow each segment to be under 200Mb.
>
>
>
> Can someone let me know if I am using this right? Or any suggestion on how
> to tackle this would be helpful.
>
>
>
> Thanks,
>
> Trin
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Thanks, Jason.

Is my understanding correct that LogByteSizeMergePolicy.setMaxMergeMB(100)
will prevent
merging of two segments that is larger than 100 Mb each at the optimizing
time?

If so, why do think would I still see segment that is larger than 200 MB?



On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Hi Trin,
>
> There was recently a discussion about this, the max size is
> for the before merge segments, rather than the resultant merged
> segment (if that makes sense). It'd be great if we had a merge
> policy that limited the resultant merged segment, though that'd
> by a rough approximation at best.
>
> Jason
>
> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong 
> wrote:
> > Hi,
> >
> >
> >
> > I am trying to optimize the index which would merge different segment
> > together. Let say the index folder is 1Gb in total, I need each
> segmentation
> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy *and
> > setMaxMergeMB(100) to ensure no segment after merging would be 200Mb.
> > However, I still see segment that are larger than 200Mb. I did call
> > IndexWriter.optimize(20) to make sure there are enough number
> segmentation
> > to allow each segment to be under 200Mb.
> >
> >
> >
> > Can someone let me know if I am using this right? Or any suggestion on
> how
> > to tackle this would be helpful.
> >
> >
> >
> > Thanks,
> >
> > Trin
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Problem: Indexing and searching repeating groups of fields

2010-01-13 Thread Erick Erickson
Ooooh, isn't that easier. You just prompted me to think
that you don't even have to do that, just index the pairs as single
tokens (KeywordAnalyzer? but watch out for no case folding)...

On Wed, Jan 13, 2010 at 4:30 PM, Digy  wrote:

> How about using languages as fieldnames?
> Doc1(Ra):
>Java:5
>C:2
>PHP:3
>
> Doc2(Rb)
>Java:2
>C:5
>VB:1
>
> Query:Java:5 AND C:2
>
> DIGY
>
> -Original Message-
> From: TJ Kolev [mailto:tjko...@gmail.com]
> Sent: Wednesday, January 13, 2010 11:00 PM
> To: java-user@lucene.apache.org
> Subject: Problem: Indexing and searching repeating groups of fields
>
> Greetings,
>
> Let's assume I have to index and search "resume" documents. Two fields are
> defined: Language and Years. The fields are associated together in a group
> called Experience. A resume document may have 0 or more Experience groups:
>
> Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
> Rb{ E1(Java,2); E2(C,5); E3(VB,1);}
>
> How do I index such documents, and how do I search, so I can formulate a
> query like this "Resumes which have (Java,5) and (C,2)" and get back Ra. I
> know I can index multiple fields of the same name, and do "(Language:Java
> AND Years:5) AND (Language:C AND Years:2)", but in addition to Ra that
> would
> also return Rb, which I don't want. The problem here is that the "grouping"
> is lost. I can create fields with compound names (E1Language, E1Years,
> E2Language, E2Years, etc), but that helps me none, as I don't know which
> group to search. I'd also like to query for "(Language:Java AND Years:5) OR
> (Language:C AND Years:2)"
>
> This is a simplified example. Real documents may have 30 - 40 groups, each
> one with several fields. Putting all the fields in a group in one index
> field won't work as the numeric/date ones should be available for range
> searchers.
>
> So far the way I see it is to do my own post processing on the results. The
> issue is that text fields will need to be untokenized, or otherwise it
> would
> be difficult to work on the result, and determine what matches.
>
> Thank you.
> tjk :)
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Oh ok, you're asking about optimizing... I think that's a different
algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
param.

On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong  wrote:
> Thanks, Jason.
>
> Is my understanding correct that LogByteSizeMergePolicy.setMaxMergeMB(100)
> will prevent
> merging of two segments that is larger than 100 Mb each at the optimizing
> time?
>
> If so, why do think would I still see segment that is larger than 200 MB?
>
>
>
> On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> Hi Trin,
>>
>> There was recently a discussion about this, the max size is
>> for the before merge segments, rather than the resultant merged
>> segment (if that makes sense). It'd be great if we had a merge
>> policy that limited the resultant merged segment, though that'd
>> by a rough approximation at best.
>>
>> Jason
>>
>> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong 
>> wrote:
>> > Hi,
>> >
>> >
>> >
>> > I am trying to optimize the index which would merge different segment
>> > together. Let say the index folder is 1Gb in total, I need each
>> segmentation
>> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy *and
>> > setMaxMergeMB(100) to ensure no segment after merging would be 200Mb.
>> > However, I still see segment that are larger than 200Mb. I did call
>> > IndexWriter.optimize(20) to make sure there are enough number
>> segmentation
>> > to allow each segment to be under 200Mb.
>> >
>> >
>> >
>> > Can someone let me know if I am using this right? Or any suggestion on
>> how
>> > to tackle this would be helpful.
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Trin
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Do you mean MergePolicy is only used during index time and will be ignored
by by the Optimize() process?


On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Oh ok, you're asking about optimizing... I think that's a different
> algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
> param.
>
> On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong 
> wrote:
> > Thanks, Jason.
> >
> > Is my understanding correct that
> LogByteSizeMergePolicy.setMaxMergeMB(100)
> > will prevent
> > merging of two segments that is larger than 100 Mb each at the optimizing
> > time?
> >
> > If so, why do think would I still see segment that is larger than 200 MB?
> >
> >
> >
> > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
> > jason.rutherg...@gmail.com> wrote:
> >
> >> Hi Trin,
> >>
> >> There was recently a discussion about this, the max size is
> >> for the before merge segments, rather than the resultant merged
> >> segment (if that makes sense). It'd be great if we had a merge
> >> policy that limited the resultant merged segment, though that'd
> >> by a rough approximation at best.
> >>
> >> Jason
> >>
> >> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong  >
> >> wrote:
> >> > Hi,
> >> >
> >> >
> >> >
> >> > I am trying to optimize the index which would merge different segment
> >> > together. Let say the index folder is 1Gb in total, I need each
> >> segmentation
> >> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
> *and
> >> > setMaxMergeMB(100) to ensure no segment after merging would be 200Mb.
> >> > However, I still see segment that are larger than 200Mb. I did call
> >> > IndexWriter.optimize(20) to make sure there are enough number
> >> segmentation
> >> > to allow each segment to be under 200Mb.
> >> >
> >> >
> >> >
> >> > Can someone let me know if I am using this right? Or any suggestion on
> >> how
> >> > to tackle this would be helpful.
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Trin
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
There's a different method in LogMergePolicy that performs the
optimize... Right, so normal merging uses the findMerges method, then
there's a findMergeOptimize (method names could be inaccurate).

On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong  wrote:
> Do you mean MergePolicy is only used during index time and will be ignored
> by by the Optimize() process?
>
>
> On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> Oh ok, you're asking about optimizing... I think that's a different
>> algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
>> param.
>>
>> On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong 
>> wrote:
>> > Thanks, Jason.
>> >
>> > Is my understanding correct that
>> LogByteSizeMergePolicy.setMaxMergeMB(100)
>> > will prevent
>> > merging of two segments that is larger than 100 Mb each at the optimizing
>> > time?
>> >
>> > If so, why do think would I still see segment that is larger than 200 MB?
>> >
>> >
>> >
>> > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> Hi Trin,
>> >>
>> >> There was recently a discussion about this, the max size is
>> >> for the before merge segments, rather than the resultant merged
>> >> segment (if that makes sense). It'd be great if we had a merge
>> >> policy that limited the resultant merged segment, though that'd
>> >> by a rough approximation at best.
>> >>
>> >> Jason
>> >>
>> >> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong > >
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> >
>> >> >
>> >> > I am trying to optimize the index which would merge different segment
>> >> > together. Let say the index folder is 1Gb in total, I need each
>> >> segmentation
>> >> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
>> *and
>> >> > setMaxMergeMB(100) to ensure no segment after merging would be 200Mb.
>> >> > However, I still see segment that are larger than 200Mb. I did call
>> >> > IndexWriter.optimize(20) to make sure there are enough number
>> >> segmentation
>> >> > to allow each segment to be under 200Mb.
>> >> >
>> >> >
>> >> >
>> >> > Can someone let me know if I am using this right? Or any suggestion on
>> >> how
>> >> > to tackle this would be helpful.
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Trin
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Trin Chavalittumrong
Seems like optimize() only cares about final number of segments rather than
the size of the segment. Is it so?

On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> There's a different method in LogMergePolicy that performs the
> optimize... Right, so normal merging uses the findMerges method, then
> there's a findMergeOptimize (method names could be inaccurate).
>
> On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong 
> wrote:
> > Do you mean MergePolicy is only used during index time and will be
> ignored
> > by by the Optimize() process?
> >
> >
> > On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
> > jason.rutherg...@gmail.com> wrote:
> >
> >> Oh ok, you're asking about optimizing... I think that's a different
> >> algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
> >> param.
> >>
> >> On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong  >
> >> wrote:
> >> > Thanks, Jason.
> >> >
> >> > Is my understanding correct that
> >> LogByteSizeMergePolicy.setMaxMergeMB(100)
> >> > will prevent
> >> > merging of two segments that is larger than 100 Mb each at the
> optimizing
> >> > time?
> >> >
> >> > If so, why do think would I still see segment that is larger than 200
> MB?
> >> >
> >> >
> >> >
> >> > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
> >> > jason.rutherg...@gmail.com> wrote:
> >> >
> >> >> Hi Trin,
> >> >>
> >> >> There was recently a discussion about this, the max size is
> >> >> for the before merge segments, rather than the resultant merged
> >> >> segment (if that makes sense). It'd be great if we had a merge
> >> >> policy that limited the resultant merged segment, though that'd
> >> >> by a rough approximation at best.
> >> >>
> >> >> Jason
> >> >>
> >> >> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong <
> mrt...@gmail.com
> >> >
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> >
> >> >> >
> >> >> > I am trying to optimize the index which would merge different
> segment
> >> >> > together. Let say the index folder is 1Gb in total, I need each
> >> >> segmentation
> >> >> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
> >> *and
> >> >> > setMaxMergeMB(100) to ensure no segment after merging would be
> 200Mb.
> >> >> > However, I still see segment that are larger than 200Mb. I did call
> >> >> > IndexWriter.optimize(20) to make sure there are enough number
> >> >> segmentation
> >> >> > to allow each segment to be under 200Mb.
> >> >> >
> >> >> >
> >> >> >
> >> >> > Can someone let me know if I am using this right? Or any suggestion
> on
> >> >> how
> >> >> > to tackle this would be helpful.
> >> >> >
> >> >> >
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Trin
> >> >> >
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Yes... You could hack LogMergePolicy to do something else.

I use optimise(numsegments:5) regularly on 80GB indexes, that if
optimized to 1 segment, would thrash the IO excessively.  This works
fine because 15-20GB indexes are plenty large and fast.

On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittumrong  wrote:
> Seems like optimize() only cares about final number of segments rather than
> the size of the segment. Is it so?
>
> On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> There's a different method in LogMergePolicy that performs the
>> optimize... Right, so normal merging uses the findMerges method, then
>> there's a findMergeOptimize (method names could be inaccurate).
>>
>> On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong 
>> wrote:
>> > Do you mean MergePolicy is only used during index time and will be
>> ignored
>> > by by the Optimize() process?
>> >
>> >
>> > On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> Oh ok, you're asking about optimizing... I think that's a different
>> >> algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
>> >> param.
>> >>
>> >> On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong > >
>> >> wrote:
>> >> > Thanks, Jason.
>> >> >
>> >> > Is my understanding correct that
>> >> LogByteSizeMergePolicy.setMaxMergeMB(100)
>> >> > will prevent
>> >> > merging of two segments that is larger than 100 Mb each at the
>> optimizing
>> >> > time?
>> >> >
>> >> > If so, why do think would I still see segment that is larger than 200
>> MB?
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
>> >> > jason.rutherg...@gmail.com> wrote:
>> >> >
>> >> >> Hi Trin,
>> >> >>
>> >> >> There was recently a discussion about this, the max size is
>> >> >> for the before merge segments, rather than the resultant merged
>> >> >> segment (if that makes sense). It'd be great if we had a merge
>> >> >> policy that limited the resultant merged segment, though that'd
>> >> >> by a rough approximation at best.
>> >> >>
>> >> >> Jason
>> >> >>
>> >> >> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong <
>> mrt...@gmail.com
>> >> >
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > I am trying to optimize the index which would merge different
>> segment
>> >> >> > together. Let say the index folder is 1Gb in total, I need each
>> >> >> segmentation
>> >> >> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
>> >> *and
>> >> >> > setMaxMergeMB(100) to ensure no segment after merging would be
>> 200Mb.
>> >> >> > However, I still see segment that are larger than 200Mb. I did call
>> >> >> > IndexWriter.optimize(20) to make sure there are enough number
>> >> >> segmentation
>> >> >> > to allow each segment to be under 200Mb.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Can someone let me know if I am using this right? Or any suggestion
>> on
>> >> >> how
>> >> >> > to tackle this would be helpful.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Thanks,
>> >> >> >
>> >> >> > Trin
>> >> >> >
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Otis Gospodnetic
I think Jason meant "15-20GB segments"?
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch





From: Jason Rutherglen 
To: java-user@lucene.apache.org
Sent: Wed, January 13, 2010 5:54:38 PM
Subject: Re: Max Segmentation Size when Optimizing Index

Yes... You could hack LogMergePolicy to do something else.

I use optimise(numsegments:5) regularly on 80GB indexes, that if
optimized to 1 segment, would thrash the IO excessively.  This works
fine because 15-20GB indexes are plenty large and fast.

On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittumrong  wrote:
> Seems like optimize() only cares about final number of segments rather than
> the size of the segment. Is it so?
>
> On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> There's a different method in LogMergePolicy that performs the
>> optimize... Right, so normal merging uses the findMerges method, then
>> there's a findMergeOptimize (method names could be inaccurate).
>>
>> On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong 
>> wrote:
>> > Do you mean MergePolicy is only used during index time and will be
>> ignored
>> > by by the Optimize() process?
>> >
>> >
>> > On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> Oh ok, you're asking about optimizing... I think that's a different
>> >> algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
>> >> param.
>> >>
>> >> On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong > >
>> >> wrote:
>> >> > Thanks, Jason.
>> >> >
>> >> > Is my understanding correct that
>> >> LogByteSizeMergePolicy.setMaxMergeMB(100)
>> >> > will prevent
>> >> > merging of two segments that is larger than 100 Mb each at the
>> optimizing
>> >> > time?
>> >> >
>> >> > If so, why do think would I still see segment that is larger than 200
>> MB?
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
>> >> > jason.rutherg...@gmail.com> wrote:
>> >> >
>> >> >> Hi Trin,
>> >> >>
>> >> >> There was recently a discussion about this, the max size is
>> >> >> for the before merge segments, rather than the resultant merged
>> >> >> segment (if that makes sense). It'd be great if we had a merge
>> >> >> policy that limited the resultant merged segment, though that'd
>> >> >> by a rough approximation at best.
>> >> >>
>> >> >> Jason
>> >> >>
>> >> >> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong <
>> mrt...@gmail.com
>> >> >
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > I am trying to optimize the index which would merge different
>> segment
>> >> >> > together. Let say the index folder is 1Gb in total, I need each
>> >> >> segmentation
>> >> >> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
>> >> *and
>> >> >> > setMaxMergeMB(100) to ensure no segment after merging would be
>> 200Mb.
>> >> >> > However, I still see segment that are larger than 200Mb. I did call
>> >> >> > IndexWriter.optimize(20) to make sure there are enough number
>> >> >> segmentation
>> >> >> > to allow each segment to be under 200Mb.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Can someone let me know if I am using this right? Or any suggestion
>> on
>> >> >> how
>> >> >> > to tackle this would be helpful.
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Thanks,
>> >> >> >
>> >> >> > Trin
>> >> >> >
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Right... It all blends together, I need an NLP analyzer for my emails

On Wed, Jan 13, 2010 at 3:05 PM, Otis Gospodnetic
 wrote:
> I think Jason meant "15-20GB segments"?
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
>
> 
> From: Jason Rutherglen 
> To: java-user@lucene.apache.org
> Sent: Wed, January 13, 2010 5:54:38 PM
> Subject: Re: Max Segmentation Size when Optimizing Index
>
> Yes... You could hack LogMergePolicy to do something else.
>
> I use optimise(numsegments:5) regularly on 80GB indexes, that if
> optimized to 1 segment, would thrash the IO excessively.  This works
> fine because 15-20GB indexes are plenty large and fast.
>
> On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittumrong  
> wrote:
>> Seems like optimize() only cares about final number of segments rather than
>> the size of the segment. Is it so?
>>
>> On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen <
>> jason.rutherg...@gmail.com> wrote:
>>
>>> There's a different method in LogMergePolicy that performs the
>>> optimize... Right, so normal merging uses the findMerges method, then
>>> there's a findMergeOptimize (method names could be inaccurate).
>>>
>>> On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong 
>>> wrote:
>>> > Do you mean MergePolicy is only used during index time and will be
>>> ignored
>>> > by by the Optimize() process?
>>> >
>>> >
>>> > On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
>>> > jason.rutherg...@gmail.com> wrote:
>>> >
>>> >> Oh ok, you're asking about optimizing... I think that's a different
>>> >> algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
>>> >> param.
>>> >>
>>> >> On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong >> >
>>> >> wrote:
>>> >> > Thanks, Jason.
>>> >> >
>>> >> > Is my understanding correct that
>>> >> LogByteSizeMergePolicy.setMaxMergeMB(100)
>>> >> > will prevent
>>> >> > merging of two segments that is larger than 100 Mb each at the
>>> optimizing
>>> >> > time?
>>> >> >
>>> >> > If so, why do think would I still see segment that is larger than 200
>>> MB?
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
>>> >> > jason.rutherg...@gmail.com> wrote:
>>> >> >
>>> >> >> Hi Trin,
>>> >> >>
>>> >> >> There was recently a discussion about this, the max size is
>>> >> >> for the before merge segments, rather than the resultant merged
>>> >> >> segment (if that makes sense). It'd be great if we had a merge
>>> >> >> policy that limited the resultant merged segment, though that'd
>>> >> >> by a rough approximation at best.
>>> >> >>
>>> >> >> Jason
>>> >> >>
>>> >> >> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong <
>>> mrt...@gmail.com
>>> >> >
>>> >> >> wrote:
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > I am trying to optimize the index which would merge different
>>> segment
>>> >> >> > together. Let say the index folder is 1Gb in total, I need each
>>> >> >> segmentation
>>> >> >> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
>>> >> *and
>>> >> >> > setMaxMergeMB(100) to ensure no segment after merging would be
>>> 200Mb.
>>> >> >> > However, I still see segment that are larger than 200Mb. I did call
>>> >> >> > IndexWriter.optimize(20) to make sure there are enough number
>>> >> >> segmentation
>>> >> >> > to allow each segment to be under 200Mb.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Can someone let me know if I am using this right? Or any suggestion
>>> on
>>> >> >> how
>>> >> >> > to tackle this would be helpful.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Thanks,
>>> >> >> >
>>> >> >> > Trin
>>> >> >> >
>>> >> >>
>>> >> >> -
>>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >>
>>> >>
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Actually I meant to say indexes... However when optimize(numsegments)
is used they're segments...

On Wed, Jan 13, 2010 at 3:05 PM, Otis Gospodnetic
 wrote:
> I think Jason meant "15-20GB segments"?
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
>
> 
> From: Jason Rutherglen 
> To: java-user@lucene.apache.org
> Sent: Wed, January 13, 2010 5:54:38 PM
> Subject: Re: Max Segmentation Size when Optimizing Index
>
> Yes... You could hack LogMergePolicy to do something else.
>
> I use optimise(numsegments:5) regularly on 80GB indexes, that if
> optimized to 1 segment, would thrash the IO excessively.  This works
> fine because 15-20GB indexes are plenty large and fast.
>
> On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittumrong  
> wrote:
>> Seems like optimize() only cares about final number of segments rather than
>> the size of the segment. Is it so?
>>
>> On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen <
>> jason.rutherg...@gmail.com> wrote:
>>
>>> There's a different method in LogMergePolicy that performs the
>>> optimize... Right, so normal merging uses the findMerges method, then
>>> there's a findMergeOptimize (method names could be inaccurate).
>>>
>>> On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong 
>>> wrote:
>>> > Do you mean MergePolicy is only used during index time and will be
>>> ignored
>>> > by by the Optimize() process?
>>> >
>>> >
>>> > On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
>>> > jason.rutherg...@gmail.com> wrote:
>>> >
>>> >> Oh ok, you're asking about optimizing... I think that's a different
>>> >> algorithm inside LogMergePolicy.  I think it ignores the maxMergeMB
>>> >> param.
>>> >>
>>> >> On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong >> >
>>> >> wrote:
>>> >> > Thanks, Jason.
>>> >> >
>>> >> > Is my understanding correct that
>>> >> LogByteSizeMergePolicy.setMaxMergeMB(100)
>>> >> > will prevent
>>> >> > merging of two segments that is larger than 100 Mb each at the
>>> optimizing
>>> >> > time?
>>> >> >
>>> >> > If so, why do think would I still see segment that is larger than 200
>>> MB?
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <
>>> >> > jason.rutherg...@gmail.com> wrote:
>>> >> >
>>> >> >> Hi Trin,
>>> >> >>
>>> >> >> There was recently a discussion about this, the max size is
>>> >> >> for the before merge segments, rather than the resultant merged
>>> >> >> segment (if that makes sense). It'd be great if we had a merge
>>> >> >> policy that limited the resultant merged segment, though that'd
>>> >> >> by a rough approximation at best.
>>> >> >>
>>> >> >> Jason
>>> >> >>
>>> >> >> On Wed, Jan 13, 2010 at 1:36 PM, Trin Chavalittumrong <
>>> mrt...@gmail.com
>>> >> >
>>> >> >> wrote:
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > I am trying to optimize the index which would merge different
>>> segment
>>> >> >> > together. Let say the index folder is 1Gb in total, I need each
>>> >> >> segmentation
>>> >> >> > to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy
>>> >> *and
>>> >> >> > setMaxMergeMB(100) to ensure no segment after merging would be
>>> 200Mb.
>>> >> >> > However, I still see segment that are larger than 200Mb. I did call
>>> >> >> > IndexWriter.optimize(20) to make sure there are enough number
>>> >> >> segmentation
>>> >> >> > to allow each segment to be under 200Mb.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Can someone let me know if I am using this right? Or any suggestion
>>> on
>>> >> >> how
>>> >> >> > to tackle this would be helpful.
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Thanks,
>>> >> >> >
>>> >> >> > Trin
>>> >> >> >
>>> >> >>
>>> >> >> -
>>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >>
>>> >>
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: RangeFilter

2010-01-13 Thread AlexElba

Hello,

I change filter to follow
  RangeFilter rangeFilter = new RangeFilter(
   "rank", NumberTools
.longToString(rating), NumberTools
.longToString(10), true, true);

and change index to store rank the same way... But still not seeing :( any
results 


AlexElba wrote:
> 
> Hello,
> 
> I am currently using lucene 2.4 and have document with 3 fields
> 
> id  
> name 
> rank
> 
> and have query and filter when I am trying to use rang filter on rank I am
> not getting any result back
> 
> RangeFilter rangeFilter = new RangeFilter("rank", "3", "10", true, true);
> 
> I have documents which are in this interval 
> 
> 
> Any suggestion what am I doing wrong?
> 
> Regards
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/RangeFilter-tp27148785p27155102.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: RangeFilter

2010-01-13 Thread Steven A Rowe
Hi AlexElba,

Did you completely re-index?

If you did, then there is some other problem - can you share (more of) your 
code?

Do you know about Luke?  It's an essential tool for Lucene index debugging:

   http://www.getopt.org/luke/

Steve

On 01/13/2010 at 8:34 PM, AlexElba wrote:
> 
> Hello,
> 
> I change filter to follow
>   RangeFilter rangeFilter = new RangeFilter(
>"rank", NumberTools
> .longToString(rating), NumberTools
> .longToString(10), true, true);
> 
> and change index to store rank the same way... But still not seeing :(
> any results
> 
> 
> AlexElba wrote:
> > 
> > Hello,
> > 
> > I am currently using lucene 2.4 and have document with 3 fields
> > 
> > id
> > name
> > rank
> > 
> > and have query and filter when I am trying to use rang filter on rank I
> > am not getting any result back
> > 
> > RangeFilter rangeFilter = new RangeFilter("rank", "3", "10", true,
> > true);
> > 
> > I have documents which are in this interval
> > 
> > 
> > Any suggestion what am I doing wrong?
> > 
> > Regards
> > 
> > 
> > 
> > 
> > 
> 
> -- View this message in context: http://old.nabble.com/RangeFilter-
> tp27148785p27155102.html Sent from the Lucene - Java Users mailing list
> archive at Nabble.com.
> 
> 
> - To
> unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For
> additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Hints on implementing XQuery full-text search

2010-01-13 Thread Paul J. Lucas
Hi -

I've used Lucene on a previous project, so I am somewhat familiar with the API. 
However, I've never had to do anything "fancy" (where "fancy" means things like 
using filters, different analyzers, boosting, payloads, etc).

I'm about to embark on implementing the full-text search feature of XQuery:

http://www.w3.org/TR/xpath-full-text-10/

Its query abilities are the most complicated I've seen.  From my experience 
with Lucene, I know it can be used to implement some of the features; however, 
I'd like to walk through them all.  For each feature, I only need a simple 
answer like, "Feature X is best implemented using a query filter," just so I 
start off in the right direction for each feature.

Note: I will be implementing my own query parser and construct queries "by 
hand" using various instantiations of Lucene classes.

-
3.3 Cardinality Selection

This allows you to say things like:

title ftcontains "usability" occurs at least 2 times

which means that the title field must contain the "usability" at least twice.  
How can this be done?

-
3.4.4 Stemming Option

This allows you to match words that have been indexed against words in the 
query that have been stemmed like:

title ftcontains "improve" with stemming

which would match even if title contained "improving".  Note that 
PorterStemFilter can not be used because the decision whether to use stemming 
or not is specified at query-time and not index-time.

In this case, would I have to add each word to the index twice?  Once for the 
original word and once for the stemmed word (assuming the stemmed word is 
different from the original word)?  Or is there a better way?

-
3.4.5 Case Option

This allows you to specify -- at query-time -- one of "case insensitive", "case 
sensitive", "lowercase", "uppercase".

The last two I think can be implemented using a query filter since, for 
"lowercase", it matches only if the document text is all in lower-case (and 
same for "uppercase").

But how would you handle the case insensitive/sensitive specifications?  One 
thought is to add every word twice: once in its original case and once in a 
normalized case (arbitrarily chosen to be, say, lowercase).  Any better ideas?

---
3.4.6 Diacritics Option

This is similar to the Cast Option except its "diacritics insensitive" or 
"diacritics sensitive.  How about implementing this?

--
3.4.7 Stop Word Option

This allows you to specify -- qt query time -- "with stop words", e.g.:

abstract ftcontains "propagating of errors"
with stop words ("a", "the", "of")

would match a document with an abstract that contains "propagating few errors". 
It seems odd, I know.  It's as if the stop words become wildcards, i.e.:

"propagating of errors" -> "propagating * errors"

where * will match any word in the document.  How can this be implemented in 
Lucene?


3.5.3 Mild-Not Selection

XQuery has two flavors of "not": (regular) not and mild-not.  This allows you 
to have a query like:

body ftcontains "Mexico" not in "New Mexico"

which would only match documents that contain "Mexico" when it's not part of 
the phrase "New Mexico".  How can this be implemented using Lucene?

---
3.6.1 Ordered Selection

This allows you to require that the order of the words in a query match the 
order of the words in a document, e.g.:

title ftcontains ("web site" ftand "usability") ordered

which would match only if the phrase "web site" and the word "usability" both 
occurred in the document and "usability" comes after "web site" in word order.  
My guess for implementing this would be to keep track of word positions and 
store this in a payload.  Yes?

-
3.6.4 Scope Selection

This allows you to require that words appear in the same "scope", e.g.:

abstract ftcontains "usability" ftand "web site" same sentence

You can also do any combination of {same|different} {sentence|paragraph}.  My 
guess for this would also be to keep track of sentence/paragraph data in a 
payload.  Yes?

-
3.7 Ignore Option

Given the partial XQuery:

let $x := 
   Web Usability and Practice
   Montana  this author is
   an expert in Web Usability Marigold
   
   Vera Tudor-Medina on Web  best
   editor on Web Usability Usability
   
 

if I were to have a query:

book ftcontains "Web Usability" without content $x//annotation

then it would not consider any text inside of  
elements at all.  "Web Usability" would be found twice: once in the title 
element and once in the editor element.  Note that the latter  
element comes smack in the middle of the phrase "Web Usability".  My guess for 
this would also be to use payload data to store the element each word is inside 
of then use a filter based on that.  Yes?

-
I realize this is a