Re: score and multiValued fields

2010-03-17 Thread Marc Sturlese

Confirmed, supposition 2 is the right one. 

Erick Erickson wrote:
> 
> Have you looked at:
> http://lucene.apache.org/java/2_4_0/scoring.html
> 
> even though it's for
> 2.4,
> I don't think there's any relevant changes for 3.x...
> 
> I'm pretty sure that your supposition 2 is the right one.
> 
> HTH
> Erick
> 
> On Tue, Mar 16, 2010 at 2:58 PM, Marc Sturlese
> wrote:
> 
>>
>> I would like to know how Lucene deals with the score on multiValued
>> fields.
>> I am wandering if:
>> 1) a score is computed per field and the maximum between them wins
>> or
>> 2)all terms of all fields (from the multivalued field) influence
>> eachother
>> to compute the score
>>
>> Let's say I have a document with a multiValued field "content" indexed 3
>> times and another document with the field indexed just once
>>
>> Doc1: content->aa; content->bb; content->dd ff gg
>> Doc2: content->aa b
>>
>> Searching for content:aa, Doc1 would be more relevant if supososition 1)
>> is
>> correct. Doc2 would be more relevant if suposition 2) is correct
>> How does it work?
>> Thanks in advance
>>
>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/score-and-multiValued-fields-tp27922940p27922940.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/score-and-multiValued-fields-tp27922940p27929937.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: OutOfMemory ParallelMultisearcher

2010-03-17 Thread Ian Lea
Hi


Caching searchers at some level should help keep memory usage down -
and will help performance too.  Searchers themselves don't generally
consume large amounts of memory, but if you've got loads of them then
obviously things will add up.

Unless you can change the whole design of your app (single index with
a user field that you use as a query filter to restrict users to
"their" data?) you may be stuck with giving the app more memory or
restricting the number of concurrent searchers.


---
Ian.


On Tue, Mar 16, 2010 at 10:00 PM, Jamie  wrote:
> Hi There
>
> I have an index which is 36 GB large. When I perform  eight simultaneous
> searches (performed by JMeter) on the index, an OutOfMemory error occurs.
> Since I need to potentially search across multiple indexes and those indexes
> can change from one search query to the next, each user has their own
> ParallelMultiSearcher object. Before each search operation, I reconstruct
> the ParrallelMultisearcher with the appropriate Searchers to each of the
> indexes that need to be included for that particular search query.
>
> The problem is that requiring each user to have their own
> ParallelMultisearcher seems to limit the number of  users that can use the
> system at the same time.
>
> While experimenting, when I make the ParallelMultiSearcher static, the same
> object used by all users, the OutOfMemory problem goes away and I am able to
> execute 50 simultaneous searches. The problem I have is I cannot make
> ParallelMultisearcher static, since the specific indexes used are variable
> from one search query to the next. I initially thought one could just cache
> the underlying Searchers and all would be okay, but this does not appear to
> be the case.
>
> My question: Will ParallelMultisearcher tend to consume a large amount of
> memory by itself when used on large indices? If so, do you have any
> suggestions on how I might support the above scenario (i.e. when the indexes
> used change from one query to the next)
>
> Thanks in advance
>
> Jamie
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



London open-source search social - 6th April

2010-03-17 Thread Richard Marr
Hi all,

We're meeting up at the Elgin just by Ladbroke Grove on the 6th for a
bit of relaxed chat about search, and related technology. Come along,
we're nice.
http://www.meetup.com/london-search-social/calendar/12781861/

It's a regular event, so if you want prior warning about future
meetups you can sign up here:
http://www.meetup.com/london-search-social/

Cheers,

Rich

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dealing with special cases in analyser

2010-03-17 Thread Grant Ingersoll
What's your current chain of TokenFilters?  How many exceptions do you expect?  
That is, could you enumerate them?

On Mar 12, 2010, at 5:27 AM, Paul Taylor wrote:

> Hi, I'm using a custom analyser based on standardanalyser with good results 
> to search artists (i.e rolling stones/beatles) but it fails to match some 
> weird artists names such as '!!!', this is not suprising because the analyser 
> ignores punctuation which is what I want it to normally. I just wonder the 
> best way to deal with these special cases.
> 
> My first idea was to use a version of CharFilter (PatternReplaceCharFilter) 
> to replace certain patterns such as '!!!' to 'ApostropheApostropheApostophe' 
> so they would remain intact , but I'm worried about the overhead of doing 
> this, is there something else I should be  doing.
> 
> thanks Paul
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Increase number of available positions?

2010-03-17 Thread Rene Hackl-Sommer

Hi,

I was looking at SpanNotQuery to see if I could make do without the 
position increment gaps. A search requirement that's causing me some 
trouble to implement is when two terms are supposed to be on the same 
L_2, yet on different L_3's (L_3's are hierarchically below L_2).


With the position increments in place, I can do this:




t293
t4979




t293
t4979




This query returns the expected documents.

I didn't manage to come up with a working solution for the approach 
without posIncGaps. The following, I thought, should work, but for some 
reason it doesn't:








t293
t4979



L_2








t293
t4979



L_3





Shouldn't this query only leave documents, where t293 and t4979 are in 
the same L_2, but not within the same L_3? I fiddled about with 
different queries to no avail and I feel the above is the most 
straightforward try. But the query doesn't match any document at all.


Any ideas on how to improve the second query would be greatly appreciated.

Thanks
Rene


Hi Rene,

Have you seen SpanNotQuery?:



For a document that looks like:


   
 T1 T2 T3
 T4 T5 T6
 T7 T8 T9
   
   
 T10 T11 T12
 T13 T14 T15
 T16 T17 T18
   
   ...

...

You could generate the following token stream (L_X being a concrete level 
boundary token):

L_1 L_2 L_3 T1  T2  T3  L_3 T4  T5  T6  L_3 T7  T8  T9
 L_2 L_3 T10 T11 T12 L_3 T13 T14 T15 L_3 T16 T17 T18
 L_2 ...
...

A query to find T2 and T8 on the same Level_2 would require you to find a span 
containing T2 and T8, but not containing L_2.

This scheme will generalize to as many levels as you need, and you can use 
nested span queries to simultaneously provide constraints at multiple levels.  
No position increment gap required.

Caveat: this scheme is not tested - I could be way off base :).

Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   




Re: Dealing with special cases in analyser

2010-03-17 Thread Paul Taylor

Grant Ingersoll wrote:

What's your current chain of TokenFilters?  How many exceptions do you expect?  
That is, could you enumerate them?
  
Very few, yes I could enumerate them, but not sure what exactly you are 
suggesting, what I was going to do would be add to the charConvertMap 
(when I posted I thought this was only for individual chars not strings)



This is my analyzer:

package org.musicbrainz.search.analysis;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.CharFilter;
import org.apache.lucene.analysis.MappingCharFilter;
import org.apache.lucene.analysis.NormalizeCharMap;


import java.io.IOException;
import java.io.Reader;

import com.ibm.icu.text.Transliterator;
import org.apache.lucene.util.Version;
import org.musicbrainz.search.LuceneVersion;

/**
* Filters StandardTokenizer with StandardFilter, ICUTransformFilter, 
AccentFilter, LowerCaseFilter

* and no stop words.
*/
public class StandardUnaccentAnalyzer extends Analyzer {

private NormalizeCharMap charConvertMap;

private void setCharConvertMap() {
charConvertMap = new NormalizeCharMap();
charConvertMap.add("&","and");

//Hebrew chars converted to western cases so matches both
charConvertMap.add("\u05f3","'");
charConvertMap.add("\u05be","-");
charConvertMap.add("\u05f4","\"");


}

public StandardUnaccentAnalyzer() {
setCharConvertMap();
}

public TokenStream tokenStream(String fieldName, Reader reader) {
CharFilter mappingCharFilter = new MappingCharFilter(charConvertMap,reader);
StandardTokenizer tokenStream = new 
StandardTokenizer(LuceneVersion.LUCENE_VERSION, mappingCharFilter);
TokenStream result = new ICUTransformFilter(tokenStream, 
Transliterator.getInstance("[ー[:Script=Katakana:]]Katakana-Hiragana"));
result = new ICUTransformFilter(result, 
Transliterator.getInstance("Traditional-Simplified"));

result = new StandardFilter(result);
result = new AccentFilter(result);
result = new LowercaseFilter(result);
return result;
}

private static final class SavedStreams {
StandardTokenizer tokenStream;
TokenStream filteredTokenStream;
}

public TokenStream reusableTokenStream(String fieldName, Reader reader) 
throws IOException {

SavedStreams streams = (SavedStreams)getPreviousTokenStream();
if (streams == null) {
streams = new SavedStreams();
setPreviousTokenStream(streams);
streams.tokenStream = new 
StandardTokenizer(LuceneVersion.LUCENE_VERSION, new 
MappingCharFilter(charConvertMap, reader));
streams.filteredTokenStream = new 
ICUTransformFilter(streams.tokenStream, Transliterator.getInstance("[ー 
[:Script=Katakana:]]Katakana-Hiragana"));
streams.filteredTokenStream = new 
ICUTransformFilter(streams.filteredTokenStream, 
Transliterator.getInstance("Traditional-Simplified"));
streams.filteredTokenStream = new 
StandardFilter(streams.filteredTokenStream);

streams.filteredTokenStream = new AccentFilter(streams.filteredTokenStream);
streams.filteredTokenStream = new 
LowercaseFilter(streams.filteredTokenStream);

}
else {
streams.tokenStream.reset(new MappingCharFilter(charConvertMap,reader));
}
return streams.filteredTokenStream;
}
}


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dealing with special cases in analyser

2010-03-17 Thread Grant Ingersoll

On Mar 17, 2010, at 11:34 AM, Paul Taylor wrote:

> Grant Ingersoll wrote:
>> What's your current chain of TokenFilters?  How many exceptions do you 
>> expect?  That is, could you enumerate them?
>>  
> Very few, yes I could enumerate them, but not sure what exactly you are 
> suggesting, what I was going to do would be add to the charConvertMap (when I 
> posted I thought this was only for individual chars not strings)

You could have modify whichever filter is removing them to take in a protected 
words list and then short circuit to not remove that token.  This would be a 
hash map lookup, which should be faster than the char replacement you are 
considering. Many of the stemmers do this.


> 
> 
> This is my analyzer:
> 
> package org.musicbrainz.search.analysis;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.CharFilter;
> import org.apache.lucene.analysis.MappingCharFilter;
> import org.apache.lucene.analysis.NormalizeCharMap;
> 
> 
> import java.io.IOException;
> import java.io.Reader;
> 
> import com.ibm.icu.text.Transliterator;
> import org.apache.lucene.util.Version;
> import org.musicbrainz.search.LuceneVersion;
> 
> /**
> * Filters StandardTokenizer with StandardFilter, ICUTransformFilter, 
> AccentFilter, LowerCaseFilter
> * and no stop words.
> */
> public class StandardUnaccentAnalyzer extends Analyzer {
> 
> private NormalizeCharMap charConvertMap;
> 
> private void setCharConvertMap() {
> charConvertMap = new NormalizeCharMap();
> charConvertMap.add("&","and");
> 
> //Hebrew chars converted to western cases so matches both
> charConvertMap.add("\u05f3","'");
> charConvertMap.add("\u05be","-");
> charConvertMap.add("\u05f4","\"");
> 
> 
> }
> 
> public StandardUnaccentAnalyzer() {
> setCharConvertMap();
> }
> 
> public TokenStream tokenStream(String fieldName, Reader reader) {
> CharFilter mappingCharFilter = new MappingCharFilter(charConvertMap,reader);
> StandardTokenizer tokenStream = new 
> StandardTokenizer(LuceneVersion.LUCENE_VERSION, mappingCharFilter);
> TokenStream result = new ICUTransformFilter(tokenStream, 
> Transliterator.getInstance("[ー[:Script=Katakana:]]Katakana-Hiragana"));
> result = new ICUTransformFilter(result, 
> Transliterator.getInstance("Traditional-Simplified"));
> result = new StandardFilter(result);
> result = new AccentFilter(result);
> result = new LowercaseFilter(result);
> return result;
> }
> 
> private static final class SavedStreams {
> StandardTokenizer tokenStream;
> TokenStream filteredTokenStream;
> }
> 
> public TokenStream reusableTokenStream(String fieldName, Reader reader) 
> throws IOException {
> SavedStreams streams = (SavedStreams)getPreviousTokenStream();
> if (streams == null) {
> streams = new SavedStreams();
> setPreviousTokenStream(streams);
> streams.tokenStream = new StandardTokenizer(LuceneVersion.LUCENE_VERSION, new 
> MappingCharFilter(charConvertMap, reader));
> streams.filteredTokenStream = new ICUTransformFilter(streams.tokenStream, 
> Transliterator.getInstance("[ー [:Script=Katakana:]]Katakana-Hiragana"));
> streams.filteredTokenStream = new 
> ICUTransformFilter(streams.filteredTokenStream, 
> Transliterator.getInstance("Traditional-Simplified"));
> streams.filteredTokenStream = new StandardFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new AccentFilter(streams.filteredTokenStream);
> streams.filteredTokenStream = new 
> LowercaseFilter(streams.filteredTokenStream);
> }
> else {
> streams.tokenStream.reset(new MappingCharFilter(charConvertMap,reader));
> }
> return streams.filteredTokenStream;
> }
> }
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Batch Indexing - best practice?

2010-03-17 Thread Murdoch, Paul
Thanks.  Timing the different parts of the indexing process led me to
the real cause of the problem.  I wasn't reusing my threaded
indexWriter.  By keeping the indexWriter open, I'm now able to index 500
documents in less than 1 second.  That's huge improvement.

Thanks again,

Paul


-Original Message-
From: java-user-return-45439-paul.b.murdoch=saic@lucene.apache.org
[mailto:java-user-return-45439-paul.b.murdoch=saic@lucene.apache.org
] On Behalf Of Erick Erickson
Sent: Monday, March 15, 2010 12:45 PM
To: java-user@lucene.apache.org
Subject: Re: Batch Indexing - best practice?

What's a document? What's indexing?

Here's what I'd do as a very first step. Time the actual
indexing and report it out. By that I mean how long does
IndexWriter.addDocument() take? If you actually get the
document from wherever first then add all the fields
and add the document, I'd time adding the fields too. The point
is to separate the Lucene stuff from whatever else you do
before trying to fix anything.

The first point of the link Ian provided has the easily-overlooked
phrase "and the slowness is indeed inside Lucene"...

Best
Erick



On Mon, Mar 15, 2010 at 11:02 AM, Murdoch, Paul
wrote:

> Thanks.  I'll try lowering the merge factor and see if speed
increases.
> The indexing is threadedsimilar to the utility class in Listing
10.1
> from Lucene in Action.  Search speed is great once the index is
> builtclose to real time.  So my main problem is getting the
indexing
> speed fixed.  I do use the StandardAnalyzer for most of my fields.
What
> type of performance level should I be trying to hit for indexing
> (docs/sec)...just to give me an idea of what to shoot for?
>
> Paul
>
> -Original Message-
> From: java-user-return-45433-paul.b.murdoch=saic@lucene.apache.org
>
[mailto:java-user-return-45433-paul.b.murdoch=saic@lucene.apache.org
> ] On Behalf Of Mark Miller
> Sent: Monday, March 15, 2010 10:48 AM
> To: java-user@lucene.apache.org
> Subject: Re: Batch Indexing - best practice?
>
> On 03/15/2010 10:41 AM, Murdoch, Paul wrote:
> > Hi,
> >
> >
> >
> > I'm using Lucene 2.9.2.  Currently, when creating my index, I'm
> calling
> > indexWriter.addDocument(doc) for each Document I want to index.  The
> > Documents aren't large and I'm averaging indexing about 500
documents
> > every 90 seconds.  I'd like to try and speed this upunless 90
> > seconds for 500 Documents is reasonable.  I have the merge factor
set
> to
> > 1000.  Do you have any suggestions for batch indexing?  Is there
> > something like indexWriter.addDocuments(Document[] docs) in the API?
> >
> >
> >
> > Thanks.
> >
> > Paul
> >
> >
> >
> >
> >
> You should lower that merge factor - thats *really* high.
>
> You shouldn't really need much more than 50 or so ... and for search
> speed your going to want fewer segments anyway -
> if your just going to end up optimizing at the end, there is no reason
> for such a large merge factor - you will pay for most of what
> you saved when you optimize.
>
> That is very slow by the way. Should be much faster - especially if
you
> are using multiple threads.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Get info wheter a field is multivalued

2010-03-17 Thread Stefan Trcek
Hello

Is there an api that indicates whether a field is multivalued, just like 
IndexReader.getFieldNames(IndexReader.FieldOption fldOption) does it 
for fields beeing indexed/stored/termvector?

Of course I could track it at index time.

Stefan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Get info wheter a field is multivalued

2010-03-17 Thread mark harwood
Not the fastest thing in the world but works:

Term startTerm=new Term("myFieldName","");
TermEnum te=reader.terms(startTerm);
BitSet docsRead=new BitSet(reader.maxDoc());
boolean multiValued=false;
do{
Term t=te.term();
if((t==null)||(t.field()!=startTerm.field()))
{
break;
}
TermDocs td = reader.termDocs(t);
while(td.next())
{
if(docsRead.get(td.doc()))
{
multiValued=true;
break;
}
docsRead.set(td.doc());
}
}while(te.next()&&multiValued==false);




- Original Message 
From: Stefan Trcek 
To: java-user@lucene.apache.org
Sent: Wed, 17 March, 2010 16:15:36
Subject: Get info wheter a field is multivalued

Hello

Is there an api that indicates whether a field is multivalued, just like 
IndexReader.getFieldNames(IndexReader.FieldOption fldOption) does it 
for fields beeing indexed/stored/termvector?

Of course I could track it at index time.

Stefan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Get info wheter a field is multivalued

2010-03-17 Thread Stefan Trcek
On Wednesday 17 March 2010 18:42:10 mark harwood wrote:
> Not the fastest thing in the world but works:
>
> Term startTerm=new Term("myFieldName","");
> TermEnum te=reader.terms(startTerm);
> BitSet docsRead=new BitSet(reader.maxDoc());
> boolean multiValued=false;
> do{
> Term t=te.term();
> if((t==null)||(t.field()!=startTerm.field()))
> {
> break;
> }
> TermDocs td = reader.termDocs(t);
> while(td.next())
> {
> if(docsRead.get(td.doc()))
> {
> multiValued=true;
> break;
> }
> docsRead.set(td.doc());
> }
> }while(te.next()&&multiValued==false);

Nice idea. Just running it: 1,5s for 1M terms. I guess I'll track it at 
index time. 

Thanks,
Stefan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Increase number of available positions?

2010-03-17 Thread Steven A Rowe
Hi Rene,

On 03/17/2010 at 11:17 AM, Rene Hackl-Sommer wrote:
> 
> 
> 
> 
> 
> 
> t293
> t4979
> 
> 
> 
> L_2
> 
> 
> 
> 
> 
> 
> 
> 
> t293
> t4979
> 
> 
> 
> L_3
> 
> 
> 
> 
>
> Shouldn't this query only leave documents, where t293 and t4979 are in
> the same L_2, but not within the same L_3?

I'm not sure what's wrong with the above (have you tried each of the two nested 
SpanNot clauses independently?), but here's another thing to try:


  

  
t293
L_3
t4979
  
  
t4979
L_3
t293
  

  
L_2
  


Steve


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



exact query match?

2010-03-17 Thread Joachim De Beule
Hi All,

I have a corpus of documents which I want to search for phrases. I only want 
to get those documents that exactly contain a phrase. for example if:
doc1 = "x 11 windowing system"
doc2 = "x windowing system"
doc3 = "the x 11 windowing system"

then I want the query "x 11 windowing system" to return only doc1 and doc3 and 
the query "the x 11" to return only doc3.

I have tried to use SimpleAnalyzer together with using the query as a single 
phrase, but this still also gives doc2 for the first example query because this 
analyzer discards the number 11. There does not seem to be an alternative 
analyzer for this however, and I don't know how to write one myself.

Is there a standard way of doing this?

Thanks!

Joachim.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: exact query match?

2010-03-17 Thread Erick Erickson
You might get some joy from WhitespaceAnalyzer, but beware of case and
punctuation. You could pre-process your indexing and querying to remove
non-alphanumerics.

Or you could create your own analyzer, see SynonymAnalyzer in Lucene In
Action, and there's another example here: http://mext.at/?p=26.

The idea is to string together some number of Filters, starting with a
Tokenizer that "does the right thing",  and create your own Analyzer.

But as far as I know, there's nothing out of the box that does what you
want.

Best
Erick

On Wed, Mar 17, 2010 at 4:25 PM, Joachim De Beule wrote:

> Hi All,
>
> I have a corpus of documents which I want to search for phrases. I only
> want
> to get those documents that exactly contain a phrase. for example if:
> doc1 = "x 11 windowing system"
> doc2 = "x windowing system"
> doc3 = "the x 11 windowing system"
>
> then I want the query "x 11 windowing system" to return only doc1 and doc3
> and
> the query "the x 11" to return only doc3.
>
> I have tried to use SimpleAnalyzer together with using the query as a
> single
> phrase, but this still also gives doc2 for the first example query because
> this
> analyzer discards the number 11. There does not seem to be an alternative
> analyzer for this however, and I don't know how to write one myself.
>
> Is there a standard way of doing this?
>
> Thanks!
>
> Joachim.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: OutOfMemory ParallelMultisearcher

2010-03-17 Thread Jamie

Hi Ian

Thanks for the info. Its difficult to reuse searchers as my users are 
performing realtime searches, so I need to open an IndexReader for every 
live search query.


I've since tracked the OutOfMemory issue down to sort on date. I am 
using too high a precision (down to the second) which is causing the 
number of search terms to escalate. Also, I haven't since moved the date 
field in the index over to using Numerics instead of a String value for 
fear of breaking compatibility with my old index format.


I dont yet quite understand the implications of the precisionStep and 
what it all means. If I change my date string to a Numeric integer in 
the format MMddHHmm, what should the precisionStep value be?


Jamie



On 2010/03/17 01:20 PM, Ian Lea wrote:

Hi


Caching searchers at some level should help keep memory usage down -
and will help performance too.  Searchers themselves don't generally
consume large amounts of memory, but if you've got loads of them then
obviously things will add up.

Unless you can change the whole design of your app (single index with
a user field that you use as a query filter to restrict users to
"their" data?) you may be stuck with giving the app more memory or
restricting the number of concurrent searchers.


---
Ian.


On Tue, Mar 16, 2010 at 10:00 PM, Jamie  wrote:
   

Hi There

I have an index which is 36 GB large. When I perform  eight simultaneous
searches (performed by JMeter) on the index, an OutOfMemory error occurs.
Since I need to potentially search across multiple indexes and those indexes
can change from one search query to the next, each user has their own
ParallelMultiSearcher object. Before each search operation, I reconstruct
the ParrallelMultisearcher with the appropriate Searchers to each of the
indexes that need to be included for that particular search query.

The problem is that requiring each user to have their own
ParallelMultisearcher seems to limit the number of  users that can use the
system at the same time.

While experimenting, when I make the ParallelMultiSearcher static, the same
object used by all users, the OutOfMemory problem goes away and I am able to
execute 50 simultaneous searches. The problem I have is I cannot make
ParallelMultisearcher static, since the specific indexes used are variable
from one search query to the next. I initially thought one could just cache
the underlying Searchers and all would be okay, but this does not appear to
be the case.

My question: Will ParallelMultisearcher tend to consume a large amount of
memory by itself when used on large indices? If so, do you have any
suggestions on how I might support the above scenario (i.e. when the indexes
used change from one query to the next)

Thanks in advance

Jamie


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dealing with special cases in analyser

2010-03-17 Thread Paul Taylor

Grant Ingersoll wrote:

On Mar 17, 2010, at 11:34 AM, Paul Taylor wrote:

  

Grant Ingersoll wrote:


What's your current chain of TokenFilters?  How many exceptions do you expect?  
That is, could you enumerate them?
 
  

Very few, yes I could enumerate them, but not sure what exactly you are 
suggesting, what I was going to do would be add to the charConvertMap (when I 
posted I thought this was only for individual chars not strings)



You could have modify whichever filter is removing them to take in a protected 
words list and then short circuit to not remove that token.  This would be a 
hash map lookup, which should be faster than the char replacement you are 
considering. Many of the stemmers do this.


  
Hmm, they are removed by the tokenizer not a filter because they are 
punctuation chars, I suppose I could try and modify the jflex file


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org