date:20060307

sub search

2006-03-07 Thread Anton Potehin

Is it possible to make search among results of previous search?

 

 

For example: I made search: 

 

Searcher searcher =...

 

Query query = ...

 

Hits hits = 

 

hits = Searcher.search(query);

 

 

 

After it I want to not make a new search, I want to make search among found 
results...

Re: sub search

2006-03-07 Thread hu andy

2006/3/7, Anton Potehin <[EMAIL PROTECTED]>:
>
> Is it possible to make search among results of previous search?
>
>
>
>
>
> For example: I made search:
>
>
>
> Searcher searcher =...
>
>
>
> Query query = ...
>
>
>
> Hits hits = 
>
>
>
> hits = Searcher.search(query);
>
>
>
>
>
>
>
> After it I want to not make a new search, I want to make search among
> found results...
>
> You can use like this

  TermQuery termQuery = new TermQuery(
 Filter  queryFilter = new QueryFilter(temQuery);
hits = Searcher.search(query,queryFilter);

Re: Distributed Lucene..

2006-03-07 Thread Andrzej Bialecki


Prasenjit Mukherjee wrote:
I think nutch has a distributed lucene implementation. I could have 
used nutch straightaway, but I have a different crawler, and also dont 
want to use NDFS(which is used by nutch) . What I have proposed 
earlier is basically based on mapReduce paradigm, which is used by 
nutch as well.


It would be nice to get some articles specifically detailing out  the 
distributed architecture used in nutch.




A few comments:

* you can use your own crawler, and then only write some glue code to 
convert the output of that crawler to the format that Nutch uses.


* Nutch can be run in a so called "local" mode, without using NDFS

* the core map-reduce and I/O functionality has been split to its own 
project, Hadoop, where the development is taking place at a furious rate 
;-) This code is completely independent of Nutch or Lucene. You can 
implement your own data processing using this framework.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: sub search

2006-03-07 Thread anton

As far as I understood that will make new search throughout the index. But
what the difference between that and search described below:

TermQuery termQuery = new TermQuery(
BooleanQuery bq = ..
bq.add(termQuery,true,false); 
bq.add(query,true,false);
hits = Searcher.search(bq,queryFilter);



-Original Message-
From: hu andy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 12:40 PM
To: java-user@lucene.apache.org
Subject: Re: sub search
Importance: High

2006/3/7, Anton Potehin <[EMAIL PROTECTED]>:
>
> Is it possible to make search among results of previous search?
> For example: I made search:
> Searcher searcher =...
> Query query = ...
> Hits hits = 
> hits = Searcher.search(query);
> After it I want to not make a new search, I want to make search among
> found results...
>
> You can use like this

  TermQuery termQuery = new TermQuery(
 Filter  queryFilter = new QueryFilter(temQuery);
hits = Searcher.search(query,queryFilter);



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

about lucene 1.9

2006-03-07 Thread Haritha_Parvatham

Hi,
I have downloaded the latest release lucene 1.9.I have deployed in
tomcat.
When i search from the front end.It gives me the message.Please tell me
how to use lucene 1.9 .

Welcome to the Lucene Template application. (This is the header) 

Document Summary
null   null   
null   null   
null   null  
 


  
 


DISCLAIMER:
This email (including any attachments) is intended for the sole use of the 
intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE 
COMPANY INFORMATION. Any review or reliance by others or copying or 
distribution or forwarding of any or all of the contents in this message is 
STRICTLY PROHIBITED. If you are not the intended recipient, please contact the 
sender by email and delete all copies; your cooperation in this regard is 
appreciated.

Writing terms/freq pairs directly to the inverted file

2006-03-07 Thread Murat Yakici


Hi,
I would like to by-pass the IndexWriter and directly write the terms and 
their frequencies to the index (and may proximity info later on). I 
might have missed any discussion if previously. As far as I know, the 
high level API in Lucene only allows you to add documents (which are 
populated by terms) to the index through IndexWriter. These are resolved 
to low-level method calls and written to the index. However, I'm getting 
the following information (term frequency pairs) to be push to the 
index: t1->f1, t2->f2, t3->f3 and so on. In other words, a functionality 
equivalent to IndexReader's termEnum, termDocs for IndexWriter (for 
directly pushing the terms to a document which already exists in the 
index ).



Is there a way to use the low-level API (FieldInfos,TermVectorWriter 
etc.) securely without breaking the integrity of the index?


Which classes I should be looking at specificaly?

Regards
Murat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiPhraseQuery

2006-03-07 Thread Erik Hatcher


On Mar 7, 2006, at 2:35 AM, Eric Jain wrote:

Daniel Naber wrote:
Please try to add this to MultiPhraseQuery and let us know if it  
helps:

  public List getTerms() {
return termArrays;
  }


That is indeed all I need (the list wouldn't have to be mutable  
though). Any chance this could be committed?


Incidentally, would be helpful if the PrecedenceQueryParser  
instantiated MultiPhraseQueries via a call to an (overridable)  
getMultiPhraseQuery method.


Since PQP is my doing, if you supply a patch for both of the above  
from svn trunk, add it to a new JIRA issue, I'd be happy to apply it.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: sub search

2006-03-07 Thread hu andy

It uses cache mechanism. The detail is described in the book Lucene in
Action. Maybe you can test it to decide which is faster

2006/3/7, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:
>
> As far as I understood that will make new search throughout the index. But
> what the difference between that and search described below:
>
> TermQuery termQuery = new TermQuery(
> BooleanQuery bq = ..
> bq.add(termQuery,true,false);
> bq.add(query,true,false);
> hits = Searcher.search(bq,queryFilter);
>
>
>
> -Original Message-
> From: hu andy [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 07, 2006 12:40 PM
> To: java-user@lucene.apache.org
> Subject: Re: sub search
> Importance: High
>
> 2006/3/7, Anton Potehin <[EMAIL PROTECTED]>:
> >
> > Is it possible to make search among results of previous search?
> > For example: I made search:
> > Searcher searcher =...
> > Query query = ...
> > Hits hits = 
> > hits = Searcher.search(query);
> > After it I want to not make a new search, I want to make search among
> > found results...
> >
> > You can use like this
>
> TermQuery termQuery = new TermQuery(
> Filter  queryFilter = new QueryFilter(temQuery);
> hits = Searcher.search(query,queryFilter);
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Question

2006-03-07 Thread Thomas Papke


Hello,

anyone implement the "Google Suggest" Feature using Lucene? The Frontend 
is clear - but i need a very fast way to retrieve matching terms. For 
example: The user typed "Ab" and i want to give him a list of 10 
possible words in term "name" starting with "Ab*". So i don't need the 
hole document and i need this information realy fast.


Thx,
Thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Question

2006-03-07 Thread Pasha Bizhan

Hi, 

> From: Thomas Papke [mailto:[EMAIL PROTECTED] 

> anyone implement the "Google Suggest" Feature using Lucene? 
> The Frontend is clear - but i need a very fast way to 
> retrieve matching terms. For
> example: The user typed "Ab" and i want to give him a list of 
> 10 possible words in term "name" starting with "Ab*". So i 
> don't need the hole document and i need this information realy fast.

It was implemented by David Spencer. See
http://searchmorph.com/experiments.php.
Unfortunately online demo is unavailable now :(

Pasha Bizhan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Question

2006-03-07 Thread Pasha Bizhan

Hi, 

> From: Thomas Papke [mailto:[EMAIL PROTECTED] 

> anyone implement the "Google Suggest" Feature using Lucene? 
> The Frontend is clear - but i need a very fast way to 
> retrieve matching terms. For
> example: The user typed "Ab" and i want to give him a list of 
> 10 possible words in term "name" starting with "Ab*". So i 
> don't need the hole document and i need this information realy fast.

It was implemented by David Spencer. See
http://searchmorph.com/experiments.php.
Unfortunately online demo is unavailable now :(

Pasha Bizhan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question

2006-03-07 Thread gekkokid

would lucene even have to be accessed? couldnt you save the queries when 
submitted and search that via a sql database?

_gk
- Original Message - 
From: "Thomas Papke" <[EMAIL PROTECTED]>

To: 
Sent: Tuesday, March 07, 2006 12:11 PM
Subject: Question

Hello,

anyone implement the "Google Suggest" Feature using Lucene? The Frontend 
is clear - but i need a very fast way to retrieve matching terms. For 
example: The user typed "Ab" and i want to give him a list of 10 possible 
words in term "name" starting with "Ab*". So i don't need the hole 
document and i need this information realy fast.

Thx,
Thomas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question

2006-03-07 Thread Leon Chaddock


Hi,
I am very interested in this aswell, as I wish to display related searches 
for users.

Does anyone know if this work is open source and is there an api available?
Thanks

Leon

- Original Message - 
From: "Pasha Bizhan" <[EMAIL PROTECTED]>

To: 
Sent: Tuesday, March 07, 2006 12:39 PM
Subject: RE: Question



Hi,


From: Thomas Papke [mailto:[EMAIL PROTECTED]



anyone implement the "Google Suggest" Feature using Lucene?
The Frontend is clear - but i need a very fast way to
retrieve matching terms. For
example: The user typed "Ab" and i want to give him a list of
10 possible words in term "name" starting with "Ab*". So i
don't need the hole document and i need this information realy fast.


It was implemented by David Spencer. See
http://searchmorph.com/experiments.php.
Unfortunately online demo is unavailable now :(

Pasha Bizhan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 268.2.0/275 - Release Date: 06/03/2006





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene version 1.9

2006-03-07 Thread WATHELET Thomas

I've created an index with the Lucene version 1.9 and when I try to open
this index I have always this error mesage:
java.lang.ArrayIndexOutOfBoundsException.
if I use an index built with the lucene version 1.4.3 it's working.
Wath's wrong?

RE: Question

2006-03-07 Thread Pasha Bizhan

Hi, 

> From: Leon Chaddock [mailto:[EMAIL PROTECTED] 
 
> I am very interested in this aswell, as I wish to display 
> related searches for users.

What does "related" mean?

> Does anyone know if this work is open source and is there an 
> api available?

Ask David or use web.archive:
http://web.archive.org/web/20050306065912/http://www.searchmorph.com/weblog/
index.php?id=26

Pasha Bizhan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Question

2006-03-07 Thread Pasha Bizhan

Hi, 

> From: Leon Chaddock [mailto:[EMAIL PROTECTED] 
 
> I am very interested in this aswell, as I wish to display 
> related searches for users.

What does "related" mean?

> Does anyone know if this work is open source and is there an 
> api available?

Ask David or use web.archive:
http://web.archive.org/web/20050306065912/http://www.searchmorph.com/weblog/
index.php?id=26

Pasha Bizhan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question

2006-03-07 Thread Jeff Rodenburg

We've done this, and it's not that complex.  (Sorry, client won't allow me
to release the code.)
It's AJAX on the front end, so that background call is simply executing a
search against an index that consists of the aggregated search terms.  We do
wildcard queries to get the results we want.  For us, the search term
represents the whole document.

Pretty straightforward.

-- j

On 3/7/06, Thomas Papke <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> anyone implement the "Google Suggest" Feature using Lucene? The Frontend
> is clear - but i need a very fast way to retrieve matching terms. For
> example: The user typed "Ab" and i want to give him a list of 10
> possible words in term "name" starting with "Ab*". So i don't need the
> hole document and i need this information realy fast.
>
> Thx,
> Thomas
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Get only count

2006-03-07 Thread Anton Potehin

Now I create new search for get number of results. For example:

IndexSearcher is = ...

Query q = ... 

numberOfResults = Is.search(q).length();

 

Can I accelerate this example ? And how ?

RE: Search for synonyms - implemenetation for review

2006-03-07 Thread Ziv Gome

Hi all,

I have few more remarks to Andrew's already thorough mail... I fear
though Andrew gave me too much credit, for a cooperative, brain-storming
work we both did.

1. How are the results? We have not conducted a real research on the
results we got, in terms of recall and precision measurements, but we
definitely got much better results. In terms of precision - there was a
significant drop in non-relevant docs popping up, which is the main
problem with any query expansion, and specifically to synonym expansion.
Recall is harder to evaluate, but I don't see how we could loose
documents with synonyms, such documents could only be pushed down.

2. An addition to Andrew's explanation on idfFactor: 
Andrew sais: 
>> factor like the following
>>  tf = sqrt( freq[tire] + 0.8 * freq[tyre] * idfFactor )
>> where
>>  idfFactor = (IDF[synonym] * IDF[synonym]) / (IDF[word] *
IDF[word])

a) Note that in the code there is no squaring, since the
returned value from sumOfSqaures is already squared.
b) We used the squared ratio since the IDF is taken squared in
Lucene calculation (TermQuery: lines 45-55): first, in constructing
queryWeight, second in "value = queryWeight * idf". Therefore, if we
wish to cancel the synonym IDF affect, we wish the IDF factor to take
the square of the IDF's ratio (or equivalently the ratio of the
sumOfSquares return values as we do in the code). This part is done in
order to treat the queryNorm-queryWeight side of the calculation. This
is explained in the code, but was not elaborated in Andrew's mail.
c) We additionally multiply at this stage with the inverse ratio
(not squared). This is done in order to treat the tf() side of the
calculation. The idea is to normalize the frequency of the synonym using
idfs. I'll try to explain this using the following example: say the user
searched for the term "car" which has "automobile" and "auto" as
synonyms, and say the synonym deboost factor is 0.9, we aim to get the
following tf result for it: 
tf(car + syn) = tf (freq[car] + 0.9*(idf[auto]/*idf[car])*freq[auto] +
0.9*(idf[automobile]/idf[car])freq[automobile])
Let's see why this is needed: say car is a rare term (high idf), but
auto is very common (low idf), we wish to put those terms on the same
ground, i.e. an occurrence of the common term counts less than the
occurrence of the rare term. In addition, remember that now the synonyms
are normalized (relative to the other original terms in the query)
according to the original term (car in this case), and so it would be
"unfair" to bring in a common word in the back door - being a synonym of
a rare term. True, the idf comparison is not perfectly suitable for
evaluating in-document occurrences, but this was the best we came up
with. Hope this explanation helps. This is also explained, with fewer
details in the code.
d) As mentioned in the code - items b and c above have some
cancellation, but we feel it is easier to understand it this way.
  


3. The trick of normalizing the IDF according to the root term (the term
the user actually entered) also helps in solving a different problem:
the problem of searching in two fields and combining the results. The
DisjunctionQuery tries to solve it, and is doing a good job unless the
fields are of very different size, and the IDFs behave very differently.
One such case could be looking in a document's content and in its title.
We figured it would be better, in this case, to treat only the IDF given
by the "main" field (in the above example - the document content) in the
sumOfSquaredWeights calculation, in order to balance the different
terms. Later on, during the normalize(f) call, our query will pass on
down an adjusted norm to the "secondary" field TermQuery, where the
adjustment is basically the IDF ratios:
IDF-main-field(term)/IDF-secondary-field(term). This way the
summation/max (however you wish to accumulate the term over different
fields) is performed on equivalent-basis scores.

Thank you for reading all this...
Ziv Gome.


-Original Message-
From: Andrew Schetinin 
Sent: Monday, March 06, 2006 4:20 PM
To: java-user@lucene.apache.org
Cc: Ziv Gome
Subject: Search for synonyms - implemenetation for review

Dear all,
 
Me and my college, Mr. Ziv Gome, would like to present here an
implementation of synonyms search that we use in our server.
Probably it will be interesting for those who worked on synonyms, or
going to implement synonyms search.
We hope that this mail will raise interesting ideas and will result in
useful results :-)

And I would like to mention that it was Ziv who inspired this research
and development.
He did most of the analytical work and basically invented this entire
idea. 
 
1. Background

There is an interesting discussion about synonyms in Lucene mailing list

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11594.html
which suggests expanding the search query with additional clauses for
synonyms penalized using boost fa

Re: sub search

2006-03-07 Thread Erik Hatcher


On Mar 7, 2006, at 7:03 AM, hu andy wrote:

It uses cache mechanism. The detail is described in the book Lucene in
Action. Maybe you can test it to decide which is faster


Major caveat here is that the caching QueryFilter employs really only  
works if you use the same instance of QueryFilter for successive  
searches using the same IndexReader (via IndexSearcher) instance.  If  
you're simply using a previous query to AND the current query and the  
previous query is not something that will be reused later, the  
BooleanQuery AND option is what I recommend.


Erik




2006/3/7, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:


As far as I understood that will make new search throughout the  
index. But

what the difference between that and search described below:

TermQuery termQuery = new TermQuery(
BooleanQuery bq = ..
bq.add(termQuery,true,false);
bq.add(query,true,false);
hits = Searcher.search(bq,queryFilter);



-Original Message-
From: hu andy [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 07, 2006 12:40 PM
To: java-user@lucene.apache.org
Subject: Re: sub search
Importance: High

2006/3/7, Anton Potehin <[EMAIL PROTECTED]>:


Is it possible to make search among results of previous search?
For example: I made search:
Searcher searcher =...
Query query = ...
Hits hits = 
hits = Searcher.search(query);
After it I want to not make a new search, I want to make search  
among

found results...

You can use like this


TermQuery termQuery = new TermQuery(
Filter  queryFilter = new QueryFilter(temQuery);
hits = Searcher.search(query,queryFilter);



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Get only count

2006-03-07 Thread Eric Jain


Anton Potehin wrote:

Now I create new search for get number of results. For example:

IndexSearcher is = ...

Query q = ... 


numberOfResults = Is.search(q).length();

Can I accelerate this example ? And how ?


Perhaps something like:

class CountingHitCollector
  implements HitCollector
{
  public int count;

  public void collect(int doc, float score)
  {
if (score > 0.0f)
  ++count;
  }
}

...

CountingHitCollector c = new CountingHitCollector();
searcher.search(query, c);
int hits = c.count;


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: sub search

2006-03-07 Thread Eric Jain


Anton Potehin wrote:
After it I want to not make a new search, 

> I want to make search among found results...

Perhaps something like this would work:

final BitSet results = toBitSet(Hits);
searcher.search(newQuery, new Filter() {
  public BitSet bits(IndexReader reader) {
return results;
  }
});


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Unreported IOException received for SpanTermQuery class

2006-03-07 Thread Murat Yakici


Hi,
I was building the Lucene 1.9.1 source code. I have received the 
following error msg:


"Unreported exceptions: java.io.IOException must be caught or declared 
to be thrown. " in class SpanOrQuery, line number 154.


Any ideas how to resolve it?

Regards,
Murat

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Get only count

2006-03-07 Thread anton

While you added "if (score > 0.0f)". Javadoc contain lines
"HitCollector.collect(int,float) is called for every non-zero scoring".

-Original Message-
From: Eric Jain [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 5:08 PM
To: java-user@lucene.apache.org
Subject: Re: Get only count
Importance: High

Anton Potehin wrote:
> Now I create new search for get number of results. For example:
> 
> IndexSearcher is = ...
> 
> Query q = ... 
> 
> numberOfResults = Is.search(q).length();
> 
> Can I accelerate this example ? And how ?

Perhaps something like:

class CountingHitCollector
   implements HitCollector
{
   public int count;

   public void collect(int doc, float score)
   {
 if (score > 0.0f)
   ++count;
   }
}

...

CountingHitCollector c = new CountingHitCollector();
searcher.search(query, c);
int hits = c.count;


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Get only count

2006-03-07 Thread anton

While you added "if (score > 0.0f)". Javadoc contain lines
"HitCollector.collect(int,float) is called for every non-zero scoring".

-Original Message-
From: Eric Jain [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 5:08 PM
To: java-user@lucene.apache.org
Subject: Re: Get only count
Importance: High

Anton Potehin wrote:
> Now I create new search for get number of results. For example:
> 
> IndexSearcher is = ...
> 
> Query q = ... 
> 
> numberOfResults = Is.search(q).length();
> 
> Can I accelerate this example ? And how ?

Perhaps something like:

class CountingHitCollector
   implements HitCollector
{
   public int count;

   public void collect(int doc, float score)
   {
 if (score > 0.0f)
   ++count;
   }
}

...

CountingHitCollector c = new CountingHitCollector();
searcher.search(query, c);
int hits = c.count;


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene version 1.9

2006-03-07 Thread Paul Elschot

Thomas,

On Tuesday 07 March 2006 13:57, WATHELET Thomas wrote:
> I've created an index with the Lucene version 1.9 and when I try to open
> this index I have always this error mesage:
> java.lang.ArrayIndexOutOfBoundsException.
> if I use an index built with the lucene version 1.4.3 it's working.
> Wath's wrong?

Iirc this was fixed in 1.9.1:
http://lucene.apache.org/java/docs/index.html

Regards,
Paul Elschot
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Get only count

2006-03-07 Thread Yonik Seeley

On 3/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> While you added "if (score > 0.0f)". Javadoc contain lines
> "HitCollector.collect(int,float) is called for every non-zero scoring".

That should probably read "is called for every matching document".

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unreported IOException received for SpanTermQuery class

2006-03-07 Thread Paul Elschot

On Tuesday 07 March 2006 15:35, Murat Yakici wrote:
> Hi,
> I was building the Lucene 1.9.1 source code. I have received the 
> following error msg:
> 
> "Unreported exceptions: java.io.IOException must be caught or declared 
> to be thrown. " in class SpanOrQuery, line number 154.
> 
> Any ideas how to resolve it?

Which compiler do you use?
My guess would be gcj.
The indicated line is in an initialisation block for an anonymous
inline subclass, and gcj's support for such constructs was not
complete the last time I tried.

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Get only count

2006-03-07 Thread anton

Can have matching document score equals zero ? 
-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 6:20 PM
To: java-user@lucene.apache.org
Subject: Re: Get only count
Importance: High

On 3/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> While you added "if (score > 0.0f)". Javadoc contain lines
> "HitCollector.collect(int,float) is called for every non-zero scoring".

That should probably read "is called for every matching document".

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Get only count

2006-03-07 Thread anton

Can have matching document score equals zero ? 
-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 6:20 PM
To: java-user@lucene.apache.org
Subject: Re: Get only count
Importance: High

On 3/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> While you added "if (score > 0.0f)". Javadoc contain lines
> "HitCollector.collect(int,float) is called for every non-zero scoring".

That should probably read "is called for every matching document".

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

indexing problems

2006-03-07 Thread Apache Lucene

Hi,
   I am using Lucene 1.9.1 to index the files. The index writer created
the following files
(1) segment file "segments"
(2) deletable file "deletable"
(3) compound file "cfs"

None of the other files like term info, frequency..etc were created. Is
there something obvious, I am doing wrong?

thanks,
lucenator

Re: Unreported IOException received for SpanTermQuery class

2006-03-07 Thread Murat Yakici


The compiler is Sun Java 1.4.2_08.

Paul Elschot wrote:


On Tuesday 07 March 2006 15:35, Murat Yakici wrote:


Hi,
I was building the Lucene 1.9.1 source code. I have received the 
following error msg:


"Unreported exceptions: java.io.IOException must be caught or declared 
to be thrown. " in class SpanOrQuery, line number 154.


Any ideas how to resolve it?



Which compiler do you use?
My guess would be gcj.
The indicated line is in an initialisation block for an anonymous
inline subclass, and gcj's support for such constructs was not
complete the last time I tried.

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing problems

2006-03-07 Thread Yonik Seeley

You are using the compound file format (the default since 1.4) and the
.cfs file contains all those individual parts.

-Yonik

On 3/7/06, Apache Lucene <[EMAIL PROTECTED]> wrote:
> Hi,
>I am using Lucene 1.9.1 to index the files. The index writer created
> the following files
> (1) segment file "segments"
> (2) deletable file "deletable"
> (3) compound file "cfs"
>
> None of the other files like term info, frequency..etc were created. Is
> there something obvious, I am doing wrong?
>
> thanks,
> lucenator
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Get only count

2006-03-07 Thread Yonik Seeley

On 3/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Can have matching document score equals zero ?

Yes.  Scorers don't generally use "score" to determine if a document
matched the query.
Scores <= 0.0f are currently screened out at the top level search
functions, but not when you use a HitCollector yourself.

-Yonik


> -Original Message-
> From: Yonik Seeley [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 07, 2006 6:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Get only count
> Importance: High
>
> On 3/7/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > While you added "if (score > 0.0f)". Javadoc contain lines
> > "HitCollector.collect(int,float) is called for every non-zero scoring".
>
> That should probably read "is called for every matching document".
>
> -Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing problems

2006-03-07 Thread Apache Lucene

Is it advisable to use compound file format? or should I revert it back to
simple file format? How do I revert it back?

thanks,

lucenenator


On 3/7/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> You are using the compound file format (the default since 1.4) and the
> .cfs file contains all those individual parts.
>
> -Yonik
>
> On 3/7/06, Apache Lucene <[EMAIL PROTECTED]> wrote:
> > Hi,
> >I am using Lucene 1.9.1 to index the files. The index writer
> created
> > the following files
> > (1) segment file "segments"
> > (2) deletable file "deletable"
> > (3) compound file "cfs"
> >
> > None of the other files like term info, frequency..etc were created. Is
> > there something obvious, I am doing wrong?
> >
> > thanks,
> > lucenator
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Unreported IOException received for SpanTermQuery class

2006-03-07 Thread Paul Elschot

On Tuesday 07 March 2006 16:34, Murat Yakici wrote:
> The compiler is Sun Java 1.4.2_08.

I'm using sun javac 1.5.0_01 and this compiles the current trunk without
any problems, so I cannot reproduce the error msg.
The common-build.xml file uses source and target 1.4 for javac,
(in the compile macro) and I can't think of anything else that might
cause the error msg.

Btw, this message title mentions SpanTermQuery but below it sais
SpanOrQuery.

Regards,
Paul Elschot

 
> Paul Elschot wrote:
> 
> > On Tuesday 07 March 2006 15:35, Murat Yakici wrote:
> > 
> >>Hi,
> >>I was building the Lucene 1.9.1 source code. I have received the 
> >>following error msg:
> >>
> >>"Unreported exceptions: java.io.IOException must be caught or declared 
> >>to be thrown. " in class SpanOrQuery, line number 154.
> >>
> >>Any ideas how to resolve it?
> > 
> > 
> > Which compiler do you use?
> > My guess would be gcj.
> > The indicated line is in an initialisation block for an anonymous
> > inline subclass, and gcj's support for such constructs was not
> > complete the last time I tried.
> > 
> > Regards,
> > Paul Elschot.
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing problems

2006-03-07 Thread Erik Hatcher



On Mar 7, 2006, at 10:41 AM, Apache Lucene wrote:

Is it advisable to use compound file format? or should I revert it  
back to

simple file format? How do I revert it back?


There is a setter on IndexWriter to set it back if you like.   The  
compound format avoids the issues that cropped up a lot in the past  
with greatly segmented indexes eating up all available file handles.


How you set it really depends.  Try both and see.


lucenenator


haha!

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unreported IOException received for SpanTermQuery class

2006-03-07 Thread Murat Yakici


Yeah, I know, sorry for that.
The reason is, first I tried to solve the problem by wrapping the line 
with a try-catch block. Then, the next build gave the same error for 
SpanTermQuery and some other classes.


I will try to compile that on 1.5.0_01.

Thanks,
Murat

Paul Elschot wrote:


On Tuesday 07 March 2006 16:34, Murat Yakici wrote:


The compiler is Sun Java 1.4.2_08.



I'm using sun javac 1.5.0_01 and this compiles the current trunk without
any problems, so I cannot reproduce the error msg.
The common-build.xml file uses source and target 1.4 for javac,
(in the compile macro) and I can't think of anything else that might
cause the error msg.

Btw, this message title mentions SpanTermQuery but below it sais
SpanOrQuery.

Regards,
Paul Elschot

 


Paul Elschot wrote:



On Tuesday 07 March 2006 15:35, Murat Yakici wrote:



Hi,
I was building the Lucene 1.9.1 source code. I have received the 
following error msg:


"Unreported exceptions: java.io.IOException must be caught or declared 
to be thrown. " in class SpanOrQuery, line number 154.


Any ideas how to resolve it?



Which compiler do you use?
My guess would be gcj.
The indicated line is in an initialisation block for an anonymous
inline subclass, and gcj's support for such constructs was not
complete the last time I tried.

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Classification / Change Scoring during search

2006-03-07 Thread Rainer Dollinger

Hello,

I want to use Lucene to get similar documents based on a Boolean Query
(similar metadata with OR clauses) and ratings of the user for already
searched documents.

I intend to implement a Naive Bayes classifier to categorize documents
into liked/disliked classes and would do this by using a HitCollector class.

class ClassifyingHitCollector implements HitCollector {

  public void collect(int doc, float score) {
// classify document

// if document is liked -> add to hit collection
  }

}

...

ClassifyingHitCollector c = new ClassifyingHitCollector ();
searcher.search(query, c);


This means that the calculation of the bayes classification has to be
calculated for each matching document. Is there a possibility to do this
(during search) for only the n top matching documents or does this mean
to use the Hits returning searcher.search(..) overload and do the
calculation on the n top matching documents, after the Lucene search?

Is there another possibility to change the scoring of the search(..)
method that is more efficient?

TIA,
Rainer

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing problems

2006-03-07 Thread Apache Lucene

This line is throwing a null pointer exception for the index I created as I
mentioned in my previous emails.

searcher = new IndexSearcher(IndexReader.open(indexPath) );

Any ideas? I made sure the indexPath is a valid path.

thanks,

lucenenator


On 3/7/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On Mar 7, 2006, at 10:41 AM, Apache Lucene wrote:
>
> > Is it advisable to use compound file format? or should I revert it
> > back to
> > simple file format? How do I revert it back?
>
> There is a setter on IndexWriter to set it back if you like.   The
> compound format avoids the issues that cropped up a lot in the past
> with greatly segmented indexes eating up all available file handles.
>
> How you set it really depends.  Try both and see.
>
> > lucenenator
>
> haha!
>
>Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

RE: Using NOT queries inside parentheses

2006-03-07 Thread Satuluri, Venu_Madhav


>   Query at = new TermQuery(new Term("alwaysTrueField","true));
>   Query user = queryParser.parse(userInput);
>   if (user instanceof BooleanQuery) {
>  BooleanQuery bq = (BooleanQuery)user;
>  if (! usableBooleanQuery(bq)) {
> bq.add(at, true, false); /* add 'always true' clause directly
*/
> return bq;
>  }
>   }
>   /* if we made it here, wrape both clauses.
>  BooleanQuery q = new BooleanQuery();
>   q.add(at, true, false);
>   q.add(user, true, false);
>   return q;

Many thanks, Chris, its working for me perfectly.

> If you want this to work, the most elegant way I've found is to
override 
> the getBooleanQuery(Vector) method in QueryParser and insert a 
> MatchAllDocsQuery into the boolean query if every clause is
prohibited.
> 
> Daniel

I tried this, but it looks like the overridden method
getBooleanQuery(vector) does not get called. I am using 1.4.3.

Thanks,
Venu





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing problems

2006-03-07 Thread Apache Lucene

BTW, I could access that index using Luke. It works fine.


On 3/7/06, Apache Lucene <[EMAIL PROTECTED]> wrote:
>
>  This line is throwing a null pointer exception for the index I created as
> I mentioned in my previous emails.
>
> searcher = new IndexSearcher(IndexReader.open(indexPath) );
>
> Any ideas? I made sure the indexPath is a valid path.
>
> thanks,
>
> lucenenator
>
>
>  On 3/7/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> >
> >
> > On Mar 7, 2006, at 10:41 AM, Apache Lucene wrote:
> >
> > > Is it advisable to use compound file format? or should I revert it
> > > back to
> > > simple file format? How do I revert it back?
> >
> > There is a setter on IndexWriter to set it back if you like.   The
> > compound format avoids the issues that cropped up a lot in the past
> > with greatly segmented indexes eating up all available file handles.
> >
> > How you set it really depends.  Try both and see.
> >
> > > lucenenator
> >
> > haha!
> >
> >Erik
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Scoring with FunctionQueries?

2006-03-07 Thread Sebastian Marius Kirsch

Hello,

I have been trying out Yonik's excellent FunctionQuery (from Solr),
but am having some problems regarding the scoring of FunctionQueries
in conjunction with other queries.

I am currently researching a data fusion approach, where you have
several separate scores for a document and combine them to produce a
composite score. One of these scores is the regular Lucene score,
which I essentially treat as a black box.

I'm trying a simple linear combination, ie.

score = a * score_a + b * score_b ...

I have one Query produced by a QueryParser, and one (possibly several)
FunctionQueries which provide additional scores. I combine them with a
BooleanQuery with required clauses.

When I look at the explanation of such a combined Query, I see that
the scores of the subqueries are all multiplied by the query norm --
but I want only the score of the full-text query to be multiplied by
the query norm. The function queries should be added to the final
query as they are (the factors a, b, ... could be set using a query
boost.)

How do I achieve that? I'm rather lost in the forest of Scorer,
Similarity and Weight right now. Which is the right place to add such
a modification, so that it doesn't mess up the rest of the scoring?


I already tried extending BooleanQuery so that getSimilarity returns a
Similarity which overloads just queryNorm, to return 1.0. But this
queryNorm is then used both for the FunctionQuery and the full-text
query.


Thanks very much for your answers.

Regards, Sebastian

-- 
Sebastian Kirsch <[EMAIL PROTECTED]> [http://www.sebastian-kirsch.org/]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene version 1.9

2006-03-07 Thread Doug Cutting


WATHELET Thomas wrote:

I've created an index with the Lucene version 1.9 and when I try to open
this index I have always this error mesage:
java.lang.ArrayIndexOutOfBoundsException.
if I use an index built with the lucene version 1.4.3 it's working.
Wath's wrong?


Are you perhaps trying to open an index created with 1.9 with 1.4.3? 
That won't work.  In general, you need to open indexes with a Lucene 
version greater or equal to the version which last wrote the index.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Throughput doesn't increase when using more concurrent threads

2006-03-07 Thread Peter Keegan

I ran a query performance tester against 8-cpu and 16-cpu Xeon servers
(16/32 cpu hyperthreaded). on Linux. Here are the results:

8-cpu:  275 qps
16-cpu: 305 qps
(the dual-core Opteron servers are still faster)

Here is the stack trace of 8 of the 16 query threads during the test:

at org.apache.lucene.index.SegmentReader.document(SegmentReader.java
:281)
- waiting to lock <0x002adf5b2110> (a
org.apache.lucene.index.SegmentReader)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:83)
at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java
:146)
at org.apache.lucene.search.Hits.doc(Hits.java:103)

SegmentReader.document is a synchronized method. I have one stored field
(binary, uncompressed) with and average length of 0.5Kb. The retrieval of
this stored field is within this synchronized code. Since I am using
MMapDirectory, does this retrieval need to be synchronized?

Peter

On 2/23/06, Peter Keegan <[EMAIL PROTECTED]> wrote:
>
> Yonik,
>
> We're investigating both approaches.
> Yes, the resources (and permutations) are dizzying!
>
> Peter
>
>
> On 2/23/06, Yonik Seeley < [EMAIL PROTECTED]> wrote:
> >
> > Wow, some resources!
> > Would it be cheaper / more scalable to copy the index to multiple
> > boxes and loadbalance requests across them?
> >
> > -Yonik
> >
> > On 2/23/06, Peter Keegan <[EMAIL PROTECTED]> wrote:
> > > Since I seem to be cpu-bound right now, I'll be trying a 16-cpu system
> > next
> > > (32 with hyperthreading), on LinTel. I may give JRockit another go
> > around
> > > then.
> > >
> > > Thanks,
> > > Peter
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Re: Throughput doesn't increase when using more concurrent threads

2006-03-07 Thread Doug Cutting


Peter Keegan wrote:

I ran a query performance tester against 8-cpu and 16-cpu Xeon servers
(16/32 cpu hyperthreaded). on Linux. Here are the results:

8-cpu:  275 qps
16-cpu: 305 qps
(the dual-core Opteron servers are still faster)

Here is the stack trace of 8 of the 16 query threads during the test:

at org.apache.lucene.index.SegmentReader.document(SegmentReader.java
:281)
- waiting to lock <0x002adf5b2110> (a
org.apache.lucene.index.SegmentReader)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:83)
at org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java
:146)
at org.apache.lucene.search.Hits.doc(Hits.java:103)

SegmentReader.document is a synchronized method. I have one stored field
(binary, uncompressed) with and average length of 0.5Kb. The retrieval of
this stored field is within this synchronized code. Since I am using
MMapDirectory, does this retrieval need to be synchronized?


Yes, since in FieldReader the file positions must be synchronized.

The way to avoid this would be to:

1. Add a clone() method to FieldReader that clones it's two IndexInputs.
2. Add a ThreadLocal to SegmentReader whose value is a cloned FieldReader.
3. Use the ThreadLocal's FieldReader in the document() method.

TermInfosReader has a similar optimization, using a ThreadLocal 
containing a SegmentTermEnum for each thread.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Distributed Lucene..

2006-03-07 Thread Otis Gospodnetic

Hi,
Just curious about this:

> We hacked :-) IndexWriter of Lucene to start all segment names with a
> prefix unique for each small index part.
> Then, when adding it to the actual index, we simply copy the new segment
> to the folder with the other segments, and add it in such a way so that
> the optimize() function cannot be called.
> This way adding a new segment is very unintrusive for the searcher.
> Optimization is scheduled to happen at night.


You just copy your uniquely-named segments in the index directory and manually 
modify the "segments" file to list all copied segments?

Thanks,
Otis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring with FunctionQueries?

2006-03-07 Thread Chris Hostetter


: but I want only the score of the full-text query to be multiplied by
: the query norm. The function queries should be added to the final
: query as they are (the factors a, b, ... could be set using a query
: boost.)
:
: How do I achieve that? I'm rather lost in the forest of Scorer,
: Similarity and Weight right now. Which is the right place to add such
: a modification, so that it doesn't mess up the rest of the scoring?

two approaches would work depending on your goal:

1) change the default similarity (using Similarity.setDefault(Similarity)
used by all queries to a version with queryNorm returning an constant, and
then in the few queries where you want the more traditional queryNorm,
override the getSimilarity method inline...

   Query q = new TermQuery(new Term("foo","bar")) {
  public Similarity getSimilarity(Searcher s) {
return new DefaultSimilarity();
  }
   };

2) reverse step one ... override getSimiliarity() just in the classes you
want to queryNorm to be constant and leave hte default alone.

: I already tried extending BooleanQuery so that getSimilarity returns a
: Similarity which overloads just queryNorm, to return 1.0. But this
: queryNorm is then used both for the FunctionQuery and the full-text
: query.

Hmmm ... that really doesn't sound right, are you sure you don't mean you
changed the default similarity, or changed the similarity on the searcher?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

ReIndex or rework query

2006-03-07 Thread Jennifer Sears

We've built an index that has 8 stored, tokenized text fields. For
optimizing search results, should we:
1. build the query programmatically and try to determine which field the
searchTerm might fit in (i.e. Terms that would match in City, country, would
not match in award or amenity)

2. Do a multi field query

3. Do a boolean query for searchTerms on all 8 fields?

4. Option we haven't thought of.

We're able to get results from the index using the following code, however,
we have been unable to get reliable or appropriate scoring.

If there's a better way, we'd appreciate the help.

Thanks,

Jennifer


IndexSearcher is = new IndexSearcher(indexDirectory);
StandardAnalyzer analyzer = new StandardAnalyzer();
String[] fields = {"hotel_name","hotel_city", "hotel_brand",
"hotel_country","hotel_type","hotel_feature","hotel_activity","hotel_award"}
;
BooleanClause.Occur[] flags =
{BooleanClause.Occur.SHOULD,BooleanClause.Occur.SHOULD,BooleanClause.Occur.S
HOULD, 
BooleanClause.Occur.SHOULD,BooleanClause.Occur.SHOULD,BooleanClause.Occur.SH
OULD,BooleanClause.Occur.SHOULD,BooleanClause.Occur.SHOULD};
Query q = MultiFieldQueryParser.parse(query, fields, flags, analyzer);
hits = is.search(q);






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene 1.9.1 and timeToString() apparent incompatibility with 1.4.3

2006-03-07 Thread George Washington



I recently converted from Lucene 1.4.3 to 1.9.1 and in the process replaced 
all deprecated classes with the new ones as recommended (for forward 
compatibility with Lucene 2.0).
This however seems to introduce an incompatibilty when the new 
timeToString() and stringToTime() classes are used. Using an index created 
with 1.4.3 and searched with 1.9.1  I now receive the following errors:


java.text.ParseException: Input is not valid date string: 0ehi17c0g
   at org.apache.lucene.document.DateTools.stringToDate(DateTools.java:140)
   at org.apache.lucene.document.DateTools.stringToTime(DateTools.java:110)
   at etc.., etc..

My 1.9.1 code is

when indexing:
   long modDate = conn.getLastModified(); //the file's last modified 
date
   String longDate = 
DateTools.timeToString(modDate,DateTools.Resolution.MINUTE);

   indxDoc.add(new Field("longdate", longDate, Field.Store.YES,
   Field.Index.TOKENIZED));

when searching:

   Date d = new Date();
   try {
 d.setTime(stringToTime(longDate));
   } catch (ParseException e) {
 e.printStackTrace();
   }


My 1.4.3 code was:
when indexing:

String longDate = DateField.timeToString(modDate);
indxDoc.add(new Field("longdate", longDate, Field.Store.YES, 
Field.Index.TOKENIZED));


when searching:

Date d = new Date();
d.setTime(org.apache.lucene.document.DateField.stringToTime(longDate));

The problem does not occur if I both create and search the index with 1.9.1

I assume there is a better way to do this than the above code as this 
incompatibility is not documented.
I know I can always revert to the old code in order to avoid re-creating the 
index, but I would prefer to find a solution that uses the latest classes 
AND avoids re-creating the index, if possible.

thanks for any help,

PS this is a re-transmission, I apologise if more than one copy is received, 
I had mail problems


_
realestate.com.au: the biggest address in property   
http://ninemsn.realestate.com.au



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: BooleanQuery$TooManyClauses with 1.9.1 when Number RangeQuery

2006-03-07 Thread Youngho Cho

Hello

- Original Message - 
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users" 
Sent: Tuesday, March 07, 2006 3:49 PM
Subject: Re: BooleanQuery$TooManyClauses with 1.9.1 when Number RangeQuery


> 
> : I upgade to 1.9.1 and reindexing
> : I used NumberTool when I index the number.
> :
> : after upgrade I got following error when number range query.
> : with query
> 
> The possibility of a TooManyClauses exception has always existed with
> RangeQuery and numbers, even when using NumberTool.  Even if you never saw
> it before, and you are still querying on the exact same range as before,
> adding new docs with values in that range can trigger the exception.
> 
You mean Theoritically 
RangeQuery should be forbidden because it always has potential time bomb ?
Should we comment it in javadoc ?

> Consider using a RangeFilter instead, or the ConstantScoreRangeQuery which
> doesn't have this limitation.
> 
Thanks for your alternative suggestion.
I will try.

Youngho

Weighted Terms Per Document

2006-03-07 Thread Matthew O'Connor

Hello,

I'm using Lucene 1.9 to replace an in-house search engine where all of the
documents to be searched are also created in-house.  One of the features of the
search engine is something called 'xtras' which are associated with the
documents.  I am wondering how best to model this feature using Lucene.  I have
one solution (offered below for critique) but I'm not sure it's the best way,
being a Lucene newbie.

First let me better explain 'xtras' and how they work in the *old* search
engine.  A document can have zero or more 'xtras'.  'xtras' consist of a token
and a weight.  At index time  this weight is taken into account when computing
a score which is saved in the index.  

The index is a database table with three columns and PK of (token, docid):

token => document id => score

The search algorithm is pretty obvious from here.  A user enters in a query,
it's parsed into tokens, and we gather up all the unique document ids and add
their scores together.  In SQL the logic is something like this:

SELECT docid, SUM(score) AS score
FROM SearchIndex 
WHERE token IN (...constructed from user query...)
GROUP BY docid
ORDER BY score DESC

The 'xtras' come into play when saving the score to the index.  Each row in the
index is a triple: (token, docid, score).  The base score is calculated somehow
and then the 'xtra' weight is merely tacked on to the final value saved.

For example, here is a document with an id of 'foo' and two 'xtras':  

Document: 
id: foo
xtra: 
token: breed
weight: 2
xtra: 
token: dog
weight: 10

When this document gets indexed the tokens 'breed' and 'dog' will have some
base score calculated some how.  This base score could be 0 if the token isn't
even in the document.  Then the weight is added onto this base score and the
results saved to the index.  So assume 'breed' has a base score of 1.2 and
'dog' has a base score of 0.4 then the rows saved to the index are:

(breed, foo, 3.2)
(dog, foo, 10.4)

There are some 12,000 in-house created documents that I am searching and nearly
all of them have these associated 'xtras'.  I feel like this is a huge hint to
any search engine and that it should be taken advantage of.  The information is
already there and new documents are created every day with these little hints.
In more popular terminology 'xtras' are kind of like tags with weights.

So, I want to use Lucene as the basis for a new search engine and I want to
take this already out there information into account.  I have developed one
approach which works okay, no complaints or problems really, but I feel like
it's wrong some how.  My solution is as follows:

I noticed that 99.9% of the 'xtras' had weights less than 10.  So in my Lucene
index I create 11 fields:

xtra_1, xtra_2, xtra_3, ..., xtra_10, xtra_max

In field 'xtra_1' I stick all of the tokens (joined by spaces) which have a
weight of 1, in field 'xtra_2' I stick all the tokens that have a weight of 2,
and so on.  In 'xtra_max' I stick all the tokens with a weight of more than 10.

I give field xtra_1 a boost of 20, field xtra_2 a boost of 40, and so, with
field xtra_10 a boost of 200.  Field xtra_max gets a gigantic boost of 1.
I picked the scaling value of 20 for the first 10 fields out of thin air, same
with the boost for xtra_max.

I'm a QueryParser fan, so that's what I've been using.  Our current search
language is very primitive so QueryParser is a huge bonus and probably good
enough for us.  However, now that I've created all these new fields I need to
search them all.  So, obviously, MultiFieldQueryParser is what I moved to.

When I search a document I have 13 fields that get passed to
MultiFieldQueryParser.  'body', 'title', and the 11 'xtra' fields above.  So
far this has worked well enough.  I can clearly see that the 'xtras' and their
weights influence the final rankings.  

In all honestly, I don't have any complaints quite yet.  However, I am left
with a feeling that the above is kind of "dirty" and that there is a better
way.  For example, had the values of the 'xtras' ranged more wildly I don't
think my approach would've scaled.  Also, it feels like this should be a 
common problem and perhaps I just lack the vocabulary to find the right 
approach.

So, is what I am doing problematic or is it an okay approach?  Am I going to
run into some kind of wall eventually?  Is there some library or API methods I
missed which do exactly what I want which I somehow blindly missed (if so,
sorry!)? 

Thanks for any input!

-matthew

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring with FunctionQueries?

2006-03-07 Thread Sebastian Marius Kirsch

Dear Chris,

thanks very much for your quick answer.

I tried both approaches, and both don't seem to do what I
want. Perhaps I did not understand you properly.

I generated a small in-memory index (six documents) for testing your
suggestions, with some text in field "content" and a numeric score in
field "score". Following are the code I used and the explanations I
obtained.

On Tue, Mar 07, 2006 at 11:10:51AM -0800, Chris Hostetter wrote:
> 1) change the default similarity (using Similarity.setDefault(Similarity)
> used by all queries to a version with queryNorm returning an constant, and
> then in the few queries where you want the more traditional queryNorm,
> override the getSimilarity method inline...
> 
>Query q = new TermQuery(new Term("foo","bar")) {
>   public Similarity getSimilarity(Searcher s) {
> return new DefaultSimilarity();
>   }
>};

This is the code I used:

IndexSearcher searcher = new IndexSearcher(directory);
 
searcher.setSimilarity(new DefaultSimilarity() {
  public float queryNorm(float sumOfSquaredWeight) {
return 1.0f;
  }
});

TermQuery tq = new TermQuery(new Term("content", "desmond")) {
  public Similarity getSimilarity(Searcher s) {
return new DefaultSimilarity();
  }
};

FunctionQuery fq = new FunctionQuery(new FloatFieldSource("score"));

BooleanQuery bq = new BooleanQuery();
bq.add(fq, BooleanClause.Occur.SHOULD);
bq.add(tq, BooleanClause.Occur.MUST);
 
And this is the explanation I obtained:

2.526826 = sum of:
  0.6 = 
FunctionQuery(org.apache.solr.search.function.FloatFieldSource:float(score)), 
product of:
0.6 = float(score)=0.6
1.0 = boost
1.0 = queryNorm
  1.926826 = weight(content:desmond in 3), product of:
2.0986123 = queryWeight(content:desmond), product of:
  2.0986123 = idf(docFreq=1)
  1.0 = queryNorm
0.9181429 = fieldWeight(content:desmond in 3), product of:
  1.0 = tf(termFreq(content:desmond)=1)
  2.0986123 = idf(docFreq=1)
  0.4375 = fieldNorm(field=content, doc=3)

So, as you see, the query norm for the FunctionQuery is 1.0, but for
the TermQuery, this query norm is also used (when it should be
computed from the terms in the query.)

> 2) reverse step one ... override getSimiliarity() just in the classes you
> want to queryNorm to be constant and leave hte default alone.

OK, so this would look like the following:

IndexSearcher searcher = new IndexSearcher(directory);
 
TermQuery tq = new TermQuery(new Term("content", "desmond"));
FunctionQuery fq = new FunctionQuery(new FloatFieldSource("score")) {
  public Similarity getSimilarity(Searcher s) {
return new DefaultSimilarity() {
  public float queryNorm(float sumOfSquaredWeight) {
return 1.0f;
  }
};
  }
};

BooleanQuery bq = new BooleanQuery();
bq.add(fq, BooleanClause.Occur.SHOULD);
bq.add(tq, BooleanClause.Occur.MUST);

And what I get as an explanation is this:

1.0869528 = sum of:
  0.25809917 = 
FunctionQuery(org.apache.solr.search.function.FloatFieldSource:float(score)), 
product of:
0.6 = float(score)=0.6
1.0 = boost
0.43016526 = queryNorm
  0.82885367 = weight(content:desmond in 3), product of:
0.90275013 = queryWeight(content:desmond), product of:
  2.0986123 = idf(docFreq=1)
  0.43016526 = queryNorm
0.9181429 = fieldWeight(content:desmond in 3), product of:
  1.0 = tf(termFreq(content:desmond)=1)
  2.0986123 = idf(docFreq=1)
  0.4375 = fieldNorm(field=content, doc=3)

So, this is also wrong, but in a different way -- the queryNorm for
the FunctionQuery should be 1.0.

I hope I interpreted your explanations correctly, and this is what you
intended me to try.


So, what I *really* want is something like this (modulo normalization;
I might want to boost both clauses to 0.5. But I'm not worrying about
that right now.):

1.42885367 = sum of:
  0.6 = 
FunctionQuery(org.apache.solr.search.function.FloatFieldSource:float(score)), 
product of:
0.6 = float(score)=0.6
1.0 = boost
1.0 = queryNorm
  0.82885367 = weight(content:desmond in 3), product of:
0.90275013 = queryWeight(content:desmond), product of:
  2.0986123 = idf(docFreq=1)
  0.43016526 = queryNorm
0.9181429 = fieldWeight(content:desmond in 3), product of:
  1.0 = tf(termFreq(content:desmond)=1)
  2.0986123 = idf(docFreq=1)
  0.4375 = fieldNorm(field=content, doc=3)

> Hmmm ... that really doesn't sound right, are you sure you don't mean you
> changed the default similarity, or changed the similarity on the searcher?

Please see the code above. I have not delved into the depths of Lucene
yet, but it seems that Lucene uses only one similarity instance for
scoring all clauses in the boolean query, and doesn't honour the
similarity instances provided by the individual clauses.

Or I'm wrong somewhere ;)

I've also wondered whether perhaps I might

Re: Using NOT queries inside parentheses

2006-03-07 Thread Daniel Noll


Satuluri, Venu_Madhav wrote:

If you want this to work, the most elegant way I've found is to
override 
the getBooleanQuery(Vector) method in QueryParser and insert a 
MatchAllDocsQuery into the boolean query if every clause is

prohibited.

Daniel


I tried this, but it looks like the overridden method
getBooleanQuery(vector) does not get called. I am using 1.4.3.


We're using 1.4.2.

The minimalist example:

public class Test {
  public static void main() throws Exception {
QueryParser parser = new QueryParser("text",
 new StandardAnalyzer()) {
  protected Query getBooleanQuery(Vector clauses)
  throws ParseException {
System.out.println("getBooleanQuery called");
return super.getBooleanQuery(clauses);
  }
};
parser.parse("-foo");
  }
}

Output of this will be:
getBooleanQuery called

If it isn't being called in your application, the likelihood is that 
you're not using the overridden query parser at all.


Daniel

--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene 1.9.1 and timeToString() apparent incompatibility with 1.4.3

2006-03-07 Thread Victor Negrin

I recently converted from Lucene 1.4.3 to 1.9.1 and in the processed
replaced all deprecated classes with the new ones as recommended (for
forward compatibility with Lucene 2.0).
This however seems to introduce an incompatibilty when the new
timeToString() and stringToTime() classes are used. Using an index created
with 1.4.3 and searched with 1.9.1  I now receive the following errors:

java.text.ParseException: Input is not valid date string: 0ehi17c0g
at org.apache.lucene.document.DateTools.stringToDate(DateTools.java:140)
at org.apache.lucene.document.DateTools.stringToTime(DateTools.java:110)
at etc.., etc..

My 1.9.1 code is

when indexing:
long modDate = conn.getLastModified(); //the file's last modified
date
String longDate = DateTools.timeToString(modDate,
DateTools.Resolution.MINUTE);
indxDoc.add(new Field("longdate", longDate, Field.Store.YES,
Field.Index.TOKENIZED));

when searching:

Date d = new Date();
try {
  d.setTime(stringToTime(longDate));
} catch (ParseException e) {
  e.printStackTrace();
}


My 1.4.3 code was:
when indexing:

String longDate = DateField.timeToString(modDate);
indxDoc.add(new Field("longdate", longDate, Field.Store.YES,
Field.Index.TOKENIZED));

when searching:

Date d = new Date();
d.setTime(org.apache.lucene.document.DateField.stringToTime(longDate));

The problem does not occur if I create and search the index with 1.9.1

I assume there is a better way to do this than the above as this
incompatibility is not documented.
I know I can always revert to the old code in order to avoid re-creating the
index, but I would prefer to find a solution that uses the latest classes
AND avoids re-creating the index, if possible.
thanks for any help,

Victor

Re: sub search

2006-03-07 Thread Daniel Noll


Anton Potehin wrote:

Is it possible to make search among results of previous search?




After it I want to not make a new search, I want to make search among
found results...


Simple.  Create a new BooleanQuery and put the original query into it, 
along with the new query.


Daniel


--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: BooleanQuery$TooManyClauses with 1.9.1 when Number RangeQuery

2006-03-07 Thread Youngho Cho

Hello,

> > 
> > : I upgade to 1.9.1 and reindexing
> > : I used NumberTool when I index the number.
> > :
> > : after upgrade I got following error when number range query.
> > : with query
> > 
> > The possibility of a TooManyClauses exception has always existed with
> > RangeQuery and numbers, even when using NumberTool.  Even if you never saw
> > it before, and you are still querying on the exact same range as before,
> > adding new docs with values in that range can trigger the exception.
> > 
> You mean Theoritically 
> RangeQuery should be forbidden because it always has potential time bomb ?
> Should we comment it in javadoc ?
> 
I found the comment at the BooleanQuery javadoc.
the default value is 1024.  
But still I don't understand why happend after using NumberTool at 1.9.1

Thanks.

Youngho

Re: BooleanQuery$TooManyClauses with 1.9.1 when Number RangeQuery

2006-03-07 Thread Chris Hostetter


: > You mean Theoritically
: > RangeQuery should be forbidden because it always has potential time bomb ?
: > Should we comment it in javadoc ?

In my opinion, the only reason to use RangeQuery is if you are dealing
with very controlled ranges, where you know hte number of terms it will
expand to is small.  If you are just parsing a user supplied query, and
you have no control over what they give you, i would always use a
RangeFilter.

: I found the comment at the BooleanQuery javadoc.
: the default value is 1024.
: But still I don't understand why happend after using NumberTool at 1.9.1

As I said, it may not have anything to do with 1.9.1 or NumberTools, it
may have just been that when you built your new index, you had more
documents with more unique values in that range.

(or maybe there has been some change in NumberTools i'm not aware of .. i
don't know, but i would advise against RangeQuery either way)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene 1.9.1 and timeToString() apparent incompatibility with 1.4.3

2006-03-07 Thread Chris Hostetter


: timeToString() and stringToTime() classes are used. Using an index created
: with 1.4.3 and searched with 1.9.1  I now receive the following errors:

As the deprecation comment in DateField says...

If you build a new index, use DateTools instead. For existing indices 
you
can continue using this class, as it will not be removed in the near
future despite being deprecated.

...DateTools is not backwards with DateField, which is why that comment
tries to make it clear that you shouldn't use DateField for new indexes,
but you can continue using it for old ones without fear.

: I assume there is a better way to do this than the above as this
: incompatibility is not documented.
: I know I can always revert to the old code in order to avoid re-creating the
: index, but I would prefer to find a solution that uses the latest classes
: AND avoids re-creating the index, if possible.
: thanks for any help,

if you don't wnat to rebuild your index, then just keep using the
DateField class and everything will be fine.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring with FunctionQueries?

2006-03-07 Thread Chris Hostetter


: I tried both approaches, and both don't seem to do what I
: want. Perhaps I did not understand you properly.

>From what I can tell it looks like you understood me perfectly, I too am
baffled by the results you are getting.  I have a couple of thoughts:

1) check the raw core you get from these docs using a HitCollector and
compare thatwith the value from explain ... the explain info is calculated
through a parallel code path which differs from the normal search/score
code path and it's totally possible there are bugs (BooleanQuery for
example will happily deal with sub queries that return scores of <= 0, but
it's explain functionwill not ... i don't think that's the issue here, but
it may be similar.

2) Add some logging (or set some breakpoints) to your custom similarites
   queryNorm methods (and your getSimilarity methods) to see
   if/when/how-often the methods are being called.

2) Try eliminating some variables andd see what happens ...
   a) create concrete subclasses instead of using anonomous instances with
  overriden methods.
   b) don't bother using FunctionQuery, just use two seperate TermQueries
  with different getSimilarity() methods (FunctionQuery is fairly new
  ... there may be bugs in it, also this way if you still have a
  problem you have a use case that anyone with lucene familiarty will
  understand even if they've never seen FunctionQuery)

: I generated a small in-memory index (six documents) for testing your
: suggestions, with some text in field "content" and a numeric score in
: field "score". Following are the code I used and the explanations I
: obtained.

once you've tried the suggestions above, can you make send out a
selfcontained JUnit test showing the problems?

: Please see the code above. I have not delved into the depths of Lucene
: yet, but it seems that Lucene uses only one similarity instance for
: scoring all clauses in the boolean query, and doesn't honour the
: similarity instances provided by the individual clauses.

i just double checked, and i can't see anyway that could be happening --
but you're seeing something weird, so *something* isn't working the way i
thought.  as i said, if you can post a self contained unit test that
demonstrates the problem, then maybe someone can spot the glitch.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene 1.9.1 and timeToString() apparent incompatibility with 1.4.3

2006-03-07 Thread George Washington

Thanks Chris for making it clear, I had read the comment but I had not 
understood that it implied incompatibility. But will the code be preserved 
in Lucene 2.0, in light of the comment contained in the Lucene 1.9.1 
announcement ?

QUOTE
Applications must compile against 1.9 without deprecation warnings
before they are compatible with 2.0.
UNQUOTE

Victor


From: Chris Hostetter <[EMAIL PROTECTED]>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: Lucene 1.9.1 and timeToString() apparent incompatibility with 
1.4.3

Date: Tue, 7 Mar 2006 17:54:27 -0800 (PST)


: timeToString() and stringToTime() classes are used. Using an index 
created

: with 1.4.3 and searched with 1.9.1  I now receive the following errors:

As the deprecation comment in DateField says...

If you build a new index, use DateTools instead. For existing indices 
you
can continue using this class, as it will not be removed in the near
future despite being deprecated.

...DateTools is not backwards with DateField, which is why that comment
tries to make it clear that you shouldn't use DateField for new indexes,
but you can continue using it for old ones without fear.

: I assume there is a better way to do this than the above as this
: incompatibility is not documented.
: I know I can always revert to the old code in order to avoid re-creating 
the

: index, but I would prefer to find a solution that uses the latest classes
: AND avoids re-creating the index, if possible.
: thanks for any help,

if you don't wnat to rebuild your index, then just keep using the
DateField class and everything will be fine.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



_
mycareer.com.au: http://www.mycareer.com.au/?s_cid=213596  Land the Job


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Distributed Lucene..

2006-03-07 Thread Andrew Schetinin

Hi,

Sure not. We created another IndexWriter class and modified its function
addIndexes (if I remember the function name correctly) so it will not
call to optimize at the end - that's all.
Having unique segment names was necessary because the segment file name
is used inside the file itself, and cannot be changed on the fly. 

Best Regards,

Andrew

 


 

--
Andrew Schetinin
C++ System Architect
Phone: +972 8 643 6560, ext. 212
Email: mailto:[EMAIL PROTECTED]

www.entopia.com

Entopia Awards: 
"Visionary in Enterprise Search Magic Quadrant" Gartner Group
"Best Search Engine" SIIA Codie Award
"Trend Setting Product" KMWorld Magazine


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 07, 2006 8:55 PM
To: java-user@lucene.apache.org
Subject: Re: Distributed Lucene..

Hi,
Just curious about this:

> We hacked :-) IndexWriter of Lucene to start all segment names with a 
> prefix unique for each small index part.
> Then, when adding it to the actual index, we simply copy the new 
> segment to the folder with the other segments, and add it in such a 
> way so that the optimize() function cannot be called.
> This way adding a new segment is very unintrusive for the searcher.
> Optimization is scheduled to happen at night.


You just copy your uniquely-named segments in the index directory and
manually modify the "segments" file to list all copied segments?

Thanks,
Otis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

62 matches

Mail list logo