RE: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread DECAFFMEYER MATHIEU
>> Is that the same reader that is used in IndexSearcher?

I opened an IndexSearcher on the path (String) to the index.
Now I tried to open on the clone IndexReader and use the constructor
that has an IndexReader as param, 
and I got everything working now 

I just have two IndexSearchers opened now most of the time, which is
deprecated,
But I think that's my only choice !

Thank u !

__
   Matt



-Original Message-
From: Antony Bowesman [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 13, 2007 10:13 PM
To: java-user@lucene.apache.org
Subject: Re: [Urgent] deleteDocuments fails after merging ...

*  This message comes from the Internet Network *

Erick Erickson wrote:
> The javadocs point out that this line
> 
> * int* nb = mIndexReaderClone.deleteDocuments(urlTerm)
> 
> removes*all* documents for a given term. So of course you'll fail
> to delete any documents the second time you call
> deleteDocuments with the same term.

Isn't the code snippet below doing a search before attempting the
deletion, so 
from the IndexReader's point of view (as used by the IndexSearcher) the
item 
exists.  What is mIndexReaderClone?  Is that the same reader that is
used in 
IndexSearcher?

I'm not sure, but if you search with one IndexReader and delete the
document 
using another IndexReader and then repeat the process, I think that the
search 
would still result in a hit, but the deletion would return 0.

> On 3/13/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote:
>>
>> Before I delete a document I search it in the index to be sure there
is a
>> hit (via a Term object),
>> When I find a hit I delete the document (with the same Term object),

>> Hits hits = search(query);
>> *if* (hits.length() > 0) {
>>* if* (hits.length() > 1) {
>> System.out.println("found in the index with
duplicates");
>> }
>> System.out.println("found in the index");
>>* try* {
>>* int* nb =
mIndexReaderClone.deleteDocuments(urlTerm);
>>* if* (nb > 0)
>> System.out.println("successfully deleted");
>>* else*
>>* throw** new* IOException("0 doc deleted");
>> }* catch* (IOException e) {
>> e.printStackTrace();
>>* throw** new* Exception(
>> Thread.currentThread().getName() + " ---
Deleting

Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can we extract phrase from lucene index

2007-03-14 Thread Bhavin Pandya

Hello guys,

I am using lucene 1.9 and i have 3GB of index.
I know we can extract tokens from index easily but can we extract phrase ?

Regards.
Bhavin pandya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ways to minimize index size?

2007-03-14 Thread jm

Hi,

I want to make my index as small as possible. I noticed about
field.setOmitNorms(true), I read in the list the diff is 1 byte per
field per doc, not huge but hey...is the only effect the score being
different? I hardly mind about the score so that would be ok.

And can I add to an index without norms when it has previous doc with norms?

Any other way to minimize size of index? Most of my fields but one are
Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
tried compressing that one and size is reduced around 1% (it's a small
field), but I guess compression means worse performance so I am not
sure about applying that.

thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: IndexReader.GetTermFreqVectors

2007-03-14 Thread Kainth, Sachin
Yes but what is a term vector? 

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: 13 March 2007 19:28
To: java-user@lucene.apache.org
Subject: Re: IndexReader.GetTermFreqVectors

It means it return the term vectors for all the fields on that document
where you have enabled TermVector when creating the Document.

i.e. new Field(, TermVector.YES) (see http://lucene.apache.org/
java/docs/api/org/apache/lucene/document/Field.TermVector.html for the
full array of options)

-Grant

On Mar 13, 2007, at 1:24 PM, Kainth, Sachin wrote:

> Hi all,
>
> The documentation for the above method mentions something called a 
> vectorized field.  Does anyone know what a vectorized field is?
>
>
>
>
> This email and any attached files are confidential and copyright 
> protected. If you are not the addressee, any dissemination of this 
> communication is strictly prohibited. Unless otherwise expressly 
> agreed in writing, nothing stated in this communication shall be 
> legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins plc.   
> Registered in England No. 1885586.  Registered Office Woodcote Grove, 
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you 
> really need to.

--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader.GetTermFreqVectors

2007-03-14 Thread Ian Lea

From 
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.TermVector.html:


"A term vector is a list of the document's terms and their number of
occurences in that document."


--
Ian.


On 3/14/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote:

Yes but what is a term vector?

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: 13 March 2007 19:28
To: java-user@lucene.apache.org
Subject: Re: IndexReader.GetTermFreqVectors

It means it return the term vectors for all the fields on that document
where you have enabled TermVector when creating the Document.

i.e. new Field(, TermVector.YES) (see http://lucene.apache.org/
java/docs/api/org/apache/lucene/document/Field.TermVector.html for the
full array of options)

-Grant

On Mar 13, 2007, at 1:24 PM, Kainth, Sachin wrote:

> Hi all,
>
> The documentation for the above method mentions something called a
> vectorized field.  Does anyone know what a vectorized field is?
>
>
>
>
> This email and any attached files are confidential and copyright
> protected. If you are not the addressee, any dissemination of this
> communication is strictly prohibited. Unless otherwise expressly
> agreed in writing, nothing stated in this communication shall be
> legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins plc.
> Registered in England No. 1885586.  Registered Office Woodcote Grove,
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you
> really need to.

--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can we extract phrase from lucene index

2007-03-14 Thread Erick Erickson

Your problem statement lends itself to flippant answers like "just
use a PhraseQuery". So I clearly don't understand what you're trying
to accomplish. Are you trying to find all of the occurrences of a
particular phrase? All the phrases (however that's defined) for
all the documents? What problem are you trying to solve?


Best
Erick


On 3/14/07, Bhavin Pandya <[EMAIL PROTECTED]> wrote:


Hello guys,

I am using lucene 1.9 and i have 3GB of index.
I know we can extract tokens from index easily but can we extract phrase ?

Regards.
Bhavin pandya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




how to get approximate total matching

2007-03-14 Thread senthil kumaran

Hi.
   I have more index directories (>6) all in GB,and searching my query with
single IndexSearcher  to all indexes one after another.i.e. I create one
IndexSearcher for index1 and search over that.Finally I close that and
create new IndexSearcher for index2 and so on. If i get 200 total results
then i don't go to search other index directories and i print 200 results
and exit from search.
   I need to get approximate total matching documents all over the indexes
without going to search in other indexes.
   Please suggest me a easiest way to achieve this.

P.S: To avoid more memory usage and to reduce search timeI don't want to
search my query through all indexes if i got 200 results. MultiSearcher
create OOM error,  so that I'm using single IndexSearcher.


Thanks in Advance
Senthil


Re: ways to minimize index size?

2007-03-14 Thread Erick Erickson

Store as little as possible, index as little as possible .

How big is your index, and how much do you expect it to grow?
I ask this because it's probably not worth your time to try to
reduce the index size below some threshold... I found that
reducing my index from 8G to 4G (through not stemming) gave
me about a 10% performance improvement, so at some point
it's just not worth the effort. Also, if you posted the index size,
it would give folks a chance to say "there's not much you can
gain by reducing things more". As it is, I don't have a clue
whether your index is 100M or 100T. The former is in the
"don't waste your time" class, and the latter is...er...
different

I wouldn't bother compressing for 1%

Question for "the guys" so I can check an assumption
Is there any difference between these two?
Field(Name, Value, Store, index)
*
*Field(Name, Value, Store, index, Field.TermVector.NO)


Best
Erick

On 3/14/07, jm <[EMAIL PROTECTED]> wrote:


Hi,

I want to make my index as small as possible. I noticed about
field.setOmitNorms(true), I read in the list the diff is 1 byte per
field per doc, not huge but hey...is the only effect the score being
different? I hardly mind about the score so that would be ok.

And can I add to an index without norms when it has previous doc with
norms?

Any other way to minimize size of index? Most of my fields but one are
Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
tried compressing that one and size is reduced around 1% (it's a small
field), but I guess compression means worse performance so I am not
sure about applying that.

thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: how to get approximate total matching

2007-03-14 Thread Erick Erickson

How much memory are you allocating for your JVM? Because you're
paying a huge search time penalty by opening and closing your
searcher sequentially, it would be a good thing to not do this.
But, as you say, if you're getting OOM errors, that's a problem.

What is the total size of all your indexes? That would help folks
give you better responses and perhaps suggest other ways of
solving your problem.

Erick

On 3/14/07, senthil kumaran <[EMAIL PROTECTED]> wrote:


Hi.
I have more index directories (>6) all in GB,and searching my query
with
single IndexSearcher  to all indexes one after another.i.e. I create one
IndexSearcher for index1 and search over that.Finally I close that and
create new IndexSearcher for index2 and so on. If i get 200 total results
then i don't go to search other index directories and i print 200 results
and exit from search.
I need to get approximate total matching documents all over the
indexes
without going to search in other indexes.
Please suggest me a easiest way to achieve this.

P.S: To avoid more memory usage and to reduce search timeI don't want to
search my query through all indexes if i got 200 results. MultiSearcher
create OOM error,  so that I'm using single IndexSearcher.


Thanks in Advance
Senthil



RE: ways to minimize index size?

2007-03-14 Thread Jeff

I found that reducing my index from 8G to 4G (through not stemming) gave me

about a 10% performance improvement.

How did you do this? I don't see this as an option.

Jeff


Re: Can we extract phrase from lucene index

2007-03-14 Thread Bhavin Pandya

Hi erick,
what i am looking for is dictionary for spell checker.
I am trying to customised lucene spell checker for phrase.
so thinking if anyhow i am able to fetech phrases from the index itself then 
i can train my spellchecker.


I tried with query logs but it has lot of spell mistakes...

Any suggestions..

Thanks.
Bhavin pandya

- Original Message - 
From: "Erick Erickson" <[EMAIL PROTECTED]>

To: ; "Bhavin Pandya" <[EMAIL PROTECTED]>
Sent: Wednesday, March 14, 2007 6:29 PM
Subject: Re: Can we extract phrase from lucene index



Your problem statement lends itself to flippant answers like "just
use a PhraseQuery". So I clearly don't understand what you're trying
to accomplish. Are you trying to find all of the occurrences of a
particular phrase? All the phrases (however that's defined) for
all the documents? What problem are you trying to solve?


Best
Erick


On 3/14/07, Bhavin Pandya <[EMAIL PROTECTED]> wrote:


Hello guys,

I am using lucene 1.9 and i have 3GB of index.
I know we can extract tokens from index easily but can we extract phrase 
?


Regards.
Bhavin pandya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



memory consumption on large indices

2007-03-14 Thread Dennis Berger

Do I have to keep something in mind to do searching on large indices?
I actually have an index with a size of 1.8gb. I have indexed 1.5 
million items from Amazon.

How much memory do I have to give to the jvm?
As a sidenote I have to tell you that I optimized the index so it's one 
segment file.

Do I need to have 1.8gb memory available for the jvm?

regards,
-Dennis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: memory consumption on large indices

2007-03-14 Thread Ian Lea

No, you don't need 1.8Gb of memory.  Start with default and raise if
you need to?
Or jump straight in at about 512Mb.


--
Ian.


On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote:

Do I have to keep something in mind to do searching on large indices?
I actually have an index with a size of 1.8gb. I have indexed 1.5
million items from Amazon.
How much memory do I have to give to the jvm?
As a sidenote I have to tell you that I optimized the index so it's one
segment file.
Do I need to have 1.8gb memory available for the jvm?

regards,
-Dennis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ways to minimize index size?

2007-03-14 Thread jm

hi Erick,


Well, typically my application will start with some hundreds of
indexes...and then grow at a rate of several per day, for ever. At
some point I know I can do some merging etc if needed.

Size is dependant on the customer, could be up to a 1G per index. That
is way I would like to minimize them. I am not worried with search
performance.

I dont understand how not stemming can reduce the size of an index...I
would think it happens the other way, does not stemming makes the
words shorter? (I dont stemm, so I never looked into it)

thanks
On 3/14/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

Store as little as possible, index as little as possible .

How big is your index, and how much do you expect it to grow?
I ask this because it's probably not worth your time to try to
reduce the index size below some threshold... I found that
reducing my index from 8G to 4G (through not stemming) gave
me about a 10% performance improvement, so at some point
it's just not worth the effort. Also, if you posted the index size,
it would give folks a chance to say "there's not much you can
gain by reducing things more". As it is, I don't have a clue
whether your index is 100M or 100T. The former is in the
"don't waste your time" class, and the latter is...er...
different

I wouldn't bother compressing for 1%

Question for "the guys" so I can check an assumption
Is there any difference between these two?
Field(Name, Value, Store, index)
*
*Field(Name, Value, Store, index, Field.TermVector.NO)


Best
Erick

On 3/14/07, jm <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I want to make my index as small as possible. I noticed about
> field.setOmitNorms(true), I read in the list the diff is 1 byte per
> field per doc, not huge but hey...is the only effect the score being
> different? I hardly mind about the score so that would be ok.
>
> And can I add to an index without norms when it has previous doc with
> norms?
>
> Any other way to minimize size of index? Most of my fields but one are
> Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> tried compressing that one and size is reduced around 1% (it's a small
> field), but I guess compression means worse performance so I am not
> sure about applying that.
>
> thanks
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: memory consumption on large indices

2007-03-14 Thread Dennis Berger

Ian Lea schrieb:

No, you don't need 1.8Gb of memory.  Start with default and raise if
you need to?

how do I know when I need it?

Or jump straight in at about 512Mb.


--
Ian.


On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote:

Do I have to keep something in mind to do searching on large indices?
I actually have an index with a size of 1.8gb. I have indexed 1.5
million items from Amazon.
How much memory do I have to give to the jvm?
As a sidenote I have to tell you that I optimized the index so it's one
segment file.
Do I need to have 1.8gb memory available for the jvm?

regards,
-Dennis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Dennis Berger
BSDSystems
Eduardstrasse 43b
20257 Hamburg

Phone: +49 (0)40 54 00 18 17
Mobile: +49 (0) 179 123 15 09
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: memory consumption on large indices

2007-03-14 Thread Ian Lea

When your app gets a java.lang.OutOfMemory exception.


--
Ian.


On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote:

Ian Lea schrieb:
> No, you don't need 1.8Gb of memory.  Start with default and raise if
> you need to?
how do I know when I need it?
> Or jump straight in at about 512Mb.
>
>
> --
> Ian.
>
>
> On 3/14/07, Dennis Berger <[EMAIL PROTECTED]> wrote:
>> Do I have to keep something in mind to do searching on large indices?
>> I actually have an index with a size of 1.8gb. I have indexed 1.5
>> million items from Amazon.
>> How much memory do I have to give to the jvm?
>> As a sidenote I have to tell you that I optimized the index so it's one
>> segment file.
>> Do I need to have 1.8gb memory available for the jvm?
>>
>> regards,
>> -Dennis
>>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can we extract phrase from lucene index

2007-03-14 Thread karl wettin


14 mar 2007 kl. 14.51 skrev Bhavin Pandya:


what i am looking for is dictionary for spell checker.
I am trying to customised lucene spell checker for phrase.
so thinking if anyhow i am able to fetech phrases from the index  
itself then i can train my spellchecker.


I tried with query logs but it has lot of spell mistakes...


You can try this:

https://issues.apache.org/jira/browse/LUCENE-626

--
karl



Any suggestions..

Thanks.
Bhavin pandya

- Original Message - From: "Erick Erickson"  
<[EMAIL PROTECTED]>
To: ; "Bhavin Pandya"  
<[EMAIL PROTECTED]>

Sent: Wednesday, March 14, 2007 6:29 PM
Subject: Re: Can we extract phrase from lucene index



Your problem statement lends itself to flippant answers like "just
use a PhraseQuery". So I clearly don't understand what you're trying
to accomplish. Are you trying to find all of the occurrences of a
particular phrase? All the phrases (however that's defined) for
all the documents? What problem are you trying to solve?


Best
Erick


On 3/14/07, Bhavin Pandya <[EMAIL PROTECTED]> wrote:


Hello guys,

I am using lucene 1.9 and i have 3GB of index.
I know we can extract tokens from index easily but can we extract  
phrase ?


Regards.
Bhavin pandya

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: memory consumption on large indices

2007-03-14 Thread Tim Patton
I'm searching a 20GB index and my searching JVM is allocated 1Gig. 
However, my indexing app only had 384mb availible to it, which means you 
can get away with far less.  I believe certain index tables will need to 
be swapped in and out of memory though so it may not search as quickly.


With a 1.8gig index you could try the jvm default (64megs) and see how 
it works.


Tim

Dennis Berger wrote:

Do I have to keep something in mind to do searching on large indices?
I actually have an index with a size of 1.8gb. I have indexed 1.5 
million items from Amazon.

How much memory do I have to give to the jvm?
As a sidenote I have to tell you that I optimized the index so it's one 
segment file.

Do I need to have 1.8gb memory available for the jvm?

regards,
-Dennis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildcard searches with * or ? as the first character - Thanks

2007-03-14 Thread Oystein Reigem

Thanks Steven and Antony.

I read the FAQ not very long ago, but that slipped my attention. Or 
perhaps it's a recent change.


- Øystein -

--
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <[EMAIL 
PROTECTED]>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <[EMAIL 
PROTECTED]>. Aksis home page: .


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ways to minimize index size?

2007-03-14 Thread Erick Erickson

OK, I caused more confusion than rendered help by my stemming
statement. The only reason I mentioned it was to illustrate that
performance is not linearly related to size.

It took some effort to put stemming into the index, see
PorterStemmer etc. This is NOT the default. So I took it out
to see what the effect would be.

Why not stemming made things shorter: because we also
have the requirement that phrases (i.e. words in double quotes)
do NOT match the stemmed version. Thus if we index
running watching, the following searches have the
indicated results
run - hits
watch - hits
running - hits
"run watch" does NOT hit.
"running watching" hits

So I indexed the following terms...

run
running$
watch
watching&

with the two forms of run indexed in the same position (0)
and the two forms of watch in the same position (1).

I agree that if we didn't have the exact-phrase-match requirement
the stemmed version of the index should be smaller


Sorry for the confusion
Erick

On 3/14/07, jm <[EMAIL PROTECTED]> wrote:


hi Erick,


Well, typically my application will start with some hundreds of
indexes...and then grow at a rate of several per day, for ever. At
some point I know I can do some merging etc if needed.

Size is dependant on the customer, could be up to a 1G per index. That
is way I would like to minimize them. I am not worried with search
performance.

I dont understand how not stemming can reduce the size of an index...I
would think it happens the other way, does not stemming makes the
words shorter? (I dont stemm, so I never looked into it)

thanks
On 3/14/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> Store as little as possible, index as little as possible .
>
> How big is your index, and how much do you expect it to grow?
> I ask this because it's probably not worth your time to try to
> reduce the index size below some threshold... I found that
> reducing my index from 8G to 4G (through not stemming) gave
> me about a 10% performance improvement, so at some point
> it's just not worth the effort. Also, if you posted the index size,
> it would give folks a chance to say "there's not much you can
> gain by reducing things more". As it is, I don't have a clue
> whether your index is 100M or 100T. The former is in the
> "don't waste your time" class, and the latter is...er...
> different
>
> I wouldn't bother compressing for 1%
>
> Question for "the guys" so I can check an assumption
> Is there any difference between these two?
> Field(Name, Value, Store, index)
> *<
file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29
>
> *Field(Name, Value, Store, index, Field.TermVector.NO)
>
>
> Best
> Erick
>
> On 3/14/07, jm <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I want to make my index as small as possible. I noticed about
> > field.setOmitNorms(true), I read in the list the diff is 1 byte per
> > field per doc, not huge but hey...is the only effect the score being
> > different? I hardly mind about the score so that would be ok.
> >
> > And can I add to an index without norms when it has previous doc with
> > norms?
> >
> > Any other way to minimize size of index? Most of my fields but one are
> > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> > tried compressing that one and size is reduced around 1% (it's a small
> > field), but I guess compression means worse performance so I am not
> > sure about applying that.
> >
> > thanks
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Chris Hostetter

: I just have two IndexSearchers opened now most of the time, which is
: deprecated,
: But I think that's my only choice !

2 searchers is fine ... it's "N" where N is not bound that you want to
avoid.

from what i understand of your requirements, you don't *really* need two
searchers open ... open a searcher to do whatever complex queries you need
to get the docIds to delete, then delete
them all, then close/reopen searcher (and check that the delets worked if
you don't trust it)

the only real reason you should really need 2 searchers at a time is if
you are searching other queries in parallel threads at the same time ...
or if you are warming up one new searcher that's "ondeck" while still
serving queries with an older searcher.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance between Filter and HitCollector?

2007-03-14 Thread Chris Hostetter

it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
 performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
 (DocLists know about scores, sorting, and pagination) in a single
 pass when scores or sorting are requested.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search vs. Rank

2007-03-14 Thread Walt Stoneburner

Most search engine technologies return result sets based some weighted
frequency of the search terms found.  I've got a new problem, I want to rank
by different criteria than I searched for.

For example, I might want to return as my result set all documents that
contain the word pizza, but rank them according to topping preferences (with
garlic at the top and goat cheese at the bottom).

Two questions.

1) Does Lucene allow one to mandatorily search for a term, but provide it
zero weight, while allowing other terms to have zero influence on the result
set, but affect their order?

I'm thinking something like  +pizza^0 garlic^1 "goat cheese"^-1

The concern is that I don't want any results that happen to mention garlic
or goat cheese except in the context of pizza.

2) Once I have this list of results, can I change their rank order without
having to do a full scale search again?

-wls


Re: Search vs. Rank

2007-03-14 Thread Chris Hostetter

: I'm thinking something like  +pizza^0 garlic^1 "goat cheese"^-1

that does in fact work.

: 2) Once I have this list of results, can I change their rank order without
: having to do a full scale search again?

the frequency of "pizza' won't affect the score at all, so you should need
to do much to change the order ... but you can implement your own custom
Sort to get any order you want.

An alternate appraoch is to use a Filter to define the super set of all
things you are interested in (ie: "pizza") and then execute a Query
against that Filter that matches all docs, scoring the ones you care the
most about higher.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance between Filter and HitCollector?

2007-03-14 Thread eks dev
just to complete this fine answer,
there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584)  
that could bring the best of both worlds via e.g. ConstantScoringQuery or 
another abstraction that enables disabling Scoring (where appropriate)

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 14 March, 2007 7:15:06 PM
Subject: Re: Performance between Filter and HitCollector?


it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
 performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
 (DocLists know about scores, sorting, and pagination) in a single
 pass when scores or sorting are requested.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







___ 
All New Yahoo! Mail – Tired of unwanted email come-ons? Let our SpamGuard 
protect you. http://uk.docs.yahoo.com/nowyoucan.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Antony Bowesman

Chris Hostetter wrote:

the only real reason you should really need 2 searchers at a time is if
you are searching other queries in parallel threads at the same time ...
or if you are warming up one new searcher that's "ondeck" while still
serving queries with an older searcher.


Hoss, I hope I misunderstood this: are you saying that the same 
IndexSearcher/IndexReader pair can not be used concurrently against a single 
index by different threads executing different queries?


The archives have several mentions of sharing IndexSearcher among threads and 
Otis says http://www.jguru.com/faq/view.jsp?EID=492393.


Can you clarify what you meant please.

Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SpellChecker and Lucene 2.1

2007-03-14 Thread Ryan O'Hara
Is there a SpellChecker.jar compatible with Lucene 2.1.  After  
updating to Lucene 2.1, I seem to have lost the ability to create a  
spell index using spellchecker-2.0-rc1-dev.jar.  Any help would be  
greatly appreciated.


Thanks,
Ryan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread Chris Hostetter

: > the only real reason you should really need 2 searchers at a time is if
: > you are searching other queries in parallel threads at the same time ...
: > or if you are warming up one new searcher that's "ondeck" while still
: > serving queries with an older searcher.
:
: Hoss, I hope I misunderstood this: are you saying that the same
: IndexSearcher/IndexReader pair can not be used concurrently against a single
: index by different threads executing different queries?

no i'm saying the only reason you need two searhsers are:

  1) if, seperate from the searcher you are using to deletes (which you
seem to have a use case that involves reopening to check the deletes) you
also wnat a searcher open continuously which you use to search search
clients.

  2) if, for performance reasons, when opening a new searcher to expose a
new version of hte index, you want to open the new one, warm it up with
some queries, and only then direct new threads to the new searcher
and close the old searcher.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fast index traversal and update for stored field?

2007-03-14 Thread Thomas K. Burkholder

Hi there,

I'm using lucene to index and store entries from a database table for  
ultimate retrieval as search results.  This works fine.  But I find  
myself in the position of wanting to occasionally (daily-ish) bulk- 
update a single, stored, non-indexed field in every document in the  
index, without changing any indexed value at all.


The obviously documented way to do this would be to remove and then  
re-add each updated document successively.  However, I know from  
experience that rebuilding our index from scratch in this fashion  
would take several hours at least, which is too long to delay pending  
incremental index jobs.  It seems to me that at some level it should  
be possible to iterate over all the document storage on disk and  
modify only the field I'm interested in (no index modification  
required remember as this is a field that is stored but not  
indexed).  It's plain from the documentation on file formats that it  
would be potentially possible to do this from a low level, however  
before I go possibly re-inventing that wheel, I'm wondering if anyone  
knows of any existing code out there that would aid in solving this  
problem.


Thanks in advance,

//Thomas
Thomas K. Burkholder
Code Janitor

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fast index traversal and update for stored field?

2007-03-14 Thread Erick Erickson

If you search the mail archive for "update in place" (no quotes),
you'll find extensive discussions of this idea. Although you're
raising an interesting variant because you're talking about a non-
indexed field, so now I'm not sure those discussions are relevant.

I don't know of anyone who has done what you're asking though...

But if it's just stored data, you could go out to a database and
pick it up at search time, although there are sound reasons for
not requiring a database connection.

What about having a separate index for just this one field? And
make it an indexed value, along with some id (not the Lucene ID,
probably) of your original. Something like

index fields
ID  (unique ID for each document)
field (the corresponding value).

Searching this should be very fast, and if the usual Hits based
search wasn't fast enough, perhaps something with
termenum/termdocs would be faster.

Or you could just index the unique ID and store (but not index)
the field. Hits or variants should work for that too.

So the general algorithm would be:

search main index
for each hit:
  search second index and fetch that field

I have no idea whether this has any traction for your problem
space, but I thought I'd mention it. This assumes that building
the mutable index would be acceptably fast...

Although conceptually, this is really just a Map of ID/value pairs.
I have no idea how much data you're talking about, but if it's not
a huge data set, might it be possible just to store it in
a simple map and look it up that way?

And if I'm all wet, I'm sure others will chime in...

Best
Erick
*

*
On 3/14/07, Thomas K. Burkholder <[EMAIL PROTECTED]> wrote:


Hi there,

I'm using lucene to index and store entries from a database table for
ultimate retrieval as search results.  This works fine.  But I find
myself in the position of wanting to occasionally (daily-ish) bulk-
update a single, stored, non-indexed field in every document in the
index, without changing any indexed value at all.

The obviously documented way to do this would be to remove and then
re-add each updated document successively.  However, I know from
experience that rebuilding our index from scratch in this fashion
would take several hours at least, which is too long to delay pending
incremental index jobs.  It seems to me that at some level it should
be possible to iterate over all the document storage on disk and
modify only the field I'm interested in (no index modification
required remember as this is a field that is stored but not
indexed).  It's plain from the documentation on file formats that it
would be potentially possible to do this from a low level, however
before I go possibly re-inventing that wheel, I'm wondering if anyone
knows of any existing code out there that would aid in solving this
problem.

Thanks in advance,

//Thomas
Thomas K. Burkholder
Code Janitor

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Fast index traversal and update for stored field?

2007-03-14 Thread Thomas K. Burkholder

Hey, thanks for the quick reply.

I've considered using a secondary index just for this data but  
thought I would look at storing the data in lucene first, since  
ultimately this data gets transported to an outside system, and it's  
a lot easier if there's only one "thing" to transfer.  The  
destination environment that receives this lucene index doesn't (and  
shouldn't) have access to the database, which is why we don't simply  
store it there.  Even if it did, we try not to access the database  
for search results when we don't have to, as this tends to make  
searching slow (as I think you were alluding to).


Sounds like there's nothing "out of the box" to solve my problem; if  
I write something to update lucene indexes in place I'll follow up  
about it in here (don't know that I will though; building a new,  
narrower index is probably more expedient and will probably be fast  
enough for my purposes in this case).


Thanks again,

//Thomas

On Mar 14, 2007, at 4:50 PM, Erick Erickson wrote:


If you search the mail archive for "update in place" (no quotes),
you'll find extensive discussions of this idea. Although you're
raising an interesting variant because you're talking about a non-
indexed field, so now I'm not sure those discussions are relevant.

I don't know of anyone who has done what you're asking though...

But if it's just stored data, you could go out to a database and
pick it up at search time, although there are sound reasons for
not requiring a database connection.

What about having a separate index for just this one field? And
make it an indexed value, along with some id (not the Lucene ID,
probably) of your original. Something like

index fields
ID  (unique ID for each document)
field (the corresponding value).

Searching this should be very fast, and if the usual Hits based
search wasn't fast enough, perhaps something with
termenum/termdocs would be faster.

Or you could just index the unique ID and store (but not index)
the field. Hits or variants should work for that too.

So the general algorithm would be:

search main index
for each hit:
  search second index and fetch that field

I have no idea whether this has any traction for your problem
space, but I thought I'd mention it. This assumes that building
the mutable index would be acceptably fast...

Although conceptually, this is really just a Map of ID/value pairs.
I have no idea how much data you're talking about, but if it's not
a huge data set, might it be possible just to store it in
a simple map and look it up that way?

And if I'm all wet, I'm sure others will chime in...

Best
Erick
*

*
On 3/14/07, Thomas K. Burkholder <[EMAIL PROTECTED]> wrote:


Hi there,

I'm using lucene to index and store entries from a database table for
ultimate retrieval as search results.  This works fine.  But I find
myself in the position of wanting to occasionally (daily-ish) bulk-
update a single, stored, non-indexed field in every document in the
index, without changing any indexed value at all.

The obviously documented way to do this would be to remove and then
re-add each updated document successively.  However, I know from
experience that rebuilding our index from scratch in this fashion
would take several hours at least, which is too long to delay pending
incremental index jobs.  It seems to me that at some level it should
be possible to iterate over all the document storage on disk and
modify only the field I'm interested in (no index modification
required remember as this is a field that is stored but not
indexed).  It's plain from the documentation on file formats that it
would be potentially possible to do this from a low level, however
before I go possibly re-inventing that wheel, I'm wondering if anyone
knows of any existing code out there that would aid in solving this
problem.

Thanks in advance,

//Thomas
Thomas K. Burkholder
Code Janitor

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to get approximate total matching

2007-03-14 Thread Xiaocheng Luan
If I remember correctly, I once searched over 40G of indexes using 
multi-searcher with 512M max heap size, how much memory did you give the JVM?
Thanks,
Xiaocheng

senthil kumaran <[EMAIL PROTECTED]> wrote: Hi.
I have more index directories (>6) all in GB,and searching my query with
single IndexSearcher  to all indexes one after another.i.e. I create one
IndexSearcher for index1 and search over that.Finally I close that and
create new IndexSearcher for index2 and so on. If i get 200 total results
then i don't go to search other index directories and i print 200 results
and exit from search.
I need to get approximate total matching documents all over the indexes
without going to search in other indexes.
Please suggest me a easiest way to achieve this.

P.S: To avoid more memory usage and to reduce search timeI don't want to
search my query through all indexes if i got 200 results. MultiSearcher
create OOM error,  so that I'm using single IndexSearcher.


Thanks in Advance
Senthil


 
-
Don't get soaked.  Take a quick peek at the forecast 
 with theYahoo! Search weather shortcut.

Indexing HTML pages and phrases

2007-03-14 Thread Maryam
Hi, 

I am wondering if we can index a phrase (not term) in
Lucene? Also, I am not usre if it can index HTML
pages? I need to have access to the text of some of
tags, I am not sure if this can be done in Lucene. I
would be so glad if you help me in this case. 

Thanks 



 

Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Is Lucene Java trunk still stable for production code?

2007-03-14 Thread Jean-Philippe Robichaud
Hello Dear Lucene Users!

 

Back in the old days (well, last year) the lucene/java/trunk subversion
path was always stable enough for everyone to use into production code.
Now, with the 2.0/2.1/2.2 braches, is it still the case?  

 

In December, I 'ported' my app to use the lucene 2.0 release.  Now, I
have another chance to upgrade the production code (this is not
happening every month!) so I would like to upgrade the lucene library
I'm using to take advantage of performance gains.  Should I just update
my svn image from lucene/java/trunk or should I take
lucene/java/branches/lucene_2_1

 

Thanks!

 

Jp



Re: Performance between Filter and HitCollector?

2007-03-14 Thread Otis Gospodnetic
eks dev and others - have you tried using the code from LUCENE-584?  Noticed 
any performance increase when you disabled scoring?  I'd like to look at that 
patch soon and commit it if everything is in place and makes sense, so I'm 
curious if you or anyone else already tried this patch...

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: eks dev <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, March 14, 2007 3:59:25 PM
Subject: Re: Performance between Filter and HitCollector?

just to complete this fine answer,
there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584)  
that could bring the best of both worlds via e.g. ConstantScoringQuery or 
another abstraction that enables disabling Scoring (where appropriate)

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 14 March, 2007 7:15:06 PM
Subject: Re: Performance between Filter and HitCollector?


it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
 performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
 (DocLists know about scores, sorting, and pagination) in a single
 pass when scores or sorting are requested.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







___ 
All New Yahoo! Mail – Tired of unwanted email come-ons? Let our SpamGuard 
protect you. http://uk.docs.yahoo.com/nowyoucan.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance between Filter and HitCollector?

2007-03-14 Thread Antony Bowesman
Thanks for the detailed reponse Hoss.  That's the sort of in depth golden nugget 
I'd like to see in a copy of LIA 2 when it becomes available...


I've wanted to use Filter to cache certain of my Term Queries, as it looked 
faster for straight Term Query searches, but Solr's DocSet interface abstraction 
is more useful.  HashDocSet will probably satisfy 90% of my cache.


Index DBs will typically be in the 1-3 million  documents range, but for mail 
which is spread over 1-6K user, so caching lots of BitSets for that number of 
users in not practical!


I ended up creating a DocSetFilter and creating DocSets (a la Solr) from BitSet 
which is then cached.  I then convert it back during Filter.bits().  Not the 
best solution, but the typical hit size is small, so the iteration is fast.


Thanks eks dev for the info about Lucene-584 - that looks like an interesting 
set of patches.


Antony

Chris Hostetter wrote:

it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
 performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
 (DocLists know about scores, sorting, and pagination) in a single
 pass when scores or sorting are requested.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpellChecker and Lucene 2.1

2007-03-14 Thread karl wettin


14 mar 2007 kl. 21.47 skrev Ryan O'Hara:

Is there a SpellChecker.jar compatible with Lucene 2.1.  After  
updating to Lucene 2.1, I seem to have lost the ability to create a  
spell index using spellchecker-2.0-rc1-dev.jar.  Any help would be  
greatly appreciated.


Can you explain the problem more detailed? Exceptions? API changes?

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance between Filter and HitCollector?

2007-03-14 Thread karl wettin


15 mar 2007 kl. 04.09 skrev Otis Gospodnetic:

eks dev and others - have you tried using the code from  
LUCENE-584?  Noticed any performance increase when you disabled  
scoring?  I'd like to look at that patch soon and commit it if  
everything is in place and makes sense, so I'm curious if you or  
anyone else already tried this patch...


I was trying out Matcher some month ago when fooling around with ways  
of improving speed in the "active search cache" of LUCENE-550. It  
worked just fine for me. I made no futher investigations, nor do I  
have any performance details. I plan to implement it in there for  
real any year now.


So +1 for commit.


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing HTML pages and phrases

2007-03-14 Thread Bhavin Pandya


- Original Message - 
From: "Maryam" <[EMAIL PROTECTED]>

To: 
Sent: Thursday, March 15, 2007 7:55 AM
Subject: Indexing HTML pages and phrases



Hi,

I am wondering if we can index a phrase (not term) in
Lucene? Also, I am not usre if it can index HTML
pages? I need to have access to the text of some of
tags, I am not sure if this can be done in Lucene. I
would be so glad if you help me in this case.

Thanks





Expecting? Get great news right away with email Auto-Check.
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing HTML pages and phrases

2007-03-14 Thread Bhavin Pandya

Hi Maryam,

You can index the content of specific field as UN_TOKENIZED and then you can 
do phrase search on that field..

It will search for only phrases not tokens...
To index HTML pages you can use any HTML parser...
this may be useful to you..
http://lucene.apache.org/java/docs/api/org/apache/lucene/demo/html/HTMLParser.html

Thanks.
Bhavin pandya


- Original Message - 
From: "Maryam" <[EMAIL PROTECTED]>

To: 
Sent: Thursday, March 15, 2007 7:55 AM
Subject: Indexing HTML pages and phrases



Hi,

I am wondering if we can index a phrase (not term) in
Lucene? Also, I am not usre if it can index HTML
pages? I need to have access to the text of some of
tags, I am not sure if this can be done in Lucene. I
would be so glad if you help me in this case.

Thanks





Expecting? Get great news right away with email Auto-Check.
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]