date:20040910

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer

Doug Cutting wrote:
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely 
that the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?
Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting

David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).
I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer

Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  

And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I know in other contexts of IR frequent terms are penalized but in this 
context it seems that frequent terms should be fine...

-- Dave

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

IRC?!

2004-09-10 Thread Kevin A. Burton

There isn't a Lucene IRC room is there (at least there isn't according 
to Google)?

I just joined #lucene on irc.freenode.net if anyone is interested...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: question on Hits.doc

2004-09-10 Thread Otis Gospodnetic

Hello Roy,

This sounds normal.  When you pull a Document from Hits, you are really
pulling it from the disk.  All fields are read from disk at that time
(i.e. no lazy loading of fields), so if you have large text fields,
this is going to result in a lot of disk IO.  You could try running
vmstat or sar (I'm assuming you are using a UNIX flavour) and look at
the bi/bo (really just bo) column (bo = blocks out -- data read from
disks).

There is not much you can do.  If you don't have to store the field,
they will probably help.  Some people are working on adding support for
field compression, so maybe that will help.

Otis

--- [EMAIL PROTECTED] wrote:

> Hey guys,
> 
> We were noticing some speed problems on our searches and after adding
> some
> debug statements to the lucene source code, we have determined that
> the
> Hits.doc(x) is the problem.  (BTW, we are using Lucene 1.2 [with
> plans to
> upgrade]).  It seems that retrieving the actual Document from the
> search is
> very slow.
> 
> We think it might be our "Message" field which stores a huge amount
> of text. 
> We are currently running a test in which we won't "store" the
> "Message" field,
> however, I was wondering if any of you guys would know if that would
> be the
> reason why we're having the performance problems?  If so, could
> anyone also
> please explain it?  It seemed that we weren't having these
> performance
> problems before.  Has anyone else experienced this?  Our environment
> is NT 4,
> JDK 1.4.2, and PIIIs.
> 
> I know that for large text fields, storing the field is not a good
> practice,
> however, it held certain conveniences for us that I hope to not get
> rid of.
> 
> Roy.
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

question on Hits.doc

2004-09-10 Thread roy-lucene-user

Hey guys,

We were noticing some speed problems on our searches and after adding some
debug statements to the lucene source code, we have determined that the
Hits.doc(x) is the problem.  (BTW, we are using Lucene 1.2 [with plans to
upgrade]).  It seems that retrieving the actual Document from the search is
very slow.

We think it might be our "Message" field which stores a huge amount of text. 
We are currently running a test in which we won't "store" the "Message" field,
however, I was wondering if any of you guys would know if that would be the
reason why we're having the performance problems?  If so, could anyone also
please explain it?  It seemed that we weren't having these performance
problems before.  Has anyone else experienced this?  Our environment is NT 4,
JDK 1.4.2, and PIIIs.

I know that for large text fields, storing the field is not a good practice,
however, it held certain conveniences for us that I hope to not get rid of.

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Kevin A. Burton

Daniel Taurat wrote:
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
Depends on what OS and with what patches...
Linux on i386 seems to have a physical limit of 1.7G (256M for VM) ... 
There are some patches to apply to get 3G but only on really modern kernels.

I just need to get Athlon systems :-/
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TermInfo using 300M for large index?

2004-09-10 Thread Kevin A. Burton

I'm trying to do some heap debugging of my application to find a memory 
leak.

Noticed that org.apache.lucene.index.TermInfo had 1.7M instances which 
consumed 300M ... this is of course for a 40G index.

Is this normal and is there any way I can streamline this?
We are of course caching the IndexSearchers but I want to reduce the 
memory footprint...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread Doug Cutting

Daniel Naber wrote:
On Thursday 09 September 2004 18:52, Doug Cutting wrote:

I have not been
able to construct a two-word query that returns a page without both
words in either the content, the title, the url or in a single anchor.
Can you?

Like this one?
konvens leitseite 

Leitseite is only in the title of the first match (www.gldv.org), konvens 
is only in the body.
Good job finding that!  I guess I should fix Nutch's BasicQueryFilter.
Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Doug Cutting

It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is 
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works 
for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 

number of
 

documents

   

Daniel Aber schrieb:
 
 

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

  

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix 


about files
 

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat

Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
  

Hi all
Reading the thread with interest, there is another way I've come 
across out
  

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 
swapping (which
  

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 
of memory.
  

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number of
  

documents



Daniel Aber schrieb:
 
  

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

   


I am facing an out of memory problem using  Lucene 1.4.1.
 
  
Could you try with a recent CVS version? There has been a fix 


about files
  

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
  
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---

Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer

eks dev wrote:
Hi Doug,

Perhaps.  Are folks really better at spelling the
beginning of words?

Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
Good ole Citeseer finds 2 docs that seem relevant:
http://citeseer.ist.psu.edu/cs?cs=1&q=Winkler+Jaro&submit=Documents&co=Citations&cm=50&cf=Any&ao=Citations&am=20&af=Any
I have some of the ngram spelling suggestion stuff, based on earlier 
msgs in this thread, working in my dev tree. I'll try to get a test site 
up later today for people to fool around with.


this topic, if you are interested.
On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required => prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.
Hope this was helpful,
Eks

  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat

Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after indexing 
my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all 

 I had a similar problem, i have  database of documents with 24 fields, and a average 
content of 7K, with  16M+ records
 i had to split the jobs into slabs of 1M each and merging the resulting indexes, 
submissions to our job queue looked like
 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)
 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number
of documents
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to be 
1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type objects, 
that is): No effect.

regards,
Daniel
Pete Lewis wrote:
   

Hi all
Reading the thread with interest, there is another way I've come 
 

across out
   

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 
 

swapping (which
   

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 
 

of memory.
   

Can you check whether or not your garbage collection is being triggered?
Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.
Cheers
Pete Lewis
- Original Message - 
From: "Daniel Taurat" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
 

number of
   

documents

 

Daniel Aber schrieb:
  

   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:



 

I am facing an out of memory problem using  Lucene 1.4.1.
  

   

Could you try with a recent CVS version? There has been a fix 
 

about files
   

not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.
Regards
Daniel



 

Well, it seems not to be files, it looks more like those SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...
Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


---

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat

The Parser is pdfBox. pdf is about 25% of the over all indexing volume  
on the productive system. I also have word-docs and loads of hmtl 
resources to be indexed.
In my testing environment I merely have 5 pdf docs and still those 
permanent object hanging around, though.
Cheers,
Daniel

Ben Litchfield wrote:
I can say that gc is not collecting these objects since I  forced gc
runs when indexing every now and then (when parsing pdf-type objects,
that is): No effect.
   

<>
What PDF parser are you using? Is the problem within the parser and not
lucene? Are you releasing all resources?
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Rupinder Singh Mazara



hi all 

  I had a similar problem, i have  database of documents with 24 fields, and a average 
content of 7K, with  16M+ records

  i had to split the jobs into slabs of 1M each and merging the resulting indexes, 
submissions to our job queue looked like
 
  java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
  
 and i still had outofmemory exception , the solution that i created was to after 
every 200K, documents create a temp directory, and merge them together, this was done 
to do the first production run, updates are now being handled incrementally
 
  

Exception in thread "main" java.lang.OutOfMemoryError
at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

>-Original Message-
>From: Daniel Taurat [mailto:[EMAIL PROTECTED]
>Sent: 10 September 2004 14:42
>To: Lucene Users List
>Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number
>of documents
>
>
>Hi Pete,
>good hint, but we actually do have physical memory of  4Gb on the 
>system. But then: we also have experienced that the gc of ibm jdk1.3.1 
>that we use is sometimes
>behaving strangely with too large heap space anyway. (Limit seems to be 
>1.2 Gb)
>I can say that gc is not collecting these objects since I  forced gc 
>runs when indexing every now and then (when parsing pdf-type objects, 
>that is): No effect.
>
>regards,
>
>Daniel
>
>
>Pete Lewis wrote:
>
>>Hi all
>>
>>Reading the thread with interest, there is another way I've come 
>across out
>>of memory errors when indexing large batches of documents.
>>
>>If you have your heap space settings too high, then you get 
>swapping (which
>>impacts performance) plus you never reach the trigger for garbage
>>collection, hence you don't garbage collect and hence you run out 
>of memory.
>>
>>Can you check whether or not your garbage collection is being triggered?
>>
>>Anomalously therefore if this is the case, by reducing the heap space you
>>can improve performance get rid of the out of memory errors.
>>
>>Cheers
>>Pete Lewis
>>
>>- Original Message - 
>>From: "Daniel Taurat" <[EMAIL PROTECTED]>
>>To: "Lucene Users List" <[EMAIL PROTECTED]>
>>Sent: Friday, September 10, 2004 1:10 PM
>>Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
>number of
>>documents
>>
>>
>>  
>>
>>>Daniel Aber schrieb:
>>>
>>>
>>>
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:



  

>I am facing an out of memory problem using  Lucene 1.4.1.
>
>
>
>
Could you try with a recent CVS version? There has been a fix 
>about files
not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.

Regards
Daniel



  

>>>Well, it seems not to be files, it looks more like those SegmentTermEnum
>>>objects accumulating in memory.
>>>#I've seen some discussion on these objects in the developer-newsgroup
>>>that had taken place some time ago.
>>>I am afraid this is some kind of runaway caching I have to deal with.
>>>Maybe not  correctly addressed in this newsgroup, after all...
>>>
>>>Anyway: any idea if there is an API command to re-init caches?
>>>
>>>Thanks,
>>>
>>>Daniel
>>>
>>>
>>>
>>>-
>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>>
>>-
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>  
>>
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Ben Litchfield

> I can say that gc is not collecting these objects since I  forced gc
> runs when indexing every now and then (when parsing pdf-type objects,
> that is): No effect.

What PDF parser are you using?  Is the problem within the parser and not
lucene?  Are you releasing all resources?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Addition to contributions page

2004-09-10 Thread Chas Emerick

I hope I'm posting this request to the right list.  The page that lists 
3rd-party tools that work with Lucene ( 
http://jakarta.apache.org/lucene/docs/contributions.html ) says that to 
be added to the the page, one should send a message to one of the 
Lucene mailing lists.  So, that's what I'm doing.

PDFTextStream should be added to the 'Document Converters' section, 
with this URL < http://snowtide.com >, and perhaps this heading: 
'PDFTextStream -- PDF text and metadata extraction'.  The 'Author' 
field should probably be left blank, since there's no single creator.

Thanks much,
Chas Emerick   |   [EMAIL PROTECTED]
PDFTextStream: fast PDF text extraction for Java apps and Lucene
http://snowtide.com/home/PDFTextStream/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to be 
1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type objects, 
that is): No effect.

regards,
Daniel
Pete Lewis wrote:
Hi all
Reading the thread with interest, there is another way I've come across out
of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out of memory.
Can you check whether or not your garbage collection is being triggered?
Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.
Cheers
Pete Lewis
- Original Message - 
From: "Daniel Taurat" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of
documents

 

Daniel Aber schrieb:
   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix about files
not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.
Regards
Daniel

 

Well, it seems not to be files, it looks more like those SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...
Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Pete Lewis

Hi all

Reading the thread with interest, there is another way I've come across out
of memory errors when indexing large batches of documents.

If you have your heap space settings too high, then you get swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out of memory.

Can you check whether or not your garbage collection is being triggered?

Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis

- Original Message - 
From: "Daniel Taurat" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of
documents


> Daniel Aber schrieb:
>
> >On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
> >
> >
> >
> >>I am facing an out of memory problem using  Lucene 1.4.1.
> >>
> >>
> >
> >Could you try with a recent CVS version? There has been a fix about files
> >not being deleted after 1.4.1. Not sure if that could cause the problems
> >you're experiencing.
> >
> >Regards
> > Daniel
> >
> >
> >
> Well, it seems not to be files, it looks more like those SegmentTermEnum
> objects accumulating in memory.
> #I've seen some discussion on these objects in the developer-newsgroup
> that had taken place some time ago.
> I am afraid this is some kind of runaway caching I have to deal with.
> Maybe not  correctly addressed in this newsgroup, after all...
>
> Anyway: any idea if there is an API command to re-init caches?
>
> Thanks,
>
> Daniel
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat

Daniel Aber schrieb:
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix about files 
not being deleted after 1.4.1. Not sure if that could cause the problems 
you're experiencing.

Regards
Daniel
 

Well, it seems not to be files, it looks more like those SegmentTermEnum 
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup 
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: combining open office spellchecker with Lucene

2004-09-10 Thread eks dev

Hi Doug,

> Perhaps.  Are folks really better at spelling the
> beginning of words?

Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
this topic, if you are interested.

On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required => prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.

Hope this was helpful,
Eks




  






___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread sergiu gordea


.
I reckon there has been a discussion (and solution :-) on how to achieve the
functionality you've been
after:
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116
I'm not sure if this would be the same though.
Best regards,
René
 

Hi all,
I took the code indicated by Rene but I've seen that it's not completly 
feeting my requirements, because my application should
provide the facility to check queries as beeing Fuzzy queries. so I 
modified the code to the following one, and I added a test main method.
Hope it helps someone.


package org.apache.lucene;
/* @(#) CWK 1.5 10.09.2004
*
* Copyright 2003-2005 ConfigWorks Informationssysteme & Consulting GmbH
* Universitätsstr. 94/7 9020 Klagenfurt Austria
* www.configworks.com
* All rights reserved.
*/
import java.util.Vector;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.Query;
/**
* @author sergiu
* this class is a patch for MultifieldQueryParser
* it's behaviour can be tested by running the main method
*
* 
Now:

String[] fields = new String[] { "title", "abstract", "content" };
QueryParser parser = new CustomQueryParser(fields, new SimpleAnalyzer());
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
Query query = parser.parse("foo -bar (baz OR title:bla)");
System.out.println("? " + query);
Produces:
? +(title:foo abstract:foo content:foo) -(title:bar abstract:bar
content:bar) +((title:baz abstract:baz content:baz) title:bla)
Perfect!
* @version 1.0
* @since CWK 1.5
*/
public class CustomQueryParser extends QueryParser{
  private String[] fields;
  private boolean fuzzySearch = false;
  public CustomQueryParser(String[] fields, Analyzer analyzer){
super(null, analyzer);
this.fields = fields;
  }
  public CustomQueryParser(String[] fields, Analyzer analyzer, int 
defaultOperator){
  super(null, analyzer);
  this.fields = fields;
  setOperator(defaultOperator);
  }

  protected Query getFieldQuery(String field, Analyzer analyzer, String 
queryText)
throws ParseException{
   
Query query = null;
   
if (field == null){
  Vector clauses = new Vector();
  for (int i = 0; i < fields.length; i++){
  if(isFuzzySearch())
  clauses.add(new 
BooleanClause(super.getFuzzyQuery(fields[i], queryText), false, false));
  else
  clauses.add(new 
BooleanClause(super.getFieldQuery(fields[i], analyzer, queryText), 
false, false));
 
  }
  query = getBooleanQuery(clauses); 
}else{
if (isFuzzySearch())
query = super.getFuzzyQuery(field, queryText);
else
query = super.getFieldQuery(field, analyzer, 
queryText);

}
return query;
  }
 
  public boolean isFuzzySearch() {
  return fuzzySearch;
  }
 
  public void setFuzzySearch(boolean fuzzySearch) {
  this.fuzzySearch = fuzzySearch;
  }

  public static void main(String[] args) throws Exception{
  String[] fields = new String[] { "title", "abstract", "content" };
  CustomQueryParser parser = new CustomQueryParser(fields, new 
StandardAnalyzer());
  parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
  parser.setFuzzySearch(true);
 
  String queryString = "foo -bar (baz OR title:bla)";
  System.out.println(queryString);
  Query query = parser.parse(queryString);
  System.out.println("? " + query);   

  }
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: frequent terms - Re: combining open office spellchecker with Lucene

Re: frequent terms - Re: combining open office spellchecker with Lucene

frequent terms - Re: combining open office spellchecker with Lucene

IRC?!

Re: question on Hits.doc

question on Hits.doc

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

TermInfo using 300M for large index?

Re: MultiFieldQueryParser seems broken... Fix attached.

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Re: combining open office spellchecker with Lucene

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

RE: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Addition to contributions page

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

Re: combining open office spellchecker with Lucene

Re: MultiFieldQueryParser seems broken... Fix attached.

22 matches

Site Navigation

Mail list logo

Footer information