Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Michael McCandless


I still don't quite understand what's causing your memory growth.

SegmentTermEnum insances have been held in a ThreadLocal cache in  
TermInfosReader for a very long time (at least since Lucene 1.4).


If indeed it's the RAMDir's contents being kept "alive" due to this,  
then, you should have already been seeing this problem before rev  
659602.  And I still don't get why your reference tree is missing the  
TermInfosReader.ThreadResources class.


I'd like to understand the root cause before we hash out possible  
solutions.


Can you post the sources for your load test?

Mike

Chris Lu wrote:

Actually, even I only use one IndexReader, some resources are cached  
via the ThreadLocal cache, and can not be released unless all  
threads do the close action.


SegmentTermEnum itself is small, but it holds RAMDirectory along the  
path, which is big.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:43 PM, robert engels  
<[EMAIL PROTECTED]> wrote:

You do not need a pool of IndexReaders...

It does not matter what class it is, what matters is the class that  
ultimately holds the reference.


If the IndexReader is never closed, the SegmentReader(s) is never  
closed, so the thread local in TermInfosReader is not cleared  
(because the thread never dies). So you will get one  
SegmentTermEnum, per thread * per segment.


The SegmentTermEnum is not a large object, so even if you had 100  
threads, and 100 segments, for 10k instances, seems hard to believe  
that is the source of your memory issue.


The SegmentTermEnum is cached by thread since it needs to enumerate  
the terms, not having a per thread cache, would lead to lots of  
random access when multiple threads read the index - very slow.


You need to keep in mind, what if every thread was executing a  
search simultaneously - you would still have 100x100 SegmentTermEnum  
instances anyway !  The only way to prevent that would be to create  
and destroy the SegmentTermEnum on each call (opening and seeking to  
the proper spot) - which would be SLOW SLOW SLOW.


On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:

I have tried to create an IndexReader pool and dynamically create  
searcher. But the memory leak is the same. It's not related to the  
Searcher class specifically, but the SegmentTermEnum in  
TermInfosReader.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:14 PM, robert engels  
<[EMAIL PROTECTED]> wrote:
A searcher uses an IndexReader - the IndexReader is slow to open,  
not a Searcher. And searchers can share an IndexReader.


You want to create a single shared (across all threads/users)  
IndexReader (usually), and create an Searcher as needed and  
dispose.  It is VERY CHEAP to create the Searcher.


I am fairly certain the javadoc on Searcher is incorrect.  The  
warning "For performance reasons it is recommended to open only one  
IndexSearcher and use it for all of your searches" is not true in  
the case where an IndexReader is passed to the ctor.


Any caching should USUALLY be performed at the IndexReader level.

You are most likely using the "path" ctor, and that is the source  
of your problems, as multiple IndexReader instances are being  
created, and thus the memory use.



On Sep 9, 2008, at 11:44 PM, Chris Lu wrote:

On J2EE environment, usually there is a searcher pool with several  
searchers open.

The speed to opening a large index for every user is not acceptable.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 9:03 PM, robert engels  
<[EMAIL PROTECTED]> wrote:
You need to close the searcher within the thread that is using it,  
in order to have it cleaned up quickly... usually right after you  
display the page of results.


If you are keeping multiple searcher refs across multiple threads  
for paging/whatever, you have not coded it correctly.


Imagine 10,000 users - storing a searcher for each one is not  
going to work...


On Sep 9, 2008, at 

[jira] Created: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

2008-09-10 Thread Michael Semb Wever (JIRA)
Patch for ShingleFilter.coterminalPositionIncrement
---

 Key: LUCENE-1380
 URL: https://issues.apache.org/jira/browse/LUCENE-1380
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Michael Semb Wever
 Fix For: 2.4


Make it possible for *all* words and shingles to be placed at the same position.

Default is to place each shingle at the same position as the unigram (or first 
shingle if outputUnigrams=false). That is, each coterminal token has 
positionIncrement=1 and every other token a positionIncrement=0. 
This leads to a MultiPhraseQuery where at least one word/shingle must be 
matched from each word/token. This is not always desired. 

See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for mailing 
list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

2008-09-10 Thread Michael Semb Wever (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated LUCENE-1380:
---

Attachment: LUCENE-1380.patch

Addition to ShingleFilter for property coterminalPositionIncrement.
New corresponding test in ShingleFilterTest.

> Patch for ShingleFilter.coterminalPositionIncrement
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Michael Semb Wever
> Fix For: 2.4
>
> Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position.
> Default is to place each shingle at the same position as the unigram (or 
> first shingle if outputUnigrams=false). That is, each coterminal token has 
> positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be 
> matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Noble Paul നോബിള്‍ नोब्ळ्
Why do you need to keep a strong reference?
Why not a WeakReference ?

--Noble

On Wed, Sep 10, 2008 at 12:27 AM, Chris Lu <[EMAIL PROTECTED]> wrote:
> The problem should be similar to what's talked about on this discussion.
> http://lucene.markmail.org/message/keosgz2c2yjc7qre?q=ThreadLocal
>
> There is a memory leak for Lucene search from Lucene-1195.(svn r659602,
> May23,2008)
>
> This patch brings in a ThreadLocal cache to TermInfosReader.
>
> It's usually recommended to keep the reader open, and reuse it when
> possible. In a common J2EE application, the http requests are usually
> handled by different threads. But since the cache is ThreadLocal, the cache
> are not really usable by other threads. What's worse, the cache can not be
> cleared by another thread!
>
> This leak is not so obvious usually. But my case is using RAMDirectory,
> having several hundred megabytes. So one un-released resource is obvious to
> me.
>
> Here is the reference tree:
> org.apache.lucene.store.RAMDirectory
>  |- directory of org.apache.lucene.store.RAMFile
>  |- file of org.apache.lucene.store.RAMInputStream
>  |- base of org.apache.lucene.index.CompoundFileReader$CSIndexInput
>  |- input of org.apache.lucene.index.SegmentTermEnum
>  |- value of java.lang.ThreadLocal$ThreadLocalMap$Entry
>
>
> After I switched back to svn revision 659601, right before this patch is
> checked in, the memory leak is gone.
> Although my case is RAMDirectory, I believe this will affect disk based
> index also.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>



-- 
--Noble Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

2008-09-10 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629763#action_12629763
 ] 

Karl Wettin commented on LUCENE-1320:
-

It really is quite a bit of work to downgrade this to 1.4, lots of generics but 
it also depends on enum.

So if you don't want 1.5 in contrib/analyzers I vote for simply removing it 
from trunk now and reintroducing it in the 3.1-dev trunk. 


   karl

> ShingleMatrixFilter, a three dimensional permutating shingle filter
> ---
>
> Key: LUCENE-1320
> URL: https://issues.apache.org/jira/browse/LUCENE-1320
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 2.3.2
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Blocker
> Fix For: 2.4
>
> Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

2008-09-10 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629798#action_12629798
 ] 

Grant Ingersoll commented on LUCENE-1320:
-

I'm almost done w/ a conversion.  Regex is your friend.  As is IntelliJ.

> ShingleMatrixFilter, a three dimensional permutating shingle filter
> ---
>
> Key: LUCENE-1320
> URL: https://issues.apache.org/jira/browse/LUCENE-1320
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 2.3.2
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Blocker
> Fix For: 2.4
>
> Attachments: LUCENE-1320.txt, LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter

2008-09-10 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1320:


Attachment: LUCENE-1320.patch

Java 1.4 compatible.  Give this a try

> ShingleMatrixFilter, a three dimensional permutating shingle filter
> ---
>
> Key: LUCENE-1320
> URL: https://issues.apache.org/jira/browse/LUCENE-1320
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Affects Versions: 2.3.2
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Blocker
> Fix For: 2.4
>
> Attachments: LUCENE-1320.patch, LUCENE-1320.txt, LUCENE-1320.txt, 
> LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle 
> tokens in three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries 
> with something speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, 
> payload boosts calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more 
> than maximumShingleSize columns forward in the stream and clearing up unused 
> resources (columns and unique token sets). Can still be optimized quite a bit 
> though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels

Sorry, but I am fairly certain you are mistaken.

If you only have a single IndexReader, the RAMDirectory will be  
shared in all cases.


The only memory growth is any buffer space allocated by an IndexInput  
(used in many places and cached).


Normally the IndexInput created by a RAMDirectory do not have any  
buffer allocated, since the underlying store is already in memory.


You have some other problem in your code...

On Sep 10, 2008, at 1:10 AM, Chris Lu wrote:

Actually, even I only use one IndexReader, some resources are  
cached via the ThreadLocal cache, and can not be released unless  
all threads do the close action.


SegmentTermEnum itself is small, but it holds RAMDirectory along  
the path, which is big.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:43 PM, robert engels  
<[EMAIL PROTECTED]> wrote:

You do not need a pool of IndexReaders...

It does not matter what class it is, what matters is the class that  
ultimately holds the reference.


If the IndexReader is never closed, the SegmentReader(s) is never  
closed, so the thread local in TermInfosReader is not cleared  
(because the thread never dies). So you will get one  
SegmentTermEnum, per thread * per segment.


The SegmentTermEnum is not a large object, so even if you had 100  
threads, and 100 segments, for 10k instances, seems hard to believe  
that is the source of your memory issue.


The SegmentTermEnum is cached by thread since it needs to enumerate  
the terms, not having a per thread cache, would lead to lots of  
random access when multiple threads read the index - very slow.


You need to keep in mind, what if every thread was executing a  
search simultaneously - you would still have 100x100  
SegmentTermEnum instances anyway !  The only way to prevent that  
would be to create and destroy the SegmentTermEnum on each call  
(opening and seeking to the proper spot) - which would be SLOW SLOW  
SLOW.


On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:

I have tried to create an IndexReader pool and dynamically create  
searcher. But the memory leak is the same. It's not related to the  
Searcher class specifically, but the SegmentTermEnum in  
TermInfosReader.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:14 PM, robert engels  
<[EMAIL PROTECTED]> wrote:
A searcher uses an IndexReader - the IndexReader is slow to open,  
not a Searcher. And searchers can share an IndexReader.


You want to create a single shared (across all threads/users)  
IndexReader (usually), and create an Searcher as needed and  
dispose.  It is VERY CHEAP to create the Searcher.


I am fairly certain the javadoc on Searcher is incorrect.  The  
warning "For performance reasons it is recommended to open only  
one IndexSearcher and use it for all of your searches" is not true  
in the case where an IndexReader is passed to the ctor.


Any caching should USUALLY be performed at the IndexReader level.

You are most likely using the "path" ctor, and that is the source  
of your problems, as multiple IndexReader instances are being  
created, and thus the memory use.



On Sep 9, 2008, at 11:44 PM, Chris Lu wrote:

On J2EE environment, usually there is a searcher pool with  
several searchers open.

The speed to opening a large index for every user is not acceptable.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 9:03 PM, robert engels  
<[EMAIL PROTECTED]> wrote:
You need to close the searcher within the thread that is using  
it, in order to have it cleaned up quickly... usually right after  
you display the page of results.


If you are keeping multiple searcher refs across multiple threads  
for paging/whatever, you have not coded it correctly.


Imagine 10,000 users - storing a searcher for each one is not  
going to work...


On Sep 9, 2008, at 10:21 PM, Chris Lu wrote:

Right, in a sense I can not release it from another thread. But  
that's the problem.


It's a J2EE

[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

2008-09-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629827#action_12629827
 ] 

Steven Rowe commented on LUCENE-1380:
-

As I said in the thread on java-user that spawned this issue: 

 (emphasis added):

{quote}
It works because you've set all of the shingles to be at the same position - 
probably better to change the one instance of .setPositionIncrement(0) to 
.setPositionIncrement(1) - that way, MultiPhraseQuery will not be invoked, and 
the standard disjunction thing should happen.

> [W]ould a patch to ShingleFilter that offers an option
> "unigramPositionIncrement" (that defaults to 1) likely be
> accepted into trunk?

The issue is not directly related to whether a unigram is involved, but rather 
whether or not _*tokens that begin at the same word*_ are given the same 
position.  The option thus should be named something like 
"coterminalPositionIncrement".  This seems like a reasonable addition, and a 
patch likely would be accepted, if it included unit tests.
{quote}

You have used the option name I suggested, but have implemented it in a form 
that doesn't follow the name -- in your implementation, *all* tokens are placed 
at the same position, not just those that start at the same word -- and I think 
this form is inappropriate for the general user.

I'm -1 on the patch in its current form.  If rewritten to modify the position 
increment only for those shingles that begin at the same word, I'd be +1 
(assuming it works and is tested appropriately).

> Patch for ShingleFilter.coterminalPositionIncrement
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Michael Semb Wever
> Fix For: 2.4
>
> Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position.
> Default is to place each shingle at the same position as the unigram (or 
> first shingle if outputUnigrams=false). That is, each coterminal token has 
> positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be 
> matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Does this make any difference?If I intentionally close the searcher and
reader failed to release the memory, I can not rely on some magic of JVM to
release it.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 4:03 AM, Noble Paul നോബിള്‍ नोब्ळ् <
[EMAIL PROTECTED]> wrote:

> Why do you need to keep a strong reference?
> Why not a WeakReference ?
>
> --Noble
>
> On Wed, Sep 10, 2008 at 12:27 AM, Chris Lu <[EMAIL PROTECTED]> wrote:
> > The problem should be similar to what's talked about on this discussion.
> > http://lucene.markmail.org/message/keosgz2c2yjc7qre?q=ThreadLocal
> >
> > There is a memory leak for Lucene search from Lucene-1195.(svn r659602,
> > May23,2008)
> >
> > This patch brings in a ThreadLocal cache to TermInfosReader.
> >
> > It's usually recommended to keep the reader open, and reuse it when
> > possible. In a common J2EE application, the http requests are usually
> > handled by different threads. But since the cache is ThreadLocal, the
> cache
> > are not really usable by other threads. What's worse, the cache can not
> be
> > cleared by another thread!
> >
> > This leak is not so obvious usually. But my case is using RAMDirectory,
> > having several hundred megabytes. So one un-released resource is obvious
> to
> > me.
> >
> > Here is the reference tree:
> > org.apache.lucene.store.RAMDirectory
> >  |- directory of org.apache.lucene.store.RAMFile
> >  |- file of org.apache.lucene.store.RAMInputStream
> >  |- base of
> org.apache.lucene.index.CompoundFileReader$CSIndexInput
> >  |- input of org.apache.lucene.index.SegmentTermEnum
> >  |- value of java.lang.ThreadLocal$ThreadLocalMap$Entry
> >
> >
> > After I switched back to svn revision 659601, right before this patch is
> > checked in, the memory leak is gone.
> > Although my case is RAMDirectory, I believe this will affect disk based
> > index also.
> >
> > --
> > Chris Lu
> > -
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> >
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> > DBSight customer, a shopping comparison site, (anonymous per request) got
> > 2.6 Million Euro funding!
> >
>
>
>
> --
> --Noble Paul
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
I do not believe I am making any mistake. Actually I just got an email from
another user, complaining about the same thing. And I am having the same
usage pattern.
After the reader is opened, the RAMDirectory is shared by several objects.
There is one instance of RAMDirectory in the memory, and it is holding lots
of memory, which is expected.

If I close the reader in the same thread that has opened it, the
RAMDirectory is gone from the memory.
If I close the reader in other threads, the RAMDirectory is left in the
memory, referenced along the tree I draw in the first email.

I do not think the usage is wrong. Period.

-

Hi,

   i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to ask you
if you already have resolved the problem and how you did it or maybe
you know where i can read about the solution. We are using
RAMDirectory too and figured out, that over time the memory
consumption raises and raises until the system breaks down but only
when we performing much index updates. if we only create the index and
don't do nothing except searching it, it work fine.

maybe you can give me a hint or a link,
greetz,

-

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 7:12 AM, robert engels <[EMAIL PROTECTED]>wrote:

> Sorry, but I am fairly certain you are mistaken.
> If you only have a single IndexReader, the RAMDirectory will be shared in
> all cases.
>
> The only memory growth is any buffer space allocated by an IndexInput (used
> in many places and cached).
>
> Normally the IndexInput created by a RAMDirectory do not have any buffer
> allocated, since the underlying store is already in memory.
>
> You have some other problem in your code...
>
> On Sep 10, 2008, at 1:10 AM, Chris Lu wrote:
>
> Actually, even I only use one IndexReader, some resources are cached via
> the ThreadLocal cache, and can not be released unless all threads do the
> close action.
>
> SegmentTermEnum itself is small, but it holds RAMDirectory along the path,
> which is big.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
> On Tue, Sep 9, 2008 at 10:43 PM, robert engels <[EMAIL PROTECTED]>wrote:
>
>>  You do not need a pool of IndexReaders...
>> It does not matter what class it is, what matters is the class that
>> ultimately holds the reference.
>>
>> If the IndexReader is never closed, the SegmentReader(s) is never closed,
>> so the thread local in TermInfosReader is not cleared (because the thread
>> never dies). So you will get one SegmentTermEnum, per thread * per segment.
>>
>> The SegmentTermEnum is not a large object, so even if you had 100 threads,
>> and 100 segments, for 10k instances, seems hard to believe that is the
>> source of your memory issue.
>>
>> The SegmentTermEnum is cached by thread since it needs to enumerate the
>> terms, not having a per thread cache, would lead to lots of random access
>> when multiple threads read the index - very slow.
>>
>> You need to keep in mind, what if every thread was executing a search
>> simultaneously - you would still have 100x100 SegmentTermEnum instances
>> anyway !  The only way to prevent that would be to create and destroy the
>> SegmentTermEnum on each call (opening and seeking to the proper spot) -
>> which would be SLOW SLOW SLOW.
>>
>> On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:
>>
>> I have tried to create an IndexReader pool and dynamically create
>> searcher. But the memory leak is the same. It's not related to the Searcher
>> class specifically, but the SegmentTermEnum in TermInfosReader.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Tue, Sep 9, 2008 at 10:14 PM, robert engels <[EMAIL PROTECTED]>wrote:
>>
>>>  A searcher uses an IndexReader - the IndexReader is slow to open, not a
>>> Searcher. And searchers can share an IndexReader.
>>> You want to c

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Frankly I don't know why TermInfosReader.ThreadResources is not showing up
in the memory snapshot.

Yes. It's been there for a long time. But let's see what's changed : A LRU
cache of termInfoCache is added.
I SegmentTermEnum previously would be released, since it's relatively a
simple object.
But with a cache added to the same class ThreadResources, which hold many
objects, with the threads still hanging around, the cache can not be
released, so in turn the SegmentTermEnum can not be released, so the
RAMDirectory can not be released.

My test is too coupled with the software I am working on and not easy to
post here. But here is a similar case from another user:

---

i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to ask you
if you already have resolved the problem and how you did it or maybe
you know where i can read about the solution. We are using
RAMDirectory too and figured out, that over time the memory
consumption raises and raises until the system breaks down but only
when we performing much index updates. if we only create the index and
don't do nothing except searching it, it work fine.

---

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 2:45 AM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> I still don't quite understand what's causing your memory growth.
>
> SegmentTermEnum insances have been held in a ThreadLocal cache in
> TermInfosReader for a very long time (at least since Lucene 1.4).
>
> If indeed it's the RAMDir's contents being kept "alive" due to this, then,
> you should have already been seeing this problem before rev 659602.  And I
> still don't get why your reference tree is missing the
> TermInfosReader.ThreadResources class.
>
> I'd like to understand the root cause before we hash out possible
> solutions.
>
> Can you post the sources for your load test?
>
> Mike
>
>
> Chris Lu wrote:
>
>  Actually, even I only use one IndexReader, some resources are cached via
>> the ThreadLocal cache, and can not be released unless all threads do the
>> close action.
>>
>> SegmentTermEnum itself is small, but it holds RAMDirectory along the path,
>> which is big.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Tue, Sep 9, 2008 at 10:43 PM, robert engels <[EMAIL PROTECTED]>
>> wrote:
>> You do not need a pool of IndexReaders...
>>
>> It does not matter what class it is, what matters is the class that
>> ultimately holds the reference.
>>
>> If the IndexReader is never closed, the SegmentReader(s) is never closed,
>> so the thread local in TermInfosReader is not cleared (because the thread
>> never dies). So you will get one SegmentTermEnum, per thread * per segment.
>>
>> The SegmentTermEnum is not a large object, so even if you had 100 threads,
>> and 100 segments, for 10k instances, seems hard to believe that is the
>> source of your memory issue.
>>
>> The SegmentTermEnum is cached by thread since it needs to enumerate the
>> terms, not having a per thread cache, would lead to lots of random access
>> when multiple threads read the index - very slow.
>>
>> You need to keep in mind, what if every thread was executing a search
>> simultaneously - you would still have 100x100 SegmentTermEnum instances
>> anyway !  The only way to prevent that would be to create and destroy the
>> SegmentTermEnum on each call (opening and seeking to the proper spot) -
>> which would be SLOW SLOW SLOW.
>>
>> On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:
>>
>>  I have tried to create an IndexReader pool and dynamically create
>>> searcher. But the memory leak is the same. It's not related to the Searcher
>>> class specifically, but the SegmentTermEnum in TermInfosReader.
>>>
>>> --
>>> Chris Lu
>>> -
>>> Instant Scalable Full-Text Search On Any Database/Application
>>> site: http://www.dbsight.net
>>> demo: http://search.dbsight.com
>>> Lucene Database Search in 3 minutes:
>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>> DBSight customer, a shopping comparison site, (anonymous per request

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels
It is basic Java. Threads are not guaranteed to run on any sort of  
schedule. If you create lots of large objects in one thread,  
releasing them in another, there is a good chance you will get an OOM  
(since the releasing thread may not run before the OOM occurs)...   
This is not Lucene specific by any means.


It is a misunderstanding on your part about how GC works.

I assume you must at some point be creating new RAMDirectories -  
otherwise the memory would never really increase, since the  
IndexReader/enums/etc are not very large...


When you create a new RAMDirectories, you need to BE CERTAIN !!! that  
the other IndexReaders/Searchers using the old RAMDirectory are ALL  
CLOSED, otherwise their memory will still be in use, which leads to  
your OOM...



On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:

I do not believe I am making any mistake. Actually I just got an  
email from another user, complaining about the same thing. And I am  
having the same usage pattern.


After the reader is opened, the RAMDirectory is shared by several  
objects.
There is one instance of RAMDirectory in the memory, and it is  
holding lots of memory, which is expected.


If I close the reader in the same thread that has opened it, the  
RAMDirectory is gone from the memory.
If I close the reader in other threads, the RAMDirectory is left in  
the memory, referenced along the tree I draw in the first email.


I do not think the usage is wrong. Period.

-
Hi,

   i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to ask you
if you already have resolved the problem and how you did it or maybe
you know where i can read about the solution. We are using
RAMDirectory too and figured out, that over time the memory
consumption raises and raises until the system breaks down but only
when we performing much index updates. if we only create the index and
don't do nothing except searching it, it work fine.

maybe you can give me a hint or a link,
greetz,
-

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 7:12 AM, robert engels  
<[EMAIL PROTECTED]> wrote:

Sorry, but I am fairly certain you are mistaken.

If you only have a single IndexReader, the RAMDirectory will be  
shared in all cases.


The only memory growth is any buffer space allocated by an  
IndexInput (used in many places and cached).


Normally the IndexInput created by a RAMDirectory do not have any  
buffer allocated, since the underlying store is already in memory.


You have some other problem in your code...

On Sep 10, 2008, at 1:10 AM, Chris Lu wrote:

Actually, even I only use one IndexReader, some resources are  
cached via the ThreadLocal cache, and can not be released unless  
all threads do the close action.


SegmentTermEnum itself is small, but it holds RAMDirectory along  
the path, which is big.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:43 PM, robert engels  
<[EMAIL PROTECTED]> wrote:

You do not need a pool of IndexReaders...

It does not matter what class it is, what matters is the class  
that ultimately holds the reference.


If the IndexReader is never closed, the SegmentReader(s) is never  
closed, so the thread local in TermInfosReader is not cleared  
(because the thread never dies). So you will get one  
SegmentTermEnum, per thread * per segment.


The SegmentTermEnum is not a large object, so even if you had 100  
threads, and 100 segments, for 10k instances, seems hard to  
believe that is the source of your memory issue.


The SegmentTermEnum is cached by thread since it needs to  
enumerate the terms, not having a per thread cache, would lead to  
lots of random access when multiple threads read the index - very  
slow.


You need to keep in mind, what if every thread was executing a  
search simultaneously - you would still have 100x100  
SegmentTermEnum instances anyway !  The only way to prevent that  
would be to create and destroy the SegmentTermEnum on each call  
(opening and seeking to the proper spot) - which would be SLOW  
SLOW SLOW.


On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:

I have tried to create an IndexReader pool and dynamically create  
searcher.

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels
Actually, a single RAMDirectory would be sufficient (since it  
supports writes). There should never be a reason to create a new  
RAMDirectory (unless you have some specialized real-time search  
occuring).


If you are creating new RAMDirectories, the statements below hold.

On Sep 10, 2008, at 10:34 AM, robert engels wrote:

It is basic Java. Threads are not guaranteed to run on any sort of  
schedule. If you create lots of large objects in one thread,  
releasing them in another, there is a good chance you will get an  
OOM (since the releasing thread may not run before the OOM  
occurs)...  This is not Lucene specific by any means.


It is a misunderstanding on your part about how GC works.

I assume you must at some point be creating new RAMDirectories -  
otherwise the memory would never really increase, since the  
IndexReader/enums/etc are not very large...


When you create a new RAMDirectories, you need to BE CERTAIN !!!  
that the other IndexReaders/Searchers using the old RAMDirectory  
are ALL CLOSED, otherwise their memory will still be in use, which  
leads to your OOM...



On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:

I do not believe I am making any mistake. Actually I just got an  
email from another user, complaining about the same thing. And I  
am having the same usage pattern.


After the reader is opened, the RAMDirectory is shared by several  
objects.
There is one instance of RAMDirectory in the memory, and it is  
holding lots of memory, which is expected.


If I close the reader in the same thread that has opened it, the  
RAMDirectory is gone from the memory.
If I close the reader in other threads, the RAMDirectory is left  
in the memory, referenced along the tree I draw in the first email.


I do not think the usage is wrong. Period.

-
Hi,

   i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to ask  
you

if you already have resolved the problem and how you did it or maybe
you know where i can read about the solution. We are using
RAMDirectory too and figured out, that over time the memory
consumption raises and raises until the system breaks down but only
when we performing much index updates. if we only create the index  
and

don't do nothing except searching it, it work fine.

maybe you can give me a hint or a link,
greetz,
-

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 7:12 AM, robert engels  
<[EMAIL PROTECTED]> wrote:

Sorry, but I am fairly certain you are mistaken.

If you only have a single IndexReader, the RAMDirectory will be  
shared in all cases.


The only memory growth is any buffer space allocated by an  
IndexInput (used in many places and cached).


Normally the IndexInput created by a RAMDirectory do not have any  
buffer allocated, since the underlying store is already in memory.


You have some other problem in your code...

On Sep 10, 2008, at 1:10 AM, Chris Lu wrote:

Actually, even I only use one IndexReader, some resources are  
cached via the ThreadLocal cache, and can not be released unless  
all threads do the close action.


SegmentTermEnum itself is small, but it holds RAMDirectory along  
the path, which is big.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:43 PM, robert engels  
<[EMAIL PROTECTED]> wrote:

You do not need a pool of IndexReaders...

It does not matter what class it is, what matters is the class  
that ultimately holds the reference.


If the IndexReader is never closed, the SegmentReader(s) is never  
closed, so the thread local in TermInfosReader is not cleared  
(because the thread never dies). So you will get one  
SegmentTermEnum, per thread * per segment.


The SegmentTermEnum is not a large object, so even if you had 100  
threads, and 100 segments, for 10k instances, seems hard to  
believe that is the source of your memory issue.


The SegmentTermEnum is cached by thread since it needs to  
enumerate the terms, not having a per thread cache, would lead to  
lots of random access when multiple threads read the index - very  
slow.


You need to keep in mind, what if every thread was executing a  
search simultaneously - you would still have 100x100  

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
I am really want to find out where I am doing wrong, if that's the case.

Yes. I have made certain that I closed all Readers/Searchers, and verified
that through memory profiler.
Yes. I am creating new RAMDirectory. But that's the problem. I need to
update the content. Sure, if no content update and everything the same, of
course no OOM.

Yes. No guarantee of the thread schedule. But that's the problem. If Lucene
is using ThreadLocal to cache lots of things by the Thread as the key, and
no idea when it'll be released. Of course ThreadLocal is not Lucene's
problem...

Chris

On Wed, Sep 10, 2008 at 8:34 AM, robert engels <[EMAIL PROTECTED]>wrote:

> It is basic Java. Threads are not guaranteed to run on any sort of
> schedule. If you create lots of large objects in one thread, releasing them
> in another, there is a good chance you will get an OOM (since the releasing
> thread may not run before the OOM occurs)...  This is not Lucene specific by
> any means.
> It is a misunderstanding on your part about how GC works.
>
> I assume you must at some point be creating new RAMDirectories - otherwise
> the memory would never really increase, since the IndexReader/enums/etc are
> not very large...
>
> When you create a new RAMDirectories, you need to BE CERTAIN !!! that the
> other IndexReaders/Searchers using the old RAMDirectory are ALL CLOSED,
> otherwise their memory will still be in use, which leads to your OOM...
>
>
> On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:
>
> I do not believe I am making any mistake. Actually I just got an email from
> another user, complaining about the same thing. And I am having the same
> usage pattern.
> After the reader is opened, the RAMDirectory is shared by several objects.
> There is one instance of RAMDirectory in the memory, and it is holding lots
> of memory, which is expected.
>
> If I close the reader in the same thread that has opened it, the
> RAMDirectory is gone from the memory.
> If I close the reader in other threads, the RAMDirectory is left in the
> memory, referenced along the tree I draw in the first email.
>
> I do not think the usage is wrong. Period.
>
> -
>
> Hi,
>
>i found a forum post from you here [1] where you mention that you
> have a memory leak using the lucene ram directory. I'd like to ask you
> if you already have resolved the problem and how you did it or maybe
> you know where i can read about the solution. We are using
> RAMDirectory too and figured out, that over time the memory
> consumption raises and raises until the system breaks down but only
> when we performing much index updates. if we only create the index and
> don't do nothing except searching it, it work fine.
>
> maybe you can give me a hint or a link,
> greetz,
>
> -
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
> On Wed, Sep 10, 2008 at 7:12 AM, robert engels <[EMAIL PROTECTED]>wrote:
>
>> Sorry, but I am fairly certain you are mistaken.
>> If you only have a single IndexReader, the RAMDirectory will be shared in
>> all cases.
>>
>> The only memory growth is any buffer space allocated by an IndexInput
>> (used in many places and cached).
>>
>> Normally the IndexInput created by a RAMDirectory do not have any buffer
>> allocated, since the underlying store is already in memory.
>>
>> You have some other problem in your code...
>>
>> On Sep 10, 2008, at 1:10 AM, Chris Lu wrote:
>>
>> Actually, even I only use one IndexReader, some resources are cached via
>> the ThreadLocal cache, and can not be released unless all threads do the
>> close action.
>>
>> SegmentTermEnum itself is small, but it holds RAMDirectory along the path,
>> which is big.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>> On Tue, Sep 9, 2008 at 10:43 PM, robert engels <[EMAIL PROTECTED]>wrote:
>>
>>>  You do not need a pool of IndexReaders...
>>> It does not matter what class it is, what matters is the class that
>>> ultimately holds the reference.
>>>
>>> If the IndexReader is never closed, the SegmentReader(s) is never closed,
>>> so the thread local in TermInfosReader is not cleared (because the thread
>>> never dies). So you will get one SegmentTermEnum, per thread * per segment.
>>>
>>> The SegmentTermEnum is not a l

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Is it possible that some other places that's using SegmentTermEnum as
ThreadLocal?This may explain why TermInfosReader.ThreadResources is not in
the memory snapshot.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 2:45 AM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> I still don't quite understand what's causing your memory growth.
>
> SegmentTermEnum insances have been held in a ThreadLocal cache in
> TermInfosReader for a very long time (at least since Lucene 1.4).
>
> If indeed it's the RAMDir's contents being kept "alive" due to this, then,
> you should have already been seeing this problem before rev 659602.  And I
> still don't get why your reference tree is missing the
> TermInfosReader.ThreadResources class.
>
> I'd like to understand the root cause before we hash out possible
> solutions.
>
> Can you post the sources for your load test?
>
> Mike
>
>
> Chris Lu wrote:
>
>  Actually, even I only use one IndexReader, some resources are cached via
>> the ThreadLocal cache, and can not be released unless all threads do the
>> close action.
>>
>> SegmentTermEnum itself is small, but it holds RAMDirectory along the path,
>> which is big.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Tue, Sep 9, 2008 at 10:43 PM, robert engels <[EMAIL PROTECTED]>
>> wrote:
>> You do not need a pool of IndexReaders...
>>
>> It does not matter what class it is, what matters is the class that
>> ultimately holds the reference.
>>
>> If the IndexReader is never closed, the SegmentReader(s) is never closed,
>> so the thread local in TermInfosReader is not cleared (because the thread
>> never dies). So you will get one SegmentTermEnum, per thread * per segment.
>>
>> The SegmentTermEnum is not a large object, so even if you had 100 threads,
>> and 100 segments, for 10k instances, seems hard to believe that is the
>> source of your memory issue.
>>
>> The SegmentTermEnum is cached by thread since it needs to enumerate the
>> terms, not having a per thread cache, would lead to lots of random access
>> when multiple threads read the index - very slow.
>>
>> You need to keep in mind, what if every thread was executing a search
>> simultaneously - you would still have 100x100 SegmentTermEnum instances
>> anyway !  The only way to prevent that would be to create and destroy the
>> SegmentTermEnum on each call (opening and seeking to the proper spot) -
>> which would be SLOW SLOW SLOW.
>>
>> On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:
>>
>>  I have tried to create an IndexReader pool and dynamically create
>>> searcher. But the memory leak is the same. It's not related to the Searcher
>>> class specifically, but the SegmentTermEnum in TermInfosReader.
>>>
>>> --
>>> Chris Lu
>>> -
>>> Instant Scalable Full-Text Search On Any Database/Application
>>> site: http://www.dbsight.net
>>> demo: http://search.dbsight.com
>>> Lucene Database Search in 3 minutes:
>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>> DBSight customer, a shopping comparison site, (anonymous per request) got
>>> 2.6 Million Euro funding!
>>>
>>> On Tue, Sep 9, 2008 at 10:14 PM, robert engels <[EMAIL PROTECTED]>
>>> wrote:
>>> A searcher uses an IndexReader - the IndexReader is slow to open, not a
>>> Searcher. And searchers can share an IndexReader.
>>>
>>> You want to create a single shared (across all threads/users) IndexReader
>>> (usually), and create an Searcher as needed and dispose.  It is VERY CHEAP
>>> to create the Searcher.
>>>
>>> I am fairly certain the javadoc on Searcher is incorrect.  The warning
>>> "For performance reasons it is recommended to open only one IndexSearcher
>>> and use it for all of your searches" is not true in the case where an
>>> IndexReader is passed to the ctor.
>>>
>>> Any caching should USUALLY be performed at the IndexReader level.
>>>
>>> You are most likely using the "path" ctor, and that is the source of your
>>> problems, as multiple IndexReader instances are being created, and thus the
>>> memory use.
>>>
>>>
>>> On Sep 9, 2008, at 11:44 PM, Chris Lu wrote:
>>>
>>>  On J2EE environment, usually there is a searcher pool with several
 searchers open.
 The speed to opening a large index for every user is not acceptable.
>>

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Michael McCandless


Chris,

After you close your IndexSearcher/Reader, is it possible you're still  
holding a reference to it?


Mike

Chris Lu wrote:

Frankly I don't know why TermInfosReader.ThreadResources is not  
showing up in the memory snapshot.


Yes. It's been there for a long time. But let's see what's changed :  
A LRU cache of termInfoCache is added.
I SegmentTermEnum previously would be released, since it's  
relatively a simple object.
But with a cache added to the same class ThreadResources, which hold  
many objects, with the threads still hanging around, the cache can  
not be released, so in turn the SegmentTermEnum can not be released,  
so the RAMDirectory can not be released.


My test is too coupled with the software I am working on and not  
easy to post here. But here is a similar case from another user:


---
i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to ask you
if you already have resolved the problem and how you did it or maybe
you know where i can read about the solution. We are using
RAMDirectory too and figured out, that over time the memory
consumption raises and raises until the system breaks down but only
when we performing much index updates. if we only create the index and
don't do nothing except searching it, it work fine.
---

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 2:45 AM, Michael McCandless <[EMAIL PROTECTED] 
> wrote:


I still don't quite understand what's causing your memory growth.

SegmentTermEnum insances have been held in a ThreadLocal cache in  
TermInfosReader for a very long time (at least since Lucene 1.4).


If indeed it's the RAMDir's contents being kept "alive" due to this,  
then, you should have already been seeing this problem before rev  
659602.  And I still don't get why your reference tree is missing  
the TermInfosReader.ThreadResources class.


I'd like to understand the root cause before we hash out possible  
solutions.


Can you post the sources for your load test?

Mike


Chris Lu wrote:

Actually, even I only use one IndexReader, some resources are cached  
via the ThreadLocal cache, and can not be released unless all  
threads do the close action.


SegmentTermEnum itself is small, but it holds RAMDirectory along the  
path, which is big.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:43 PM, robert engels  
<[EMAIL PROTECTED]> wrote:

You do not need a pool of IndexReaders...

It does not matter what class it is, what matters is the class that  
ultimately holds the reference.


If the IndexReader is never closed, the SegmentReader(s) is never  
closed, so the thread local in TermInfosReader is not cleared  
(because the thread never dies). So you will get one  
SegmentTermEnum, per thread * per segment.


The SegmentTermEnum is not a large object, so even if you had 100  
threads, and 100 segments, for 10k instances, seems hard to believe  
that is the source of your memory issue.


The SegmentTermEnum is cached by thread since it needs to enumerate  
the terms, not having a per thread cache, would lead to lots of  
random access when multiple threads read the index - very slow.


You need to keep in mind, what if every thread was executing a  
search simultaneously - you would still have 100x100 SegmentTermEnum  
instances anyway !  The only way to prevent that would be to create  
and destroy the SegmentTermEnum on each call (opening and seeking to  
the proper spot) - which would be SLOW SLOW SLOW.


On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:

I have tried to create an IndexReader pool and dynamically create  
searcher. But the memory leak is the same. It's not related to the  
Searcher class specifically, but the SegmentTermEnum in  
TermInfosReader.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous p

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Michael McCandless


Good question.

As far as I can tell, nowhere in Lucene do we put a SegmentTermEnum  
directly into ThreadLocal, after rev 659602.


Is it possible that output came from a run with Lucene before rev  
659602?


Mike

Chris Lu wrote:

Is it possible that some other places that's using SegmentTermEnum  
as ThreadLocal?
This may explain why TermInfosReader.ThreadResources is not in the  
memory snapshot.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 2:45 AM, Michael McCandless <[EMAIL PROTECTED] 
> wrote:


I still don't quite understand what's causing your memory growth.

SegmentTermEnum insances have been held in a ThreadLocal cache in  
TermInfosReader for a very long time (at least since Lucene 1.4).


If indeed it's the RAMDir's contents being kept "alive" due to this,  
then, you should have already been seeing this problem before rev  
659602.  And I still don't get why your reference tree is missing  
the TermInfosReader.ThreadResources class.


I'd like to understand the root cause before we hash out possible  
solutions.


Can you post the sources for your load test?

Mike


Chris Lu wrote:

Actually, even I only use one IndexReader, some resources are cached  
via the ThreadLocal cache, and can not be released unless all  
threads do the close action.


SegmentTermEnum itself is small, but it holds RAMDirectory along the  
path, which is big.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:43 PM, robert engels  
<[EMAIL PROTECTED]> wrote:

You do not need a pool of IndexReaders...

It does not matter what class it is, what matters is the class that  
ultimately holds the reference.


If the IndexReader is never closed, the SegmentReader(s) is never  
closed, so the thread local in TermInfosReader is not cleared  
(because the thread never dies). So you will get one  
SegmentTermEnum, per thread * per segment.


The SegmentTermEnum is not a large object, so even if you had 100  
threads, and 100 segments, for 10k instances, seems hard to believe  
that is the source of your memory issue.


The SegmentTermEnum is cached by thread since it needs to enumerate  
the terms, not having a per thread cache, would lead to lots of  
random access when multiple threads read the index - very slow.


You need to keep in mind, what if every thread was executing a  
search simultaneously - you would still have 100x100 SegmentTermEnum  
instances anyway !  The only way to prevent that would be to create  
and destroy the SegmentTermEnum on each call (opening and seeking to  
the proper spot) - which would be SLOW SLOW SLOW.


On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:

I have tried to create an IndexReader pool and dynamically create  
searcher. But the memory leak is the same. It's not related to the  
Searcher class specifically, but the SegmentTermEnum in  
TermInfosReader.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:14 PM, robert engels  
<[EMAIL PROTECTED]> wrote:
A searcher uses an IndexReader - the IndexReader is slow to open,  
not a Searcher. And searchers can share an IndexReader.


You want to create a single shared (across all threads/users)  
IndexReader (usually), and create an Searcher as needed and  
dispose.  It is VERY CHEAP to create the Searcher.


I am fairly certain the javadoc on Searcher is incorrect.  The  
warning "For performance reasons it is recommended to open only one  
IndexSearcher and use it for all of your searches" is not true in  
the case where an IndexReader is passed to the ctor.


Any caching should USUALLY be performed at the IndexReader level.

You are most likely using the "path" ctor, and that is the source of  
your problems, as multiple IndexReader instances are being created,  
and thus the memory use.



On Sep 9, 2008, at 11:44 PM, Chris Lu wrote:

On J2EE environment, usually there is a searcher pool with several  
searchers open.

The speed to opening a large index for every use

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels
You do not need to create a new RAMDirectory - just write to the  
existing one, and then reopen() the IndexReader using it.


This will prevent lots of big objects being created. This may be the  
source of your problem.


Even if the Segment is closed, the ThreadLocal will no longer be  
referenced, but there will still be a reference to the  
SegmentTermEnum (which will be cleared when the thread dies, or "most  
likely" when new thread locals on that thread a created, so here is a  
potential problem.


Thread 1 does a search, creates a thread local that references the  
RAMDir (A).
Thread 2 does a search, creates a thread local that references the  
RAMDir (A).


All readers, are closed on RAMDir (A).

A new RAMDir (B) is opened.

There may still be references in the thread local maps to RAMDir A  
(since no new thread local have been created yet).


So you may get OOM depending on the size of the RAMDir (since you  
would need room for more than 1).  If you extend this out with lots  
of threads that don't run very often, you can see how you could  
easily run out of memory.  "I think" that ThreadLocal should use a  
ReferenceQueue so stale object slots can be reclaimed as soon as the  
key is dereferenced - but that is an issue for SUN.


This is why you don't want to create new RAMDirs.

A good rule of thumb - don't keep references to large objects in  
ThreadLocal (especially indirectly).  If needed, use a "key", and  
then read the cache using a the "key".

This would be something for the Lucene folks to change.

On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:

I am really want to find out where I am doing wrong, if that's the  
case.


Yes. I have made certain that I closed all Readers/Searchers, and  
verified that through memory profiler.


Yes. I am creating new RAMDirectory. But that's the problem. I need  
to update the content. Sure, if no content update and everything  
the same, of course no OOM.


Yes. No guarantee of the thread schedule. But that's the problem.  
If Lucene is using ThreadLocal to cache lots of things by the  
Thread as the key, and no idea when it'll be released. Of course  
ThreadLocal is not Lucene's problem...


Chris

On Wed, Sep 10, 2008 at 8:34 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
It is basic Java. Threads are not guaranteed to run on any sort of  
schedule. If you create lots of large objects in one thread,  
releasing them in another, there is a good chance you will get an  
OOM (since the releasing thread may not run before the OOM  
occurs)...  This is not Lucene specific by any means.


It is a misunderstanding on your part about how GC works.

I assume you must at some point be creating new RAMDirectories -  
otherwise the memory would never really increase, since the  
IndexReader/enums/etc are not very large...


When you create a new RAMDirectories, you need to BE CERTAIN !!!  
that the other IndexReaders/Searchers using the old RAMDirectory  
are ALL CLOSED, otherwise their memory will still be in use, which  
leads to your OOM...



On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:

I do not believe I am making any mistake. Actually I just got an  
email from another user, complaining about the same thing. And I  
am having the same usage pattern.


After the reader is opened, the RAMDirectory is shared by several  
objects.
There is one instance of RAMDirectory in the memory, and it is  
holding lots of memory, which is expected.


If I close the reader in the same thread that has opened it, the  
RAMDirectory is gone from the memory.
If I close the reader in other threads, the RAMDirectory is left  
in the memory, referenced along the tree I draw in the first email.


I do not think the usage is wrong. Period.

-
Hi,

   i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to ask  
you

if you already have resolved the problem and how you did it or maybe
you know where i can read about the solution. We are using
RAMDirectory too and figured out, that over time the memory
consumption raises and raises until the system breaks down but only
when we performing much index updates. if we only create the index  
and

don't do nothing except searching it, it work fine.

maybe you can give me a hint or a link,
greetz,
-

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 7:12 AM, robert engels  
<[EMAIL PROTECTED]> wrote:

Sorry, but I am fairly certain you are mistaken.

If you only have a single IndexReader, the RAMDirectory will be  
shared in all cases.



[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.coterminalPositionIncrement

2008-09-10 Thread Michael Semb Wever (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629846#action_12629846
 ] 

Michael Semb Wever commented on LUCENE-1380:


i suspected such re the option name, but "coterminal" is a word i haven't used 
since high school.

> I'm -1 on the patch in its current form. If rewritten to modify the position 
> increment only for those shingles that begin at the same word, I'd be +1 
> (assuming it works and is tested appropriately).

As i said in thread your suggestion does not work.
Setting each shingle to have a positionIncrement=1 so to avoid using the 
MultiPhraseQuery in favour of the plain PhraseQuery makes sense, but does not 
work. And not phrasing the query doesn't invoke the ShingleFilter properly.

> The ShingleFilter appears to only work, at least for me, on phrases.
> I would think this correct as each shingle is in fact a sub-phrase to the 
> larger original phrase.

If this is the case, ie ShingleFilter works on phrases as a whole entity, and 
that shingles from each term in the phrase do have a relationship as they all 
come from the one phrase, then does it not make sense to have the possibility 
to position them altogether.

For example in the current implementation, in the phrase "abcd efgh ijkl" it is 
the first term "abcd" that is responsible for generating the shingles "abcd 
efgh ijkl" and "abcd efgh". 
What  says that these shingles couldn't be generated from the "efgh" (or "ijkl" 
for the former shingle) term in an alternative implementation?
Why the presumption that it's in the user's interest to force this separation 
between where this implementation chooses to put its shingles?

If this isn't lost-in-the-bush-logic, have you a suggestion for a more 
appropriate option name for the current solution?

> Patch for ShingleFilter.coterminalPositionIncrement
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Michael Semb Wever
> Fix For: 2.4
>
> Attachments: LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position.
> Default is to place each shingle at the same position as the unigram (or 
> first shingle if outputUnigrams=false). That is, each coterminal token has 
> positionIncrement=1 and every other token a positionIncrement=0. 
> This leads to a MultiPhraseQuery where at least one word/shingle must be 
> matched from each word/token. This is not always desired. 
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels
My review of truck, show a SegmentReader, contains a TermInfosReader,  
which contains a threadlocal of ThreadResources, which contains a  
SegmentTermEnum.


So there should be a ThreadResources in the memory profiler for each  
SegmentTermEnum instances - unless you have something goofy going on.


On Sep 10, 2008, at 11:05 AM, Michael McCandless wrote:



Good question.

As far as I can tell, nowhere in Lucene do we put a SegmentTermEnum  
directly into ThreadLocal, after rev 659602.


Is it possible that output came from a run with Lucene before rev  
659602?


Mike

Chris Lu wrote:

Is it possible that some other places that's using SegmentTermEnum  
as ThreadLocal?
This may explain why TermInfosReader.ThreadResources is not in the  
memory snapshot.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 2:45 AM, Michael McCandless  
<[EMAIL PROTECTED]> wrote:


I still don't quite understand what's causing your memory growth.

SegmentTermEnum insances have been held in a ThreadLocal cache in  
TermInfosReader for a very long time (at least since Lucene 1.4).


If indeed it's the RAMDir's contents being kept "alive" due to  
this, then, you should have already been seeing this problem  
before rev 659602.  And I still don't get why your reference tree  
is missing the TermInfosReader.ThreadResources class.


I'd like to understand the root cause before we hash out possible  
solutions.


Can you post the sources for your load test?

Mike


Chris Lu wrote:

Actually, even I only use one IndexReader, some resources are  
cached via the ThreadLocal cache, and can not be released unless  
all threads do the close action.


SegmentTermEnum itself is small, but it holds RAMDirectory along  
the path, which is big.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:43 PM, robert engels  
<[EMAIL PROTECTED]> wrote:

You do not need a pool of IndexReaders...

It does not matter what class it is, what matters is the class  
that ultimately holds the reference.


If the IndexReader is never closed, the SegmentReader(s) is never  
closed, so the thread local in TermInfosReader is not cleared  
(because the thread never dies). So you will get one  
SegmentTermEnum, per thread * per segment.


The SegmentTermEnum is not a large object, so even if you had 100  
threads, and 100 segments, for 10k instances, seems hard to  
believe that is the source of your memory issue.


The SegmentTermEnum is cached by thread since it needs to  
enumerate the terms, not having a per thread cache, would lead to  
lots of random access when multiple threads read the index - very  
slow.


You need to keep in mind, what if every thread was executing a  
search simultaneously - you would still have 100x100  
SegmentTermEnum instances anyway !  The only way to prevent that  
would be to create and destroy the SegmentTermEnum on each call  
(opening and seeking to the proper spot) - which would be SLOW  
SLOW SLOW.


On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:

I have tried to create an IndexReader pool and dynamically create  
searcher. But the memory leak is the same. It's not related to the  
Searcher class specifically, but the SegmentTermEnum in  
TermInfosReader.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Tue, Sep 9, 2008 at 10:14 PM, robert engels  
<[EMAIL PROTECTED]> wrote:
A searcher uses an IndexReader - the IndexReader is slow to open,  
not a Searcher. And searchers can share an IndexReader.


You want to create a single shared (across all threads/users)  
IndexReader (usually), and create an Searcher as needed and  
dispose.  It is VERY CHEAP to create the Searcher.


I am fairly certain the javadoc on Searcher is incorrect.  The  
warning "For performance reasons it is recommended to open only  
one IndexSearcher and use it for all of your searches" is not true  
in the case where an IndexReader is passed to the ctor.


Any caching should USUALLY be performed at the Ind

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels
The other thing Lucene can do is create a SafeThreadLocal - it is  
rather trivial, and have that integrate at a higher-level, allowing  
for manual clean-up across all threads.


It MIGHT  be a bit slower than the JDK version (since that uses  
heuristics to clear stale entries), and so doesn't always clear.


But it will be far more deterministic.

If someone is interested I can post the class, but I think it is well  
within the understanding of the core Lucene developers.



On Sep 10, 2008, at 11:10 AM, robert engels wrote:

You do not need to create a new RAMDirectory - just write to the  
existing one, and then reopen() the IndexReader using it.


This will prevent lots of big objects being created. This may be  
the source of your problem.


Even if the Segment is closed, the ThreadLocal will no longer be  
referenced, but there will still be a reference to the  
SegmentTermEnum (which will be cleared when the thread dies, or  
"most likely" when new thread locals on that thread a created, so  
here is a potential problem.


Thread 1 does a search, creates a thread local that references the  
RAMDir (A).
Thread 2 does a search, creates a thread local that references the  
RAMDir (A).


All readers, are closed on RAMDir (A).

A new RAMDir (B) is opened.

There may still be references in the thread local maps to RAMDir A  
(since no new thread local have been created yet).


So you may get OOM depending on the size of the RAMDir (since you  
would need room for more than 1).  If you extend this out with lots  
of threads that don't run very often, you can see how you could  
easily run out of memory.  "I think" that ThreadLocal should use a  
ReferenceQueue so stale object slots can be reclaimed as soon as  
the key is dereferenced - but that is an issue for SUN.


This is why you don't want to create new RAMDirs.

A good rule of thumb - don't keep references to large objects in  
ThreadLocal (especially indirectly).  If needed, use a "key", and  
then read the cache using a the "key".

This would be something for the Lucene folks to change.

On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:

I am really want to find out where I am doing wrong, if that's the  
case.


Yes. I have made certain that I closed all Readers/Searchers, and  
verified that through memory profiler.


Yes. I am creating new RAMDirectory. But that's the problem. I  
need to update the content. Sure, if no content update and  
everything the same, of course no OOM.


Yes. No guarantee of the thread schedule. But that's the problem.  
If Lucene is using ThreadLocal to cache lots of things by the  
Thread as the key, and no idea when it'll be released. Of course  
ThreadLocal is not Lucene's problem...


Chris

On Wed, Sep 10, 2008 at 8:34 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
It is basic Java. Threads are not guaranteed to run on any sort of  
schedule. If you create lots of large objects in one thread,  
releasing them in another, there is a good chance you will get an  
OOM (since the releasing thread may not run before the OOM  
occurs)...  This is not Lucene specific by any means.


It is a misunderstanding on your part about how GC works.

I assume you must at some point be creating new RAMDirectories -  
otherwise the memory would never really increase, since the  
IndexReader/enums/etc are not very large...


When you create a new RAMDirectories, you need to BE CERTAIN !!!  
that the other IndexReaders/Searchers using the old RAMDirectory  
are ALL CLOSED, otherwise their memory will still be in use, which  
leads to your OOM...



On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:

I do not believe I am making any mistake. Actually I just got an  
email from another user, complaining about the same thing. And I  
am having the same usage pattern.


After the reader is opened, the RAMDirectory is shared by several  
objects.
There is one instance of RAMDirectory in the memory, and it is  
holding lots of memory, which is expected.


If I close the reader in the same thread that has opened it, the  
RAMDirectory is gone from the memory.
If I close the reader in other threads, the RAMDirectory is left  
in the memory, referenced along the tree I draw in the first email.


I do not think the usage is wrong. Period.

-
Hi,

   i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to  
ask you

if you already have resolved the problem and how you did it or maybe
you know where i can read about the solution. We are using
RAMDirectory too and figured out, that over time the memory
consumption raises and raises until the system breaks down but only
when we performing much index updates. if we only create the  
index and

don't do nothing except searching it, it work fine.

maybe you can give me a hint or a link,
greetz,
-

--
Chris Lu
-
Instant Scalable Full-Text Se

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Thanks for the analysis, really appreciate it, and I agree with it. But...
This is really a normal J2EE use case. The threads seldom die.
Doesn't that mean closing the RAMDirectory doesn't work for J2EE
applications?
And only reopen() works?
And close() doesn't release the resources? duh...

I can only say this is a problem to be cleaned up.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 9:10 AM, robert engels <[EMAIL PROTECTED]>wrote:

> You do not need to create a new RAMDirectory - just write to the existing
> one, and then reopen() the IndexReader using it.
> This will prevent lots of big objects being created. This may be the source
> of your problem.
>
> Even if the Segment is closed, the ThreadLocal will no longer be
> referenced, but there will still be a reference to the SegmentTermEnum
> (which will be cleared when the thread dies, or "most likely" when new
> thread locals on that thread a created, so here is a potential problem.
>
> Thread 1 does a search, creates a thread local that references the RAMDir
> (A).
> Thread 2 does a search, creates a thread local that references the RAMDir
> (A).
>
> All readers, are closed on RAMDir (A).
>
> A new RAMDir (B) is opened.
>
> There may still be references in the thread local maps to RAMDir A (since
> no new thread local have been created yet).
>
> So you may get OOM depending on the size of the RAMDir (since you would
> need room for more than 1).  If you extend this out with lots of threads
> that don't run very often, you can see how you could easily run out of
> memory.  "I think" that ThreadLocal should use a ReferenceQueue so stale
> object slots can be reclaimed as soon as the key is dereferenced - but that
> is an issue for SUN.
>
> This is why you don't want to create new RAMDirs.
>
> A good rule of thumb - don't keep references to large objects in
> ThreadLocal (especially indirectly).  If needed, use a "key", and then read
> the cache using a the "key".
> This would be something for the Lucene folks to change.
>
> On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:
>
> I am really want to find out where I am doing wrong, if that's the case.
>
> Yes. I have made certain that I closed all Readers/Searchers, and verified
> that through memory profiler.
> Yes. I am creating new RAMDirectory. But that's the problem. I need to
> update the content. Sure, if no content update and everything the same, of
> course no OOM.
>
> Yes. No guarantee of the thread schedule. But that's the problem. If Lucene
> is using ThreadLocal to cache lots of things by the Thread as the key, and
> no idea when it'll be released. Of course ThreadLocal is not Lucene's
> problem...
>
> Chris
>
> On Wed, Sep 10, 2008 at 8:34 AM, robert engels <[EMAIL PROTECTED]>wrote:
>
>>  It is basic Java. Threads are not guaranteed to run on any sort of
>> schedule. If you create lots of large objects in one thread, releasing them
>> in another, there is a good chance you will get an OOM (since the releasing
>> thread may not run before the OOM occurs)...  This is not Lucene specific by
>> any means.
>> It is a misunderstanding on your part about how GC works.
>>
>> I assume you must at some point be creating new RAMDirectories - otherwise
>> the memory would never really increase, since the IndexReader/enums/etc are
>> not very large...
>>
>> When you create a new RAMDirectories, you need to BE CERTAIN !!! that the
>> other IndexReaders/Searchers using the old RAMDirectory are ALL CLOSED,
>> otherwise their memory will still be in use, which leads to your OOM...
>>
>>
>> On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:
>>
>> I do not believe I am making any mistake. Actually I just got an email
>> from another user, complaining about the same thing. And I am having the
>> same usage pattern.
>> After the reader is opened, the RAMDirectory is shared by several objects.
>> There is one instance of RAMDirectory in the memory, and it is holding
>> lots of memory, which is expected.
>>
>> If I close the reader in the same thread that has opened it, the
>> RAMDirectory is gone from the memory.
>> If I close the reader in other threads, the RAMDirectory is left in the
>> memory, referenced along the tree I draw in the first email.
>>
>> I do not think the usage is wrong. Period.
>>
>> -
>>
>> Hi,
>>
>>i found a forum post from you here [1] where you mention that you
>> have a memory leak using the lucene ram directory. I'd like to ask you
>> if you already have resolved the problem and how you did it or maybe
>> you know where i can read about the solution. We are using
>> RAMDirectory too and figured out, 

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Not holding searcher/reader. I did check that via memory snapshot.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 8:58 AM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> Chris,
>
> After you close your IndexSearcher/Reader, is it possible you're still
> holding a reference to it?
>
> Mike
>
>
> Chris Lu wrote:
>
>  Frankly I don't know why TermInfosReader.ThreadResources is not showing up
>> in the memory snapshot.
>>
>> Yes. It's been there for a long time. But let's see what's changed : A LRU
>> cache of termInfoCache is added.
>> I SegmentTermEnum previously would be released, since it's relatively a
>> simple object.
>> But with a cache added to the same class ThreadResources, which hold many
>> objects, with the threads still hanging around, the cache can not be
>> released, so in turn the SegmentTermEnum can not be released, so the
>> RAMDirectory can not be released.
>>
>> My test is too coupled with the software I am working on and not easy to
>> post here. But here is a similar case from another user:
>>
>>
>> ---
>> i found a forum post from you here [1] where you mention that you
>> have a memory leak using the lucene ram directory. I'd like to ask you
>> if you already have resolved the problem and how you did it or maybe
>> you know where i can read about the solution. We are using
>> RAMDirectory too and figured out, that over time the memory
>> consumption raises and raises until the system breaks down but only
>> when we performing much index updates. if we only create the index and
>> don't do nothing except searching it, it work fine.
>>
>> ---
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Wed, Sep 10, 2008 at 2:45 AM, Michael McCandless <
>> [EMAIL PROTECTED]> wrote:
>>
>> I still don't quite understand what's causing your memory growth.
>>
>> SegmentTermEnum insances have been held in a ThreadLocal cache in
>> TermInfosReader for a very long time (at least since Lucene 1.4).
>>
>> If indeed it's the RAMDir's contents being kept "alive" due to this, then,
>> you should have already been seeing this problem before rev 659602.  And I
>> still don't get why your reference tree is missing the
>> TermInfosReader.ThreadResources class.
>>
>> I'd like to understand the root cause before we hash out possible
>> solutions.
>>
>> Can you post the sources for your load test?
>>
>> Mike
>>
>>
>> Chris Lu wrote:
>>
>> Actually, even I only use one IndexReader, some resources are cached via
>> the ThreadLocal cache, and can not be released unless all threads do the
>> close action.
>>
>> SegmentTermEnum itself is small, but it holds RAMDirectory along the path,
>> which is big.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Tue, Sep 9, 2008 at 10:43 PM, robert engels <[EMAIL PROTECTED]>
>> wrote:
>> You do not need a pool of IndexReaders...
>>
>> It does not matter what class it is, what matters is the class that
>> ultimately holds the reference.
>>
>> If the IndexReader is never closed, the SegmentReader(s) is never closed,
>> so the thread local in TermInfosReader is not cleared (because the thread
>> never dies). So you will get one SegmentTermEnum, per thread * per segment.
>>
>> The SegmentTermEnum is not a large object, so even if you had 100 threads,
>> and 100 segments, for 10k instances, seems hard to believe that is the
>> source of your memory issue.
>>
>> The SegmentTermEnum is cached by thread since it needs to enumerate the
>> terms, not having a per thread cache, would lead to lots of random access
>> when multiple threads read the index - very slow.
>>
>> You need to keep in mind, what if every thread was executing a search
>> simultaneously - you would still have 100x100 SegmentTermEnum instances
>> anyway !  The only way to prev

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels
Close() does work - it is just that the memory may not be freed until  
much later...


When working with VERY LARGE objects, this can be a problem.

On Sep 10, 2008, at 12:36 PM, Chris Lu wrote:

Thanks for the analysis, really appreciate it, and I agree with it.  
But...


This is really a normal J2EE use case. The threads seldom die.
Doesn't that mean closing the RAMDirectory doesn't work for J2EE  
applications?

And only reopen() works?
And close() doesn't release the resources? duh...

I can only say this is a problem to be cleaned up.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!



On Wed, Sep 10, 2008 at 9:10 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
You do not need to create a new RAMDirectory - just write to the  
existing one, and then reopen() the IndexReader using it.


This will prevent lots of big objects being created. This may be  
the source of your problem.


Even if the Segment is closed, the ThreadLocal will no longer be  
referenced, but there will still be a reference to the  
SegmentTermEnum (which will be cleared when the thread dies, or  
"most likely" when new thread locals on that thread a created, so  
here is a potential problem.


Thread 1 does a search, creates a thread local that references the  
RAMDir (A).
Thread 2 does a search, creates a thread local that references the  
RAMDir (A).


All readers, are closed on RAMDir (A).

A new RAMDir (B) is opened.

There may still be references in the thread local maps to RAMDir A  
(since no new thread local have been created yet).


So you may get OOM depending on the size of the RAMDir (since you  
would need room for more than 1).  If you extend this out with lots  
of threads that don't run very often, you can see how you could  
easily run out of memory.  "I think" that ThreadLocal should use a  
ReferenceQueue so stale object slots can be reclaimed as soon as  
the key is dereferenced - but that is an issue for SUN.


This is why you don't want to create new RAMDirs.

A good rule of thumb - don't keep references to large objects in  
ThreadLocal (especially indirectly).  If needed, use a "key", and  
then read the cache using a the "key".

This would be something for the Lucene folks to change.

On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:

I am really want to find out where I am doing wrong, if that's the  
case.


Yes. I have made certain that I closed all Readers/Searchers, and  
verified that through memory profiler.


Yes. I am creating new RAMDirectory. But that's the problem. I  
need to update the content. Sure, if no content update and  
everything the same, of course no OOM.


Yes. No guarantee of the thread schedule. But that's the problem.  
If Lucene is using ThreadLocal to cache lots of things by the  
Thread as the key, and no idea when it'll be released. Of course  
ThreadLocal is not Lucene's problem...


Chris

On Wed, Sep 10, 2008 at 8:34 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
It is basic Java. Threads are not guaranteed to run on any sort of  
schedule. If you create lots of large objects in one thread,  
releasing them in another, there is a good chance you will get an  
OOM (since the releasing thread may not run before the OOM  
occurs)...  This is not Lucene specific by any means.


It is a misunderstanding on your part about how GC works.

I assume you must at some point be creating new RAMDirectories -  
otherwise the memory would never really increase, since the  
IndexReader/enums/etc are not very large...


When you create a new RAMDirectories, you need to BE CERTAIN !!!  
that the other IndexReaders/Searchers using the old RAMDirectory  
are ALL CLOSED, otherwise their memory will still be in use, which  
leads to your OOM...



On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:

I do not believe I am making any mistake. Actually I just got an  
email from another user, complaining about the same thing. And I  
am having the same usage pattern.


After the reader is opened, the RAMDirectory is shared by several  
objects.
There is one instance of RAMDirectory in the memory, and it is  
holding lots of memory, which is expected.


If I close the reader in the same thread that has opened it, the  
RAMDirectory is gone from the memory.
If I close the reader in other threads, the RAMDirectory is left  
in the memory, referenced along the tree I draw in the first email.


I do not think the usage is wrong. Period.

-
Hi,

   i found a forum post from you here [1] where you mention that you
have a memory leak using the lucene ram directory. I'd like to  
ask you

if you already have resolved the problem and how you did it 

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Not likely. Actually I made some changes to Lucene source code and I can see
the changes in the memory snapshot. So it is the latest Lucene version.
-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 9:05 AM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> Good question.
>
> As far as I can tell, nowhere in Lucene do we put a SegmentTermEnum
> directly into ThreadLocal, after rev 659602.
>
> Is it possible that output came from a run with Lucene before rev 659602?
>
> Mike
>
>
> Chris Lu wrote:
>
>  Is it possible that some other places that's using SegmentTermEnum as
>> ThreadLocal?
>> This may explain why TermInfosReader.ThreadResources is not in the memory
>> snapshot.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Wed, Sep 10, 2008 at 2:45 AM, Michael McCandless <
>> [EMAIL PROTECTED]> wrote:
>>
>> I still don't quite understand what's causing your memory growth.
>>
>> SegmentTermEnum insances have been held in a ThreadLocal cache in
>> TermInfosReader for a very long time (at least since Lucene 1.4).
>>
>> If indeed it's the RAMDir's contents being kept "alive" due to this, then,
>> you should have already been seeing this problem before rev 659602.  And I
>> still don't get why your reference tree is missing the
>> TermInfosReader.ThreadResources class.
>>
>> I'd like to understand the root cause before we hash out possible
>> solutions.
>>
>> Can you post the sources for your load test?
>>
>> Mike
>>
>>
>> Chris Lu wrote:
>>
>> Actually, even I only use one IndexReader, some resources are cached via
>> the ThreadLocal cache, and can not be released unless all threads do the
>> close action.
>>
>> SegmentTermEnum itself is small, but it holds RAMDirectory along the path,
>> which is big.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Tue, Sep 9, 2008 at 10:43 PM, robert engels <[EMAIL PROTECTED]>
>> wrote:
>> You do not need a pool of IndexReaders...
>>
>> It does not matter what class it is, what matters is the class that
>> ultimately holds the reference.
>>
>> If the IndexReader is never closed, the SegmentReader(s) is never closed,
>> so the thread local in TermInfosReader is not cleared (because the thread
>> never dies). So you will get one SegmentTermEnum, per thread * per segment.
>>
>> The SegmentTermEnum is not a large object, so even if you had 100 threads,
>> and 100 segments, for 10k instances, seems hard to believe that is the
>> source of your memory issue.
>>
>> The SegmentTermEnum is cached by thread since it needs to enumerate the
>> terms, not having a per thread cache, would lead to lots of random access
>> when multiple threads read the index - very slow.
>>
>> You need to keep in mind, what if every thread was executing a search
>> simultaneously - you would still have 100x100 SegmentTermEnum instances
>> anyway !  The only way to prevent that would be to create and destroy the
>> SegmentTermEnum on each call (opening and seeking to the proper spot) -
>> which would be SLOW SLOW SLOW.
>>
>> On Sep 10, 2008, at 12:19 AM, Chris Lu wrote:
>>
>> I have tried to create an IndexReader pool and dynamically create
>> searcher. But the memory leak is the same. It's not related to the Searcher
>> class specifically, but the SegmentTermEnum in TermInfosReader.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Tue, Sep 9, 2008 at 10:14 PM, robert engels <[EMAIL PROTECTED]>
>> wrote:
>> A searcher uses an IndexReader - the IndexReader is slow to open, not a
>> Searcher. And searchers can share an IndexReader.
>>
>> You want to create a

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
SafeThreadLocal is very interesting. It'll be good not only for Lucene, but
also other projects.

Could you please post it?

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 9:41 AM, robert engels <[EMAIL PROTECTED]>wrote:

> The other thing Lucene can do is create a SafeThreadLocal - it is rather
> trivial, and have that integrate at a higher-level, allowing for manual
> clean-up across all threads.
> It MIGHT  be a bit slower than the JDK version (since that uses heuristics
> to clear stale entries), and so doesn't always clear.
>
> But it will be far more deterministic.
>
> If someone is interested I can post the class, but I think it is well
> within the understanding of the core Lucene developers.
>
>
> On Sep 10, 2008, at 11:10 AM, robert engels wrote:
>
> You do not need to create a new RAMDirectory - just write to the existing
> one, and then reopen() the IndexReader using it.
> This will prevent lots of big objects being created. This may be the source
> of your problem.
>
> Even if the Segment is closed, the ThreadLocal will no longer be
> referenced, but there will still be a reference to the SegmentTermEnum
> (which will be cleared when the thread dies, or "most likely" when new
> thread locals on that thread a created, so here is a potential problem.
>
> Thread 1 does a search, creates a thread local that references the RAMDir
> (A).
> Thread 2 does a search, creates a thread local that references the RAMDir
> (A).
>
> All readers, are closed on RAMDir (A).
>
> A new RAMDir (B) is opened.
>
> There may still be references in the thread local maps to RAMDir A (since
> no new thread local have been created yet).
>
> So you may get OOM depending on the size of the RAMDir (since you would
> need room for more than 1).  If you extend this out with lots of threads
> that don't run very often, you can see how you could easily run out of
> memory.  "I think" that ThreadLocal should use a ReferenceQueue so stale
> object slots can be reclaimed as soon as the key is dereferenced - but that
> is an issue for SUN.
>
> This is why you don't want to create new RAMDirs.
>
> A good rule of thumb - don't keep references to large objects in
> ThreadLocal (especially indirectly).  If needed, use a "key", and then read
> the cache using a the "key".
> This would be something for the Lucene folks to change.
>
> On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:
>
> I am really want to find out where I am doing wrong, if that's the case.
>
> Yes. I have made certain that I closed all Readers/Searchers, and verified
> that through memory profiler.
> Yes. I am creating new RAMDirectory. But that's the problem. I need to
> update the content. Sure, if no content update and everything the same, of
> course no OOM.
>
> Yes. No guarantee of the thread schedule. But that's the problem. If Lucene
> is using ThreadLocal to cache lots of things by the Thread as the key, and
> no idea when it'll be released. Of course ThreadLocal is not Lucene's
> problem...
>
> Chris
>
> On Wed, Sep 10, 2008 at 8:34 AM, robert engels <[EMAIL PROTECTED]>wrote:
>
>>  It is basic Java. Threads are not guaranteed to run on any sort of
>> schedule. If you create lots of large objects in one thread, releasing them
>> in another, there is a good chance you will get an OOM (since the releasing
>> thread may not run before the OOM occurs)...  This is not Lucene specific by
>> any means.
>> It is a misunderstanding on your part about how GC works.
>>
>> I assume you must at some point be creating new RAMDirectories - otherwise
>> the memory would never really increase, since the IndexReader/enums/etc are
>> not very large...
>>
>> When you create a new RAMDirectories, you need to BE CERTAIN !!! that the
>> other IndexReaders/Searchers using the old RAMDirectory are ALL CLOSED,
>> otherwise their memory will still be in use, which leads to your OOM...
>>
>>
>> On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:
>>
>> I do not believe I am making any mistake. Actually I just got an email
>> from another user, complaining about the same thing. And I am having the
>> same usage pattern.
>> After the reader is opened, the RAMDirectory is shared by several objects.
>> There is one instance of RAMDirectory in the memory, and it is holding
>> lots of memory, which is expected.
>>
>> If I close the reader in the same thread that has opened it, the
>> RAMDirectory is gone from the memory.
>> If I close the reader in other threads, the RAMDirectory is left in the
>> memory, referenced along the tree I draw in the first email.
>>
>> I do not think the usage is wrong. Period.
>>
>> -
>>
>

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Yeah, the timing is different. But it's an unknown, undetermined, and
uncontrollable time...
We can not ask the user,

while(memory is low){
  sleep(1000);
}
do_the_real_thing_an_hour_later


-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 10:39 AM, robert engels <[EMAIL PROTECTED]>wrote:

> Close() does work - it is just that the memory may not be freed until much
> later...
> When working with VERY LARGE objects, this can be a problem.
>
> On Sep 10, 2008, at 12:36 PM, Chris Lu wrote:
>
> Thanks for the analysis, really appreciate it, and I agree with it. But...
> This is really a normal J2EE use case. The threads seldom die.
> Doesn't that mean closing the RAMDirectory doesn't work for J2EE
> applications?
> And only reopen() works?
> And close() doesn't release the resources? duh...
>
> I can only say this is a problem to be cleaned up.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
>
> On Wed, Sep 10, 2008 at 9:10 AM, robert engels <[EMAIL PROTECTED]>wrote:
>
>> You do not need to create a new RAMDirectory - just write to the existing
>> one, and then reopen() the IndexReader using it.
>> This will prevent lots of big objects being created. This may be the
>> source of your problem.
>>
>> Even if the Segment is closed, the ThreadLocal will no longer be
>> referenced, but there will still be a reference to the SegmentTermEnum
>> (which will be cleared when the thread dies, or "most likely" when new
>> thread locals on that thread a created, so here is a potential problem.
>>
>> Thread 1 does a search, creates a thread local that references the RAMDir
>> (A).
>> Thread 2 does a search, creates a thread local that references the RAMDir
>> (A).
>>
>> All readers, are closed on RAMDir (A).
>>
>> A new RAMDir (B) is opened.
>>
>> There may still be references in the thread local maps to RAMDir A (since
>> no new thread local have been created yet).
>>
>> So you may get OOM depending on the size of the RAMDir (since you would
>> need room for more than 1).  If you extend this out with lots of threads
>> that don't run very often, you can see how you could easily run out of
>> memory.  "I think" that ThreadLocal should use a ReferenceQueue so stale
>> object slots can be reclaimed as soon as the key is dereferenced - but that
>> is an issue for SUN.
>>
>> This is why you don't want to create new RAMDirs.
>>
>> A good rule of thumb - don't keep references to large objects in
>> ThreadLocal (especially indirectly).  If needed, use a "key", and then read
>> the cache using a the "key".
>> This would be something for the Lucene folks to change.
>>
>> On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:
>>
>> I am really want to find out where I am doing wrong, if that's the case.
>>
>> Yes. I have made certain that I closed all Readers/Searchers, and verified
>> that through memory profiler.
>> Yes. I am creating new RAMDirectory. But that's the problem. I need to
>> update the content. Sure, if no content update and everything the same, of
>> course no OOM.
>>
>> Yes. No guarantee of the thread schedule. But that's the problem. If
>> Lucene is using ThreadLocal to cache lots of things by the Thread as the
>> key, and no idea when it'll be released. Of course ThreadLocal is not
>> Lucene's problem...
>>
>> Chris
>>
>> On Wed, Sep 10, 2008 at 8:34 AM, robert engels <[EMAIL PROTECTED]>wrote:
>>
>>>  It is basic Java. Threads are not guaranteed to run on any sort of
>>> schedule. If you create lots of large objects in one thread, releasing them
>>> in another, there is a good chance you will get an OOM (since the releasing
>>> thread may not run before the OOM occurs)...  This is not Lucene specific by
>>> any means.
>>> It is a misunderstanding on your part about how GC works.
>>>
>>> I assume you must at some point be creating new RAMDirectories -
>>> otherwise the memory would never really increase, since the
>>> IndexReader/enums/etc are not very large...
>>>
>>> When you create a new RAMDirectories, you need to BE CERTAIN !!! that the
>>> other IndexReaders/Searchers using the old RAMDirectory are ALL CLOSED,
>>> otherwise their memory will still be in use, which leads to your OOM...
>>>
>>>
>>> On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:
>>>
>>> I do not believe I am making any mistake. Actually I just got an em

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels

Why not just use reopen() and be done with it???

On Sep 10, 2008, at 12:48 PM, Chris Lu wrote:

Yeah, the timing is different. But it's an unknown, undetermined,  
and uncontrollable time...


We can not ask the user,

while(memory is low){
  sleep(1000);
}
do_the_real_thing_an_hour_later


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 10:39 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
Close() does work - it is just that the memory may not be freed  
until much later...


When working with VERY LARGE objects, this can be a problem.

On Sep 10, 2008, at 12:36 PM, Chris Lu wrote:

Thanks for the analysis, really appreciate it, and I agree with  
it. But...


This is really a normal J2EE use case. The threads seldom die.
Doesn't that mean closing the RAMDirectory doesn't work for J2EE  
applications?

And only reopen() works?
And close() doesn't release the resources? duh...

I can only say this is a problem to be cleaned up.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!



On Wed, Sep 10, 2008 at 9:10 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
You do not need to create a new RAMDirectory - just write to the  
existing one, and then reopen() the IndexReader using it.


This will prevent lots of big objects being created. This may be  
the source of your problem.


Even if the Segment is closed, the ThreadLocal will no longer be  
referenced, but there will still be a reference to the  
SegmentTermEnum (which will be cleared when the thread dies, or  
"most likely" when new thread locals on that thread a created, so  
here is a potential problem.


Thread 1 does a search, creates a thread local that references the  
RAMDir (A).
Thread 2 does a search, creates a thread local that references the  
RAMDir (A).


All readers, are closed on RAMDir (A).

A new RAMDir (B) is opened.

There may still be references in the thread local maps to RAMDir A  
(since no new thread local have been created yet).


So you may get OOM depending on the size of the RAMDir (since you  
would need room for more than 1).  If you extend this out with  
lots of threads that don't run very often, you can see how you  
could easily run out of memory.  "I think" that ThreadLocal should  
use a ReferenceQueue so stale object slots can be reclaimed as  
soon as the key is dereferenced - but that is an issue for SUN.


This is why you don't want to create new RAMDirs.

A good rule of thumb - don't keep references to large objects in  
ThreadLocal (especially indirectly).  If needed, use a "key", and  
then read the cache using a the "key".

This would be something for the Lucene folks to change.

On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:

I am really want to find out where I am doing wrong, if that's  
the case.


Yes. I have made certain that I closed all Readers/Searchers, and  
verified that through memory profiler.


Yes. I am creating new RAMDirectory. But that's the problem. I  
need to update the content. Sure, if no content update and  
everything the same, of course no OOM.


Yes. No guarantee of the thread schedule. But that's the problem.  
If Lucene is using ThreadLocal to cache lots of things by the  
Thread as the key, and no idea when it'll be released. Of course  
ThreadLocal is not Lucene's problem...


Chris

On Wed, Sep 10, 2008 at 8:34 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
It is basic Java. Threads are not guaranteed to run on any sort  
of schedule. If you create lots of large objects in one thread,  
releasing them in another, there is a good chance you will get an  
OOM (since the releasing thread may not run before the OOM  
occurs)...  This is not Lucene specific by any means.


It is a misunderstanding on your part about how GC works.

I assume you must at some point be creating new RAMDirectories -  
otherwise the memory would never really increase, since the  
IndexReader/enums/etc are not very large...


When you create a new RAMDirectories, you need to BE CERTAIN !!!  
that the other IndexReaders/Searchers using the old RAMDirectory  
are ALL CLOSED, otherwise their memory will still be in use,  
which leads to your OOM...



On Sep 10, 2008, at 10:16 AM, Chris Lu wrote:

I do not believe I am making any mistake. Actually I just got an  
email from another user, complaining about the same thing. And I

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Actually I am done with it by simply downgrading and not to use r659602 and
later.The old version is more clean and consistent with the API and close()
does mean close, not something complicated and unknown to most users, which
almost feels like a trap. And later on, if no changes happened for this
file, I will have to upgrade Lucene and manually remove the patch
Lucene-1195.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 10:56 AM, robert engels <[EMAIL PROTECTED]>wrote:

> Why not just use reopen() and be done with it???
>
> On Sep 10, 2008, at 12:48 PM, Chris Lu wrote:
>
> Yeah, the timing is different. But it's an unknown, undetermined, and
> uncontrollable time...
> We can not ask the user,
>
> while(memory is low){
>   sleep(1000);
> }
> do_the_real_thing_an_hour_later
>
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
> On Wed, Sep 10, 2008 at 10:39 AM, robert engels <[EMAIL PROTECTED]>wrote:
>
>> Close() does work - it is just that the memory may not be freed until much
>> later...
>> When working with VERY LARGE objects, this can be a problem.
>>
>> On Sep 10, 2008, at 12:36 PM, Chris Lu wrote:
>>
>> Thanks for the analysis, really appreciate it, and I agree with it. But...
>> This is really a normal J2EE use case. The threads seldom die.
>> Doesn't that mean closing the RAMDirectory doesn't work for J2EE
>> applications?
>> And only reopen() works?
>> And close() doesn't release the resources? duh...
>>
>> I can only say this is a problem to be cleaned up.
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>>
>> On Wed, Sep 10, 2008 at 9:10 AM, robert engels <[EMAIL PROTECTED]>wrote:
>>
>>> You do not need to create a new RAMDirectory - just write to the existing
>>> one, and then reopen() the IndexReader using it.
>>> This will prevent lots of big objects being created. This may be the
>>> source of your problem.
>>>
>>> Even if the Segment is closed, the ThreadLocal will no longer be
>>> referenced, but there will still be a reference to the SegmentTermEnum
>>> (which will be cleared when the thread dies, or "most likely" when new
>>> thread locals on that thread a created, so here is a potential problem.
>>>
>>> Thread 1 does a search, creates a thread local that references the RAMDir
>>> (A).
>>> Thread 2 does a search, creates a thread local that references the RAMDir
>>> (A).
>>>
>>> All readers, are closed on RAMDir (A).
>>>
>>> A new RAMDir (B) is opened.
>>>
>>> There may still be references in the thread local maps to RAMDir A (since
>>> no new thread local have been created yet).
>>>
>>> So you may get OOM depending on the size of the RAMDir (since you would
>>> need room for more than 1).  If you extend this out with lots of threads
>>> that don't run very often, you can see how you could easily run out of
>>> memory.  "I think" that ThreadLocal should use a ReferenceQueue so stale
>>> object slots can be reclaimed as soon as the key is dereferenced - but that
>>> is an issue for SUN.
>>>
>>> This is why you don't want to create new RAMDirs.
>>>
>>> A good rule of thumb - don't keep references to large objects in
>>> ThreadLocal (especially indirectly).  If needed, use a "key", and then read
>>> the cache using a the "key".
>>> This would be something for the Lucene folks to change.
>>>
>>> On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:
>>>
>>>  I am really want to find out where I am doing wrong, if that's the case.
>>>
>>> Yes. I have made certain that I closed all Readers/Searchers, and
>>> verified that through memory profiler.
>>>  Yes. I am creating new RAMDirectory. But that's the problem. I need to
>>> update the content. Sure, if no content update and everything the same, of
>>> course no OOM.
>>>
>>> Yes. No guarantee of the thread schedule. But that's the problem. If
>>> Lucene is using ThreadLocal to cache lots of things by the Thread as the
>>> key, and no idea when it'll be released. Of course ThreadLocal is no

Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels

Always your prerogative.

On Sep 10, 2008, at 1:15 PM, Chris Lu wrote:

Actually I am done with it by simply downgrading and not to use  
r659602 and later.
The old version is more clean and consistent with the API and close 
() does mean close, not something complicated and unknown to most  
users, which almost feels like a trap. And later on, if no changes  
happened for this file, I will have to upgrade Lucene and manually  
remove the patch Lucene-1195.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 10:56 AM, robert engels  
<[EMAIL PROTECTED]> wrote:

Why not just use reopen() and be done with it???

On Sep 10, 2008, at 12:48 PM, Chris Lu wrote:

Yeah, the timing is different. But it's an unknown, undetermined,  
and uncontrollable time...


We can not ask the user,

while(memory is low){
  sleep(1000);
}
do_the_real_thing_an_hour_later


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 10:39 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
Close() does work - it is just that the memory may not be freed  
until much later...


When working with VERY LARGE objects, this can be a problem.

On Sep 10, 2008, at 12:36 PM, Chris Lu wrote:

Thanks for the analysis, really appreciate it, and I agree with  
it. But...


This is really a normal J2EE use case. The threads seldom die.
Doesn't that mean closing the RAMDirectory doesn't work for J2EE  
applications?

And only reopen() works?
And close() doesn't release the resources? duh...

I can only say this is a problem to be cleaned up.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/ 
index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got 2.6 Million Euro funding!



On Wed, Sep 10, 2008 at 9:10 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
You do not need to create a new RAMDirectory - just write to the  
existing one, and then reopen() the IndexReader using it.


This will prevent lots of big objects being created. This may be  
the source of your problem.


Even if the Segment is closed, the ThreadLocal will no longer be  
referenced, but there will still be a reference to the  
SegmentTermEnum (which will be cleared when the thread dies, or  
"most likely" when new thread locals on that thread a created, so  
here is a potential problem.


Thread 1 does a search, creates a thread local that references  
the RAMDir (A).
Thread 2 does a search, creates a thread local that references  
the RAMDir (A).


All readers, are closed on RAMDir (A).

A new RAMDir (B) is opened.

There may still be references in the thread local maps to RAMDir  
A (since no new thread local have been created yet).


So you may get OOM depending on the size of the RAMDir (since you  
would need room for more than 1).  If you extend this out with  
lots of threads that don't run very often, you can see how you  
could easily run out of memory.  "I think" that ThreadLocal  
should use a ReferenceQueue so stale object slots can be  
reclaimed as soon as the key is dereferenced - but that is an  
issue for SUN.


This is why you don't want to create new RAMDirs.

A good rule of thumb - don't keep references to large objects in  
ThreadLocal (especially indirectly).  If needed, use a "key", and  
then read the cache using a the "key".

This would be something for the Lucene folks to change.

On Sep 10, 2008, at 10:44 AM, Chris Lu wrote:

I am really want to find out where I am doing wrong, if that's  
the case.


Yes. I have made certain that I closed all Readers/Searchers,  
and verified that through memory profiler.


Yes. I am creating new RAMDirectory. But that's the problem. I  
need to update the content. Sure, if no content update and  
everything the same, of course no OOM.


Yes. No guarantee of the thread schedule. But that's the  
problem. If Lucene is using ThreadLocal to cache lots of things  
by the Thread as the key, and no idea when it'll be released. Of  
course ThreadLocal is not Lucene's problem...


Chris

On Wed, Sep 10, 2008 at 8:34 AM, robert engels  
<[EMAIL PROTECTED]> wrote:
It is basic Java. Threads

Re: Realtime Search for Social Networks Collaboration

2008-09-10 Thread Jason Rutherglen
Hi Mike,

There would be a new sorted list or something to replace the
hashtable?  Seems like an issue that is not solved.

Jason

On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> This would just tap into the live hashtable that DocumentsWriter* maintain
> for the posting lists... except the docFreq will need to be copied away on
> reopen, I think.
>
> Mike
>
> Jason Rutherglen wrote:
>
>> Term dictionary?  I'm curious how that would be solved?
>>
>> On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> Yonik Seeley wrote:
>>>
> I think it's quite feasible, but, it'd still have a "reopen" cost in
> that
> any buffered delete by term or query would have to be "materialiazed"
> into
> docIDs on reopen.  Though, if this somehow turns out to be a problem,
> in
> the
> future we could do this materializing immediately, instead of
> buffering,
> if
> we already have a reader open.

 Right... it seems like re-using readers internally is something we
 could already be doing in IndexWriter.
>>>
>>> True.
>>>
> Flushing is somewhat tricky because any open RAM readers would then
> have
> to
> cutover to the newly flushed segment once the flush completes, so that
> the
> RAM buffer can be recycled for the next segment.

 Re-use of a RAM buffer doesn't seem like such a big deal.

 But, how would you maintain a static view of an index...?

 IndexReader r1 = indexWriter.getCurrentIndex()
 indexWriter.addDocument(...)
 IndexReader r2 = indexWriter.getCurrentIndex()

 I assume r1 will have a view of the index before the document was
 added, and r2 after?
>>>
>>> Right, getCurrentIndex would return a MultiReader that includes
>>> SegmentReader for each segment in the index, plus a "RAMReader" that
>>> searches the RAM buffer.  That RAMReader is a tiny shell class that would
>>> basically just record the max docID it's allowed to go up to (the docID
>>> as
>>> of when it was opened), and stop enumerating docIDs (eg in the TermDocs)
>>> when it hits a docID beyond that limit.
>>>
>>> For reading stored fields and term vectors, which are now flushed
>>> immediately to disk, we need to somehow get an IndexInput from the
>>> IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
>>> open new IndexInputs?
>>>
 Another thing that will help is if users could get their hands on the
 sub-readers of a multi-segment reader.  Right now that is hidden in
 MultiSegmentReader and makes updating anything incrementally
 difficult.
>>>
>>> Besides what's handled by MultiSegmentReader.reopen already, what else do
>>> you need to incrementally update?
>>>
>>> Mike
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Chris Lu
Well, the code is correct, because it can work by avoiding this trap. But it
failed to act as a good API.

I learned the inside details from you. I am not the only one that's trapped.
And more users will likely be trapped again, unless javadoc to describe the
close() function is changed. Actually, I didn't look at the javadoc of
close(), because, shouldn't close() means close(), not uncontrollably
delayed resource releasing? So I fear just changing the javadoc is not
enough.
-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!


On Wed, Sep 10, 2008 at 1:03 PM, robert engels <[EMAIL PROTECTED]>wrote:

> Always your prerogative.
>
> On Sep 10, 2008, at 1:15 PM, Chris Lu wrote:
>
> Actually I am done with it by simply downgrading and not to use r659602
> and later.The old version is more clean and consistent with the API and
> close() does mean close, not something complicated and unknown to most
> users, which almost feels like a trap. And later on, if no changes happened
> for this file, I will have to upgrade Lucene and manually remove the patch
> Lucene-1195.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
> On Wed, Sep 10, 2008 at 10:56 AM, robert engels <[EMAIL PROTECTED]>wrote:
>
>> Why not just use reopen() and be done with it???
>>
>> On Sep 10, 2008, at 12:48 PM, Chris Lu wrote:
>>
>> Yeah, the timing is different. But it's an unknown, undetermined, and
>> uncontrollable time...
>> We can not ask the user,
>>
>> while(memory is low){
>>   sleep(1000);
>> }
>> do_the_real_thing_an_hour_later
>>
>>
>> --
>> Chris Lu
>> -
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>> On Wed, Sep 10, 2008 at 10:39 AM, robert engels <[EMAIL PROTECTED]>wrote:
>>
>>> Close() does work - it is just that the memory may not be freed until
>>> much later...
>>> When working with VERY LARGE objects, this can be a problem.
>>>
>>> On Sep 10, 2008, at 12:36 PM, Chris Lu wrote:
>>>
>>> Thanks for the analysis, really appreciate it, and I agree with it.
>>> But...
>>> This is really a normal J2EE use case. The threads seldom die.
>>> Doesn't that mean closing the RAMDirectory doesn't work for J2EE
>>> applications?
>>> And only reopen() works?
>>> And close() doesn't release the resources? duh...
>>>
>>> I can only say this is a problem to be cleaned up.
>>>
>>> --
>>> Chris Lu
>>> -
>>> Instant Scalable Full-Text Search On Any Database/Application
>>> site: http://www.dbsight.net
>>> demo: http://search.dbsight.com
>>> Lucene Database Search in 3 minutes:
>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>> DBSight customer, a shopping comparison site, (anonymous per request) got
>>> 2.6 Million Euro funding!
>>>
>>>
>>> On Wed, Sep 10, 2008 at 9:10 AM, robert engels <[EMAIL PROTECTED]>wrote:
>>>
 You do not need to create a new RAMDirectory - just write to the
 existing one, and then reopen() the IndexReader using it.
 This will prevent lots of big objects being created. This may be the
 source of your problem.

 Even if the Segment is closed, the ThreadLocal will no longer be
 referenced, but there will still be a reference to the SegmentTermEnum
 (which will be cleared when the thread dies, or "most likely" when new
 thread locals on that thread a created, so here is a potential problem.

 Thread 1 does a search, creates a thread local that references the
 RAMDir (A).
 Thread 2 does a search, creates a thread local that references the
 RAMDir (A).

 All readers, are closed on RAMDir (A).

 A new RAMDir (B) is opened.

 There may still be references in the thread local maps to RAMDir A
 (since no new thread local have been created yet).

 So you may get OOM depending on the size of the RAMDir (since you would
 need room for more than 1).  If you extend this out with lots of threads
 that don't run very often, you can see how you could easily run out of
>>

[jira] Commented: (LUCENE-1344) Make the Lucene jar an OSGi bundle

2008-09-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629954#action_12629954
 ] 

Nicolas Lalevée commented on LUCENE-1344:
-

About the missing header in the maven jar, this is weird because they exist in 
every other jar in the distrib but in the maven one. And this is a lot more 
strange to see that the manifest of the lucene core jar is in fact the manifest 
of the demo one... And I retested without the patch, everything works 
correctly. I don't see yet how it can happen.

And the META-INF/MANIFEST.MF file doesn't have to be updated when releasing. 
The build process is overriding the header entries. The file is mainly a 
template. See the {{build-bundle-manifest}} macro in the patch.
But it will have to be updated after the release, just like the 
common-build.xml, to update the version number.

> Make the Lucene jar an OSGi bundle
> --
>
> Key: LUCENE-1344
> URL: https://issues.apache.org/jira/browse/LUCENE-1344
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicolas Lalevée
>Assignee: Michael McCandless
> Fix For: 2.4
>
> Attachments: LUCENE-1344-r679133.patch, LUCENE-1344-r690675.patch, 
> LUCENE-1344-r690691.patch, MANIFEST.MF.diff
>
>
> In order to use Lucene in an OSGi environment, some additional headers are 
> needed in the manifest of the jar. As Lucene has no dependency, it is pretty 
> straight forward and it ill be easy to maintain I think.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: docid set compression and boolean docid set operations

2008-09-10 Thread John Wang
Sorry, I meant lucene 2.4

-John

On Wed, Sep 10, 2008 at 2:08 PM, John Wang <[EMAIL PROTECTED]> wrote:

> Hi guys:
>
>  We have build this on top of the lucene 1.4. api/refactoring for docid
> sets and docIdIterater.
>
>  We've implemented the p4Delta compression algorithm presented at
> www2008: http://www2008.org/papers/fp618.html
>
>  We've been using this in production here at LinkedIn and would love to
> contribute it into lucene.
>
>  We currently open sourced it at:
> http://code.google.com/p/lucene-ext/wiki/Kamikaze
>
>  Please let us know if it is thing you guys want to proceed, if so,
> what are the steps we should take.
>
> Thanks
>
> -John
>
>


docid set compression and boolean docid set operations

2008-09-10 Thread John Wang
Hi guys:

 We have build this on top of the lucene 1.4. api/refactoring for docid
sets and docIdIterater.

 We've implemented the p4Delta compression algorithm presented at
www2008: http://www2008.org/papers/fp618.html

 We've been using this in production here at LinkedIn and would love to
contribute it into lucene.

 We currently open sourced it at:
http://code.google.com/p/lucene-ext/wiki/Kamikaze

 Please let us know if it is thing you guys want to proceed, if so, what
are the steps we should take.

Thanks

-John


Re: 2.4 status

2008-09-10 Thread John Wang
Looking forward to 2.4!

-John

On Tue, Sep 9, 2008 at 2:38 AM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> OK we are gradually whittling down the list.  It's down to 9 issues now.
>
> I have 2 issues, Grant has 3, Otis has 2 and Mark and Karl have 1 each.
>
> Can each of you try to finish your issues this week, or, take them off your
> plate / move to future?
>
> We are almost there!!
>
> I can be the release manager  It'll be my first time so there could be
> some "fun" ;)
>
> Mike
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


[jira] Resolved: (LUCENE-1366) Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS

2008-09-10 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1366.


Resolution: Fixed

Committed revision 694004.

> Rename Field.Index.UN_TOKENIZED/TOKENIZED/NO_NORMS
> --
>
> Key: LUCENE-1366
> URL: https://issues.apache.org/jira/browse/LUCENE-1366
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1366.patch, LUCENE-1366.patch
>
>
> There is confusion about these current Field options and I think we
> should rename them, deprecating the old names in 2.4/2.9 and removing
> them in 3.0.  How about this:
> {code}
> TOKENIZED --> ANALYZED
> UN_TOKENIZED --> NOT_ANALYZED
> NO_NORMS --> NOT_ANALYZED_NO_NORMS
> {code}
> Should we also add ANALYZED_NO_NORMS?
> Spinoff from here:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200808.mbox/%3C48a3076a.2679420a.1c53.a5c4%40mx.google.com%3E
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

2008-09-10 Thread robert engels (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

robert engels updated LUCENE-1195:
--

Attachment: SafeThreadLocal.java

A "safe" ThreadLocal that can be used for more deterministic memory usage.

Probably a bit slower than the JDK ThreadLocal, due to the synchronization.

Offers a "purge()" method to force the cleanup of stale entries.  Probably most 
useful in code like this:

SomeLargeObject slo; // maybe a RAMDirectory?
try {
slo = new SomeLargeObject(); // or other creation mechanism;
} catch (OutOfMemoryException e) {
SafeThreadLocal.purge();
// now try again
slo = new SomeLargeObject(); // or other creation mechanism;
}




> Performance improvement for TermInfosReader
> ---
>
> Key: LUCENE-1195
> URL: https://issues.apache.org/jira/browse/LUCENE-1195
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch, 
> SafeThreadLocal.java
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup 
> is being done
> twice for each term. The first time in Similarity.idf(), where 
> searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance 
> improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add 
> a small LRU 
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an 
> mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller 
> once more.
> I will attach a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread Noble Paul നോബിള്‍ नोब्ळ्
When I look at the reference tree That is the feeling I get. if you
held a WeakReference it would get released .
 |- base of org.apache.lucene.index.CompoundFileReader$CSIndexInput
  |- input of org.apache.lucene.index.SegmentTermEnum
  |- value of java.lang.ThreadLocal$ThreadLocalMap$Entry

On Wed, Sep 10, 2008 at 8:39 PM, Chris Lu <[EMAIL PROTECTED]> wrote:
> Does this make any difference?
> If I intentionally close the searcher and reader failed to release the
> memory, I can not rely on some magic of JVM to release it.
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
> On Wed, Sep 10, 2008 at 4:03 AM, Noble Paul നോബിള്‍ नोब्ळ्
> <[EMAIL PROTECTED]> wrote:
>>
>> Why do you need to keep a strong reference?
>> Why not a WeakReference ?
>>
>> --Noble
>>
>> On Wed, Sep 10, 2008 at 12:27 AM, Chris Lu <[EMAIL PROTECTED]> wrote:
>> > The problem should be similar to what's talked about on this discussion.
>> > http://lucene.markmail.org/message/keosgz2c2yjc7qre?q=ThreadLocal
>> >
>> > There is a memory leak for Lucene search from Lucene-1195.(svn r659602,
>> > May23,2008)
>> >
>> > This patch brings in a ThreadLocal cache to TermInfosReader.
>> >
>> > It's usually recommended to keep the reader open, and reuse it when
>> > possible. In a common J2EE application, the http requests are usually
>> > handled by different threads. But since the cache is ThreadLocal, the
>> > cache
>> > are not really usable by other threads. What's worse, the cache can not
>> > be
>> > cleared by another thread!
>> >
>> > This leak is not so obvious usually. But my case is using RAMDirectory,
>> > having several hundred megabytes. So one un-released resource is obvious
>> > to
>> > me.
>> >
>> > Here is the reference tree:
>> > org.apache.lucene.store.RAMDirectory
>> >  |- directory of org.apache.lucene.store.RAMFile
>> >  |- file of org.apache.lucene.store.RAMInputStream
>> >  |- base of
>> > org.apache.lucene.index.CompoundFileReader$CSIndexInput
>> >  |- input of org.apache.lucene.index.SegmentTermEnum
>> >  |- value of java.lang.ThreadLocal$ThreadLocalMap$Entry
>> >
>> >
>> > After I switched back to svn revision 659601, right before this patch is
>> > checked in, the memory leak is gone.
>> > Although my case is RAMDirectory, I believe this will affect disk based
>> > index also.
>> >
>> > --
>> > Chris Lu
>> > -
>> > Instant Scalable Full-Text Search On Any Database/Application
>> > site: http://www.dbsight.net
>> > demo: http://search.dbsight.com
>> > Lucene Database Search in 3 minutes:
>> >
>> > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> > DBSight customer, a shopping comparison site, (anonymous per request)
>> > got
>> > 2.6 Million Euro funding!
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
>



-- 
--Noble Paul


Re: ThreadLocal causing memory leak with J2EE applications

2008-09-10 Thread robert engels
You can't hold the ThreadLocal value in a WeakReference, because  
there is no hard reference between enumeration calls (so it would be  
cleared out from under you while enumerating).


All of this occurs because you have some objects (readers/segments  
etc.) that are shared across all threads, but these contain objects  
that are 'thread/search state' specific. These latter objects are  
essentially "cached" for performance (so you don't need to seek and  
read, sequential buffer access, etc.)


A sometimes better solution is to have the state returned to the  
caller, and require the caller to pass/use the state later - then you  
don't need thread locals.


You can accomplish a similar solution by returning a "SessionKey"  
object, and have the caller pass this later.  You can then have a  
WeakHashMap of SessionKey,SearchState that the code can use.  When  
the SessionKey is destroyed (no longer referenced), the state map can  
be cleaned up automatically.




On Sep 10, 2008, at 11:30 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



When I look at the reference tree That is the feeling I get. if you
held a WeakReference it would get released .
 |- base of org.apache.lucene.index.CompoundFileReader$CSIndexInput
  |- input of org.apache.lucene.index.SegmentTermEnum
  |- value of java.lang.ThreadLocal$ThreadLocalMap 
$Entry


On Wed, Sep 10, 2008 at 8:39 PM, Chris Lu <[EMAIL PROTECTED]> wrote:

Does this make any difference?
If I intentionally close the searcher and reader failed to release  
the

memory, I can not rely on some magic of JVM to release it.
--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php? 
title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request) got

2.6 Million Euro funding!

On Wed, Sep 10, 2008 at 4:03 AM, Noble Paul  
നോബിള്‍ नोब्ळ्

<[EMAIL PROTECTED]> wrote:


Why do you need to keep a strong reference?
Why not a WeakReference ?

--Noble

On Wed, Sep 10, 2008 at 12:27 AM, Chris Lu <[EMAIL PROTECTED]>  
wrote:
The problem should be similar to what's talked about on this  
discussion.

http://lucene.markmail.org/message/keosgz2c2yjc7qre?q=ThreadLocal

There is a memory leak for Lucene search from Lucene-1195.(svn  
r659602,

May23,2008)

This patch brings in a ThreadLocal cache to TermInfosReader.

It's usually recommended to keep the reader open, and reuse it when
possible. In a common J2EE application, the http requests are  
usually
handled by different threads. But since the cache is  
ThreadLocal, the

cache
are not really usable by other threads. What's worse, the cache  
can not

be
cleared by another thread!

This leak is not so obvious usually. But my case is using  
RAMDirectory,
having several hundred megabytes. So one un-released resource is  
obvious

to
me.

Here is the reference tree:
org.apache.lucene.store.RAMDirectory
 |- directory of org.apache.lucene.store.RAMFile
 |- file of org.apache.lucene.store.RAMInputStream
 |- base of
org.apache.lucene.index.CompoundFileReader$CSIndexInput
 |- input of org.apache.lucene.index.SegmentTermEnum
 |- value of java.lang.ThreadLocal$ThreadLocalMap 
$Entry



After I switched back to svn revision 659601, right before this  
patch is

checked in, the memory leak is gone.
Although my case is RAMDirectory, I believe this will affect  
disk based

index also.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:

http://wiki.dbsight.com/index.php? 
title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per  
request)

got
2.6 Million Euro funding!





--
--Noble Paul

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









--
--Noble Paul



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

2008-09-10 Thread robert engels (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630091#action_12630091
 ] 

robert engels commented on LUCENE-1195:
---

Also, SafeThreadLocal can be trivially changed to reduce the synchronization 
times, by using a synchronized map - then only the access is sync'd.

Since a ThreadLocal in Lucene is primarily read (after initial creation), a 1.5 
lock designed for read often, write rarely would be best.



> Performance improvement for TermInfosReader
> ---
>
> Key: LUCENE-1195
> URL: https://issues.apache.org/jira/browse/LUCENE-1195
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch, 
> SafeThreadLocal.java
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup 
> is being done
> twice for each term. The first time in Similarity.idf(), where 
> searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance 
> improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add 
> a small LRU 
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an 
> mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller 
> once more.
> I will attach a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1381) Hanging while indexing/digesting on multiple threads

2008-09-10 Thread David Fertig (JIRA)
Hanging while indexing/digesting on multiple threads


 Key: LUCENE-1381
 URL: https://issues.apache.org/jira/browse/LUCENE-1381
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3.2
 Environment: Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed 
mode) on 2.6.9-78.0.1.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux
Reporter: David Fertig


With several older lucene projects already running and "stable", I have 
recently written a multi-threading indexer using to the 2.3.2 release.
My volume is in the millions of documents indexed daily and I have been stress 
testing for a while now.  My current setup has 3 JVMs, each running 6 threads 
indexing different documents, with 1 IndexWriter per JVM.  For stability 
testing, the indexer shutsdown and exits every 5-10 minutes, with a new JVM is 
started again for a clean restart. At this rate, I have noticed an rare, but 
eventually consistent internal hang/deadlock in all indexer threads while 
parsing documents.  My 'manager' thread is alive and regularly polling the 
indexer threads and displaying their state variables, but the indexer threads 
themselves appear not to be making progress while using up nearly 100% of 
available CPU.  Memory usage is relativly low and stable at 481m out of 2048m 
available. 

Most stack traces, and STAY in this state even after repeated inspections: 
(pressing CTRL-\ in active JVM window)
--
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode):

"Thread-6" prio=1 tid=0x002b25750920 nid=0x34f6 runnable 
[0x41465000..0x41465db0]
at java.util.WeakHashMap.eq(WeakHashMap.java:254)
at java.util.WeakHashMap.get(WeakHashMap.java:345)
at 
org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
at 
org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
at 
org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
at org.apache.commons.digester.Rule.end(Rule.java:230)
at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at org.apache.commons.digester.Digester.parse(Digester.java:1685)
at 
com.cymfony.dci.lucene.IndexerThread.indexFile(IndexerThread.java:199)
at 
com.cymfony.dci.lucene.IndexerThread.indexDirectory(IndexerThread.java:142)
at com.cymfony.dci.lucene.IndexerThread.run(IndexerThread.java:81)

"Thread-5" prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable 
[0x41364000..0x41364d30]
at java.lang.String.equals(String.java:858)
at 
org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833)
at java.util.WeakHashMap.eq(WeakHashMap.java:254)
at java.util.WeakHashMap.get(WeakHashMap.java:345)
at 
org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
at 
org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
at 
org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
at org.apache.commons.digester.Rule.end(Rule.java:230)
at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Co

[jira] Updated: (LUCENE-1381) Hanging while indexing/digesting on multiple threads

2008-09-10 Thread David Fertig (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Fertig updated LUCENE-1381:
-

Description: 
With several older lucene projects already running and "stable", I have 
recently written a multi-threading indexer using to the 2.3.2 release.
My volume is in the millions of documents indexed daily and I have been stress 
testing for a while now.  My current setup has 3 JVMs, each running 6 threads 
indexing different documents, with 1 IndexWriter per JVM.  For stability 
testing, the indexer shutsdown and exits every 5-10 minutes, with a new JVM is 
started again for a clean restart. At this rate, I have noticed an rare, but 
eventually consistent internal hang/deadlock in all indexer threads while 
parsing documents.  My 'manager' thread is alive and regularly polling the 
indexer threads and displaying their state variables, but the indexer threads 
themselves appear not to be making progress while using up nearly 100% of 
available CPU.  Memory usage is relativly low and stable at 481m out of 2048m 
available. 

Most stack traces, and STAY in this state even after repeated inspections: 
(pressing CTRL-\ in active JVM window)
--
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_16-b02 mixed mode):

"Thread-6" prio=1 tid=0x002b25750920 nid=0x34f6 runnable 
[0x41465000..0x41465db0]
at java.util.WeakHashMap.eq(WeakHashMap.java:254)
at java.util.WeakHashMap.get(WeakHashMap.java:345)
at 
org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
at 
org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
at 
org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
at org.apache.commons.digester.Rule.end(Rule.java:230)
at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at org.apache.commons.digester.Digester.parse(Digester.java:1685)
...

"Thread-5" prio=1 tid=0x002b25754eb0 nid=0x34f5 runnable 
[0x41364000..0x41364d30]
at java.lang.String.equals(String.java:858)
at 
org.apache.commons.beanutils.MethodUtils$MethodDescriptor.equals(MethodUtils.java:833)
at java.util.WeakHashMap.eq(WeakHashMap.java:254)
at java.util.WeakHashMap.get(WeakHashMap.java:345)
at 
org.apache.commons.beanutils.MethodUtils.getMatchingAccessibleMethod(MethodUtils.java:530)
at 
org.apache.commons.beanutils.MethodUtils.invokeMethod(MethodUtils.java:209)
at 
org.apache.commons.digester.CallMethodRule.end(CallMethodRule.java:625)
at org.apache.commons.digester.Rule.end(Rule.java:230)
at org.apache.commons.digester.Digester.endElement(Digester.java:1130)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at org.apache.commons.digester.Digester.parse(Digester.java:1685)
...

"Thread-4" prio=1 tid=0x002b25754860 nid=0x34f4 runnable 
[0x41263000..0x41263cb0]
at java.lang.String.