答复:RE: RE: About lucene memory consumption

2014-07-01 Thread wangzhijiang999
My application also meet this problem last year and I researched on the code 
and found the reason. 
The whole process is as follow:
1. When using NRTCachingDirectory, it will use RAMDirectory as cache and 
MMapDirectory as delegate. The new segment will be created in the process of 
flush or merge. And the NRTCachingDirectory use the parameters of 
maxMergeSizeBytes and maxCachedBytes to decide to create the new segment in 
cache(in memory) or in delegate(in disk).
2.  When flush to create new segment,   it will compare the 
context.fllushinfo.estimatedSegmentSize of new segment with the above 
parameter. If the size of new segment is small, then it will be created in 
RAMDirectory, otherwise in MMapDirectory.
3. When merge to create new segment, it will compare the 
context.mergeInfo.estimatedMergeBytes of  new segment with the above parameter. 
And if the size of new segment is small, it will be created in cache, otherwise 
in delegate.
4.  But when the new segment is compound index file(cfs) no matter during flush 
or merge, it will use IOContext.DEFAULT for that segment, and the 
estimatedMergeBytes ,estimatedSegmentSize  are both null for IOContext.DEFAULT, 
resulting in creating the new compund segment file always in cache no matter 
how big it really is. This is the core issue. 
 
Then I will explain the mechanism of releasing the segment in cache.
1.  Normally, in the process of commit, the sync operation will flush the new 
created segment files to the disk, and delete them from the cache. But if the 
merging process is running during the sync, so the new created segment by merge 
will not be sync to disk in this commit, and the new merged compound segment 
file will be created in cache as described above.
2.  If using NRT feature, the IndexSearcher will get segmentReader from the 
IndexWriter by getReader method. And theire is a ReaderPool inside the 
IndexWriter. For the new segment, it will first fetch from the cache of 
NRTCachingDirectory, if the new segment is not in the cache(created directly in 
the disk or commit to disk releasing from the cache), then fetch it from the 
delegate. The new fetched segment will be put in the ReaderPool in the 
IndexWriter. As described above, the new segment created by merge is in the 
cache now, and when it is fetched by IndexWriter, it will be referenced by the 
ReaderPool of IndexWriter. In the process of next commit, this new segment will 
be sync to disk and released from the cache, but it is still referenced by the 
ReaderPool. And you will see the IndexSearcher reference a lot of RAMFile which 
are already in the disk. When these RAMFil can be dropped?  When these segments 
join the new merging process to create new segment, then these old segments 
will be released from the ReaderPool of the IndexWriter completely.
 
I modified the lucene souce code to solve this problem in the 
CompoundFileWriter class.
out = new DirectCFSIndexOutput(getOutput(), entry, false);  //original
out = new DirectCFSIndexOutput(getOutput(context), entry, false); //modified

IndexOutput createOutput(String name, IOContext context) throws IOException {  
ensureOpen();  boolean success = false;  boolean outputLocked = false;  try {   
assert name != null : "name must not be null";   if (entries.containsKey(name)) 
{throw new IllegalArgumentException("File " + name + " already exists");   
}   final FileEntry entry = new FileEntry();   entry.file = name;   
entries.put(name, entry);   final String id = 
IndexFileNames.stripSegmentName(name);   assert !seenIDs.contains(id) : 
"file=\"" + name + "\" maps to id=\"" + id + "\", which was already written";   
seenIDs.add(id);   final DirectCFSIndexOutput out;
   if ((outputLocked = outputTaken.compareAndSet(false, true))) {//out = 
new DirectCFSIndexOutput(getOutput(), entry, false);     out = new 
DirectCFSIndexOutput(getOutput(context), entry, false);   } else {entry.dir 
= this.directory;if (directory.fileExists(name)) { throw new 
IllegalArgumentException("File " + name + " already exists");}out = new 
DirectCFSIndexOutput(directory.createOutput(name, context), entry, true);   }   
success = true;   return out;  } finally {   if (!success) {
entries.remove(name);if (outputLocked) { // release the output lock if not 
successful assert outputTaken.get(); releaseOutputLock();}   }  } }
 private synchronized IndexOutput getOutput(IOContext context) throws 
IOException {  if (dataOut == null) {   boolean success = false;   try {
dataOut = directory.createOutput(dataFileName, context);
CodecUtil.writeHeader(dataOut, DATA_CODEC, VERSION_CURRENT);success = true; 
  } finally {if (!success) { 
IOUtils.closeWhileHandlingException(dataOut);}   }  }  return dataOut; }
 

RE: 答复:RE: RE: About lucene memory consumption

2014-07-01 Thread Uwe Schindler
Hi Wang,

would it be possible to open a JIRA issue so we can track this?
In any case, I would recommend to disable compound files if you use 
NRTCachingDirectory (as a workaround).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: wangzhijiang999 [mailto:wangzhijiang...@aliyun.com]
> Sent: Tuesday, July 01, 2014 9:17 AM
> To: java-user
> Subject: 答复:RE: RE: About lucene memory consumption
> 
> My application also meet this problem last year and I researched on the code
> and found the reason.
> The whole process is as follow:
> 1. When using NRTCachingDirectory, it will use RAMDirectory as cache and
> MMapDirectory as delegate. The new segment will be created in the process
> of flush or merge. And the NRTCachingDirectory use the parameters of
> maxMergeSizeBytes and maxCachedBytes to decide to create the new
> segment in cache(in memory) or in delegate(in disk).
> 2.  When flush to create new segment,   it will compare the
> context.fllushinfo.estimatedSegmentSize of new segment with the above
> parameter. If the size of new segment is small, then it will be created
> in RAMDirectory, otherwise in MMapDirectory.
> 3. When merge to create new segment, it will compare the
> context.mergeInfo.estimatedMergeBytes of  new segment with the above
> parameter. And if the size of new segment is small, it will be created in 
> cache,
> otherwise in delegate.
> 4.  But when the new segment is compound index file(cfs) no matter during
> flush or merge, it will use IOContext.DEFAULT for that segment, and the
> estimatedMergeBytes ,estimatedSegmentSize  are both null for
> IOContext.DEFAULT, resulting in creating the new compund segment file
> always in cache no matter how big it really is. This is the core issue.
> 
> Then I will explain the mechanism of releasing the segment in cache.
> 1.  Normally, in the process of commit, the sync operation will flush the new
> created segment files to the disk, and delete them from the cache. But if the
> merging process is running during the sync, so the new created segment by
> merge will not be sync to disk in this commit, and the new merged
> compound segment file will be created in cache as described above.
> 2.  If using NRT feature, the IndexSearcher will get segmentReader from the
> IndexWriter by getReader method. And theire is a ReaderPool
> inside the IndexWriter. For the new segment, it will first fetch from the 
> cache
> of NRTCachingDirectory, if the new segment is not in the cache(created
> directly in the disk or commit to disk releasing from the cache), then fetch 
> it
> from the delegate. The new fetched segment will be put in the ReaderPool
> in the IndexWriter. As described above, the new segment created by merge
> is in the cache now, and when it is fetched by IndexWriter, it will be
> referenced by the ReaderPool of IndexWriter. In the process of next
> commit, this new segment will be sync to disk and released from the cache,
> but it is still referenced by the ReaderPool. And you will see the
> IndexSearcher reference a lot of RAMFile which are already in the disk.
> When these RAMFil can be dropped?  When these segments join the new
> merging process to create new segment, then these old segments will be
> released from the ReaderPool of the IndexWriter completely.
> 
> I modified the lucene souce code to solve this problem in the
> CompoundFileWriter class.
> out = new DirectCFSIndexOutput(getOutput(), entry, false);  //original out =
> new DirectCFSIndexOutput(getOutput(context), entry, false); //modified
> 
> IndexOutput createOutput(String name, IOContext context) throws
> IOException {  ensureOpen();  boolean success = false;  boolean
> outputLocked = false;  try {   assert name != null : "name must not be null"; 
>   if
> (entries.containsKey(name)) {throw new IllegalArgumentException("File "
> + name + " already exists");   }   final FileEntry entry = new
> FileEntry();   entry.file = name;   entries.put(name, entry);   final String 
> id =
> IndexFileNames.stripSegmentName(name);   assert !seenIDs.contains(id) :
> "file=\"" + name + "\" maps to id=\"" + id + "\", which was already
> written";   seenIDs.add(id);   final DirectCFSIndexOutput out;
>if ((outputLocked = outputTaken.compareAndSet(false, true))) {//out =
> new DirectCFSIndexOutput(getOutput(), entry, false); out = new
> DirectCFSIndexOutput(getOutput(context), entry, false);   } else {
> entry.dir
> = this.directory;if (directory.fileExists(name)) { throw new
> IllegalArgumentException("File " + name + " already exists");}out = 
> new
> DirectCFSIndexOutput(directory.createOutput(name, context), entry,
> true);   }   success = true;   return out;  } finally {   if (!success)
> {entries.remove(name);if (outputLocked) { // release the output lock 
> if
> not successful assert outputTaken.get(); releaseOutputLock();}   
> }  } }
>  pr

答复:答复:RE: RE: About lucene memory consumption

2014-07-01 Thread wangzhijiang999
Hi Uwe, 
   I already created the issue in JIRA 
"https://issues.apache.org/jira/i#browse/LUCENE-5800";.
 
 
 
 
 
Zhijiang Wang



--发件人:Uwe 
Schindler 发送时间:2014年7月1日(星期二) 15:47收件人:java-user 
; wangzhijiang999 主 
题:RE: 答复:RE: RE: About lucene memory consumptionHi Wang,would it be possible to 
open a JIRA issue so we can track this?In any case, I would recommend to 
disable compound files if you use NRTCachingDirectory (as a 
workaround).Uwe-Uwe SchindlerH.-H.-Meier-Allee 63, D-28213 
Bremenhttp://www.thetaphi.deeMail: u...@thetaphi.de> -Original 
Message-> From: wangzhijiang999 [mailto:wangzhijiang...@aliyun.com]> Sent: 
Tuesday, July 01, 2014 9:17 AM> To: java-user> Subject: 答复:RE: RE: About lucene 
memory consumption> > My application also meet this problem last year and I 
researched on the code> and found the reason.> The whole process is as follow:> 
1. When using NRTCachingDirectory, it will use RAMDirectory as cache and> 
MMapDirectory as delegate. The new segment will be created in the process> of 
flush or merge. And the NRTCachingDirectory use the parameters of> 
maxMergeSizeBytes and maxCachedBytes to decide to create the new> segment in 
cache(in memory) or in delegate(in disk).> 2. When flush to create new segment, 
it will compare the> context.fllushinfo.estimatedSegmentSize of new segment 
with the above> parameter. If the size of new segment is small, then it will be 
created> in RAMDirectory, otherwise in MMapDirectory.> 3. When merge to create 
new segment, it will compare the> context.mergeInfo.estimatedMergeBytes of new 
segment with the above> parameter. And if the size of new segment is small, it 
will be created in cache,> otherwise in delegate.> 4. But when the new segment 
is compound index file(cfs) no matter during> flush or merge, it will use 
IOContext.DEFAULT for that segment, and the> estimatedMergeBytes 
,estimatedSegmentSize are both null for> IOContext.DEFAULT, resulting in 
creating the new compund segment file> always in cache no matter how big it 
really is. This is the core issue.> > Then I will explain the mechanism of 
releasing the segment in cache.> 1. Normally, in the process of commit, the 
sync operation will flush the new> created segment files to the disk, and 
delete them from the cache. But if the> merging process is running during the 
sync, so the new created segment by> merge will not be sync to disk in this 
commit, and the new merged> compound segment file will be created in cache as 
described above.> 2. If using NRT feature, the IndexSearcher will get 
segmentReader from the> IndexWriter by getReader method. And theire is a 
ReaderPool> inside the IndexWriter. For the new segment, it will first fetch 
from the cache> of NRTCachingDirectory, if the new segment is not in the 
cache(created> directly in the disk or commit to disk releasing from the 
cache), then fetch it> from the delegate. The new fetched segment will be put 
in the ReaderPool> in the IndexWriter. As described above, the new segment 
created by merge> is in the cache now, and when it is fetched by IndexWriter, 
it will be> referenced by the ReaderPool of IndexWriter. In the process of 
next> commit, this new segment will be sync to disk and released from the 
cache,> but it is still referenced by the ReaderPool. And you will see the> 
IndexSearcher reference a lot of RAMFile which are already in the disk.> When 
these RAMFil can be dropped? When these segments join the new> merging process 
to create new segment, then these old segments will be> released from the 
ReaderPool of the IndexWriter completely.> > I modified the lucene souce code 
to solve this problem in the> CompoundFileWriter class.> out = new 
DirectCFSIndexOutput(getOutput(), entry, false); //original out => new 
DirectCFSIndexOutput(getOutput(context), entry, false); //modified> > 
IndexOutput createOutput(String name, IOContext context) throws> IOException { 
ensureOpen(); boolean success = false; boolean> outputLocked = false; try { 
assert name != null : "name must not be null"; if> (entries.containsKey(name)) 
{ throw new IllegalArgumentException("File "> + name + " already exists"); } 
final FileEntry entry = new> FileEntry(); entry.file = name; entries.put(name, 
entry); final String id => IndexFileNames.stripSegmentName(name); assert 
!seenIDs.contains(id) :> "file=\"" + name + "\" maps to id=\"" + id + "\", 
which was already> written"; seenIDs.add(id); final DirectCFSIndexOutput out;> 
if ((outputLocked = outputTaken.compareAndSet(false, true))) { //out => new 
DirectCFSIndexOutput(getOutput(), entry, false); out = new> 
DirectCFSIndexOutput(getOutput(context), entry, false); } else { entry.dir> = 
this.directory; if (directory.fileExists(name)) { throw new> 
IllegalArgumentException("File " + name + " already exists"); } out = new> 
DirectCFSIndexOutput(directory.createOutput(name, context), entry,> true); } 
success = true; retur

Incremental Field Updates

2014-07-01 Thread Sandeep Khanzode
Hi,

I wanted to know of the best approach to follow if a few fields in my indexed 
documents are changing at run time (after index and before or during search), 
but a majority of them are created at index time.

I could see the JIRA given below but it is scheduled for Lucene 4.9, I believe. 
 

There are a few other approaches, like maintaining a separate index for 
changing fields and use either a parallelreader or use a Join.

Can everyone share their experience for this scenario on how it is handled in 
your systems? Thanks,

[LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA

 
 [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA
Shai and I would like to start working on the proposal to Incremental Field 
Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).   
View on issues.apache.org Preview by Yahoo  
 
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: createNormalizedWeight

2014-07-01 Thread Pawel Rog
Hi,
Is there any way to pre-build rewritten queries and cache them somewhere.
When I have a set of queries which is used very frequently I would get
significant boost (10-20% of CPU wasted) when I can skip rewriting (for
example by caching rewritten queries).
Thank you for any suggestions.


On Mon, Jun 30, 2014 at 3:24 PM, Pawel Rog  wrote:

> Hi,
> Thank you Uwe. I see mostly ConstantScoreQuery, BooleanQuery and
> FilteredQuery. Maybe it is quite cheap for MI but I execute quite many
> queries on it and I was looking for optimizations.
>
> --
> Paweł
>
>
> On Mon, Jun 30, 2014 at 3:01 PM, Uwe Schindler  wrote:
>
>> Hi,
>>
>> Queries have to be rewritten, this has nothing to do with scoring. What
>> type of queries are you seeing this? Wildcard or text ranges are expensive,
>> there is no way around, but for MemoryIndex (I assume you mean this class),
>> this should be quite cheap.
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>> > -Original Message-
>> > From: ppp.pawel...@gmail.com [mailto:ppp.pawel...@gmail.com] On
>> > Behalf Of Pawel Rog
>> > Sent: Monday, June 30, 2014 2:26 PM
>> > To: java-user@lucene.apache.org
>> > Subject: createNormalizedWeight
>> >
>> > Hi,
>> > I'm running queries over memory index and see in profiler significant
>> CPU
>> > usage on method createNormalizedWeight. Most of the time is spent on
>> > rewrite method.
>> >
>> > Is it any way to avoid it or optimize to reduce CPU usage on
>> > createNormalizedWeight? Scoring is not important for me at all. It only
>> want
>> > to know if query matches do document or not.
>> >
>> > --
>> > Regards,
>> > Paweł
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: Incremental Field Updates

2014-07-01 Thread Erick Erickson
This JIRA is "complicated", don't really expect it in 4.9 as it's
been hanging around for quite a while. Everyone would like this,
but it's not easy.

Atomic updates will work, but you have to stored="true" for all
source fields. Under the covers this actually reads the document
out of the stored fields, deletes the old one and adds it
over again.

FWIW,
Erick

On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode
 wrote:
> Hi,
>
> I wanted to know of the best approach to follow if a few fields in my indexed 
> documents are changing at run time (after index and before or during search), 
> but a majority of them are created at index time.
>
> I could see the JIRA given below but it is scheduled for Lucene 4.9, I 
> believe.
>
> There are a few other approaches, like maintaining a separate index for 
> changing fields and use either a parallelreader or use a Join.
>
> Can everyone share their experience for this scenario on how it is handled in 
> your systems? Thanks,
>
> [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA
>
>
>  [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA
> Shai and I would like to start working on the proposal to Incremental Field 
> Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).
> View on issues.apache.org Preview by Yahoo
>
>
> ---
> Thanks n Regards,
> Sandeep Ramesh Khanzode

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Incremental Field Updates

2014-07-01 Thread Shai Erera
Except that Lucene now offers efficient numeric and binary DocValues
updates. See IndexWriter.updateNumeric/Binary...
On Jul 1, 2014 5:51 PM, "Erick Erickson"  wrote:

> This JIRA is "complicated", don't really expect it in 4.9 as it's
> been hanging around for quite a while. Everyone would like this,
> but it's not easy.
>
> Atomic updates will work, but you have to stored="true" for all
> source fields. Under the covers this actually reads the document
> out of the stored fields, deletes the old one and adds it
> over again.
>
> FWIW,
> Erick
>
> On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode
>  wrote:
> > Hi,
> >
> > I wanted to know of the best approach to follow if a few fields in my
> indexed documents are changing at run time (after index and before or
> during search), but a majority of them are created at index time.
> >
> > I could see the JIRA given below but it is scheduled for Lucene 4.9, I
> believe.
> >
> > There are a few other approaches, like maintaining a separate index for
> changing fields and use either a parallelreader or use a Join.
> >
> > Can everyone share their experience for this scenario on how it is
> handled in your systems? Thanks,
> >
> > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
> JIRA
> >
> >
> >  [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
> JIRA
> > Shai and I would like to start working on the proposal to Incremental
> Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex
> ).
> > View on issues.apache.org Preview by Yahoo
> >
> >
> > ---
> > Thanks n Regards,
> > Sandeep Ramesh Khanzode
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Incremental Field Updates

2014-07-01 Thread Sandeep Khanzode
Hi Shai,

So one follow-up question.

Assume that my use case is to have approx. ~50M documents indexed with each 
document having about ~10-15 indexed but not stored fields. These fields will 
never change, but there are another ~5-6 fields that will change and will 
continue to change after the index is written. These ~5-6 fields may also be 
multivalued. The size of this index turns out to be ~120GB. 

In this case, I would like to sort or facet or search on these ~5-6 fields. 
Which approach do you suggest? Should I use BinaryDocValues and update using IW 
or use either a ParallelReader/Join query.
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, July 1, 2014 9:53 PM, Shai Erera  wrote:
 


Except that Lucene now offers efficient numeric and binary DocValues
updates. See IndexWriter.updateNumeric/Binary...

On Jul 1, 2014 5:51 PM, "Erick Erickson"  wrote:

> This JIRA is "complicated", don't really expect it in 4.9 as it's
> been hanging around for quite a while. Everyone would like this,
> but it's not easy.
>
> Atomic updates will work, but you have to stored="true" for all
> source fields. Under the covers this actually reads the document
> out of the stored fields, deletes the old one and adds it
> over again.
>
> FWIW,
> Erick
>
> On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode
>  wrote:
> > Hi,
> >
> > I wanted to know of the best approach to follow if a few fields in my
> indexed documents are changing at run time (after index and before or
> during search), but a majority of them are created at index time.
> >
> > I could see the JIRA given below but it is scheduled for Lucene 4.9, I
> believe.
> >
> > There are a few other approaches, like maintaining a separate index for
> changing fields and use either a parallelreader or use a Join.
> >
> > Can everyone share their experience for this scenario on how it is
> handled in your systems? Thanks,
> >
> > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
> JIRA
> >
> >
> >  [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
> JIRA
> > Shai and I would like to start working on the proposal to Incremental
> Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex
> ).
> > View on issues.apache.org Preview by Yahoo
> >
> >
> > ---
> > Thanks n Regards,
> > Sandeep Ramesh Khanzode
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>