答复:RE: RE: About lucene memory consumption
My application also meet this problem last year and I researched on the code and found the reason. The whole process is as follow: 1. When using NRTCachingDirectory, it will use RAMDirectory as cache and MMapDirectory as delegate. The new segment will be created in the process of flush or merge. And the NRTCachingDirectory use the parameters of maxMergeSizeBytes and maxCachedBytes to decide to create the new segment in cache(in memory) or in delegate(in disk). 2. When flush to create new segment, it will compare the context.fllushinfo.estimatedSegmentSize of new segment with the above parameter. If the size of new segment is small, then it will be created in RAMDirectory, otherwise in MMapDirectory. 3. When merge to create new segment, it will compare the context.mergeInfo.estimatedMergeBytes of new segment with the above parameter. And if the size of new segment is small, it will be created in cache, otherwise in delegate. 4. But when the new segment is compound index file(cfs) no matter during flush or merge, it will use IOContext.DEFAULT for that segment, and the estimatedMergeBytes ,estimatedSegmentSize are both null for IOContext.DEFAULT, resulting in creating the new compund segment file always in cache no matter how big it really is. This is the core issue. Then I will explain the mechanism of releasing the segment in cache. 1. Normally, in the process of commit, the sync operation will flush the new created segment files to the disk, and delete them from the cache. But if the merging process is running during the sync, so the new created segment by merge will not be sync to disk in this commit, and the new merged compound segment file will be created in cache as described above. 2. If using NRT feature, the IndexSearcher will get segmentReader from the IndexWriter by getReader method. And theire is a ReaderPool inside the IndexWriter. For the new segment, it will first fetch from the cache of NRTCachingDirectory, if the new segment is not in the cache(created directly in the disk or commit to disk releasing from the cache), then fetch it from the delegate. The new fetched segment will be put in the ReaderPool in the IndexWriter. As described above, the new segment created by merge is in the cache now, and when it is fetched by IndexWriter, it will be referenced by the ReaderPool of IndexWriter. In the process of next commit, this new segment will be sync to disk and released from the cache, but it is still referenced by the ReaderPool. And you will see the IndexSearcher reference a lot of RAMFile which are already in the disk. When these RAMFil can be dropped? When these segments join the new merging process to create new segment, then these old segments will be released from the ReaderPool of the IndexWriter completely. I modified the lucene souce code to solve this problem in the CompoundFileWriter class. out = new DirectCFSIndexOutput(getOutput(), entry, false); //original out = new DirectCFSIndexOutput(getOutput(context), entry, false); //modified IndexOutput createOutput(String name, IOContext context) throws IOException { ensureOpen(); boolean success = false; boolean outputLocked = false; try { assert name != null : "name must not be null"; if (entries.containsKey(name)) {throw new IllegalArgumentException("File " + name + " already exists"); } final FileEntry entry = new FileEntry(); entry.file = name; entries.put(name, entry); final String id = IndexFileNames.stripSegmentName(name); assert !seenIDs.contains(id) : "file=\"" + name + "\" maps to id=\"" + id + "\", which was already written"; seenIDs.add(id); final DirectCFSIndexOutput out; if ((outputLocked = outputTaken.compareAndSet(false, true))) {//out = new DirectCFSIndexOutput(getOutput(), entry, false); out = new DirectCFSIndexOutput(getOutput(context), entry, false); } else {entry.dir = this.directory;if (directory.fileExists(name)) { throw new IllegalArgumentException("File " + name + " already exists");}out = new DirectCFSIndexOutput(directory.createOutput(name, context), entry, true); } success = true; return out; } finally { if (!success) { entries.remove(name);if (outputLocked) { // release the output lock if not successful assert outputTaken.get(); releaseOutputLock();} } } } private synchronized IndexOutput getOutput(IOContext context) throws IOException { if (dataOut == null) { boolean success = false; try { dataOut = directory.createOutput(dataFileName, context); CodecUtil.writeHeader(dataOut, DATA_CODEC, VERSION_CURRENT);success = true; } finally {if (!success) { IOUtils.closeWhileHandlingException(dataOut);} } } return dataOut; }
RE: 答复:RE: RE: About lucene memory consumption
Hi Wang, would it be possible to open a JIRA issue so we can track this? In any case, I would recommend to disable compound files if you use NRTCachingDirectory (as a workaround). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: wangzhijiang999 [mailto:wangzhijiang...@aliyun.com] > Sent: Tuesday, July 01, 2014 9:17 AM > To: java-user > Subject: 答复:RE: RE: About lucene memory consumption > > My application also meet this problem last year and I researched on the code > and found the reason. > The whole process is as follow: > 1. When using NRTCachingDirectory, it will use RAMDirectory as cache and > MMapDirectory as delegate. The new segment will be created in the process > of flush or merge. And the NRTCachingDirectory use the parameters of > maxMergeSizeBytes and maxCachedBytes to decide to create the new > segment in cache(in memory) or in delegate(in disk). > 2. When flush to create new segment, it will compare the > context.fllushinfo.estimatedSegmentSize of new segment with the above > parameter. If the size of new segment is small, then it will be created > in RAMDirectory, otherwise in MMapDirectory. > 3. When merge to create new segment, it will compare the > context.mergeInfo.estimatedMergeBytes of new segment with the above > parameter. And if the size of new segment is small, it will be created in > cache, > otherwise in delegate. > 4. But when the new segment is compound index file(cfs) no matter during > flush or merge, it will use IOContext.DEFAULT for that segment, and the > estimatedMergeBytes ,estimatedSegmentSize are both null for > IOContext.DEFAULT, resulting in creating the new compund segment file > always in cache no matter how big it really is. This is the core issue. > > Then I will explain the mechanism of releasing the segment in cache. > 1. Normally, in the process of commit, the sync operation will flush the new > created segment files to the disk, and delete them from the cache. But if the > merging process is running during the sync, so the new created segment by > merge will not be sync to disk in this commit, and the new merged > compound segment file will be created in cache as described above. > 2. If using NRT feature, the IndexSearcher will get segmentReader from the > IndexWriter by getReader method. And theire is a ReaderPool > inside the IndexWriter. For the new segment, it will first fetch from the > cache > of NRTCachingDirectory, if the new segment is not in the cache(created > directly in the disk or commit to disk releasing from the cache), then fetch > it > from the delegate. The new fetched segment will be put in the ReaderPool > in the IndexWriter. As described above, the new segment created by merge > is in the cache now, and when it is fetched by IndexWriter, it will be > referenced by the ReaderPool of IndexWriter. In the process of next > commit, this new segment will be sync to disk and released from the cache, > but it is still referenced by the ReaderPool. And you will see the > IndexSearcher reference a lot of RAMFile which are already in the disk. > When these RAMFil can be dropped? When these segments join the new > merging process to create new segment, then these old segments will be > released from the ReaderPool of the IndexWriter completely. > > I modified the lucene souce code to solve this problem in the > CompoundFileWriter class. > out = new DirectCFSIndexOutput(getOutput(), entry, false); //original out = > new DirectCFSIndexOutput(getOutput(context), entry, false); //modified > > IndexOutput createOutput(String name, IOContext context) throws > IOException { ensureOpen(); boolean success = false; boolean > outputLocked = false; try { assert name != null : "name must not be null"; > if > (entries.containsKey(name)) {throw new IllegalArgumentException("File " > + name + " already exists"); } final FileEntry entry = new > FileEntry(); entry.file = name; entries.put(name, entry); final String > id = > IndexFileNames.stripSegmentName(name); assert !seenIDs.contains(id) : > "file=\"" + name + "\" maps to id=\"" + id + "\", which was already > written"; seenIDs.add(id); final DirectCFSIndexOutput out; >if ((outputLocked = outputTaken.compareAndSet(false, true))) {//out = > new DirectCFSIndexOutput(getOutput(), entry, false); out = new > DirectCFSIndexOutput(getOutput(context), entry, false); } else { > entry.dir > = this.directory;if (directory.fileExists(name)) { throw new > IllegalArgumentException("File " + name + " already exists");}out = > new > DirectCFSIndexOutput(directory.createOutput(name, context), entry, > true); } success = true; return out; } finally { if (!success) > {entries.remove(name);if (outputLocked) { // release the output lock > if > not successful assert outputTaken.get(); releaseOutputLock();} > } } } > pr
答复:答复:RE: RE: About lucene memory consumption
Hi Uwe, I already created the issue in JIRA "https://issues.apache.org/jira/i#browse/LUCENE-5800";. Zhijiang Wang --发件人:Uwe Schindler 发送时间:2014年7月1日(星期二) 15:47收件人:java-user ; wangzhijiang999 主 题:RE: 答复:RE: RE: About lucene memory consumptionHi Wang,would it be possible to open a JIRA issue so we can track this?In any case, I would recommend to disable compound files if you use NRTCachingDirectory (as a workaround).Uwe-Uwe SchindlerH.-H.-Meier-Allee 63, D-28213 Bremenhttp://www.thetaphi.deeMail: u...@thetaphi.de> -Original Message-> From: wangzhijiang999 [mailto:wangzhijiang...@aliyun.com]> Sent: Tuesday, July 01, 2014 9:17 AM> To: java-user> Subject: 答复:RE: RE: About lucene memory consumption> > My application also meet this problem last year and I researched on the code> and found the reason.> The whole process is as follow:> 1. When using NRTCachingDirectory, it will use RAMDirectory as cache and> MMapDirectory as delegate. The new segment will be created in the process> of flush or merge. And the NRTCachingDirectory use the parameters of> maxMergeSizeBytes and maxCachedBytes to decide to create the new> segment in cache(in memory) or in delegate(in disk).> 2. When flush to create new segment, it will compare the> context.fllushinfo.estimatedSegmentSize of new segment with the above> parameter. If the size of new segment is small, then it will be created> in RAMDirectory, otherwise in MMapDirectory.> 3. When merge to create new segment, it will compare the> context.mergeInfo.estimatedMergeBytes of new segment with the above> parameter. And if the size of new segment is small, it will be created in cache,> otherwise in delegate.> 4. But when the new segment is compound index file(cfs) no matter during> flush or merge, it will use IOContext.DEFAULT for that segment, and the> estimatedMergeBytes ,estimatedSegmentSize are both null for> IOContext.DEFAULT, resulting in creating the new compund segment file> always in cache no matter how big it really is. This is the core issue.> > Then I will explain the mechanism of releasing the segment in cache.> 1. Normally, in the process of commit, the sync operation will flush the new> created segment files to the disk, and delete them from the cache. But if the> merging process is running during the sync, so the new created segment by> merge will not be sync to disk in this commit, and the new merged> compound segment file will be created in cache as described above.> 2. If using NRT feature, the IndexSearcher will get segmentReader from the> IndexWriter by getReader method. And theire is a ReaderPool> inside the IndexWriter. For the new segment, it will first fetch from the cache> of NRTCachingDirectory, if the new segment is not in the cache(created> directly in the disk or commit to disk releasing from the cache), then fetch it> from the delegate. The new fetched segment will be put in the ReaderPool> in the IndexWriter. As described above, the new segment created by merge> is in the cache now, and when it is fetched by IndexWriter, it will be> referenced by the ReaderPool of IndexWriter. In the process of next> commit, this new segment will be sync to disk and released from the cache,> but it is still referenced by the ReaderPool. And you will see the> IndexSearcher reference a lot of RAMFile which are already in the disk.> When these RAMFil can be dropped? When these segments join the new> merging process to create new segment, then these old segments will be> released from the ReaderPool of the IndexWriter completely.> > I modified the lucene souce code to solve this problem in the> CompoundFileWriter class.> out = new DirectCFSIndexOutput(getOutput(), entry, false); //original out => new DirectCFSIndexOutput(getOutput(context), entry, false); //modified> > IndexOutput createOutput(String name, IOContext context) throws> IOException { ensureOpen(); boolean success = false; boolean> outputLocked = false; try { assert name != null : "name must not be null"; if> (entries.containsKey(name)) { throw new IllegalArgumentException("File "> + name + " already exists"); } final FileEntry entry = new> FileEntry(); entry.file = name; entries.put(name, entry); final String id => IndexFileNames.stripSegmentName(name); assert !seenIDs.contains(id) :> "file=\"" + name + "\" maps to id=\"" + id + "\", which was already> written"; seenIDs.add(id); final DirectCFSIndexOutput out;> if ((outputLocked = outputTaken.compareAndSet(false, true))) { //out => new DirectCFSIndexOutput(getOutput(), entry, false); out = new> DirectCFSIndexOutput(getOutput(context), entry, false); } else { entry.dir> = this.directory; if (directory.fileExists(name)) { throw new> IllegalArgumentException("File " + name + " already exists"); } out = new> DirectCFSIndexOutput(directory.createOutput(name, context), entry,> true); } success = true; retur
Incremental Field Updates
Hi, I wanted to know of the best approach to follow if a few fields in my indexed documents are changing at run time (after index and before or during search), but a majority of them are created at index time. I could see the JIRA given below but it is scheduled for Lucene 4.9, I believe. There are a few other approaches, like maintaining a separate index for changing fields and use either a parallelreader or use a Join. Can everyone share their experience for this scenario on how it is handled in your systems? Thanks, [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex). View on issues.apache.org Preview by Yahoo --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: createNormalizedWeight
Hi, Is there any way to pre-build rewritten queries and cache them somewhere. When I have a set of queries which is used very frequently I would get significant boost (10-20% of CPU wasted) when I can skip rewriting (for example by caching rewritten queries). Thank you for any suggestions. On Mon, Jun 30, 2014 at 3:24 PM, Pawel Rog wrote: > Hi, > Thank you Uwe. I see mostly ConstantScoreQuery, BooleanQuery and > FilteredQuery. Maybe it is quite cheap for MI but I execute quite many > queries on it and I was looking for optimizations. > > -- > Paweł > > > On Mon, Jun 30, 2014 at 3:01 PM, Uwe Schindler wrote: > >> Hi, >> >> Queries have to be rewritten, this has nothing to do with scoring. What >> type of queries are you seeing this? Wildcard or text ranges are expensive, >> there is no way around, but for MemoryIndex (I assume you mean this class), >> this should be quite cheap. >> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> > -Original Message- >> > From: ppp.pawel...@gmail.com [mailto:ppp.pawel...@gmail.com] On >> > Behalf Of Pawel Rog >> > Sent: Monday, June 30, 2014 2:26 PM >> > To: java-user@lucene.apache.org >> > Subject: createNormalizedWeight >> > >> > Hi, >> > I'm running queries over memory index and see in profiler significant >> CPU >> > usage on method createNormalizedWeight. Most of the time is spent on >> > rewrite method. >> > >> > Is it any way to avoid it or optimize to reduce CPU usage on >> > createNormalizedWeight? Scoring is not important for me at all. It only >> want >> > to know if query matches do document or not. >> > >> > -- >> > Regards, >> > Paweł >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Re: Incremental Field Updates
This JIRA is "complicated", don't really expect it in 4.9 as it's been hanging around for quite a while. Everyone would like this, but it's not easy. Atomic updates will work, but you have to stored="true" for all source fields. Under the covers this actually reads the document out of the stored fields, deletes the old one and adds it over again. FWIW, Erick On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode wrote: > Hi, > > I wanted to know of the best approach to follow if a few fields in my indexed > documents are changing at run time (after index and before or during search), > but a majority of them are created at index time. > > I could see the JIRA given below but it is scheduled for Lucene 4.9, I > believe. > > There are a few other approaches, like maintaining a separate index for > changing fields and use either a parallelreader or use a Join. > > Can everyone share their experience for this scenario on how it is handled in > your systems? Thanks, > > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA > > > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA > Shai and I would like to start working on the proposal to Incremental Field > Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex). > View on issues.apache.org Preview by Yahoo > > > --- > Thanks n Regards, > Sandeep Ramesh Khanzode - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Incremental Field Updates
Except that Lucene now offers efficient numeric and binary DocValues updates. See IndexWriter.updateNumeric/Binary... On Jul 1, 2014 5:51 PM, "Erick Erickson" wrote: > This JIRA is "complicated", don't really expect it in 4.9 as it's > been hanging around for quite a while. Everyone would like this, > but it's not easy. > > Atomic updates will work, but you have to stored="true" for all > source fields. Under the covers this actually reads the document > out of the stored fields, deletes the old one and adds it > over again. > > FWIW, > Erick > > On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode > wrote: > > Hi, > > > > I wanted to know of the best approach to follow if a few fields in my > indexed documents are changing at run time (after index and before or > during search), but a majority of them are created at index time. > > > > I could see the JIRA given below but it is scheduled for Lucene 4.9, I > believe. > > > > There are a few other approaches, like maintaining a separate index for > changing fields and use either a parallelreader or use a Join. > > > > Can everyone share their experience for this scenario on how it is > handled in your systems? Thanks, > > > > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF > JIRA > > > > > > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF > JIRA > > Shai and I would like to start working on the proposal to Incremental > Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex > ). > > View on issues.apache.org Preview by Yahoo > > > > > > --- > > Thanks n Regards, > > Sandeep Ramesh Khanzode > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Incremental Field Updates
Hi Shai, So one follow-up question. Assume that my use case is to have approx. ~50M documents indexed with each document having about ~10-15 indexed but not stored fields. These fields will never change, but there are another ~5-6 fields that will change and will continue to change after the index is written. These ~5-6 fields may also be multivalued. The size of this index turns out to be ~120GB. In this case, I would like to sort or facet or search on these ~5-6 fields. Which approach do you suggest? Should I use BinaryDocValues and update using IW or use either a ParallelReader/Join query. --- Thanks n Regards, Sandeep Ramesh Khanzode On Tuesday, July 1, 2014 9:53 PM, Shai Erera wrote: Except that Lucene now offers efficient numeric and binary DocValues updates. See IndexWriter.updateNumeric/Binary... On Jul 1, 2014 5:51 PM, "Erick Erickson" wrote: > This JIRA is "complicated", don't really expect it in 4.9 as it's > been hanging around for quite a while. Everyone would like this, > but it's not easy. > > Atomic updates will work, but you have to stored="true" for all > source fields. Under the covers this actually reads the document > out of the stored fields, deletes the old one and adds it > over again. > > FWIW, > Erick > > On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode > wrote: > > Hi, > > > > I wanted to know of the best approach to follow if a few fields in my > indexed documents are changing at run time (after index and before or > during search), but a majority of them are created at index time. > > > > I could see the JIRA given below but it is scheduled for Lucene 4.9, I > believe. > > > > There are a few other approaches, like maintaining a separate index for > changing fields and use either a parallelreader or use a Join. > > > > Can everyone share their experience for this scenario on how it is > handled in your systems? Thanks, > > > > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF > JIRA > > > > > > [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF > JIRA > > Shai and I would like to start working on the proposal to Incremental > Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex > ). > > View on issues.apache.org Preview by Yahoo > > > > > > --- > > Thanks n Regards, > > Sandeep Ramesh Khanzode > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >