Re: large scale indexing issues / single threaded bottleneck
On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer simon.willna...@googlemail.com wrote: one more thing, after somebody (thanks robert) pointed me at the stacktrace it seems kind of obvious what the root cause of your problem is. Its solr :) Solr closes the IndexWriter on commit which is very wasteful since you basically wait until all merges are done. Solr trunk has solved this problem. That is very wasteful but I don't think it's actually the cause of the slowdown here? The cause looks like it's in applying deletes, which even once Solr stops closing the IW will still occur (ie, IW.commit must also resolve all deletes). When IW resolves deletes it 1) opens a SegmentReader for each segment in the index, and 2) looks up each deleted term and mark its document(s) as deleted. I saw a mention somewhere that you can tell Solr not to use IW.addDocument (not IW.updateDocument) when you add a document if you are certain it's not replacing a previous document with the same ID -- I don't know how to do that but if that's true, and you are truly only adding documents, that could be the easiest fix here. Failing that... you could try increasing IndexWriterConfig.setReaderTermsIndexDivisor (not sure if/how this is exposed in Solr's config)... this will make init time and RAM usage for each SegmentReader faster, but lookup time slower; whether this helps depends on if your slowness is in opening the SegmentReader (how long does it take to IR.open on your index?) or on resolving the deletes once SR is open. Do you have a great many terms in your index? Can you run CheckIndex and post the output? (If so this might mean you have an analysis problem, ie, putting too many terms in the index). We should maybe try to fix this in 3.x too? +1; having to wait for running merges to complete when the app calls commit is crazy (Lucene long ago removed that limitation). Mike McCandless http://blog.mikemccandless.com
Re: URL Redirect
I would personally implement this in the app tier, above Solr. One way to do it using Solr to match keywords to URLs is to index special redirect documents with the keywords in the search field (either in the main index, or in a separate core index). But there is nothing magically built into Solr, at the moment, to do what you're asking out of the box. I'm curious what other tasks are tedious about migrating from Endeca to Solr. Erik On Oct 28, 2011, at 23:40 , prr wrote: Finotti Simone tech178 at yoox.com writes: Hello, I have been assigned the task to migrate from Endeca to Solr. The former engine allowed me to set keyword triggers that, when matched exactly, caused the web client to redirect to a specified URL. Does that feature exist in Solr? If so, where can I get some info? Thank you Hi, Iam also looking out for migrating from Endeca to Solr , but on the first look it looks extremely tedious to me ...please pass on any tips or how to approach the problem..
Uncomplete date expressions
Hi all, I want to index MEDLINE documents which not always contain complete dates of publication. The year is known always. Now the Solr documentation states, dates must have the format 1995-12-31T23:59:59Z for which month, day and even the time of the day must be known. I could, of course, just complement uncomplete dates with default values, 01-01 for example. But then I won't be able to distinguish between complete and uncomplete dates afterwards which is of importance when displaying the documents. I could just store the known information, e.g. the year, into an integer-typed field, but then I won't have date math. Is there a good solution to my problem? Probably I'm just missing the obvious, perhaps you can help me :-) Best regards, Erik
Re: Uncomplete date expressions
Erik I would complement the date with default values as you suggest and store a boolean flag indicating whether the date was complete or not, or store the original date if it is not complete which would probably be better because the presence of that data would tell you that the original date was not complete and you would also have it too. Cheers François On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote: Hi all, I want to index MEDLINE documents which not always contain complete dates of publication. The year is known always. Now the Solr documentation states, dates must have the format 1995-12-31T23:59:59Z for which month, day and even the time of the day must be known. I could, of course, just complement uncomplete dates with default values, 01-01 for example. But then I won't be able to distinguish between complete and uncomplete dates afterwards which is of importance when displaying the documents. I could just store the known information, e.g. the year, into an integer-typed field, but then I won't have date math. Is there a good solution to my problem? Probably I'm just missing the obvious, perhaps you can help me :-) Best regards, Erik
Re: Uncomplete date expressions
Hello François, thank you for your quick reply. I thought about just storing which information I am lacking and this would be a possibility of course. It just seemed a bit like quickdirty to me and I wondered whether Solr really cannot understand dates which only consist of the year. Isn't it a common case that a date/time expression is not determined to the hour, for example? But if there is no other possibility I will stick with your suggestion, thank you! Best, Erik Am 29.10.2011 um 15:20 schrieb François Schiettecatte: Erik I would complement the date with default values as you suggest and store a boolean flag indicating whether the date was complete or not, or store the original date if it is not complete which would probably be better because the presence of that data would tell you that the original date was not complete and you would also have it too. Cheers François On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote: Hi all, I want to index MEDLINE documents which not always contain complete dates of publication. The year is known always. Now the Solr documentation states, dates must have the format 1995-12-31T23:59:59Z for which month, day and even the time of the day must be known. I could, of course, just complement uncomplete dates with default values, 01-01 for example. But then I won't be able to distinguish between complete and uncomplete dates afterwards which is of importance when displaying the documents. I could just store the known information, e.g. the year, into an integer-typed field, but then I won't have date math. Is there a good solution to my problem? Probably I'm just missing the obvious, perhaps you can help me :-) Best regards, Erik
Re: large scale indexing issues / single threaded bottleneck
On Sat, Oct 29, 2011 at 6:35 AM, Michael McCandless luc...@mikemccandless.com wrote: I saw a mention somewhere that you can tell Solr not to use IW.addDocument (not IW.updateDocument) when you add a document if you are certain it's not replacing a previous document with the same ID Right - adding overwrite=false to the URL when updating should do this. -Yonik http://www.lucidimagination.com
Re: large scale indexing issues / single threaded bottleneck
Roman: 2) what would be the best way to port these (and only these) changes to 3.4.0? I tried to dig into the branching and revisions, but got lost quickly. Tried something like svn diff […]realtime_search@r953476 […]realtime_search@r1097767, but I'm not sure if it's even possible to merge these into 3.4.0 3) what would you recommend for production 24/7 use? 3.4.0? If you want try real time indexing without commits with ver 3.4.0, you can give Solr with RankingAlgorithm a try. It does not need commits during to add documents ( you can set your commits to every 15 mins or as desired ). You can get more information about NRT with 3.4.0 from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here: http://solr-ra.tgels.org Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 10/28/2011 11:38 AM, Roman Alekseenkov wrote: Hi everyone, I'm looking for some help with Solr indexing issues on a large scale. We are indexing few terabytes/month on a sizeable Solr cluster (8 masters / serving writes, 16 slaves / serving reads). After certain amount of tuning we got to the point where a single Solr instance can handle index size of 100GB without much issues, but after that we are starting to observe noticeable delays on index flush and they are getting larger. See the attached picture for details, it's done for a single JVM on a single machine. We are posting data in 8 threads using javabin format and doing commit every 5K documents, merge factor 20, and ram buffer size about 384MB. From the picture it can be seen that a single-threaded index flushing code kicks in on every commit and blocks all other indexing threads. The hardware is decent (12 physical / 24 virtual cores per machine) and it is mostly idle when the index is flushing. Very little CPU utilization and disk I/O (5%), with the exception of a single CPU core which actually does index flush (95% CPU, 5% I/O wait). My questions are: 1) will Solr changes from real-time branch help to resolve these issues? I was reading http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html and it looks like we have exactly the same problem 2) what would be the best way to port these (and only these) changes to 3.4.0? I tried to dig into the branching and revisions, but got lost quickly. Tried something like svn diff […]realtime_search@r953476 […]realtime_search@r1097767, but I'm not sure if it's even possible to merge these into 3.4.0 3) what would you recommend for production 24/7 use? 3.4.0? 4) is there a workaround that can be used? also, I listed the stack trace below Thank you! Roman P.S. This single index flushing thread spends 99% of all the time in org.apache.lucene.index.BufferedDeletesStream.applyDeletes, and then the merge seems to go quickly. I looked it up and it looks like the intent here is deleting old commit points (we are keeping only 1 non-optimized commit point per config). Not sure why is it taking that long. pool-2-thread-1 [RUNNABLE] CPU time: 3:31 java.nio.Bits.copyToByteArray(long, Object, long, long) java.nio.DirectByteBuffer.get(byte[], int, int) org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int) org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) org.apache.lucene.index.SegmentTermEnum.next() org.apache.lucene.index.TermInfosReader.init(Directory, String, FieldInfos, int, int) org.apache.lucene.index.SegmentCoreReaders.init(SegmentReader, Directory, SegmentInfo, int, int) org.apache.lucene.index.SegmentReader.get(boolean, Directory, SegmentInfo, int, boolean, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean, int, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool, List) org.apache.lucene.index.IndexWriter.doFlush(boolean) org.apache.lucene.index.IndexWriter.flush(boolean, boolean) org.apache.lucene.index.IndexWriter.closeInternal(boolean) org.apache.lucene.index.IndexWriter.close(boolean) org.apache.lucene.index.IndexWriter.close() org.apache.solr.update.SolrIndexWriter.close() org.apache.solr.update.DirectUpdateHandler2.closeWriter() org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand) org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run() java.util.concurrent.Executors$RunnableAdapter.call() java.util.concurrent.FutureTask$Sync.innerRun() java.util.concurrent.FutureTask.run() java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) java.util.concurrent.ThreadPoolExecutor$Worker.run() java.lang.Thread.run()
difference between analysis output and searches
Why is it that I can see in the analysis admin page an obvious match between terms, yet sometimes they don't come back in searches? Debug output on the searches indicate a non-match yet the analysis page shows an obvious match. I don't get it.
Re: difference between analysis output and searches
Robert - Can you give us a concrete input text, the field type definition, and the query(/ies) that you'd expect to match? The devil is in the details. Just because analysis.jsp _only_ means that an index and query time output token for the given text was equal. But in the real world of doing a search, the query parser adds a whole other level of processing. analysis.jsp does not do query parsing and thus can be misleading. Erik On Oct 29, 2011, at 13:45 , Robert Petersen wrote: Why is it that I can see in the analysis admin page an obvious match between terms, yet sometimes they don't come back in searches? Debug output on the searches indicate a non-match yet the analysis page shows an obvious match. I don't get it.
shingles and dismax?
Hello. While trying to understand why phrase match and boost was not working with shingles and the dismax parser, I saw this thread - http://lucene.472066.n3.nabble.com/Local-Params-syntax-not-protecting-Shingles-in-DisMax-from-Lucene-query-parser-td1563090.html It states I really like the DisMax query parser, but of course its main design is a bit at odds with shingles and phrases. What are these issues? I thought that I'd use shingles in conjunction with higher pf boost and dismax to get better phrase matches, but its just not working - the shingle field is almost never seen to match, and I have no idea why! thanks, vijay -- Performance marketing on Twitter - http://www.wisdomtap.com/