Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!
Congratulations and thank you, Jan! It is so exciting that Solr is now a TLP! Mike McCandless http://blog.mikemccandless.com On Thu, Feb 18, 2021 at 1:56 PM Anshum Gupta wrote: > Hi everyone, > > I’d like to inform everyone that the newly formed Apache Solr PMC > nominated and elected Jan Høydahl for the position of the Solr PMC Chair > and Vice President. This decision was approved by the board in its February > 2021 meeting. > > Congratulations Jan! > > -- > Anshum Gupta >
Re: ExecutorService support in SolrIndexSearcher
We pass ExecutorService to Lucene's IndexSearcher at Amazon (for customer facing product search) and it's a big win on long-pole query latencies, but hurts red-line QPS (cluster capacity) a bit, due to less efficient collection across segments and thread context switching. I'm surprised it's not an option for Solr and Elasticsearch ... for certain applications it's a huge win. And yes as David points out -- Collectors (CollectorManagers) need to support this "gather results for each segment separately then reduce in the end" mode... Mike McCandless http://blog.mikemccandless.com On Fri, Aug 30, 2019 at 4:45 PM David Smiley wrote: > It'd take some work to do that. Years ago I recall Etsy did a POC and > shared their experience at Lucene/Solr Revolution in Washington DC; I > attended the presentation with great interest. One of the major obstacles, > if I recall, was the Collector needs to support this mode of operation, and > in particular Solr's means of flipping bits in a big bitset to accumulate > the DocSet had to be careful so that multiple threads don't try to > overwrite the same underlying "long" in the long[]. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Mon, Aug 26, 2019 at 7:02 AM Aghasi Ghazaryan > wrote: > > > Hi, > > > > Lucene's IndexSearcher > > < > > > http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/IndexSearcher.html#IndexSearcher-org.apache.lucene.index.IndexReaderContext-java.util.concurrent.ExecutorService- > > > > > supports > > running searches for each segment separately, using the provided > > ExecutorService. > > I wonder why SolrIndexSearcher does not support the same as it may > improve > > queries performance a lot? > > > > Thanks, looking forward to hearing from you. > > > > Regards > > Aghasi Ghazaryan > > >
Re: Mistake assert tips in FST builder ?
Hello, Indeed, you cosmetic fix looks great -- I'll push that change. Thanks for noticing and raising! Mike McCandless http://blog.mikemccandless.com On Tue, Apr 16, 2019 at 12:04 AM zhenyuan wei wrote: > Hi, >With current newest version, 9.0.0-snapshot,In > Builder.UnCompileNode.addArc() function, > found this line: > > assert numArcs == 0 || label > arcs[numArcs-1].label: "arc[-1].label=" > + arcs[numArcs-1].label + " new label=" + label + " numArcs=" + > numArcs; > > Maybe assert tips is : > > assert numArcs == 0 || label > arcs[numArcs-1].label: > "arc[numArc-1].label=" + arcs[numArcs-1].label + " new label=" + label > + " numArcs=" + numArcs; > > Is it a personal tips code style? or small mistake? > > Just curious about it. >
Re: Long blocking during indexing + deleteByQuery
I'm not sure this is what's affecting you, but you might try upgrading to Lucene/Solr 7.1; in 7.0 there were big improvements in using multiple threads to resolve deletions: http://blog.mikemccandless.com/2017/07/lucene-gets-concurrent-deletes-and.html Mike McCandless http://blog.mikemccandless.com On Tue, Nov 7, 2017 at 2:26 PM, Chris Troulliswrote: > @Erick, I see, thanks for the clarification. > > @Shawn, Good idea for the workaround! I will try that and see if it > resolves the issue. > > Thanks, > > Chris > > On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson > wrote: > > > bq: you think it is caused by the DBQ deleting a document while a > > document with that same ID > > > > No. I'm saying that DBQ has no idea _if_ that would be the case so > > can't carry out the operations in parallel because it _might_ be the > > case. > > > > Shawn: > > > > IIUC, here's the problem. For deleteById, I can guarantee the > > sequencing through the same optimistic locking that regular updates > > use (i.e. the _version_ field). But I'm kind of guessing here. > > > > Best, > > Erick > > > > On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey > wrote: > > > On 11/5/2017 12:20 PM, Chris Troullis wrote: > > >> The issue I am seeing is when some > > >> threads are adding/updating documents while other threads are issuing > > >> deletes (using deleteByQuery), solr seems to get into a state of > extreme > > >> blocking on the replica > > > > > > The deleteByQuery operation cannot coexist very well with other > indexing > > > operations. Let me tell you about something I discovered. I think > your > > > problem is very similar. > > > > > > Solr 4.0 and later is supposed to be able to handle indexing operations > > > at the same time that the index is being optimized (in Lucene, > > > forceMerge). I have some indexes that take about two hours to > optimize, > > > so having indexing stop while that happens is a less than ideal > > > situation. Ongoing indexing is similar in many ways to a merge, enough > > > that it is handled by the same Merge Scheduler that handles an > optimize. > > > > > > I could indeed add documents to the index without issues at the same > > > time as an optimize, but when I would try my full indexing cycle while > > > an optimize was underway, I found that all operations stopped until the > > > optimize finished. > > > > > > Ultimately what was determined (I think it was Yonik that figured it > > > out) was that *most* indexing operations can happen during the > optimize, > > > *except* for deleteByQuery. The deleteById operation works just fine. > > > > > > I do not understand the low-level reasons for this, but apparently it's > > > not something that can be easily fixed. > > > > > > A workaround is to send the query you plan to use with deleteByQuery as > > > a standard query with a limited fl parameter, to retrieve matching > > > uniqueKey values from the index, then do a deleteById with that list of > > > ID values instead. > > > > > > Thanks, > > > Shawn > > > > > >
Re: SOLR-11504: Provide a config to restrict number of indexing threads
Actually, it's one lucene segment per *concurrent* indexing thread. So if you have 10 indexing threads in Lucene at once, then 10 in-memory segments will be created and will have to be written on refresh/commit. Elasticsearch uses a bounded thread pool to service all indexing requests, which I think is a healthy approach. It shouldn't have to be the client's job to worry about server side details like this. Mike McCandless http://blog.mikemccandless.com On Thu, Nov 2, 2017 at 5:23 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Nawab, > > > One indexing thread in lucene corresponds to one segment being written. > I need a fine control on the number of segments. > > I didn’t check the code, but I would be surprised that it is how things > work. It can appear that it is working like that if each client thread is > doing commits. Is that the case? > > Thanks, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 1 Nov 2017, at 18:00, Nawab Zada Asad Iqbalwrote: > > > > Well, the reason i want to control number of indexing threads is to > > restrict number of "segments" being created at one time in the RAM. One > > indexing thread in lucene corresponds to one segment being written. I > need > > a fine control on the number of segments. Less than that, and I will not > be > > fully utilizing my writing capacity. On the other hand, if I have more > > threads, then I will end up a lot more segments of small size, which I > will > > need to flush frequently and then merge, and that will cause a different > > kind of problem. > > > > Your suggestion will require me and other such solr users to create a > tight > > coupling between the clients and the Solr servers. My client is not SolrJ > > based. IN a scenario when I am connecting and indexing to Solr remotely, > I > > want more requests to be waiting on the solr side so that they start > > writing as soon as an Indexing thread is available, vs waiting on my > client > > side - on the other side of the wire. > > > > Thanks > > Nawab > > > > On Wed, Nov 1, 2017 at 7:11 AM, Shawn Heisey > wrote: > > > >> On 10/31/2017 4:57 PM, Nawab Zada Asad Iqbal wrote: > >> > >>> I hit this issue https://issues.apache.org/jira/browse/SOLR-11504 > while > >>> migrating to solr6 and locally working around it in Lucene code. I am > >>> thinking to fix it properly and hopefully patch back to Solr. Since, > >>> Lucene > >>> code does not want to keep any such config, I am thinking to use a > >>> counting > >>> semaphore in Solr code before calling IndexWriter.addDocument(s) or > >>> IndexWriter.updateDocument(s). > >>> > >> > >> There's a fairly simple way to control the number of indexing threads > that > >> doesn't require ANY changes to Solr: Don't start as many > threads/processes > >> on your indexing client(s). If you control the number of simultaneous > >> requests sent to Solr, then Solr won't start as many indexing threads. > >> That kind of control over your indexing system is something that's > always > >> preferable to have. > >> > >> Thanks, > >> Shawn > >> > >
Re: StandardDirectoryReader.java:: applyAllDeletes, writeAllDeletes
If you are not using NRT readers then the applyAllDeletes/writeAllDeletes boolean values are completely unused (and should have no impact on your performance). Mike McCandless http://blog.mikemccandless.com On Sun, May 28, 2017 at 8:34 PM, Nawab Zada Asad Iqbal <khi...@gmail.com> wrote: > After reading some more code it seems if we are sure that there are no > deletes in this segment/index, then setting applyAllDeletes and > writeAllDeletes both to false will achieve similar to what I was getting in > 4.5.0 > > However, after I read the comment from IndexWriter::DirectoryReader > getReader(boolean applyAllDeletes, boolean writeAllDeletes) , it seems that > this method is particular to NRT. Since we are not using soft commits, can > this change actually improve our performance during full reindex? > > > Thanks > Nawab > > > > > > > > > > On Sun, May 28, 2017 at 2:16 PM, Nawab Zada Asad Iqbal <khi...@gmail.com> > wrote: > >> Thanks Michael and Shawn for the detailed response. I was later able to >> pull the full history using gitk; and found the commits behind this patch. >> >> Mike: >> >> So, in solr 4.5.0 ; some earlier developer has added code and config to >> set applyAllDeletes to false when we reindex all the data. At the moment, >> I am not sure about the performance gain by this. >> >> >> >> >> I am investigating the question, if this change is still needed in 6.5.1 >> or can this be achieved by any other configuration? >> >> For now, we are not planning to use NRT and solrCloud. >> >> >> Thanks >> Nawab >> >> On Sun, May 28, 2017 at 9:26 AM, Michael McCandless < >> luc...@mikemccandless.com> wrote: >> >>> Sorry, yes, that commit was one of many on a feature branch I used to >>> work on LUCENE-5438, which added near-real-time index replication to >>> Lucene. Before this change, Lucene's replication module required a commit >>> in order to replicate, which is a heavy operation. >>> >>> The writeAllDeletes boolean option asks Lucene to move all recent >>> deletes (tombstone bitsets) to disk while opening the NRT (near-real-time) >>> reader. >>> >>> Normally Lucene won't always do that, and will instead carry the bitsets >>> in memory from writer to reader, for reduced refresh latency. >>> >>> What sort of custom changes do you have in this part of Lucene? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Sat, May 27, 2017 at 10:35 PM, Nawab Zada Asad Iqbal < >>> khi...@gmail.com> wrote: >>> >>>> Hi all >>>> >>>> I am looking at following change in lucene-solr which doen't mention any >>>> JIRA. How can I know more about it? >>>> >>>> "1ae7291 Mike McCandless on 1/24/16 at 3:17 PM current patch" >>>> >>>> Specifically, I am interested in what 'writeAllDeletes' does in the >>>> following method. Let me know if it is very stupid question and I should >>>> have done something else before emailing here. >>>> >>>> static DirectoryReader open(IndexWriter writer, SegmentInfos infos, >>>> boolean applyAllDeletes, boolean writeAllDeletes) throws IOException { >>>> >>>> Background: We are running solr4.5 and upgrading to 6.5.1. We have >>>> some custom code in this area, which we need to merge. >>>> >>>> >>>> Thanks >>>> >>>> Nawab >>>> >>> >>> >> >
Re: StandardDirectoryReader.java:: applyAllDeletes, writeAllDeletes
Sorry, yes, that commit was one of many on a feature branch I used to work on LUCENE-5438, which added near-real-time index replication to Lucene. Before this change, Lucene's replication module required a commit in order to replicate, which is a heavy operation. The writeAllDeletes boolean option asks Lucene to move all recent deletes (tombstone bitsets) to disk while opening the NRT (near-real-time) reader. Normally Lucene won't always do that, and will instead carry the bitsets in memory from writer to reader, for reduced refresh latency. What sort of custom changes do you have in this part of Lucene? Mike McCandless http://blog.mikemccandless.com On Sat, May 27, 2017 at 10:35 PM, Nawab Zada Asad Iqbalwrote: > Hi all > > I am looking at following change in lucene-solr which doen't mention any > JIRA. How can I know more about it? > > "1ae7291 Mike McCandless on 1/24/16 at 3:17 PM current patch" > > Specifically, I am interested in what 'writeAllDeletes' does in the > following method. Let me know if it is very stupid question and I should > have done something else before emailing here. > > static DirectoryReader open(IndexWriter writer, SegmentInfos infos, > boolean applyAllDeletes, boolean writeAllDeletes) throws IOException { > > Background: We are running solr4.5 and upgrading to 6.5.1. We have > some custom code in this area, which we need to merge. > > > Thanks > > Nawab >
Re: AnalyzingInfixSuggester performance
It also indexes edge ngrams for short sequences (e.g. a*, b*, etc.) and switches to ordinary PrefixQuery for longer sequences, and does some work to at search time to do the "infixing". But yeah otherwise that's it. If your ranking at lookup isn't exactly matching the weight, but "roughly" has some correlation to it, you could still use the fast early termination, except collect deeper than just the top N to ensure you likely found the best hits according to your ranking function. Mike McCandless http://blog.mikemccandless.com On Tue, Apr 18, 2017 at 4:35 PM, OTH <omer.t@gmail.com> wrote: > I see. I had actually overlooked the fact that Suggester provides a > 'weightField', and I could possibly use that in my case instead of the > regular Solr index with bq. > > So if I understand then - the main advantage of using the > AnalyzingInfixSuggester instead of a regular Solr index (since both are > using standard Lucene?) is that the AInfixSuggester does sorting at > index-time using the weightField? So it's only ever advantageous to use > this Suggester if you need sorting based on a field? > > Thanks > > On Tue, Apr 18, 2017 at 2:20 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > > > AnalyzingInfixSuggester uses index-time sort, to sort all postings by the > > suggest weight, so that lookup, as long as your sort by the suggest > weight > > is extremely fast. > > > > But if you need to rank at lookup time by something not "congruent" with > > the index-time sort then you lose that benefit. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Sun, Apr 16, 2017 at 11:46 AM, OTH <omer.t@gmail.com> wrote: > > > > > Hello, > > > > > > From what I understand, the AnalyzingInfixSuggester is using a simple > > > Lucene query; so I was wondering, how then would this suggester have > > better > > > performance than using a simple Solr 'select' query on a regular Solr > > index > > > (with an asterisk placed at the start and end of the query string). I > > > could understand why say an FST based suggester would be faster, but I > > > wanted to confirm if that indeed is the case with > > AnalyzingInfixSuggester. > > > > > > One reason I ask is: > > > I needed the results to be boosted based on the value of another field; > > > e.g., if a user in the UK is searching for cities, then I'd need the > > cities > > > which are in the UK to be boosted. I was able to do this with a > regular > > > Solr index by adding something like these parameters: > > > defType=edismax=country:UK^2.0 > > > > > > However, I'm not sure if this is possible with the Suggester. > Moreover - > > > other than the 'country' field above, there are other fields as well > > which > > > I need to be returned with the results. Since the Suggester seems to > > only > > > allow one additional field, called 'payload', I'm able to do this by > > > putting the values of all the other fields into a JSON and then placing > > > that into the 'payload' field - however, I don't know if it would be > > > possible then to incorporate the boosting mechanism I showed above. > > > > > > So I was thinking of just using a regular Solr index instead of the > > > Suggester; I wanted to confirm, what if any is the performance > > improvement > > > in using the AnalyzingInfixSuggester over using a regular index? > > > > > > Much thanks > > > > > >
Re: AnalyzingInfixSuggester performance
AnalyzingInfixSuggester uses index-time sort, to sort all postings by the suggest weight, so that lookup, as long as your sort by the suggest weight is extremely fast. But if you need to rank at lookup time by something not "congruent" with the index-time sort then you lose that benefit. Mike McCandless http://blog.mikemccandless.com On Sun, Apr 16, 2017 at 11:46 AM, OTHwrote: > Hello, > > From what I understand, the AnalyzingInfixSuggester is using a simple > Lucene query; so I was wondering, how then would this suggester have better > performance than using a simple Solr 'select' query on a regular Solr index > (with an asterisk placed at the start and end of the query string). I > could understand why say an FST based suggester would be faster, but I > wanted to confirm if that indeed is the case with AnalyzingInfixSuggester. > > One reason I ask is: > I needed the results to be boosted based on the value of another field; > e.g., if a user in the UK is searching for cities, then I'd need the cities > which are in the UK to be boosted. I was able to do this with a regular > Solr index by adding something like these parameters: > defType=edismax=country:UK^2.0 > > However, I'm not sure if this is possible with the Suggester. Moreover - > other than the 'country' field above, there are other fields as well which > I need to be returned with the results. Since the Suggester seems to only > allow one additional field, called 'payload', I'm able to do this by > putting the values of all the other fields into a JSON and then placing > that into the 'payload' field - however, I don't know if it would be > possible then to incorporate the boosting mechanism I showed above. > > So I was thinking of just using a regular Solr index instead of the > Suggester; I wanted to confirm, what if any is the performance improvement > in using the AnalyzingInfixSuggester over using a regular index? > > Much thanks >
Re: Is there a way to tell if multivalued field actually contains multiple values?
I think you can use the term stats that Lucene tracks for each field. Compare Terms.getSumTotalTermFreq and Terms.getDocCount. If they are equal it means every document that had this field, had only one token. Mike McCandless http://blog.mikemccandless.com On Fri, Nov 11, 2016 at 5:50 AM, Mikhail Khludnevwrote: > I suppose it's needless to remind that norm(field) is proportional (but not > precisely by default) to number of tokens in a doc's field (although not > actual text values). > > On Fri, Nov 11, 2016 at 5:08 AM, Alexandre Rafalovitch > wrote: > >> Hello, >> >> Say I indexed a large dataset against a schemaless configuration. Now >> I have a bunch of multivalued fields. Is there any way to say which of >> these (text) fields have (for given data) only single values? I know I >> am supposed to look at the original data, and all that, but this is >> more for debugging/troubleshooting. >> >> Turning termOffsets/termPositions would make it easy, but that's a bit >> messy for troubleshooting purposes. >> >> I was thinking that one giveaway is the positionIncrementGap causing >> the second value's token to start at number above a hundred. But I am >> not sure how to craft a query against a field to see if such a token >> is generically present. >> >> >> Any ideas? >> >> Regards, >> Alex. >> >> >> Solr Example reading group is starting November 2016, join us at >> http://j.mp/SolrERG >> Newsletter and resources for Solr beginners and intermediates: >> http://www.solr-start.com/ >> > > > > -- > Sincerely yours > Mikhail Khludnev
[ANNOUNCE] Apache Solr 6.2.0 released
26 August 2016, Apache Solr 6.2.0 available Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 6.2.0 is available for immediate download at: * http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 6.2.0 Release Highlights: DocValues, streaming, /export, machine learning * DocValues can now be used with BoolFields * Date and boolean support added to /export handler * Add "scoreNodes" streaming graph expression * Support parallel ETL with the "topic" expression * Feature selection and logistic regression on text via new streaming expressions: "features" and "train" bin/solr script * Add basic auth support to the bin/solr script * File operations to/from Zookeeper are now supported SolrCloud * New tag 'role' in replica placement rules, e.g. rule=role:!overseer keeps new repicas off overseer nodes * CDCR: fall back to whole-index replication when tlogs are insufficient * New REPLACENODE command to decommission an existing node and replace it with another new node * New DELETENODE command to delete all replicas on a node Security * Add Kerberos delegation token support * Support secure impersonation / proxy user for Kerberos authentication Misc changes * A large number of regressions were fixed in the new Admin UI * New boolean comparison function queries comparing numeric arguments: gt, gte, lt, lte, eq * Upgraded Extraction module to Apache Tika 1.13. * Updated to Hadoop 2.7.2 Further details of changes are available in the change log available at: http://lucene.apache.org/solr/6_2_0/changes/Changes.html Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also applies to Maven access. Happy searching, Mike McCandless http://blog.mikemccandless.com
Re: ConcurrentMergeScheduler options not exposed
Really we need the infoStream output, to see what IW is doing, to take so long merging. Likely only one merge thread is running (CMS tries to detect if your IO system "spins" and if so, uses 1 merge thread) ... maybe try configuring this to something higher since your RAID array can probably handle it? It's good that disabling auto IO throttling didn't change things ... that's what I expected (since forced merges are not throttled by default). Maybe capture all thread stacks and post back here? Mike McCandless http://blog.mikemccandless.com On Thu, Jun 16, 2016 at 4:04 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 6/16/2016 2:35 AM, Michael McCandless wrote: > > > > Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for > > very long ... unless there is a huge percentage of deletes. Also, by > > default CMS doesn't throttle forced merges (see > > CMS.get/setForceMergeMBPerSec). Maybe capture > > IndexWriter.setInfoStream output? > > I can see the problem myself. I have a RAID10 array with six SATA > disks. When I click the Optimize button for a core that's several > gigabytes, iotop shows me reads happening at about 100MB/s for several > seconds, then writes clocking no more than 25 MB/s, and usually a lot > less. The last several gigabytes that were written were happening at > less than 5 MB/s. This is VERY slow, and does affect my nightly > indexing processes. > > Asking the shell to copy a 5GB file revealed sustained write rates of > over 500MB/s, so the hardware can definitely go faster. > > I patched in an option for solrconfig.xml where I could force it to call > disableAutoIOThrottle(). I included logging in my patch to make > absolutely sure that the new code was used. This option made no > difference in the write speed. I also enabled infoStream, but either I > configured it wrong or I do not know where to look for the messages. I > was modifying and compiling branch_5_5. > > This is the patch that I applied: > > http://apaste.info/wKG > > I did see the expected log entries in solr.log when I restarted with the > patch and the new option in solrconfig.xml. > > What else can I look at? > > Thanks, > Shawn > >
Re: ConcurrentMergeScheduler options not exposed
Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for very long ... unless there is a huge percentage of deletes. Also, by default CMS doesn't throttle forced merges (see CMS.get/setForceMergeMBPerSec). Maybe capture IndexWriter.setInfoStream output? Mike McCandless http://blog.mikemccandless.com On Wed, Jun 15, 2016 at 9:12 PM, Shawn Heiseywrote: > On the IRC channel, I ran into somebody who was having problems with > optimizes on their Solr indexes taking a really long time. When > investigating, they found that during the optimize, *reads* were > happening on their SSD disk at over 800MB/s, but *writes* were > proceeding at only 20 MB/s. > > Looking into ConcurrentMergeScheduler, I discovered that it does indeed > have a default write throttle of only 20 MB/s. I saw code that would > sometimes set the speed to unlimited, but had a hard time figuring out > what circumstances will result in the different settings, so based on > the user experience, I assume that the 20MB/s throttle must be applied > for Solr optimizes. > > From what I can see in the code, there's currently no way in > solrconfig.xml to configure scheduler options like the maximum write > speed. Before I an open an issue to add additional configuration > options for the merge scheduler, I thought it might be a good idea to > just double-check with everyone here to see whether there's something I > missed. > > This is likely even affecting people who are not using SSD storage. > Most modern magnetic disks can easily exceed 20MB/s on both reads and > writes. Some RAID arrays can write REALLY fast. > > Thanks, > Shawn > >
Re: Lucene/Solr Git Mirrors 5 day lag behind SVN?
I added a comment on the INFRA issue. I don't understand why it periodically "gets stuck". Mike McCandless http://blog.mikemccandless.com On Fri, Oct 23, 2015 at 11:27 AM, Kevin Risdenwrote: > It looks like both Apache Git mirror (git://git.apache.org/lucene-solr.git) > and GitHub mirror (https://github.com/apache/lucene-solr.git) are 5 days > behind SVN. This seems to have happened before: > https://issues.apache.org/jira/browse/INFRA-9182 > > Is this a known issue? > > Kevin Risden
Re: CheckIndex failed for Solr 4.7.2 index
IBM's J9 JVM unfortunately still has a number of nasty bugs affecting Lucene; most likely you are hitting one of these. We used to test J9 in our continuous Jenkins jobs, but there were just too many J9-specific failures and we couldn't get IBM's attention to resolve them, so we stopped. For now you should switch to Oracle JDK, or OpenJDK. But there's some good news! Recently, a member from the IBM JDK team replied to this Elasticsearch thread: https://discuss.elastic.co/t/need-help-with-ibm-jdk-issues-with-es-1-4-5/1748/3 And then Robert Muir ran Lucene's tests with the latest J9 and opened several issues; see the 2nd bullet under Apache Lucene at https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2015-06-09 and at least one of the issues seems to be making progress (https://issues.apache.org/jira/browse/LUCENE-6522). So there is hope for the future, but for today it's too dangerous to use J9 with Lucene/Solr/Elasticsearch. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 9, 2015 at 12:23 PM, Guy Moshkowich g...@il.ibm.com wrote: We are using Solr 4.7.2 and we found that when we run CheckIndex.checkIndex on one of the Solr shards we are getting the error below. Both replicas of the shard had the same error. The shard index looked healthy: 1) It appeared active in the Solr admin page. 2) We could run searches against it. 3) No relevant errors where found in Solr logs. 4) After we optimized the index in LUKE, CheckIndex did not report any error. My questions: 1) Is this is a real issue or a known bug in CheckIndex code that cause false negative ? 2) Is there a known fix for this issue? Here is the error we got: validateIndex Segments file=segments_bhe numSegments=15 version=4.7 format= userData={commitTimeMSec=1432689607801} 1 of 15: name=_6cth docCount=248744 codec=Lucene46 compound=false numFiles=11 size (MB)=86.542 diagnostics = {timestamp=1428883354605, os=Linux, os.version=2.6.32-431.23.3.el6.x86_64, mergeFactor=10, source=merge, lucene.version=4.7.2 1586229 - rmuir - 2014-04-10 09:00:35, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0, java.vendor=IBM Corporation} has deletions [delGen=3174] test: open reader.FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: liveDocs count mismatch: info=156867, vs bits=156872 at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:581) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:372) Appreciate yout help, Guy.
[ANNOUNCE] Apache Solr 4.10.4 released
October 2014, Apache Solr™ 4.10.4 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.4 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.4 is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.4 Solr 4.10.4 includes 24 bug fixes, as well as Lucene 4.10.4 and its 13 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
Re: Frequent deletions
Also see this G+ post I wrote up recently showing how %tg deletions changes over time for an every add also deletes a previous document stress test: https://plus.google.com/112759599082866346694/posts/MJVueTznYnD Mike McCandless http://blog.mikemccandless.com On Wed, Dec 31, 2014 at 12:21 PM, Erick Erickson erickerick...@gmail.com wrote: It's usually not necessary to optimize, as more indexing happens you should see background merges happen that'll reclaim the space, so I wouldn't worry about it unless you're seeing actual problems that have to be addressed. Here's a great visualization of the process: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html See especially the third video, TieredMergePolicy which is the default. If you insist, however, try a commit with expungeDeletes=true and if that isn't enough, try an optimize call you can issue a force merge (aka optimize) command from the URL (Or cUrl etc) as: http://localhost:8983/solr/techproducts/update?optimize=true But please don't do this unless it's absolutely necessary. You state that you have frequent deletions, but eventually this shoul dall happen in the background. Optimize is a fairly expensive operation and should be used judiciously. Best, Erick On Wed, Dec 31, 2014 at 1:32 AM, ig01 inna.gel...@elbitsystems.com wrote: Hello, We perform frequent deletions from our index, which greatly increases the index size. How can we perform an optimization in order to reduce the size. Please advise, Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DocsEnum and TermsEnum reuse in lucene join library?
They should be reused if the impl. allows for it. Besides reducing GC cost, it can also be a sizable performance gain since these enums can have quite a bit of state that otherwise must be re-initialized. If you really don't want to reuse them (force a new enum every time), pass null. Mike McCandless http://blog.mikemccandless.com On Fri, Dec 5, 2014 at 8:14 PM, Darin Amos dari...@gmail.com wrote: Hi All, I have been working on a custom query and I am going off of samples in the lucene join library (4.3.0) and I am a little unclear about a couple lines. 1) When getting a TermsEnum in TermsIncludingScoreQuery.createWeight(…).scorer()… A previous TermsEnum is used like the following: segmentTermsEnum = terms.iterator(segmentTermsEnum); 2) When getting a DocsEnum SVInOrderScorer.fillDocsAndScores: for (int i = 0; i terms.size(); i++) { if (termsEnum.seekExact(terms.get(ords[i], spare), true)) { docsEnum = termsEnum.docs(acceptDocs, docsEnum, DocsEnum.FLAG_NONE); My assumption is that the previous enum values are not reused, but this is a tuning mechanism for garbage collection, is the correct assumption? Thanks! Darin
[ANNOUNCE] Apache Solr 4.10.2 released
October 2014, Apache Solr™ 4.10.2 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.2 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.2 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.10.2 includes 10 bug fixes, as well as Lucene 4.10.2 and its 2 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy Halloween, Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Solr 4.10.1 released
September 2014, Apache Solr™ 4.10.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.1 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.10.1 includes 6 bug fixes, as well as Lucene 4.10.1 and its 7 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Solr 4.9.1 released
September 2014, Apache Solr™ 4.9.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.9.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.9.1 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.9.1 includes 2 bug fixes, as well as Lucene 4.9.1 and its 7 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
Re: [ANNOUNCE] Apache Solr 4.9.1 released
I'll merge back the 4.9.1 CHANGES entries so when we do a 4.10.1, they'll be there ... and I'll also make sure any fix we backported for 4.9.1, we also backport for 4.10.1. Mike McCandless http://blog.mikemccandless.com On Mon, Sep 22, 2014 at 9:11 AM, Shawn Heisey s...@elyograg.org wrote: On 9/22/2014 6:24 AM, Bernd Fehling wrote: This confuses me a bit, aren't we already at 4.10.0? But CHANGES.txt of 4.10.0 doesn't know anything about 4.9.1. Is this an interim version or something about backward compatibility? It's a bugfix release, fixing some showstopper bugs in a recent release that is critical to the RM (Michael McCandless) and/or an organization where he has influence or liability. Apparently this was a more expedient path than completely validating a 4.10 upgrade and waiting for the 4.10.1 bugfix release. Validating the 4.10 upgrade probably would have taken considerably longer than simply backporting some critical fixes to the 4.9 release that they're actually using. The two bug fixes for Solr are a license issue and a security vulnerability. The bugfix list for Lucene includes fixes for some major problems that can cause index corruption or incorrect operation. I had thought that the CHANGES.txt list would remain the same for trunk and the stable branch because some of those bugfixes skipped the 4.10.0 release, but it looks like that's not the case for LUCENE-5919 (the only one that I actually investigated). If these issues all got updated to the 4.9.1 section of CHANGES.txt in places other than the 4.9 branch and the 4.9.1 tag, there might be a small amount of confusion in the distant future. That confusion would be cleared up by looking at CHANGES.txt for the 4.10.0 release, though. Looks like the 4.10.1 release has been delayed a little. I hope that this collection of fixes makes it in there too, so that 4.10.0 is the only release where that confusion might impact users. Thanks, Shawn
Re: optimize and .nfsXXXX files
Soft commit (i.e. opening a new IndexReader in Lucene and closing the old one) should make those go away? The .nfsX files are created when a file is deleted but a local process (in this case, the current Lucene IndexReader) still has the file open. Mike McCandless http://blog.mikemccandless.com On Mon, Aug 18, 2014 at 5:20 AM, BorisG boris.golo...@mail.huji.ac.il wrote: Hi, I am using solr 3.6.2. I use NFS and my index folder is a mounted folder. When I run the command: server:port/solr/collection1/update?optimize=truemaxSegments=1waitFlush=trueexpungeDeletes=true in order to optimize my index, I have some .nfsX files created while the optimize is running. The problem that i am having is that after optimize finishes its run the .nfs files aren't deleted. When I close the solr process they immediately disappear. I don't want to restart the solr process after each optimize, is there anything that can be done in order for solr to get rid of those files. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/optimize-and-nfs-files-tp4153473.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Does lucene uses tries?
The default terms dictionary (BlockTree) also uses a trie index structure to locate the block on disk that may contain a target term. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 5, 2014 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote: I just have want know that does the lucene used the tries data structure to store the data. Lucene (and Solr) will use whatever you tell it when you create the field. If you indicate in your schema fieldType that you want to use a class of solr.TrieIntField, then the field will use a Lucene trie type that holds integers. Similar for TrieLongField, TrieFloatField, etc. Thanks, Shawn
Re: Expected date of release for Solr 4.7.1
RC2 is being voted on now ... so it should be soon (a few days, but more if any new blocker issues are found and we need to do RC3). Mike McCandless http://blog.mikemccandless.com On Sat, Mar 29, 2014 at 2:26 PM, Puneet Pawaia puneet.paw...@gmail.com wrote: Hi Any idea on the expected date of release for Solr 4.7.1 Regards Puneet
Re: Enabling other SimpleText formats besides postings
You told the fieldType to use SimpleText only for the postings, not all other parts of the codec (doc values, live docs, stored fields, etc...), and so it used the default codec for those components. If instead you used the SimpleTextCodec (not sure how to specify this in Solr's schema.xml) then all components would be SimpleText. Mike McCandless http://blog.mikemccandless.com On Fri, Mar 28, 2014 at 8:53 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi all, I've been using the SimpleTextCodec in the past, but I just noticed something odd... I'm running Solr 4.3, and enable the SimpleText posting format via something like: fieldType name=date class=solr.DateField postingsFormat=SimpleText / The resulting index does have the expected _0_SimpleText_0.pst text output, but I just noticed that the other files are all the standard binary format (e.g. .fdt for field data) Based on SimpleTextCodec.java, I was assuming that I'd get the SimpleTextStoredFieldsFormat for stored data. This same holds true for most (all?) of the other files, e.g. https://issues.apache.org/jira/browse/LUCENE-3074 is about adding a simple text format for DocValues. I can walk the code to figure out what's up, but I'm hoping I just need to change some configuration setting. Thanks! -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: AutoSuggest like Google in Solr using Solarium Client.
I think it's best to use one of the many autosuggesters Lucene/Solr provide? E.g. AnalyzingInfixSuggester is running here: http://jirasearch.mikemccandless.com But that's just one suggester... there are many more. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 17, 2014 at 10:44 AM, Sohan Kalsariya sohankalsar...@gmail.com wrote: Can anyone suggest me the best practices how to do SpellCheck and AutoSuggest in solarium. Can anyone give me example for that? -- Regards, *Sohan Kalsariya*
Re: Join Scoring
I suspect (not certain) one reason for the performance difference with Solr vs Lucene joins is that Solr operates on a top-level reader? This results in fast joins, but it means whenever you open a new reader (NRT reader) there is a high cost to regenerate the top-level data structures. But if the app doesn't open NRT readers, or opens them rarely, perhaps that cost is a good tradeoff to get faster joins. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 13, 2014 at 12:10 AM, anand chandak anand.chan...@oracle.com wrote: Re-posting... Thanks, Anand On 2/12/2014 10:55 AM, anand chandak wrote: Thanks David, really helpful response. You mentioned that if we have to add scoring support in solr then a possible approach would be to add a custom QueryParser, which might be taking Lucene's JOIN module. I have tired this approach and this makes it slow, because I believe this is making more searches.. Curious, if it is possible instead to enhance existing solr's JoinQParserPlugin and add the the scoring support in the same class ? Do you think its feasible and recommended ? If yes, what would it take (highlevel) - in terms of code changes, any pointers ? Thanks, Anand On 2/12/2014 10:31 AM, David Smiley (@MITRE.org) wrote: Hi Anand. Solr's JOIN query, {!join}, constant-scores. It's simpler and faster and more memory efficient (particularly the worse-case memory use) to implement the JOIN query without scoring, so that's why. Of course, you might want it to score and pay whatever penalty is involved. For that you'll need to write a Solr QueryParser that might use Lucene's join module which has scoring variants. I've taken this approach before. You asked a specific question about the purpose of JoinScorer when it doesn't actually score. Lucene's Query produces a Weight which in turn produces a Scorer that is a DocIdSetIterator plus it returns a score. So Queries have to have a Scorer to match any document even if the score is always 1. Solr does indeed have a lot of caching; that may be in play here when comparing against a quick attempt at using Lucene directly. In particular, the matching documents are likely to end up in Solr's DocumentCache. Returning stored fields that come back in search results are one of the more expensive things Lucene/Solr does. I also think you noted that the fields on documents from the from side of the query are not available to be returned in search results, just the to side. Yup; that's true. To remedy this, you might write a Solr SearchComponent that adds fields from the from side. That could be tricky to do; it would probably need to re-run the from-side query but filtered to the matching top-N documents being returned. ~ David anand chandak wrote Resending, if somebody can please respond. Thanks, Anand On 2/5/2014 6:26 PM, anand chandak wrote: Hi, Having a question on join score, why doesn't the solr join query return the scores. Looking at the code, I see there's JoinScorer defined in the JoinQParserPlugin class ? If its not used for scoring ? where is it actually used. Also, to evaluate the performance of solr join plugin vs lucene joinutil, I filed same join query against same data-set and same schema and in the results, I am always seeing the Qtime for Solr much lower then lucenes. What is the reason behind this ? Solr doesn't return scores could that cause so much difference ? My guess is solr has very sophisticated caching mechanism and that might be coming in play, is that true ? or there's difference in the way JOIN happens in the 2 approach. If I understand correctly both the implementation are using 2 pass approach - first all the terms from fromField and then returns all documents that have matching terms in a toField If somebody can throw some light, would highly appreciate. Thanks, Anand - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Join-Scoring-tp4115539p4116818.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Lucene Join
Look in lucene's join module? Mike McCandless http://blog.mikemccandless.com On Thu, Jan 30, 2014 at 4:15 AM, anand chandak anand.chan...@oracle.com wrote: Hi, I am trying to find whether the lucene joins (not solr join) if they are using any filter cache. The API that lucene uses is for joining joinutil.createjoinquery(), where can I find the source code for this API. Thanks in advance Thanks, Anand
Re: background merge hit exception while optimizing index (SOLR 4.4.0)
Which version of Java are you using? That root cause exception is somewhat spooky: it's in the ByteBufferIndexCode that handles an UnderflowException, ie when a small (maybe a few hundred bytes) read happens to span the 1 GB page boundary, and specifically the exception happens on the final read (curBuf.get(b, offset, len)). Such page-spanning reads are very rare. The code looks fine to me though, and it's hard to explain how NPE (b = null) could happen: that byte array is allocated in the Lucene41PostingsReader.BlockDocsAndPositionsEnum class's ctor: encoded = new byte[MAX_ENCODED_SIZE]. Separately, you really should not have to optimize daily, if ever. Mike McCandless http://blog.mikemccandless.com On Mon, Jan 13, 2014 at 2:06 AM, Ralf Matulat ralf.matu...@bundestag.de wrote: Hi, I am currently running into merge-issues while optimizing an index. To give you some informations: We are running 4 SOLR Servers with identical OS, VM-Hardware, RAM etc. Only one Server by now is having issues, the others are fine. We are running SOLR 4.4.0 with Tomcat 6.0 It was running since October without any problems. The problems first occur after doing a minor change in the synonyms.txt, but I guess that was just a coincedence. We added `ulimit -v unlimited` to our tomcat init-script years ago. We have 4 Cores running on each SOLR Server, configuration, index-sizes of all 4 servers are identical (we are distributing cfgs via git). We did a rebuild of the index twice: First time without removing the old index files, second time deleting the data dir and starting from scratch. We are working with DIH, getting data from a MySQL DB. After an initial complete index-run, the optimize is working. The optimize fails one or two days later. We are doing one optimize-run a day, the index contains ~10 millions documents, the index size on disc is ~39GB while having 127G of free disc space. We have a mergeFactor of 3. The solr.log says: ERROR - 2014-01-12 22:47:11.062; org.apache.solr.common.SolrException; java.io.IOException: background merge hit exception: _dc8(4.4):C9876366/1327 _e8u(4.4):C4250/7 _f4a(4.4):C1553/13 _fj6(4.4 ):C1903/15 _ep3(4.4):C1217/42 _fle(4.4):C256/7 _flf(4.4):C11 into _flg [maxNumSegments=1] at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1714) at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1650) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:530) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1235) at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1219) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:735) Caused by: java.lang.NullPointerException at java.nio.ByteBuffer.get(ByteBuffer.java:661)
Re: background merge hit exception while optimizing index (SOLR 4.4.0)
I have trouble understanding J9's version strings ... but, is it really from 2008? You could be hitting a JVM bug; can you test upgrading? I don't have much experience with Solr faceting on optimized vs unoptimized indices; maybe someone else can answer your question. Lucene's facet module (not yet exposed through Solr) performance shouldn't change much for optimized vs unoptimized indices. Mike McCandless http://blog.mikemccandless.com On Mon, Jan 13, 2014 at 10:09 AM, Ralf Matulat ralf.matu...@bundestag.de wrote: java -version java version 1.6.0 Java(TM) SE Runtime Environment (build pxa6460sr3ifix-20090218_02(SR3+IZ43791+IZ43798)) IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 Linux amd64-64 jvmxa6460-20081105_25433 (JIT enabled, AOT enabled) J9VM - 20081105_025433_LHdSMr JIT - r9_20081031_1330 GC - 20081027_AB) JCL - 20090218_01 A question regarding to optimizing the index: As of SOLR 3.X we encountered massive performance improvements with facettet queries after optimizing an index. So we once started optimizing the indexes on a daily basis. With SOLR 4.X and the new index-format that is not true anymore? Btw: The checkIndex failed with 'java.io.FileNotFoundException:', I guess because I did not stopped the tomcat while checking. So SOLR created, merged and deleted some segments while checking. I will restart the check after stoppimg SOLR. Kind regards Ralf Matulat Which version of Java are you using? That root cause exception is somewhat spooky: it's in the ByteBufferIndexCode that handles an UnderflowException, ie when a small (maybe a few hundred bytes) read happens to span the 1 GB page boundary, and specifically the exception happens on the final read (curBuf.get(b, offset, len)). Such page-spanning reads are very rare. The code looks fine to me though, and it's hard to explain how NPE (b = null) could happen: that byte array is allocated in the Lucene41PostingsReader.BlockDocsAndPositionsEnum class's ctor: encoded = new byte[MAX_ENCODED_SIZE]. Separately, you really should not have to optimize daily, if ever. Mike McCandless http://blog.mikemccandless.com
Re: MergePolicy for append-only indices?
On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I think the key optimization when there are no deletions is that you don't need to renumber documents and can bulk-copy blocks of contiguous documents, and that is independent of merge policy. I think :) Merging of term vectors and stored fields will always use bulk-copy for contiguous chunks of non-deleted docs, so for the append-only case these will be the max chunk size and be efficient. We have no codec that implements bulk merging for postings, which would be interesting to pursue: in the append-only case it's possible, and merging of postings is normally by far the most time consuming step of a merge. Also, no RAM will be used holding the doc mapping, since the docIDs don't change. These benefits are independent of the MergePolicy. I think TieredMergePolicy will work fine for append-only; I'm not sure how you'd improve on its approach. It will in general renumber the docs, so if that's a problem, apps should use LogByteSizeMP. Mike McCandless http://blog.mikemccandless.com
Re: Possible memory leak after segment merge? (related to DocValues?)
On Mon, Dec 30, 2013 at 1:22 PM, Greg Preston gpres...@marinsoftware.com wrote: That was it. Setting omitNorms=true on all fields fixed my problem. I left it indexing all weekend, and heap usage still looks great. Good! I'm still not clear why bouncing the solr instance freed up memory, unless the in-memory structure for this norms data is lazily loaded somehow. In fact it is lazily loaded, the first time a search (well, Similarity) needs to load the norms for scoring. Anyway, thank you very much for the suggestion. You're welcome. Mike McCandless http://blog.mikemccandless.com
Re: Possible memory leak after segment merge? (related to DocValues?)
Likely this is for field norms, which use doc values under the hood. Mike McCandless http://blog.mikemccandless.com On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston gpres...@marinsoftware.com wrote: Does anybody with knowledge of solr internals know why I'm seeing instances of Lucene42DocValuesProducer when I don't have any fields that are using DocValues? Or am I misunderstanding what this class is for? -Greg On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston gpres...@marinsoftware.com wrote: Hello, I'm loading up our solr cloud with data (from a solrj client) and running into a weird memory issue. I can reliably reproduce the problem. - Using Solr Cloud 4.4.0 (also replicated with 4.6.0) - 24 solr nodes (one shard each), spread across 3 physical hosts, each host has 256G of memory - index and tlogs on ssd - Xmx=7G, G1GC - Java 1.7.0_25 - schema and solrconfig.xml attached I'm using composite routing to route documents with the same clientId to the same shard. After several hours of indexing, I occasionally see an IndexWriter go OOM. I think that's a symptom. When that happens, indexing continues, and that node's tlog starts to grow. When I notice this, I stop indexing, and bounce the problem node. That's where it gets interesting. Upon bouncing, the tlog replays, and then segments merge. Once the merging is complete, the heap is fairly full, and forced full GC only helps a little. But if I then bounce the node again, the heap usage goes way down, and stays low until the next segment merge. I believe segment merges are also what causes the original OOM. More details: Index on disk for this node is ~13G, tlog is ~2.5G. See attached mem1.png. This is a jconsole view of the heap during the following: (Solr cloud node started at the left edge of this graph) A) One CPU core pegged at 100%. Thread dump shows: Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800 nid=0x7a74 runnable [0x7f5a41c5f000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.util.fst.Builder.add(Builder.java:397) at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000) at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112) at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) B) One CPU core pegged at 100%. Manually triggered GC. Lots of memory freed. Thread dump shows: Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800 nid=0x7a74 runnable [0x7f5a41c5f000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127) at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144) at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92) at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) C) One CPU core pegged at 100%. Manually triggered GC. No memory freed. Thread dump shows: Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800 nid=0x7a74 runnable [0x7f5a41c5f000] java.lang.Thread.State: RUNNABLE at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127) at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108) at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92) at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221) at
Re: Problems with gaps removed with SynonymFilter
Unfortunately the current SynonymFilter cannot handle posInc != 1 ... we could perhaps try to fix this ... patches welcome :) So for now it's best to place SynonymFilter before StopFilter, and before any other filters that may create graph tokens (posLen 1, posInc == 0). Mike McCandless http://blog.mikemccandless.com On Mon, Sep 23, 2013 at 2:45 AM, david.dav...@correo.aeat.es wrote: Hi, I am having a problem applying StopFilterFactory and SynonimFilterFactory. The problem is that SynonymFilter removes the gaps that were previously put by the StopFilterFactory. I'm applying filters in query time, because users need to change synonym lists frequently. This is my schema, and an example of the issue: String: documentacion para agentes org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_35} position1 2 3 term text documentaciónpara agentes startOffset 0 14 19 endOffset 13 18 26 org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_35} position1 2 3 term text documentaciónpara agentes startOffset 0 14 19 endOffset 13 18 26 org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt, ignoreCase=true, enablePositionIncrements=true, luceneMatchVersion=LUCENE_35} position1 3 term text documentación agentes startOffset 0 19 endOffset 13 26 org.apache.solr.analysis.SynonymFilterFactory {synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true, luceneMatchVersion=LUCENE_35} position1 2 term text documentación agente archivo agentes typeSYNONYM SYNONYM SYNONYM SYNONYM startOffset 0 19 0 19 endOffset 1326 13 26 As you can see, the position should be 1 and 3, but SynonymFilter removes the gap and moves token from position 3 to 2 I've got the same problem with Solr 3.5 y 4.0. I don't know if it's a bug or an error with my configuration. In other schemas that I have worked with, I had always put the SynonymFilter previous to StopFilter, but in this I prefered using this order because of the big number of synonym that the list has (i.e. I don't want to generate a lot of synonyms for a word that I really wanted to remove). Thanks, David Dávila Atienza AEAT - Departamento de Informática Tributaria David Dávila Atienza AEAT - Departamento de Informática Tributaria Subdirección de Tecnologías de Análisis de la Información e Investigación del Fraude Área de Infraestructuras Teléfono: 915831543 Extensión: 31543
Re: Why solr 4.0 use FSIndexOutput to write file, otherwise MMap/NIO
Output is quite a bit simpler than input because all we do is write a single stream of bytes with no seeking (append only), and it's done with only one thread, so I don't think there'd be much to gain by using the newer IO APIs for writing... Mike McCandless http://blog.mikemccandless.com On Fri, Jun 28, 2013 at 2:23 AM, Jeffery Wang jeffery.w...@morningstar.com wrote: I have checked the FSDirectory, it will create MMapDirectory or NIOFSDirectory for Directory. This two directory only supply IndexInput extend for read file (MMapIndexInput extends ByteBufferIndexInput), why not there is not MMap/NIO IndexOutput extend for file write. It only use FSIndexOutput for file write(FSIndexOutput extends BufferedIndexOutput). Does FSIndexOutput wirte file very slow than MMap/NIO? How to improve the IO write performance. Thanks, __ Jeffery Wang Application Service - Backend Morningstar (Shenzhen) Ltd. Morningstar. Illuminating investing worldwide. +86 755 3311 0220 Office +86 130 7782 2813 Mobile jeffery.w...@morningstar.commailto:jeffery.w...@morningstar.com This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you received this message in error please contact the sender immediately and delete the materials from any computer.
Re: TieredMergePolicy reclaimDeletesWeight
The default is 2.0, and higher values will more strongly favor merging segments with deletes. I think 20.0 is likely way too high ... maybe try 3-5? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi In continuing a previous conversation, I am attempting to not have to do optimizes on our continuously updated index in solr3.6.1 and I came across the mention of the reclaimDeletesWeight setting in this blog: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html We do a *lot* of deletes in our index so I want to make the merges be more aggressive on reclaiming deletes, but I am having trouble finding much out about this setting. Does anyone have experience with this setting? Would the below accomplish what I want ie for it to go after deletes more aggressively than normal? I got the impression 10.0 was the default from looking at this code but I could be wrong: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3085 mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce20/int int name=segmentsPerTier8/int double name=reclaimDeletesWeight20.0/double /mergePolicy Thanks Robert (Robi) Petersen Senior Software Engineer Search Department
Re: TieredMergePolicy reclaimDeletesWeight
Way too high would cause it to pick highly lopsided merges just because a few deletes were removed. Highly lopsided merges (e.g. one big segment and N tiny segments) can be horrible because it can lead to O(N^2) merge cost over time. Mike McCandless http://blog.mikemccandless.com On Wed, Jun 19, 2013 at 1:36 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: OK thanks, will do. Just out of curiosity, what would having that set way too high do? Would the index become fragmented or what? -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, June 19, 2013 9:33 AM To: solr-user@lucene.apache.org Subject: Re: TieredMergePolicy reclaimDeletesWeight The default is 2.0, and higher values will more strongly favor merging segments with deletes. I think 20.0 is likely way too high ... maybe try 3-5? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi In continuing a previous conversation, I am attempting to not have to do optimizes on our continuously updated index in solr3.6.1 and I came across the mention of the reclaimDeletesWeight setting in this blog: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-mer ges.html We do a *lot* of deletes in our index so I want to make the merges be more aggressive on reclaiming deletes, but I am having trouble finding much out about this setting. Does anyone have experience with this setting? Would the below accomplish what I want ie for it to go after deletes more aggressively than normal? I got the impression 10.0 was the default from looking at this code but I could be wrong: https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulB uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3 085 mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce20/int int name=segmentsPerTier8/int double name=reclaimDeletesWeight20.0/double /mergePolicy Thanks Robert (Robi) Petersen Senior Software Engineer Search Department
Re: Slow Highlighter Performance Even Using FastVectorHighlighter
You could also try the new[ish] PostingsHighlighter: http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html Mike McCandless http://blog.mikemccandless.com On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: If you have very large documents (many MB) that can lead to slow highlighting, even with FVH. See https://issues.apache.org/jira/browse/LUCENE-3234 and try setting phraseLimit=1 (or some bigger number, but not infinite, which is the default) -Mike On 6/14/13 4:52 PM, Andy Brown wrote: Bryan, For specifics, I'll refer you back to my original email where I specified all the fields/field types/handlers I use. Here's a general overview. I really only have 3 fields that I index and search against: name, description, and content. All of which are just general text (string) fields. I have a catch-all field called text that is only used for querying. It's indexed but not stored. The name, description, and content fields are copied into the text field. For partial word matching, I have 4 more fields: name_par, description_par, content_par, and text_par. The text_par field has the same relationship to the *_par fields as text does to the others (only used for querying). Those partial word matching fields are of type text_general_partial which I created. That field type is analyzed different than the regular text field in that it goes through an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7 at index time. I query against both text and text_par fields using edismax deftype with my qf set to text^2 text_par^1 to give full word matches a higher score. This part returns back very fast as previously stated. It's when I turn on highlighting that I take the huge performance hit. Again, I'm using the FastVectorHighlighting. The hl.fl is set to name name_par description description_par content content_par so that it returns highlights for full and partial word matches. All of those fields have indexed, stored, termPositions, termVectors, and termOffsets set to true. It all seems redundant just to allow for partial word matching/highlighting but I didn't know of a better way. Does anything stand out to you that could be the culprit? Let me know if you need any more clarification. Thanks! - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Wednesday, May 29, 2013 5:44 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store a
Re: How to recover from Error opening new searcher when machine crashed while indexing
Alas I think CheckIndex can't do much here: there is no segments file, so you'll have to reindex from scratch. Just to check: did you ever called commit while building the index before the machine crashed? Mike McCandless http://blog.mikemccandless.com On Tue, Apr 30, 2013 at 8:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Try running the CheckIndex tool. Otis Solr ElasticSearch Support http://sematext.com/ On Apr 30, 2013 3:10 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Solr 4.0 was indexing data and the machine crashed. Any suggestions on how to recover my index since I don't want to delete my data directory? When I try to start it again, I get this error: ERROR 12:01:46,493 Failed to load Solr core: xyz.index1 ERROR 12:01:46,493 Cause: ERROR 12:01:46,494 Error opening new searcher org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.init(SolrCore.java:701) at org.apache.solr.core.SolrCore.init(SolrCore.java:564) at org.apache.solr.core.CassandraCoreContainer.load(CassandraCoreContainer.java:213) at com.datastax.bdp.plugin.SolrCorePlugin.activateImpl(SolrCorePlugin.java:66) at com.datastax.bdp.plugin.PluginManager$PluginInitializer.call(PluginManager.java:161) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1290) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1402) at org.apache.solr.core.SolrCore.init(SolrCore.java:675) ... 9 more Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in NRTCachingDirectory(org.apache.lucene.store.NIOFSDirectory@ /media/SSD/data/solr.data/rlcatalogks.prodinfo/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@d7581b; maxCacheMB=48.0 maxMergeSizeMB=4.0): files: [_73ne_nrm.cfs, _73ng_Lucene40_0.tip, _73nh_nrm.cfs, _73ng_Lucene40_0.tim, _73nf.fnm, _73n5_Lucene40_0.frq, _73ne.fdt, _73nh.fdx, _73ne_nrm.cfe, _73ne.fdx, _73ne_Lucene40_0.tim, _73ne.si, _73ni.fnm, _73nh_Lucene40_0.prx, _73ni.fdt, _73n5.si, _73ne_Lucene40_0.tip, _73nf_Lucene40_0.frq, _73nf_Lucene40_0.prx, _73nf_nrm.cfe, _73ne_Lucene40_0.frq, _73ng_Lucene40_0.prx, _73nf_Lucene40_0.tip, _73n5.fdx, _73ng_Lucene40_0.frq, _73ng.fnm, _73ni.fdx, _73n5.fnm, _73nf_Lucene40_0.tim, _73ni.si, _73n5.fdt, _73nf_nrm.cfs, _73nh_nrm.cfe, _73ni_Lucene40_0.frq, _73ng.fdx, _73ne_Lucene40_0.prx, _73nh.fnm, _73nh_Lucene40_0.tip, _73nh_Lucene40_0.tim, _73nh.si, _73n5_Lucene40_0.tip, _73ni_Lucene40_0.prx, _73n5_Lucene40_0.tim, _73nf.si, _73ng_nrm.cfe, _73n5_Lucene40_0.prx, _392j_42f.del, _73ng.fdt, _73ng.si, _73ni_nrm.cfe, _73n5_nrm.cfe, _73ni_nrm.cfs, _73nf.fdx, _73ni_Lucene40_0.tip, _73n5_nrm.cfs, _73ni_Lucene40_0.tim, _73nf.fdt, _73ne.fnm, _73nh.fdt, _73nh_Lucene40_0.frq, _73ng_nrm.cfs] -- Thanks, -Utkarsh
Re: Bloom filters and optimized vs. unoptimized indices
Be sure to test the bloom postings format on your own use case ... in my tests (heavy PK lookups) it was slower. But to answer your question: I would expect a single segment index to have much faster PK lookups than a multi-segment one, with and without the bloom postings format, but bloom may make the many-segment case faster (just be sure to test it yourself). Mike McCandless http://blog.mikemccandless.com On Tue, Apr 30, 2013 at 1:05 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I was looking at http://lucene.apache.org/core/4_2_1/codecs/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.html and this piece of text: A PostingsFormat useful for low doc-frequency fields such as primary keys. Bloom filters are maintained in a .blm file which offers fast-fail for reads in segments known to have no record of the key. Is this implying that if you are doing PK lookups AND you have a large index (i.e. slow queries) it may actually be better to keep the index unoptimized, so whole index segments can be skipped? Thanks, Otis -- SOLR Performance Monitoring - http://sematext.com/spm/index.html
Re: Document adds, deletes, and commits ... a question about visibility.
At the Lucene level, you don't have to commit before doing the deleteByQuery, i.e. 'a' will be correctly deleted without any intervening commit. Mike McCandless http://blog.mikemccandless.com On Mon, Apr 15, 2013 at 3:57 PM, Shawn Heisey s...@elyograg.org wrote: Simple question first: Is there anything in SolrJ that prevents indexing more than 500 documents in one request? I'm not aware of anything myself, but a co-worker remembers running into something, so his code is restricting them to 490 docs. The only related limit I'm aware of is the POST buffer size limit, which defaults in recent Solr versions to 2MiB. A more complex question: If I am doing both deletes and adds in separate update requests, and I want to ensure that a delete in the next request can delete a document that I am adding in the current one, do I need to commit between the two requests? This is probably more of a Lucene question than Solr, but Solr is what I'm using. To simplify: Let's say I start with an empty index. I add documents a and b in one request ... then I send a deleteByQuery request for a c and e. If I don't do a commit between these two requests, will a still be in the index when I commit after the second request? If so, would there be an easy fix? Thanks, Shawn
Re: Is Lucene's DrillSideways something suitable for Solr?
On Tue, Mar 12, 2013 at 11:24 PM, Yonik Seeley yo...@lucidworks.com wrote: On Tue, Mar 12, 2013 at 10:27 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Lucene seems to get a new DrillSideways functionality on top of its own facet implementation. I would love to have something like that in Solr Solr has had multi-select faceting for 4 years now. My understanding of DrillSideways is that it implements the same type of thing for Lucene faceting module (which Solr doesn't use). There's implementation (which DrillSideways is), and interface (which for Solr means tagging / excluding filters). If you have any ideas around improving the Solr interface for multi-select faceting, please share them! Actually DrillSideways is independent of multi-select. Ie, it's useful to have the sideways counts for a drill-down field, whether your UI offers single or multi select for a given dimension. DrillSideways.java is a different implementation (minShouldMatch=N-1 query w/ custom collector to separate hit from near-miss) than Solr (tagging/excluding filters), and also a different interface. Mike McCandless http://blog.mikemccandless.com
Re: AW: 170G index, 1.5 billion documents, out of memory on query
It really should be unlimited: this setting has nothing to do with how much RAM is on the computer. See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Mike McCandless http://blog.mikemccandless.com On Tue, Feb 26, 2013 at 12:18 PM, zqzuk ziqizh...@hotmail.co.uk wrote: Hi sorry I couldnt do this directly... the way I do this is by subscribing to a cluster of computers in our organisation and send the job with required memory. It gets randomly allocated to a node (one single server in the cluster) once executed and it is not possible to connect to that specific node to check. But im pretty sure it wont be unlimited but matching the figure I required, which was 40G (the max memory on a single node is 48G anyway). So Solr only gets maximum of 40G memory for this index. -- View this message in context: http://lucene.472066.n3.nabble.com/170G-index-1-5-billion-documents-out-of-memory-on-query-tp4042696p4043110.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: AnalyzingSuggester returning index value instead of field value?
I'm not very familiar with how AnalyzingSuggester works inside Solr ... if you try this directly with the Lucene APIs does it still happen? Hmm maybe one idea: if you remove whitespace from your suggestion does it work? I wonder if there's a whitespace / multi-token issue ... if so then maybe see how TestPhraseSuggestions.java (in Solr) does this? Mike McCandless http://blog.mikemccandless.com On Thu, Feb 7, 2013 at 9:48 AM, Sebastian Saip sebastian.s...@gmail.com wrote: I'm looking into a way to implement an autosuggest and for my special needs (I'm doing a startsWith-search that should retrieve the full name, which may have accents - However, I want to search with/without accents and in any upper/lowercase for comfort) Here's part of my configuration: http://pastebin.com/20vSGJ1a So I have a name=Têst Námè and I query for test, tést, TÈST, or similiar. This gives me back test name as a suggestion, which looks like the index, rather than the actual value. Furthermore, when I fed the document without index-analyzers, then added the index-analyzers, restarted without refeeding and queried, it returned the right value (so this seems to retrieve the index, rather than the actual stored value?) Or maybe I just configured it the wrong way :? Theres not really much documentation about this yet :( BR Sebastian Saip
Re: get a list of terms sorted by total term frequency
Lucene's misc module has HighFreqTerms tool. Mike McCandless http://blog.mikemccandless.com On Wed, Nov 7, 2012 at 1:15 PM, Edward Garrett heacu.mcint...@gmail.com wrote: hi, is there a simple way to get a list of all terms that occur in a field sorted by their total term frequency within that field? TermsComponent (http://wiki.apache.org/solr/TermsComponent) provides fast field faceting over the whole index, but as counts it gives the number of documents that each term occurs in (given a field or set of fields). in place of document counts, i want total term frequency counts. the ttf function (http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides this, but only if you know what term to pass to the function. edward
Re: throttle segment merging
With Lucene 4.0, FSDirectory now supports merge bytes/sec throttling (FSDirectory.setMaxMergeWriteMBPerSec): it rate limits that max bytes/sec load on the IO system due to merging. Not sure if it's been exposed in Solr / ElasticSearch yet ... Mike McCandless http://blog.mikemccandless.com On Mon, Oct 29, 2012 at 7:07 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: Is there way to set-up logging to output something when segment merging runs? I think segment merging is logged when you enable infoStream logging (you should see it commented in the solrconfig.xml) no, segment merging is not logged at info level. it needs customized log config. INFO level is not the same as infoStream. See solrconfig, there is a commented section that talks about it, and if you uncomment it it will generate a file with low level Lucene logging. This file will include segments information, including merging. Can be segment merges throttled? You can change when and how segments are merged with the merge policy, maybe it's enough for you changing the initial settings (mergeFactor for example)? I am now researching elasticsearch, it can do it, its lucene 3.6 based I don't know if this is what you are looking for, but the TieredMergePolicy (default) allows you to set maximum number of segments to be merged at once and maximum size of segments to be created during normal merging. Other option is, as you said, create a Jira for a new merge policy. Tomás
Re: Indexing in Solr: invalid UTF-8
Python's unicode function takes an optional (keyword) errors argument, telling it what to do when an invalid UTF8 byte sequence is seen. The default (errors='strict') is to throw the exceptions you're seeing. But you can also pass errors='replace' or errors='ignore'. See http://docs.python.org/howto/unicode.html for details ... However, I agree with Robert: you should dig into why whatever process you used to extract the full text from your binary documents is producing invalid UTF-8 ... something is wrong with that process. Mike McCandless http://blog.mikemccandless.com On Tue, Sep 25, 2012 at 10:44 PM, Robert Muir rcm...@gmail.com wrote: On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner patrick.oliver.glau...@cern.ch wrote: Hi Thanks. But I see that 0xd835 is missing in this list (see my exceptions). What's the best way to get rid of all of them in Python? I am new to unicode in Python but I am sure that this use case is quite frequent. I don't really know python either: so I could be wrong here but are you just taking these binary .PDF and .DOC files and treating them as UTF-8 text and sending them to Solr? If so, I don't think that will work very well. Maybe instead try parsing these binary files with something like Tika to get at the actual content and send that? (it seems some people have developed python integration for this, e.g. http://redmine.djity.net/projects/pythontika/wiki)
Re: offsets issues with multiword synonyms since LUCENE_33
See also SOLR-3390. Some cases have been addressed. Eg, if you match domain name system - dns, then dns will have correct offsets spanning the full phrase domain name system in the input. (However: QueryParser won't work because a query for domain name system is pre-split on whitespace so the synonym never matches). But for the reverse case, which I call expanding (ie, match dns - domain name system), the results are not correct (or at least different from the previous SynFilter impl): the three tokens are overlapped onto subsequent tokens, resulting in highlighting the wrong tokens. However, QueryParser will work correctly for the query domain name system... But, I'd like to ask: why do apps want to expand (replace a match with more than one input token, ie the dns - domain name system case)? Is it ONLY because of QueryParser's limitation (that it pre-splits on whitespace)? Or are there other realistic use cases? Mike McCandless http://blog.mikemccandless.com On Tue, Aug 14, 2012 at 11:53 AM, Marc Sturlese marc.sturl...@gmail.com wrote: Has someone noticed this problem and solved it somehow? (without using LUCENE_33 in the solrconfig.xml) https://issues.apache.org/jira/browse/LUCENE-3668 Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: synonym file
Actually FST (and SynFilter based on it) was backported to 3.x. Mike McCandless http://blog.mikemccandless.com On Fri, Aug 3, 2012 at 11:28 AM, Jack Krupansky j...@basetechnology.com wrote: The Lucene FST guys made a big improvement in synonym filtering in Lucene/Solr 4.0 using FSTs. Or are you already using that? Or if you are stuck with pre-4.0, you could do a preprocessor that efficiently generates boolean queries for the synonym expansions. That should give you more decent query times, assuming you develop a decent synonym lookup filter. Maybe you could backport the 4.0 FST code, or at least use the same techniques for your own preprocessor. -- Jack Krupansky -Original Message- From: Peyman Faratin Sent: Friday, August 03, 2012 12:56 AM To: solr-user@lucene.apache.org Subject: synonym file Hi I have a (23M) synonym file that takes a long time (3 or so minutes) to load and once included seems to adversely affect the QTime of the application by approximately 4 orders of magnitude. Any advise on how to load faster and lower the QT would be much appreciated. best Peyman=
Re: Near Real Time Indexing and Searching with solr 3.6
Hi, You might want to take a look at Solr's trunk (very soon to be 4.0.0 alpha release), which already has a near-real-time solution (using Lucene's near-real-time APIs). Lucene has NRTCachingDirectory (to use RAM for small / recently flushed segments), but I don't think Solr uses it yet. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 3, 2012 at 4:02 AM, thomas tho...@codemium.com wrote: Hi, As part of my bachelor thesis I'm trying to archive NRT with Solr 3.6. I've came up with a basic concept and would be trilled if I could get some feedback. The main idea is to use two different Indexes. One persistent on disc and one in RAM. The plan is to route every added and modified document to the RAMIndex (http://imgur.com/kLfUN). After a certain period of time, this index would get cleared and the documents get added to the persistent Index. Some major problems I still have with this idea is: - deletions of documents from documents in the persistent index - having the same unique IDs in both the RAM index and persitent Index, as a result of an updated document - Merging search results to filter out old versions of updated documents Would such an idea be viable to persuit? Thanks for you time
Re: leap second bug
Looks like this is a low-level Linux issue ... see Shay's email to the ElasticSearch list about it: https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/_I1_OfaL7QY Also see the comments here: http://news.ycombinator.com/item?id=4182642 Mike McCandless http://blog.mikemccandless.com On Sun, Jul 1, 2012 at 8:08 AM, Óscar Marín Miró oscarmarinm...@gmail.com wrote: Hello Michael, thanks for the note :) I'm having a similar problem since yesterday, tomcats are wild on CPU [near 100%]. Did your solr servers did not reply to index/query requests? Thanks :) On Sun, Jul 1, 2012 at 1:22 PM, Michael Tsadikov mich...@myheritage.comwrote: Our solr servers went into GC hell, and became non-responsive on date change today. Restarting tomcats did not help. Rebooting the machine did. http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-wreaks-havoc-with-java-linux/ -- Whether it's science, technology, personal experience, true love, astrology, or gut feelings, each of us has confidence in something that we will never fully comprehend. --Roy H. William
Re: Exception when optimizing index
Is it possible the Linux machine has bad RAM / bad disk? Mike McCandless http://blog.mikemccandless.com On Mon, Jun 18, 2012 at 7:06 AM, Erick Erickson erickerick...@gmail.com wrote: Is it possible that you somehow have some problem with jars and classpath? I'm wondering because this problem really seems odd, and you've eliminated a bunch of possibilities. I'm wondering if you've somehow gotten some old jars mixed in the bunch. Or, alternately, what about re-installing Solr on the theory that somehow you got a bad download and/or files (i.e. the Solr jar files) got corrupted, your disk has a bad spot or. Really clutching at straws here Erick On Mon, Jun 18, 2012 at 3:44 AM, Rok Rejc rokrej...@gmail.com wrote: Hi all, during the last days, I have create solr instance on a windows environment - same Solr as on the linux machine (solr 4.0 from 9th June 2012), same solr configurations, Tomcat 6, Java 6u23. I have also upgraded Java on the linux machine (1.7.0_05-b05 from Oracle). Import and optimize on the windows machine worked without any issue, but on the linux machine optimize fails with the same exception: Caused by: java.io.IOException: Invalid vInt detected (too many bits) at org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:217) ... after that I have also change directory factory (on the linux machine) to SimpleFSDirectoryFactory. I have reindexed all the documents and again run the optimize - it fails again with the same expcetion. In the next steps I could maybe do partial insertions (which will be a painful process), but after that I'm out of ideas (and out of time for experimenting). Many thanks for further suggestions. Rok On Wed, Jun 13, 2012 at 1:31 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Jun 7, 2012 at 5:50 AM, Rok Rejc rokrej...@gmail.com wrote: - java.runtime.nameOpenJDK Runtime Environment - java.runtime.version1.6.0_22-b22 ... As far as I see from the JIRA issue I have the patch attached (as mentioned I have a trunk version from May 12). Any ideas? its not guaranteed that the patch will workaround all hotspot bugs related to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921 Since you can reproduce, is it possible for you to re-test the scenario with a newer JVM (e.g. 1.7.0_04) just to rule that out? -- lucidimagination.com
Re: field name was indexed without position data; cannot run PhraseQuery (term=a)
This behavior has changed. In 3.x, you silently got no results in such cases. In trunk, you get an exception notifying you that the query cannot run. Mike McCandless http://blog.mikemccandless.com On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, What is the intended behaviour for explicit phrase queries on fields without position data? If a (e)dismax qf parameter included a field omitTermFreqAndPositions=true user explicit phrase queries throw the following error on trunk but not on the 3x branch. java.lang.IllegalStateException: field name was indexed without position data; cannot run PhraseQuery (term=a) at org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274) at org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Thread.java:662) Thanks
Re: field name was indexed without position data; cannot run PhraseQuery (term=a)
I believe termPositions=false refers to the term vectors and not how the field is indexed (which is very confusing I think...). I think you'll need to index a separate field disabling term freqs + positions than the field the queryparser can query? But ... if all of this is to just do custom scoring ... can't you just set a custom similarity for the field and index it normally (with term freq + positions). Mike McCandless http://blog.mikemccandless.com On Thu, May 24, 2012 at 6:47 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks! How can we, in that case, omit term frequency for a qf field? I assume the way to go is to configure a custom flat term frequency similarity for that field. And how can it be that this error is not thrown with termPosition=false for that field but only with omitTermFreqAndPositions? Markus -Original message- From:Michael McCandless luc...@mikemccandless.com Sent: Thu 24-May-2012 12:26 To: solr-user@lucene.apache.org Subject: Re: field quot;namequot; was indexed without position data; cannot run PhraseQuery (term=a) This behavior has changed. In 3.x, you silently got no results in such cases. In trunk, you get an exception notifying you that the query cannot run. Mike McCandless http://blog.mikemccandless.com On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, What is the intended behaviour for explicit phrase queries on fields without position data? If a (e)dismax qf parameter included a field omitTermFreqAndPositions=true user explicit phrase queries throw the following error on trunk but not on the 3x branch. java.lang.IllegalStateException: field name was indexed without position data; cannot run PhraseQuery (term=a) at org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274) at org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280) at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at
Re: question about NRT(soft commit) and Transaction Log in trunk
This is a good question... I don't know much about how Solr's transaction log works, but, peeking in the code, I do see it fsync'ing (look in TransactionLog.java, in the finish method), but only if the SyncLevel is FSYNC. If the default is really flush, I don't see how the transaction log helps on recovery...? Should we change the default ot FSYNC? Mike McCandless http://blog.mikemccandless.com On Sat, Apr 28, 2012 at 7:11 AM, Li Li fancye...@gmail.com wrote: hi I checked out the trunk and played with its new soft commit feature. it's cool. But I've got a few questions about it. By reading some introductory articles and wiki, and hasted code reading, my understand of it's implementation is: For normal commit(hard commit), we should flush all into disk and commit it. flush is not very time consuming because of os level cache. the most time consuming one is sync in commit process. Soft commit just flush postings and pending deletions into disk and generating new segments. Then solr can use a new searcher to read the latest indexes and warm up and then register itself. if there is no hard commit and the jvm crashes, then new data may lose. if my understanding is correct, then why we need transaction log? I found in DirectUpdateHandler2, every time a command is executed, TransactionLog will record a line in log. But the default sync level in RunUpdateProcessorFactory is flush, which means it will not sync the log file. does this make sense? in database implementation, we usually write log and modify data in memory because log is smaller than real data. if crashes. we can redo the unfinished log and make data correct. will Solr leverage this log like this? if it is, why it's not synced?
Re: SOLR 3.5 Index Optimization not producing single .cfs file
By default, the default merge policy (TieredMergePolicy) won't create the CFS if the segment is very large ( 10% of the total index size). Likely that's what you are seeing? If you really must have a CFS (how come?) then you can call TieredMergePolicy.setNOCFSRatio(1.0) -- not sure how/where this is exposed in Solr though. LogMergePolicy also has the same behaviour/method... Mike McCandless http://blog.mikemccandless.com On Thu, May 3, 2012 at 5:18 AM, pravesh suyalprav...@yahoo.com wrote: Hi, I've migrated the search servers to the latest stable release (SOLR-3.5) from SOLR-1.4.1. We've fully recreated the index for this. After index completes, when im optimizing the index then it is not merging the index into a single .cfs file as was being done with 1.4.1 version. We've set the , useCompoundFiletrue/useCompoundFile Is it something related to the new MergePolicy being used with SOLR 3.x onwards (I suppose it is TieredMergePolicy with 3.x version)? If yes should i change it to the LogByteSizeMergePolicy? Does this change requires complete rebuilt OR will do incrementally? Regards Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-3-5-Index-Optimization-not-producing-single-cfs-file-tp3958619.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Large Index and OutOfMemoryError: Map failed
Is it possible you are hitting this (just opened) Solr issue?: https://issues.apache.org/jira/browse/SOLR-3392 Mike McCandless http://blog.mikemccandless.com On Fri, Apr 20, 2012 at 9:33 AM, Gopal Patwa gopalpa...@gmail.com wrote: We cannot avoid auto soft commit, since we need Lucene NRT feature. And I use StreamingUpdateSolrServer for adding/updating index. On Thu, Apr 19, 2012 at 7:42 AM, Boon Low boon@brightsolid.com wrote: Hi, Also came across this error recently, while indexing with 10 DIH processes in parallel + default index setting. The JVM grinds to a halt and throws this error. Checking the index of a core reveals thousands of files! Tuning the default autocommit from 15000ms to 90ms solved the problem for us. (no 'autosoftcommit'). Boon - Boon Low Search UX and Engine Developer brightsolid Online Publishing On 14 Apr 2012, at 17:40, Gopal Patwa wrote: I checked it was MMapDirectory.UNMAP_SUPPORTED=true and below are my system data. Is their any existing test case to reproduce this issue? I am trying understand how I can reproduce this issue with unit/integration test I will try recent solr trunk build too, if it is some bug in solr or lucene keeping old searcher open then how to reproduce it? SYSTEM DATA === PROCESSOR: Intel(R) Xeon(R) CPU E5504 @ 2.00GHz SYSTEM ID: x86_64 CURRENT CPU SPEED: 1600.000 MHz CPUS: 8 processor(s) MEMORY: 49449296 kB DISTRIBUTION: CentOS release 5.3 (Final) KERNEL NAME: 2.6.18-128.el5 UPTIME: up 71 days LOAD AVERAGE: 1.42, 1.45, 1.53 JBOSS Version: Implementation-Version: 4.2.2.GA (build: SVNTag=JBoss_4_2_2_GA date=20 JAVA Version: java version 1.6.0_24 On Thu, Apr 12, 2012 at 3:07 AM, Michael McCandless luc...@mikemccandless.com wrote: Your largest index has 66 segments (690 files) ... biggish but not insane. With 64K maps you should be able to have ~47 searchers open on each core. Enabling compound file format (not the opposite!) will mean fewer maps ... ie should improve this situation. I don't understand why Solr defaults to compound file off... that seems dangerous. Really we need a Solr dev here... to answer how long is a stale searcher kept open. Is it somehow possible 46 old searchers are being left open...? I don't see any other reason why you'd run out of maps. Hmm, unless MMapDirectory didn't think it could safely invoke unmap in your JVM. Which exact JVM are you using? If you can print the MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure. Yes, switching away from MMapDir will sidestep the too many maps issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if there really is a leak here (Solr not closing the old searchers or a Lucene bug or something...) then you'll eventually run out of file descriptors (ie, same problem, different manifestation). Mike McCandless http://blog.mikemccandless.com 2012/4/11 Gopal Patwa gopalpa...@gmail.com: I have not change the mergefactor, it was 10. Compound index file is disable in my config but I read from below post, that some one had similar issue and it was resolved by switching from compound index file format to non-compound index file. and some folks resolved by changing lucene code to disable MMapDirectory. Is this best practice to do, if so is this can be done in configuration? http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html I have index document of core1 = 5 million, core2=8million and core3=3million and all index are hosted in single Solr instance I am going to use Solr for our site StubHub.com, see attached ls -l list of index files for all core SolrConfig.xml: indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength-- ramBufferSizeMB4096/ramBufferSizeMB maxThreadStates10/maxThreadStates writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType mergePolicy class=org.apache.lucene.index.TieredMergePolicy double name=forceMergeDeletesPctAllowed0.0/double double name=reclaimDeletesWeight10.0/double /mergePolicy deletionPolicy class=solr.SolrDeletionPolicy str name=keepOptimizedOnlyfalse/str str name=maxCommitsToKeep0/str /deletionPolicy /indexDefaults updateHandler class=solr.DirectUpdateHandler2 maxPendingDeletes1000/maxPendingDeletes autoCommit maxTime90/maxTime openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime
Re: Large Index and OutOfMemoryError: Map failed
commit error...:java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:293) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:221) at org.apache.lucene.codecs.lucene40.Lucene40PostingsReader.init(Lucene40PostingsReader.java:58) at org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsProducer(Lucene40PostingsFormat.java:80) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.visitOneFormat(PerFieldPostingsFormat.java:189) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.init(PerFieldPostingsFormat.java:280) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.init(PerFieldPostingsFormat.java:186) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.init(PerFieldPostingsFormat.java:186) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:256) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:108) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:51) at org.apache.lucene.index.IndexWriter$ReadersAndLiveDocs.getReader(IndexWriter.java:494) at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:214) at org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2939) at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2930) at org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2681) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2804) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2786) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:391) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:197) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at ... [Message clipped] -- From: Michael McCandless luc...@mikemccandless.com Date: Sat, Mar 31, 2012 at 3:15 AM To: solr-user@lucene.apache.org It's the virtual memory limit that matters; yours says unlimited below (good!), but, are you certain that's really the limit your Solr process runs with? On Linux, there is also a per-process map count: cat /proc/sys/vm/max_map_count I think it typically defaults to 65,536 but you should check on your env. If a process tries to map more than this many regions, you'll hit that exception. I think you can: cat /proc/pid/maps | wc to see how many maps your Solr process currently has... if that is anywhere near the limit then it could be the cause. Mike McCandless http://blog.mikemccandless.com On Sat, Mar 31, 2012 at 1:26 AM, Gopal Patwa gopalpa...@gmail.com wrote: *I need help!!* * * *I am using Solr 4.0 nightly build with NRT and I often get this error during auto commit **java.lang.OutOfMemoryError:* *Map* *failed. I have search this forum and what I found it is related to OS ulimit setting, please se below my ulimit settings. I am not sure what ulimit setting I should have? and we also get **java.net.SocketException:* *Too* *many* *open* *files NOT sure how many open file we need to set?* I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB, with Single shard * * *We update the index every 5 seconds, soft commit every 1 second and hard commit every 15 minutes* * * *Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB* * * ulimit: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 401408 max locked memory (kbytes, -l) 1024 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 401408 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited * * *ERROR:* * * *2012-03-29* *15:14:08*,*560* [] *priority=ERROR* *app_name=* *thread=pool-3-thread-1* *location=CommitTracker* *line=93* *auto* *commit* *error...:java.io.IOException:* *Map* *failed* *at* *sun.nio.ch.FileChannelImpl.map*(*FileChannelImpl.java:748*) *at* *org.apache.lucene.store.MMapDirectory$MMapIndexInput.**init*(*MMapDirectory.java:293*) *at* *org.apache.lucene.store.MMapDirectory.openInput*(*MMapDirectory.java:221
Re: codecs for sorted indexes
Do you mean you are pre-sorting the documents (by what criteria?) yourself, before adding them to the index? In which case... you should already be seeing some benefits (smaller index size) than had you randomly added them (ie the vInts should take fewer bytes), I think. (Probably the savings would be greater for better intblock codecs like PForDelta, SimpleX, but I'm not sure...). Or do you mean having a codec re-sort the documents (on flush/merge)? I think this should be possible w/ the Codec API... but nobody has tried it yet that I know of. Note that the bulkpostings branch is effectively dead (nobody is iterating on it, and we've removed the old bulk API from trunk), but there is likely a GSoC project to add a PForDelta codec to trunk: https://issues.apache.org/jira/browse/LUCENE-3892 Mike McCandless http://blog.mikemccandless.com On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas c...@experienceon.com wrote: Hello, We're using a sorted index in order to implement early termination efficiently over an index of hundreds of millions of documents. As of now, we're using the default codecs coming with Lucene 4, but we believe that due to the fact that the docids are sorted, we should be able to do much better in terms of storage and achieve much better performance, especially decompression performance. In particular, Robert Muir is commenting on these lines here: https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411 We're aware that the in the bulkpostings branch there are different codecs being implemented and different experiments being done. We don't know whether we should implement our own codec (i.e. using some RLE-like techniques) or we should use one of the codecs implemented there (PFOR, Simple64, ...). Can you please give us some advice on this? Thanks Carlos Carlos Gonzalez-Cadenas CEO, ExperienceOn - New generation search http://www.experienceon.com Mobile: +34 652 911 201 Skype: carlosgonzalezcadenas LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
Re: Large Index and OutOfMemoryError: Map failed
Hi, 65K is already a very large number and should have been sufficient... However: have you increased the merge factor? Doing so increases the open files (maps) required. Have you disabled compound file format? (Hmmm: I think Solr does so by default... which is dangerous). Maybe try enabling compound file format? Can you ls -l your index dir and post the results? It's also possible Solr isn't closing the old searchers quickly enough ... I don't know the details on when Solr closes old searchers... Mike McCandless http://blog.mikemccandless.com On Tue, Apr 10, 2012 at 11:35 PM, Gopal Patwa gopalpa...@gmail.com wrote: Michael, Thanks for response it was 65K as you mention the default value for cat /proc/sys/vm/max_map_count . How we determine what value this should be? is it number of document during hard commit in my case it is 15 minutes? or it is number of index file or number of documents we have in all cores. I have raised the number to 140K but I still get when it reaches to 140K, we have to restart jboss server to free up the map count, sometime OOM error happen during *Error opening new searcher* is making this number to unlimited is only solution?'' Error log: *location=CommitTracker line=93 auto commit error...:org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:409) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:197) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:293) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:221) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.init(PerFieldPostingsFormat.java:262) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$1.init(PerFieldPostingsFormat.java:316) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.files(PerFieldPostingsFormat.java:316) at org.apache.lucene.codecs.Codec.files(Codec.java:56) at org.apache.lucene.index.SegmentInfo.files(SegmentInfo.java:423) at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:215) at org.apache.lucene.index.IndexWriter.prepareFlushedSegment(IndexWriter.java:2220) at org.apache.lucene.index.DocumentsWriter.publishFlushedSegment(DocumentsWriter.java:497) at org.apache.lucene.index.DocumentsWriter.finishFlush(DocumentsWriter.java:477) at org.apache.lucene.index.DocumentsWriterFlushQueue$SegmentFlushTicket.publish(DocumentsWriterFlushQueue.java:201) at org.apache.lucene.index.DocumentsWriterFlushQueue.innerPurge(DocumentsWriterFlushQueue.java:119) at org.apache.lucene.index.DocumentsWriterFlushQueue.tryPurge(DocumentsWriterFlushQueue.java:148) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:438) at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:553) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:354) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:258) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:243) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:250) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1091) ... 11 moreCaused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)* And one more issue we came across i.e On Sat, Mar 31, 2012 at 3:15 AM, Michael McCandless luc...@mikemccandless.com wrote: It's the virtual memory limit that matters; yours says unlimited below (good!), but, are you certain that's really the limit your Solr process
Re: Virtual Memory very high
Are you seeing a real problem here, besides just being alarmed by the big numbers from top? Consumption of virtual memory by itself is basically harmless, as long as you're not running up against any of the OS limits (and, you're running a 64 bit JVM). This is just top telling you that you've mapped large files into the virtual memory space. It's not telling you that you don't have any RAM left... virtual memory is different from RAM. In my tests, generally MMapDirectory gives faster search performance than NIOFSDirectory... so unless there's an actual issue, I would recommend sticking with MMapDirectory. Mike McCandless http://blog.mikemccandless.com On Fri, Dec 9, 2011 at 11:54 PM, Rohit ro...@in-rev.com wrote: Hi All, Don't know if this question is directly related to this forum, I am running Solr in Tomcat on linux server. The moment I start tomcat the virtual memory shown using TOP command goes to its max 31.1G and then remains there. Is this the right behaviour, why is the virtual memory usage so high. I have 36GB of ram on the server. Tasks: 309 total, 1 running, 308 sleeping, 0 stopped, 0 zombie Cpu(s): 19.1%us, 0.2%sy, 0.0%ni, 79.3%id, 1.2%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 49555260k total, 36152224k used, 13403036k free, 121612k buffers Swap: 999416k total, 0k used, 999416k free, 5409052k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2741 mysql 20 0 6412m 5.8g 6380 S 182 12.3 108:07.45 mysqld 2814 root 20 0 31.1g 22g 9716 S 100 46.6 375:51.70 java 1765 root 20 0 12.2g 285m 9488 S 2 0.6 3:52.59 java 3591 root 20 0 19352 1576 1068 R 0 0.0 0:00.24 top 1 root 20 0 23684 1908 1276 S 0 0.0 0:06.21 init Regards, Rohit
Re: Open deleted index file failing jboss shutdown with Too many open files Error
Hmm, unless the ulimits are low, or the default mergeFactor was changed, or you have many indexes open in a single JVM, or you keep too many IndexReaders open, even in an NRT or frequent commit use case, you should not run out of file descriptors. Frequent commit/reopen should be perfectly fine, as long as you close the old readers... Mike McCandless http://blog.mikemccandless.com On Mon, Apr 2, 2012 at 8:35 AM, Erick Erickson erickerick...@gmail.com wrote: How often are you committing index updates? This kind of thing can happen if you commit too often. Consider setting commitWithin to something like, say, 5 minutes. Or doing the equivalent with the autoCommit parameters in solrconfig.xml If that isn't relevant, you need to provide some more details about what you're doing and how you're using Solr Best Erick On Sun, Apr 1, 2012 at 10:47 PM, Gopal Patwa gopalpa...@gmail.com wrote: I am using Solr 4.0 nightly build with NRT and I often get this error during auto commit Too many open files. I have search this forum and what I found it is related to OS ulimit setting, please see below my ulimit settings. I am not sure what ulimit setting I should have for open file? ulimit -n unlimited?. Even if I set to higher number, it will just delay the issue until it reach new open file limit. What I have seen that Solr has kept deleted index file open by java process, which causing issue for our application server jboss to shutdown gracefully due this open files by java process. I have seen recently this issue was resolved in lucene, is it TRUE? https://issues.apache.org/jira/browse/LUCENE-3855 I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB, with Single shard We update the index every 5 seconds, soft commit every 1 second and hard commit every 15 minutes Environment: Jboss 4.2, JDK 1.6 64 bit, CentOS , JVM Heap Size = 24GB* ulimit: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 401408 max locked memory (kbytes, -l) 1024 max memory size (kbytes, -m) unlimited open files (-n) 4096 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 401408 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ERROR:* *2012-04-01* *20:08:35*,*323* [] *priority=ERROR* *app_name=* *thread=pool-10-thread-1* *location=CommitTracker* *line=93* *auto* *commit* *error...:org.apache.solr.common.SolrException:* *Error* *opening* *new* *searcher* *at* *org.apache.solr.core.SolrCore.openNewSearcher*(*SolrCore.java:1138*) *at* *org.apache.solr.core.SolrCore.getSearcher*(*SolrCore.java:1251*) *at* *org.apache.solr.update.DirectUpdateHandler2.commit*(*DirectUpdateHandler2.java:409*) *at* *org.apache.solr.update.CommitTracker.run*(*CommitTracker.java:197*) *at* *java.util.concurrent.Executors$RunnableAdapter.call*(*Executors.java:441*) *at* *java.util.concurrent.FutureTask$Sync.innerRun*(*FutureTask.java:303*) *at* *java.util.concurrent.FutureTask.run*(*FutureTask.java:138*) *at* *java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301*(*ScheduledThreadPoolExecutor.java:98*) *at* *java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run*(*ScheduledThreadPoolExecutor.java:206*) *at* *java.util.concurrent.ThreadPoolExecutor$Worker.runTask*(*ThreadPoolExecutor.java:886*) *at* *java.util.concurrent.ThreadPoolExecutor$Worker.run*(*ThreadPoolExecutor.java:908*) *at* *java.lang.Thread.run*(*Thread.java:662*)*Caused* *by:* *java.io.FileNotFoundException:* */opt/mci/data/srwp01mci001/inventory/index/_4q1y_0.tip* (*Too many open files*) *at* *java.io.RandomAccessFile.open*(*Native* *Method*) *at* *java.io.RandomAccessFile.**init*(*RandomAccessFile.java:212*) *at* *org.apache.lucene.store.FSDirectory$FSIndexOutput.**init*(*FSDirectory.java:449*) *at* *org.apache.lucene.store.FSDirectory.createOutput*(*FSDirectory.java:288*) *at* *org.apache.lucene.codecs.BlockTreeTermsWriter.**init*(*BlockTreeTermsWriter.java:161*) *at* *org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsConsumer*(*Lucene40PostingsFormat.java:66*) *at* *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField*(*PerFieldPostingsFormat.java:118*) *at* *org.apache.lucene.index.FreqProxTermsWriterPerField.flush*(*FreqProxTermsWriterPerField.java:322*) *at*
Re: Large Index and OutOfMemoryError: Map failed
It's the virtual memory limit that matters; yours says unlimited below (good!), but, are you certain that's really the limit your Solr process runs with? On Linux, there is also a per-process map count: cat /proc/sys/vm/max_map_count I think it typically defaults to 65,536 but you should check on your env. If a process tries to map more than this many regions, you'll hit that exception. I think you can: cat /proc/pid/maps | wc to see how many maps your Solr process currently has... if that is anywhere near the limit then it could be the cause. Mike McCandless http://blog.mikemccandless.com On Sat, Mar 31, 2012 at 1:26 AM, Gopal Patwa gopalpa...@gmail.com wrote: *I need help!!* * * *I am using Solr 4.0 nightly build with NRT and I often get this error during auto commit **java.lang.OutOfMemoryError:* *Map* *failed. I have search this forum and what I found it is related to OS ulimit setting, please se below my ulimit settings. I am not sure what ulimit setting I should have? and we also get **java.net.SocketException:* *Too* *many* *open* *files NOT sure how many open file we need to set?* I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 - 15GB, with Single shard * * *We update the index every 5 seconds, soft commit every 1 second and hard commit every 15 minutes* * * *Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB* * * ulimit: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 401408 max locked memory (kbytes, -l) 1024 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 401408 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited * * *ERROR:* * * *2012-03-29* *15:14:08*,*560* [] *priority=ERROR* *app_name=* *thread=pool-3-thread-1* *location=CommitTracker* *line=93* *auto* *commit* *error...:java.io.IOException:* *Map* *failed* *at* *sun.nio.ch.FileChannelImpl.map*(*FileChannelImpl.java:748*) *at* *org.apache.lucene.store.MMapDirectory$MMapIndexInput.**init*(*MMapDirectory.java:293*) *at* *org.apache.lucene.store.MMapDirectory.openInput*(*MMapDirectory.java:221*) *at* *org.apache.lucene.codecs.lucene40.Lucene40PostingsReader.**init*(*Lucene40PostingsReader.java:58*) *at* *org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsProducer*(*Lucene40PostingsFormat.java:80*) *at* *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.visitOneFormat*(*PerFieldPostingsFormat.java:189*) *at* *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.**init*(*PerFieldPostingsFormat.java:280*) *at* *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.**init*(*PerFieldPostingsFormat.java:186*) *at* *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.**init*(*PerFieldPostingsFormat.java:186*) *at* *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer*(*PerFieldPostingsFormat.java:256*) *at* *org.apache.lucene.index.SegmentCoreReaders.**init*(*SegmentCoreReaders.java:108*) *at* *org.apache.lucene.index.SegmentReader.**init*(*SegmentReader.java:51*) *at* *org.apache.lucene.index.IndexWriter$ReadersAndLiveDocs.getReader*(*IndexWriter.java:494*) *at* *org.apache.lucene.index.BufferedDeletesStream.applyDeletes*(*BufferedDeletesStream.java:214*) *at* *org.apache.lucene.index.IndexWriter.applyAllDeletes*(*IndexWriter.java:2939*) *at* *org.apache.lucene.index.IndexWriter.maybeApplyDeletes*(*IndexWriter.java:2930*) *at* *org.apache.lucene.index.IndexWriter.prepareCommit*(*IndexWriter.java:2681*) *at* *org.apache.lucene.index.IndexWriter.commitInternal*(*IndexWriter.java:2804*) *at* *org.apache.lucene.index.IndexWriter.commit*(*IndexWriter.java:2786*) *at* *org.apache.solr.update.DirectUpdateHandler2.commit*(*DirectUpdateHandler2.java:391*) *at* *org.apache.solr.update.CommitTracker.run*(*CommitTracker.java:197*) *at* *java.util.concurrent.Executors$RunnableAdapter.call*(*Executors.java:441*) *at* *java.util.concurrent.FutureTask$Sync.innerRun*(*FutureTask.java:303*) *at* *java.util.concurrent.FutureTask.run*(*FutureTask.java:138*) *at* *java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301*(*ScheduledThreadPoolExecutor.java:98*) *at*
Re: effect of continuous deletes on index's read performance
On Mon, Feb 6, 2012 at 8:20 AM, prasenjit mukherjee prasen@gmail.com wrote: Pardon my ignorance, Why can't the IndexWriter and IndexSearcher share the same underlying in-memory datastructure so that IndexSearcher need not be reopened with every commit. Because the semantics of an IndexReader in Lucene guarantee an unchanging point-in-time view of the index, as of when that IndexReader was opened. That said, Lucene has near-real-time readers, which keep point-in-time semantics but are very fast to open after adding/deleting docs, and do not require a (costly) commit. EG see my blog post: http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html The tests I ran there indexed at a highish rate (~1000 1KB sized docs per second, or 1 MB plain text per second, or ~2X Twitter's peak rate, at least as of last July), and the reopen latency was fast (~ 60 msec). Admittedly this was a fast machine, and the index was on a good SSD, and I used NRTCachingDir and MemoryCodec for the id field. But net/net Lucene's NRT search is very fast. It should easily handle your 20 docs/second rate, unless your docs are enormous Solr trunk has finally cutover to using these APIs, but unfortunately this has not been backported to Solr 3.x. You might want to check out ElasticSearch, an alternative to Solr, which does use Lucene's NRT APIs Mike McCandless http://blog.mikemccandless.com
Re: LUCENE-995 in 3.x
Thank you Ingo! I think post the 3.x patch directly on the issue? I'm not sure why this wasn't backported to 3.x the first time around... Mike McCandless http://blog.mikemccandless.com On Thu, Jan 5, 2012 at 8:15 AM, Ingo Renner i...@typo3.org wrote: Hi all, I've backported LUCENE-995 to 3.x and the unit test for TestQueryParser is green. What would be the workflow to actually get it into 3.x now? - attach the patch to the original issue or - create a new issue attaching the patch there? best Ingo -- Ingo Renner TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code TYPO3 Open Source Enterprise Content Management System http://typo3.org
Re: LUCENE-995 in 3.x
Awesome, thanks Ingo... I'll have a look! Mike McCandless http://blog.mikemccandless.com On Thu, Jan 5, 2012 at 9:23 AM, Ingo Renner i...@typo3.org wrote: Am 05.01.2012 um 15:05 schrieb Michael McCandless: Thank you Ingo! I think post the 3.x patch directly on the issue? thanks for the advice Michael, path is attached: https://issues.apache.org/jira/browse/LUCENE-995 Ingo -- Ingo Renner TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code TYPO3 Open Source Enterprise Content Management System http://typo3.org
Re: help no segment in my lucene index!!!
Which version of Solr/Lucene were you using when you hit power loss? There was a known bug that could allow power loss to cause corruption, but this was fixed in Lucene 3.4.0. Unfortunately, there is no easy way to recreate the segments_N file... in principle it should be possible and maybe not too much work but nobody has created such a tool yet, that I know of. Mike McCandless http://blog.mikemccandless.com On Mon, Nov 28, 2011 at 5:54 AM, Roberto Iannone roberto.iann...@gmail.com wrote: Hi all, after a power supply inperruption my lucene index (about 28 GB) looks like this: 18/11/2011 20:29 2.016.961.997 _3d.fdt 18/11/2011 20:29 1.816.004 _3d.fdx 18/11/2011 20:29 89 _3d.fnm 18/11/2011 20:30 197.323.436 _3d.frq 18/11/2011 20:30 1.816.004 _3d.nrm 18/11/2011 20:30 358.016.461 _3d.prx 18/11/2011 20:30 637.604 _3d.tii 18/11/2011 20:30 48.565.519 _3d.tis 18/11/2011 20:31 454.004 _3d.tvd 18/11/2011 20:31 1.695.380.935 _3d.tvf 18/11/2011 20:31 3.632.004 _3d.tvx 18/11/2011 23:33 2.048.500.822 _6g.fdt 18/11/2011 23:33 3.032.004 _6g.fdx 18/11/2011 23:33 89 _6g.fnm 18/11/2011 23:34 221.593.644 _6g.frq 18/11/2011 23:34 3.032.004 _6g.nrm 18/11/2011 23:34 350.136.996 _6g.prx 18/11/2011 23:34 683.668 _6g.tii 18/11/2011 23:34 52.224.328 _6g.tis 18/11/2011 23:36 758.004 _6g.tvd 18/11/2011 23:36 1.758.786.158 _6g.tvf 18/11/2011 23:36 6.064.004 _6g.tvx 19/11/2011 03:29 1.966.167.843 _9j.fdt 19/11/2011 03:29 3.832.004 _9j.fdx 19/11/2011 03:28 89 _9j.fnm 19/11/2011 03:30 222.733.606 _9j.frq 19/11/2011 03:30 3.832.004 _9j.nrm 19/11/2011 03:30 324.722.843 _9j.prx 19/11/2011 03:30 715.441 _9j.tii 19/11/2011 03:30 54.488.546 _9j.tis without any segment files! I tried to fix with CheckIndex utility in lucene, but I got the following message: ERROR: could not read any segments file in directory org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.a pache.lucene.store.MMapDirectory@E:\recover_me lockFactory=org.apache.lucene.sto re.NativeFSLockFactory@5d36d1d7: files: [_3d.fdt, _3d.fdx, _3d.fnm, _3d.frq, _3d .nrm, _3d.prx, _3d.tii, _3d.tis, _3d.tvd, _3d.tvf, _3d.tvx, _6g.fdt, _6g.fdx, _6 g.fnm, _6g.frq, _6g.nrm, _6g.prx, _6g.tii, _6g.tis, _6g.tvd, _6g.tvf, _6g.tvx, _ 9j.fdt, _9j.fdx, _9j.fnm, _9j.frq, _9j.nrm, _9j.prx, _9j.tii, _9j.tis, _9j.tvd, _9j.tvf, _9j.tvx, _cf.cfs, _cm.fdt, _cm.fdx, _cm.fnm, _cm.frq, _cm.nrm, _cm.prx, _cm.tii, _cm.tis, _cm.tvd, _cm.tvf, _cm.tvx, _ff.fdt, _ff.fdx, _ff.fnm, _ff.frq , _ff.nrm, _ff.prx, _ff.tii, _ff.tis, _ff.tvd, _ff.tvf, _ff.tvx, _ii.fdt, _ii.fd x, _ii.fnm, _ii.frq, _ii.nrm, _ii.prx, _ii.tii, _ii.tis, _ii.tvd, _ii.tvf, _ii.t vx, _lc.cfs, _ll.fdt, _ll.fdx, _ll.fnm, _ll.frq, _ll.nrm, _ll.prx, _ll.tii, _ll. tis, _ll.tvd, _ll.tvf, _ll.tvx, _lo.cfs, _lp.cfs, _lq.cfs, _lr.cfs, _ls.cfs, _lt .cfs, _lu.cfs, _lv.cfs, _lw.fdt, _lw.fdx, _lw.tvd, _lw.tvf, _lw.tvx, _m.fdt, _m. fdx, _m.fnm, _m.frq, _m.nrm, _m.prx, _m.tii, _m.tis, _m.tvd, _m.tvf, _m.tvx] at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo s.java:712) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo s.java:593) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:359) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:327) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:995) There's a way to recover this index ? Cheers Rob
Re: help no segment in my lucene index!!!
On Mon, Nov 28, 2011 at 10:49 AM, Roberto Iannone iann...@crmpa.unisa.it wrote: Hi Michael, thx for your help :) You're welcome! 2011/11/28 Michael McCandless luc...@mikemccandless.com Which version of Solr/Lucene were you using when you hit power loss? I'm using Lucene 3.4. Hmm, which OS/filesystem? Unexpected power loss (nor OS crash, JVM crash) in 3.4.0 should not cause corrumption, as long as the IO system properly implements fsync. There was a known bug that could allow power loss to cause corruption, but this was fixed in Lucene 3.4.0. Unfortunately, there is no easy way to recreate the segments_N file... in principle it should be possible and maybe not too much work but nobody has created such a tool yet, that I know of. some hints about how could I write this code by myself ? Well, you'd need to take a listing of all files, aggregate those into unique segment names, open a SegmentReader on each segment name, and from that SegmentReader reconstruct what you can (numDocs, delCount, isCompoundFile, etc.) about each SegmentInfo. Add all the resulting SegmentInfo instances into a new SegmentInfos and write it to the directory. Was the index newly created in 3.4.x? If not (if you inherited segments from earlier Lucene versions) you might also have to reconstruct shared doc stores (stored fields, term vectors) files, which will be trickier... Mike
Re: Parent-child options
Lucene itself has BlockJoinQuery/Collector (in contrib/join), which is what ElasticSearch is using under the hood for its nested documents (I think?). But I don't think this has been exposed in Solr yet patches welcome! Mike McCandless http://blog.mikemccandless.com On Tue, Nov 8, 2011 at 12:59 PM, Jean Maynier jmayn...@eco2market.com wrote: Hello, Did someone find a way to solve the parent-child problem? The Join option is too complex because you have to create multiple document type and do the join in the query. ElasticSearch did a better job at solving this problem: http://www.elasticsearch.org/guide/reference/mapping/nested-type.html http://www.elasticsearch.org/guide/reference/query-dsl/nested-query.html Is Solr has a similar feature (at least in the roadmap) ? I don't want to change for ES (too much changed) but it seems better for the moment for structured content. -- Jean Maynier
Re: large scale indexing issues / single threaded bottleneck
On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer simon.willna...@googlemail.com wrote: one more thing, after somebody (thanks robert) pointed me at the stacktrace it seems kind of obvious what the root cause of your problem is. Its solr :) Solr closes the IndexWriter on commit which is very wasteful since you basically wait until all merges are done. Solr trunk has solved this problem. That is very wasteful but I don't think it's actually the cause of the slowdown here? The cause looks like it's in applying deletes, which even once Solr stops closing the IW will still occur (ie, IW.commit must also resolve all deletes). When IW resolves deletes it 1) opens a SegmentReader for each segment in the index, and 2) looks up each deleted term and mark its document(s) as deleted. I saw a mention somewhere that you can tell Solr not to use IW.addDocument (not IW.updateDocument) when you add a document if you are certain it's not replacing a previous document with the same ID -- I don't know how to do that but if that's true, and you are truly only adding documents, that could be the easiest fix here. Failing that... you could try increasing IndexWriterConfig.setReaderTermsIndexDivisor (not sure if/how this is exposed in Solr's config)... this will make init time and RAM usage for each SegmentReader faster, but lookup time slower; whether this helps depends on if your slowness is in opening the SegmentReader (how long does it take to IR.open on your index?) or on resolving the deletes once SR is open. Do you have a great many terms in your index? Can you run CheckIndex and post the output? (If so this might mean you have an analysis problem, ie, putting too many terms in the index). We should maybe try to fix this in 3.x too? +1; having to wait for running merges to complete when the app calls commit is crazy (Lucene long ago removed that limitation). Mike McCandless http://blog.mikemccandless.com
Re: How to make UnInvertedField faster?
On Sat, Oct 22, 2011 at 4:10 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Fri, Oct 21, 2011 at 4:37 PM, Michael McCandless luc...@mikemccandless.com wrote: Well... the limitation of DocValues is that it cannot handle more than one value per document (which UnInvertedField can). you can pack this into one byte[] or use more than one field? I don't see a real limitation here. Well... not very easily? UnInvertedField (DocTermOrds in Lucene) is the same as DocValues' BYTES_VAR_SORTED. So for an app to do this on top it'd have to handle the term - ord resolving itself, save that somewhere, then encode the multiple ords into a byte[]. I agree for other simple types (no deref/sorting involved) an app could pack them into its own byte[] that's otherwise opaque to Lucene. Mike McCandless http://blog.mikemccandless.com
Re: How to make UnInvertedField faster?
Well... the limitation of DocValues is that it cannot handle more than one value per document (which UnInvertedField can). Hopefully we can fix that at some point :) Mike McCandless http://blog.mikemccandless.com On Fri, Oct 21, 2011 at 7:50 AM, Simon Willnauer simon.willna...@googlemail.com wrote: In trunk we have a feature called IndexDocValues which basically creates the uninverted structure at index time. You can then simply suck that into memory or even access it on disk directly (RandomAccess). Even if I can't help you right now this is certainly going to help you here. There is no need to uninvert at all anymore in lucene 4.0 simon On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan mr...@moreover.com wrote: I was wondering if anyone has any ideas for making UnInvertedField.uninvert() faster, or other alternatives for generating facets quickly. The vast majority of the CPU time for our Solr instances is spent generating UnInvertedFields after each commit. Here's an example of one of our slower fields: [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) - UnInverted multi-valued field {field=authorCS,memSize=38063628,tindexSize=422652, time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0} That is from an index with approximately 8 million documents. After each commit, it takes on average about 90 seconds to uninvert all the fields that we facet on. Any ideas at all would be greatly appreciated. -Michael
Re: Indexing PDF
Can you attach this PDF to an email send to the list? Or is it too large for that? Or, you can try running Tika directly on the PDF to see if it's able to extract the text. Mike McCandless http://blog.mikemccandless.com 2011/10/5 Héctor Trujillo hecto...@gmail.com: Sorry you have the reason, this file was indexed with a .Net web service client, that calls a Java application(a web service) that calls Solr using SolrJ. I will try to index this in a different way, may be this resolve the problem. Thanks Best regards El 5 de octubre de 2011 08:42, Héctor Trujillo hecto...@gmail.comescribió: It seems unreasonable that if I want to index a local file, I have to references this local file by an URL. This isn't a estrange file, this is a file downloaded from lucid web portal called: Starting a Search Application.pdf This problem may be a codification problem, or char set problem. I open this file with a PDF Reader and I have no problems, and I don’t Know why referencing this file with and URL will fix this problem, can you help me? I'm working with SolrJ, from Java, does some have the same problem with SolrJ? Thanks to Paul Libbrecht, for your option. Best regards 2011/10/4 Paul Libbrecht p...@hoplahup.net full of boxes for me. Héctor, you need another way to reference these! (e.g. a URL) paul Le 4 oct. 2011 à 16:49, Héctor Trujillo a écrit : Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But with some files I’ve got problems because they stored estrange characters. I got stored this content: +++ Starting a Search Application Abstract Starting a Search Application A Lucid Imagination White Paper ¥ April 2009 Page i Starting a Search Application A Lucid Imagination White Paper ¥ April 2009 Page ii Do You Need Full-text Search? ∞ ∞ ∞
Re: Indexing PDF
Hmm, no attachment; maybe it's too large? Can you send it directly to me? Mike McCandless http://blog.mikemccandless.com 2011/10/5 Héctor Trujillo hecto...@gmail.com: This is the file that give me errors. 2011/10/5 Michael McCandless luc...@mikemccandless.com Can you attach this PDF to an email send to the list? Or is it too large for that? Or, you can try running Tika directly on the PDF to see if it's able to extract the text. Mike McCandless http://blog.mikemccandless.com 2011/10/5 Héctor Trujillo hecto...@gmail.com: Sorry you have the reason, this file was indexed with a .Net web service client, that calls a Java application(a web service) that calls Solr using SolrJ. I will try to index this in a different way, may be this resolve the problem. Thanks Best regards El 5 de octubre de 2011 08:42, Héctor Trujillo hecto...@gmail.comescribió: It seems unreasonable that if I want to index a local file, I have to references this local file by an URL. This isn't a estrange file, this is a file downloaded from lucid web portal called: Starting a Search Application.pdf This problem may be a codification problem, or char set problem. I open this file with a PDF Reader and I have no problems, and I don’t Know why referencing this file with and URL will fix this problem, can you help me? I'm working with SolrJ, from Java, does some have the same problem with SolrJ? Thanks to Paul Libbrecht, for your option. Best regards 2011/10/4 Paul Libbrecht p...@hoplahup.net full of boxes for me. Héctor, you need another way to reference these! (e.g. a URL) paul Le 4 oct. 2011 à 16:49, Héctor Trujillo a écrit : Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But with some files I’ve got problems because they stored estrange characters. I got stored this content: +++ Starting a Search Application Abstract Starting a Search Application A Lucid Imagination White Paper ¥ April 2009 Page i Starting a Search Application A Lucid Imagination White Paper ¥ April 2009 Page ii Do You Need Full-text Search
Re: Query failing because of omitTermFreqAndPositions
This is because, within one segment only 1 value (omitP or not) is possible, for all the docs in that segment. This then means, on merging segments with different values for omitP, Lucene must reconcile the different values, and that reconciliation will favor omitting positions (if it went the other way, Lucene would have to make up fake positions, which seems very dangerous). Even if you delete all documents containing that field, and optimize down to one segment, this omitPositions bit will still stick, because of how Lucene stores the metadata per field. omitNorms also behaves this way: once omitted, always omitted. Mike McCandless http://blog.mikemccandless.com On Tue, Oct 4, 2011 at 1:41 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi Mike, Thanks for the information.But why is it that once omiited positions in the past , it will always omit positions even if omitPositions is made false. Thanks, Isan Fulia. On 29 September 2011 17:49, Michael McCandless luc...@mikemccandless.comwrote: Once a given field has omitted positions in the past, even for just one document, it sticks and that field will forever omit positions. Try creating a new index, never omitting positions from that field? Mike McCandless http://blog.mikemccandless.com On Thu, Sep 29, 2011 at 1:14 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi All, My schema consisted of field textForQuery which was defined as field name=textForQuery type=text indexed=true stored=false multiValued=true/ After indexing 10 lakhs of documents I changed the field to field name=textForQuery type=text indexed=true stored=false multiValued=true *omitTermFreqAndPositions=true*/ So documents that were indexed after that omiited the position information of the terms. As a result I was not able to search the text which rely on position information for eg. coke studio at mtv even though its present in some documents. So I again changed the field textForQuery to field name=textForQuery type=text indexed=true stored=false multiValued=true/ But now even for new documents added the query requiring positon information is still failing. For example i reindexed certain documents that consisted of coke studio at mtv but still the query is not returning any documents when searched for *textForQuery:coke studio at mtv* Can anyone please help me out why this is happening -- Thanks Regards, Isan Fulia. -- Thanks Regards, Isan Fulia.
Re: Query failing because of omitTermFreqAndPositions
Once a given field has omitted positions in the past, even for just one document, it sticks and that field will forever omit positions. Try creating a new index, never omitting positions from that field? Mike McCandless http://blog.mikemccandless.com On Thu, Sep 29, 2011 at 1:14 AM, Isan Fulia isan.fu...@germinait.com wrote: Hi All, My schema consisted of field textForQuery which was defined as field name=textForQuery type=text indexed=true stored=false multiValued=true/ After indexing 10 lakhs of documents I changed the field to field name=textForQuery type=text indexed=true stored=false multiValued=true *omitTermFreqAndPositions=true*/ So documents that were indexed after that omiited the position information of the terms. As a result I was not able to search the text which rely on position information for eg. coke studio at mtv even though its present in some documents. So I again changed the field textForQuery to field name=textForQuery type=text indexed=true stored=false multiValued=true/ But now even for new documents added the query requiring positon information is still failing. For example i reindexed certain documents that consisted of coke studio at mtv but still the query is not returning any documents when searched for *textForQuery:coke studio at mtv* Can anyone please help me out why this is happening -- Thanks Regards, Isan Fulia.
Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?
On Wed, Sep 21, 2011 at 10:10 PM, Michael Sokolov soko...@ifactory.com wrote: I wonder if config-file validation would be helpful here :) I posted a patch in SOLR-1758 once. Big +1. We should aim for as stringent config file checking as possible. Mike McCandless http://blog.mikemccandless.com
Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved
Are you sure you are using a 64 bit JVM? Are you sure you really changed your vmem limit to unlimited? That should have resolved the OOME from mmap. Or: can you run cat /proc/sys/vm/max_map_count? This is a limit on the total number of maps in a single process, that Linux imposes. But the default limit is usually high (64K), so it'd be surprising if you are hitting that unless it's lower in your env. The amount of [free] RAM on the machine should have no bearing on whether mmap succeeds or fails; it's the available address space (32 bit is tiny; 64 bit is immense) and then any OS limits imposed. Mike McCandless http://blog.mikemccandless.com On Thu, Sep 22, 2011 at 5:27 AM, Ralf Matulat ralf.matu...@bundestag.de wrote: Good morning! Recently we slipped into an OOME by optimizing our index. It looks like it's regarding to the nio class and the memory-handling. I'll try to describe the environment, the error and what we did to solve the problem. Nevertheless, none of our approaches was successful. The environment: - Tested with both, SOLR 3.3 3.4 - SuSE SLES 11 (X64)virtual machine with 16GB RAM - ulimi: virtual memory 14834560 (14GB) - Java: java-1_6_0-ibm-1.6.0-124.5 - Apache Tomcat/6.0.29 - Index Size (on filesystem): ~5GB, 1.1 million text documents. The error: First, building the index from scratch with a mysql DIH, with an empty index-Dir works fine. Building an index with command=full-import, when the old segment files still in place, fails with an OutOfMemoryException. Same as optimizing the index. Doing an optimize fails after some time with: SEVERE: java.io.IOException: background merge hit exception: _6p(3.4):Cv1150724 _70(3.4):Cv667 _73(3.4):Cv7 _72(3.4):Cv4 _71(3.4):Cv1 into _74 [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2552) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2472) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:410) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:61) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:735) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:765) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:264) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:216) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:89) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:115) at org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:710) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4378) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3917) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456) Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:762)
Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved
OK, excellent. Thanks for bringing closure, Mike McCandless http://blog.mikemccandless.com On Thu, Sep 22, 2011 at 9:00 AM, Ralf Matulat ralf.matu...@bundestag.de wrote: Dear Mike, thanks for your your reply. Just a couple of minutes we found a solution or - to be honest - where we went wrong. Our failure was the use of ulimit. We missed, that ulimit sets the vmem for each shell seperatly. So we set 'ulimit -v unlimited' on a shell, thinking that we've done the job correctly. As we recognized our mistake, we added 'ulimit -v unlimited' to our init-Skript of the tomcat-instance and now it looks like everything works as aspected. Need some further testing with the java versions, but I'm quite optimistic. Best regards Ralf Am 22.09.2011 14:46, schrieb Michael McCandless: Are you sure you are using a 64 bit JVM? Are you sure you really changed your vmem limit to unlimited? That should have resolved the OOME from mmap. Or: can you run cat /proc/sys/vm/max_map_count? This is a limit on the total number of maps in a single process, that Linux imposes. But the default limit is usually high (64K), so it'd be surprising if you are hitting that unless it's lower in your env. The amount of [free] RAM on the machine should have no bearing on whether mmap succeeds or fails; it's the available address space (32 bit is tiny; 64 bit is immense) and then any OS limits imposed. Mike McCandless http://blog.mikemccandless.com On Thu, Sep 22, 2011 at 5:27 AM, Ralf Matulatralf.matu...@bundestag.de wrote: Good morning! Recently we slipped into an OOME by optimizing our index. It looks like it's regarding to the nio class and the memory-handling. I'll try to describe the environment, the error and what we did to solve the problem. Nevertheless, none of our approaches was successful. The environment: - Tested with both, SOLR 3.3 3.4 - SuSE SLES 11 (X64)virtual machine with 16GB RAM - ulimi: virtual memory 14834560 (14GB) - Java: java-1_6_0-ibm-1.6.0-124.5 - Apache Tomcat/6.0.29 - Index Size (on filesystem): ~5GB, 1.1 million text documents. The error: First, building the index from scratch with a mysql DIH, with an empty index-Dir works fine. Building an index withcommand=full-import, when the old segment files still in place, fails with an OutOfMemoryException. Same as optimizing the index. Doing an optimize fails after some time with: SEVERE: java.io.IOException: background merge hit exception: _6p(3.4):Cv1150724 _70(3.4):Cv667 _73(3.4):Cv7 _72(3.4):Cv4 _71(3.4):Cv1 into _74 [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2552) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2472) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:410) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154) at org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:61) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:735) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:765) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:264
Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved
Unfortunately I really don't know ;) Every time I set forth to figure things like this out I seem to learn some new way... Maybe someone else knows? Mike McCandless http://blog.mikemccandless.com On Thu, Sep 22, 2011 at 2:15 PM, Shawn Heisey s...@elyograg.org wrote: Michael, What is the best central place on an rpm-based distro (CentOS 6 in my case) to raise the vmem limit for specific user(s), assuming it's not already correct? I'm using /etc/security/limits.conf to raise the open file limit for the user that runs Solr: ncindex hard nofile 65535 ncindex soft nofile 49151 Thanks, Shawn On 9/22/2011 9:56 AM, Michael McCandless wrote: OK, excellent. Thanks for bringing closure, Mike McCandless http://blog.mikemccandless.com On Thu, Sep 22, 2011 at 9:00 AM, Ralf Matulatralf.matu...@bundestag.de wrote: Dear Mike, thanks for your your reply. Just a couple of minutes we found a solution or - to be honest - where we went wrong. Our failure was the use of ulimit. We missed, that ulimit sets the vmem for each shell seperatly. So we set 'ulimit -v unlimited' on a shell, thinking that we've done the job correctly. As we recognized our mistake, we added 'ulimit -v unlimited' to our init-Skript of the tomcat-instance and now it looks like everything works as aspected.
Re: MMapDirectory failed to map a 23G compound index segment
Since you hit OOME during mmap, I think this is an OS issue not a JVM issue. Ie, the JVM isn't running out of memory. How many segments were in the unoptimized index? It's possible the OS rejected the mmap because of process limits. Run cat /proc/sys/vm/max_map_count to see how many mmaps are allowed. Or: is it possible you reopened the reader several times against the index (ie, after committing from Solr)? If so, I think 2.9.x never unmaps the mapped areas, and so this would accumulate against the system limit. My memory of this is a little rusty but isn't mmap also limited by mem + swap on the box? What does 'free -g' report? I don't think this should be the case; you are using a 64 bit OS/JVM so in theory (except for OS system wide / per-process limits imposed) you should be able to mmap up to the full 64 bit address space. Your virtual memory is unlimited (from ulimit output), so that's good. Mike McCandless http://blog.mikemccandless.com On Wed, Sep 7, 2011 at 12:25 PM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 1064959 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
[ANNOUNCE] Apache Solr 3.4.0 released
September 14 2011, Apache Solr™ 3.4.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.4.0. Apache Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr (see note below). If you are already using Apache Solr 3.1, 3.2 or 3.3, we strongly recommend you upgrade to 3.4.0 because of the index corruption bug on OS or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0. See the CHANGES.txt file included with the release for a full list of details. Solr 3.4.0 Release Highlights: * Bug fixes and improvements from Apache Lucene 3.4.0, including a major bug (LUCENE-3418) whereby a Lucene index could easily become corrupted if the OS or computer crashed or lost power. * SolrJ client can now parse grouped and range facets results (SOLR-2523). * A new XsltUpdateRequestHandler allows posting XML that's transformed by a provided XSLT into a valid Solr document (SOLR-2630). * Post-group faceting option (group.truncate) can now compute facet counts for only the highest ranking documents per-group. (SOLR-2665). * Add commitWithin update request parameter to all update handlers that were previously missing it. This tells Solr to commit the change within the specified amount of time (SOLR-2540). * You can now specify NIOFSDirectory (SOLR-2670). * New parameter hl.phraseLimit speeds up FastVectorHighlighter (LUCENE-3234). * The query cache and filter cache can now be disabled per request See http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters (SOLR-2429). * Improved memory usage, build time, and performance of SynonymFilterFactory (LUCENE-3233). * Added omitPositions to the schema, so you can omit position information while still indexing term frequencies (LUCENE-2048). * Various fixes for multi-threaded DataImportHandler. Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Apache Lucene/Solr Developers
Re: Nested documents
Even if it applies, this is for Lucene. I don't think we've added Solr support for this yet... we should! Mike McCandless http://blog.mikemccandless.com On Sun, Sep 11, 2011 at 12:16 PM, Erick Erickson erickerick...@gmail.com wrote: Does this JIRA apply? https://issues.apache.org/jira/browse/LUCENE-3171 Best Erick On Sat, Sep 10, 2011 at 8:32 PM, Andy angelf...@yahoo.com wrote: Hi, Does Solr support nested documents? If not is there any plan to add such a feature? Thanks.
Re: What will happen when one thread is closing a searcher while another is searching?
Closing a searcher while thread(s) is/are still using it is definitely bad, so, this code looks spooky... But: it possible something higher up (in Solr) is ensuring this code runs exclusively? I don't know enough about this part of Solr... Mike McCandless http://blog.mikemccandless.com On Mon, Sep 5, 2011 at 10:43 PM, Li Li fancye...@gmail.com wrote: hi all, I am using spellcheck in solr 1.4. I found that spell check is not implemented as SolrCore. in SolrCore, it uses reference count to track current searcher. oldSearcher and newSearcher will both exist if oldSearcher is servicing some query. But in FileBasedSpellChecker public void build(SolrCore core, SolrIndexSearcher searcher) { try { loadExternalFileDictionary(core.getSchema(), core.getResourceLoader()); spellChecker.clearIndex(); spellChecker.indexDictionary(dictionary); } catch (IOException e) { throw new RuntimeException(e); } } public void clearIndex() throws IOException { IndexWriter writer = new IndexWriter(spellIndex, null, true); writer.close(); //close the old searcher searcher.close(); searcher = new IndexSearcher(this.spellIndex); } it clear old Index and close current searcher. When other thread is doing search and searcher.close() is called, will it cause problem? Or searcher.close() has finished and new IndexSearch has not yet constructed. When other thread try to do search, will it also be problematic?
heads up: re-index trunk Lucene/Solr indices
Hi, I just committed a new block tree terms dictionary implementation, which requires fully re-indexing any trunk indices. See here for details: https://issues.apache.org/jira/browse/LUCENE-3030 If you are using a released version of Lucene/Solr then you can ignore this message. Mike McCandless http://blog.mikemccandless.com
Re: Solr Join in 3.3.x
Unfortunately Solr's join impl hasn't been backported to 3.x, as far as I know. You might want to look at ElasticSearch; it has a join implementation already or use Solr 4.0. Mike McCandless http://blog.mikemccandless.com On Wed, Aug 17, 2011 at 7:40 PM, Cameron Hurst wakemaste...@z33k.com wrote: Hello all, I was looking into finding a way to do filtering of documents based on fields of other documents in the index. In particular I have a document that will update very frequently and hundreds that will very rarely change, but the rarely changing documents have a field that will change often that is denormalized from the frequently changing document. The brute force method I have is to reindex all the documents every time that field changes, but this at times is a huge load on my server at a critical time that I am trying to avoid. To avoid this hit I was trying to implement patch SOLR-2272. This opens up a join feature to map fields of 1 document onto another (or so my understanding is). This would allow me to only update that 1 document and have the change applied to all others that rely on it. There is a number of spots that this patch fails to apply and I was wondering if anyone has tried to use join in 3.3 or any other released version of SOLR or if I the only way to do it is use 4.0. Also while I found this patch, I am also open to any other ideas that people have on how to accomplish what I need, this just seemed like the most direct method. Thanks for the help, Cameron
Re: segment.gen file is not replicated
This file is actually optional; its there for redundancy in case the filesystem is not reliable when listing a directory. Ie, normally, we list the directory to find the latest segments_N file; but if this is wrong (eg the file system might have stale a cache) then we fallback to reading the segments.gen file. For example this is sometimes needed for NFS. Likely replication is just skipping it? Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: segment.gen file is not replicated
I think we should fix replication to copy it? Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 8:16 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Am 04.08.2011 12:52, schrieb Michael McCandless: This file is actually optional; its there for redundancy in case the filesystem is not reliable when listing a directory. Ie, normally, we list the directory to find the latest segments_N file; but if this is wrong (eg the file system might have stale a cache) then we fallback to reading the segments.gen file. For example this is sometimes needed for NFS. Likely replication is just skipping it? That was my first idea. If not changed and touched then it will be skipped. While being smart I deleted it on slave from index dir and then replicated, but segment.gen was not replicated. Due to your explanation NFS could not be reliable any more. So my idea either a bug or a feature and the experts will know :-) Regards Bernd Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: I have now updated to solr 3.3 but segment.gen is still not replicated. Any idea why, is it a bug or a feature? Should I write a jira issue for it? Regards Bernd Am 29.07.2011 14:10, schrieb Bernd Fehling: Dear list, is there a deeper logic behind why the segment.gen file is not replicated with solr 3.2? Is it obsolete because I have a single segment? Regards, Bernd
Re: Field collapsing on multiple fields and/or ranges?
I believe the underlying grouping module is now technically able to do this, because subclasses of the abstract first/second pass grouping collectors are free to decide what type/value the group key is. But, we have to fix Solr to allow for compound keys by creating the necessary concrete subclasses. Mike McCandless http://blog.mikemccandless.com On Wed, Jul 6, 2011 at 6:22 AM, Rih tanrihae...@gmail.com wrote: Have the same requirement. What is your workaround for this? On Thu, May 12, 2011 at 7:40 AM, arian487 akarb...@tagged.com wrote: I'm wondering if there is a way to get the field collapsing to collapse on multiple things? For example, is there a way to get it to collapse on a field (lets say 'domain') but ALSO something else (maybe time or something)? To visualize maybe something like this: Group1 has common field 'www.forum1.com' and ALSO the posts are all from may 11 Group2 has common field 'www.forum2.com' and ALSO the posts are all from may 11 . . . GroupX has common field 'www.forum1.com' and ALSO the posts from may 12 So obviously it's still sorted by date but it won't group the 'www.forum1.com' things together if the document is from a different date, it'll group common date AND common domain field. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Field-collapsing-on-multiple-fields-and-or-ranges-tp2929793p2929793.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cannot I search documents added by IndexWriter after commit?
After your writer.commit you need to reopen your searcher to see the changes. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 5, 2011 at 1:48 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: @Test public void testUpdate() throws IOException, ParserConfigurationException, SAXException, ParseException { Analyzer analyzer = getAnalyzer(); QueryParser parser = new QueryParser(Version.LUCENE_32, content, analyzer); Query allQ = parser.parse(*:*); IndexWriter writer = getWriter(); IndexSearcher searcher = new IndexSearcher(IndexReader.open(writer, true)); TopDocs docs = searcher.search(allQ, 10); * assertEquals(0, docs.totalHits); // empty/no index* Document doc = getDoc(); writer.addDocument(doc); writer.commit(); docs = searcher.search(allQ, 10); * assertEquals(1,docs.totalHits); //it fails here. docs.totalHits equals 0* } What am I doing wrong here? If I initialize searcher with new IndexSearcher(directory) I'm told: org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.RAMDirectory@3caa4blockFactory=org.apache.lucene.store.SingleInstanceLockFactory@ed0220c: files: [] -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Cannot I search documents added by IndexWriter after commit?
Sorry, you must reopen the underlying IndexReader, and then make a new IndexSearcher from the reopened reader. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 5, 2011 at 2:12 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: and how do you do that? There is no reopen method On Tue, Jul 5, 2011 at 8:09 PM, Michael McCandless luc...@mikemccandless.com wrote: After your writer.commit you need to reopen your searcher to see the changes. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 5, 2011 at 1:48 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: @Test public void testUpdate() throws IOException, ParserConfigurationException, SAXException, ParseException { Analyzer analyzer = getAnalyzer(); QueryParser parser = new QueryParser(Version.LUCENE_32, content, analyzer); Query allQ = parser.parse(*:*); IndexWriter writer = getWriter(); IndexSearcher searcher = new IndexSearcher(IndexReader.open(writer, true)); TopDocs docs = searcher.search(allQ, 10); * assertEquals(0, docs.totalHits); // empty/no index* Document doc = getDoc(); writer.addDocument(doc); writer.commit(); docs = searcher.search(allQ, 10); * assertEquals(1,docs.totalHits); //it fails here. docs.totalHits equals 0* } What am I doing wrong here? If I initialize searcher with new IndexSearcher(directory) I'm told: org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.RAMDirectory@3caa4blockFactory =org.apache.lucene.store.SingleInstanceLockFactory@ed0220c: files: [] -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Fuzzy Query Param
Good question... I think in Lucene 4.0, the edit distance is (will be) in Unicode code points, but in past releases, it's UTF16 code units. Mike McCandless http://blog.mikemccandless.com 2011/6/30 Floyd Wu floyd...@gmail.com: if this is edit distance implementation, what is the result apply to CJK query? For example, 您好~3 Floyd 2011/6/30 entdeveloper cameron.develo...@gmail.com I'm using Solr trunk. If it's levenstein/edit distance, that's great, that's what I want. It just didn't seem to be officially documented anywhere so I wanted to find out for sure. Thanks for confirming. -- View this message in context: http://lucene.472066.n3.nabble.com/Fuzzy-Query-Param-tp3120235p3122418.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Fuzzy Query Param
Which version of Solr (Lucene) are you using? Recent versions of Lucene now accept ~N 1 to be edit distance. Ie foobar~2 matches any term that's = 2 edit distance away from foobar. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 28, 2011 at 11:00 PM, entdeveloper cameron.develo...@gmail.com wrote: According to the docs on lucene query syntax: Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. I was messing around with this and started doing queries with values greater than 1 and it seemed to be doing something. However I haven't been able to find any documentation on this. What happens when specifying a fuzzy query with a value 1? tiger~2 animal~3 -- View this message in context: http://lucene.472066.n3.nabble.com/Fuzzy-Query-Param-tp3120235p3120235.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Optimize taking two steps and extra disk space
OK that sounds like a good solution! You can also have CMS limit how many merges are allowed to run at once, if your IO system has trouble w/ that much concurrency. Mike McCandless http://blog.mikemccandless.com On Mon, Jun 20, 2011 at 6:29 PM, Shawn Heisey s...@elyograg.org wrote: On 6/20/2011 3:18 PM, Michael McCandless wrote: With segmentsPerTier at 35 you will easily cross 70 segs in the index... If you want optimize to run in a single merge, I would lower sementsPerTier and mergeAtOnce (maybe back to the 10 default), and set your maxMergeAtOnceExplicit to 70 or higher... Lower mergeAtOnce means merges run more frequently but for shorter time, and, your searching should be faster (than 35/35) since there are fewer segments to visit. Thanks again for more detailed information. There is method to my madness, which I will now try to explain. With a value of 10, the reindex involves enough merges that there is are many second level merges, and a third-level merge. I was running into situations on my development platform (with its slow disks) where there were three merges happening at the same time, which caused all indexing activity to cease for several minutes. This in turn would cause JDBC to time out and drop the connection to the database, which caused DIH to fail and rollback the entire import about two hours (two thirds) in. With a mergeFactor of 35, there are no second level merges, and no third-level merges. I can do a complete reindex successfully even on a system with slow disks. In production, one shard (out of six) is optimized every day to eliminate deleted documents. When I have to reindex everything, I will typically go through and manually optimize each shard in turn after it's done. This is the point where I discovered this two-pass problem. I don't want to do a full-import with optimize=true, because all six large shards build at the same time in a Xen environment. The I/O storm that results from three optimizes happening on each host at the same time and then replicating to similar Xen hosts is very bad. I have now set maxMergeAtOnceExplicit to 105. I think that is probably enough, given that that I currently do not experience any second level merges. When my index gets big enough, I will increase the ram buffer. By then I will probably have more memory, so the first-level merges can still happen entirely from I/O cache. Shawn
Re: Optimize taking two steps and extra disk space
On Tue, Jun 21, 2011 at 9:42 AM, Shawn Heisey s...@elyograg.org wrote: On 6/20/2011 12:31 PM, Michael McCandless wrote: For back-compat, mergeFactor maps to both of these, but it's better to set them directly eg: mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce10/int int name=segmentsPerTier20/int /mergePolicy (and then remove your mergeFactor setting under indexDefaults) When I did this and ran a reindex, it merged once it reached 10 segments, despite what I had defined in the mergePolicy. This is Solr 3.2 with the patch from SOLR-1972 applied. I've included the config snippet below into solrconfig.xml using xinclude via another file. I had to put mergeFactor back in to make it work right. I haven't checked yet to see whether an optimize takes one pass. That will be later today. mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce35/int int name=segmentsPerTier35/int int name=maxMergeAtOnceExplicit105/int /mergePolicy Hmm something strange is going on. In Solr 3.2, if you attempt to use mergeFactor and useCompoundFile inside indexDefaults (and outside the mergePolicy), when your mergePolicy is TMP, you should see a warning like this: Use of compound file format or mergefactor cannot be configured if merge policy is not an instance of LogMergePolicy. The configured policy's defaults will be used. And it shouldn't work. But, using the right params inside your mergePolicy section ought to work (though, I don't think this is well tested...). I'm not sure why you're seeing the opposite of what I'd expect... I wonder if you're actually really getting the TMP? Can you turn on verbose IndexWriter infoStream and post the output? Mike McCandless http://blog.mikemccandless.com
Re: Optimize taking two steps and extra disk space
On Sun, Jun 19, 2011 at 12:35 PM, Shawn Heisey s...@elyograg.org wrote: On 6/19/2011 7:32 AM, Michael McCandless wrote: With LogXMergePolicy (the default before 3.2), optimize respects mergeFactor, so it's doing 2 steps because you have 37 segments but 35 mergeFactor. With TieredMergePolicy (default on 3.2 and after), there is now a separate merge factor used for optimize (maxMergeAtOnceExplicit)... so you could eg set this factor higher and more often get a single merge for the optimize. This makes sense. the default for maxMergeAtOnceExplicit is 30 according to LUCENE-854, so it merges the first 30 segments, then it goes back and merges the new one plus the other 7 that remain. To counteract this behavior, I've put this in my solrconfig.xml, to test next week. mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnceExplicit70/int /mergePolicy I figure that twice the megeFactor (35) will likely cover every possible outcome. Is that a correct thought? Actually, TieredMP has two different params (different from the previous default LogMP): * segmentsPerTier controls how many segments you can tolerate in the index (bigger number means more segments) * maxMergeAtOnce says how many segments can be merged at a time for normal (not optimize) merging For back-compat, mergeFactor maps to both of these, but it's better to set them directly eg: mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce10/int int name=segmentsPerTier20/int /mergePolicy (and then remove your mergeFactor setting under indexDefaults) You should always have maxMergeAtOnce = segmentsPerTier else too much merging will happen. If you set segmentsPerTier to 35 than this can easily exceed 70 segments, so your optimize will again need more than one merge. Note that if you make the maxMergeAtOnce/Explicit too large then 1) you risk running out of file handles (if you don't use compound file), and 2) merge performance likely gets worse as the OS is forced to splinter its IO cache across more files (I suspect) and so more seeking will happen. Mike McCandless http://blog.mikemccandless.com
Re: Optimize taking two steps and extra disk space
On Mon, Jun 20, 2011 at 4:00 PM, Shawn Heisey s...@elyograg.org wrote: On 6/20/2011 12:31 PM, Michael McCandless wrote: Actually, TieredMP has two different params (different from the previous default LogMP): * segmentsPerTier controls how many segments you can tolerate in the index (bigger number means more segments) * maxMergeAtOnce says how many segments can be merged at a time for normal (not optimize) merging For back-compat, mergeFactor maps to both of these, but it's better to set them directly eg: mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce10/int int name=segmentsPerTier20/int /mergePolicy (and then remove your mergeFactor setting under indexDefaults) You should always have maxMergeAtOnce= segmentsPerTier else too much merging will happen. If you set segmentsPerTier to 35 than this can easily exceed 70 segments, so your optimize will again need more than one merge. Note that if you make the maxMergeAtOnce/Explicit too large then 1) you risk running out of file handles (if you don't use compound file), and 2) merge performance likely gets worse as the OS is forced to splinter its IO cache across more files (I suspect) and so more seeking will happen. Thanks much for the information! I've set my server up so that the user running the index has a soft limit of 4096 files and a hard limit of 6144 files, and /proc/sys/fs/file-max is 48409, so I should be OK on file handles. The index is almost twice as big as available memory, so I'm not really worried about the I/O cache. I've sized my mergFactor and ramBufferSizeMB so that the individual merges during indexing happen entirely from the I/O cache, which is the point where I really care about it. There's nothing I can do about the optimize without spending a LOT of money. I will remove mergeFactor, set maxMergeAtOnce and segmentsPerTier to 35, and maxMergeAtOnceExplicit to 70. If I ever run into a situation where it gets beyond 70 segments at any one time, I've probably got bigger problems than the number of passes my optimize takes, so I'll think about it then. :) Does that sound reasonable? With segmentsPerTier at 35 you will easily cross 70 segs in the index... If you want optimize to run in a single merge, I would lower sementsPerTier and mergeAtOnce (maybe back to the 10 default), and set your maxMergeAtOnceExplicit to 70 or higher... Lower mergeAtOnce means merges run more frequently but for shorter time, and, your searching should be faster (than 35/35) since there are fewer segments to visit. Mike McCandless http://blog.mikemccandless.com
Re: Optimize taking two steps and extra disk space
With LogXMergePolicy (the default before 3.2), optimize respects mergeFactor, so it's doing 2 steps because you have 37 segments but 35 mergeFactor. With TieredMergePolicy (default on 3.2 and after), there is now a separate merge factor used for optimize (maxMergeAtOnceExplicit)... so you could eg set this factor higher and more often get a single merge for the optimize. Mike McCandless http://blog.mikemccandless.com On Sat, Jun 18, 2011 at 6:45 PM, Shawn Heisey s...@elyograg.org wrote: I've noticed something odd in Solr 3.2 when it does an optimize. One of my shards (freshly built via DIH full-import) had 37 segments, totalling 17.38GB of disk space. 13 of those segments were results of merges during initial import, the other 24 were untouched after creation. Starting at _0, the final segment before optimizing is _co. The mergefactor on the index is 35, chosen because it makes merged segments line up nicely on z boundaries. The optmization process created a _cp segment of 14.4GB, followed by a _cq segment at the final 17.27GB size, so at the peak, it took 49GB of disk space to hold the index. Is there any way to make it do the optimize in one pass? Is there a compelling reason why it does it this way? Thanks, Shawn
Re: Field Collapsing and Grouping in Solr 3.2
Alas, no, not yet.. grouping/field collapse has had a long history with Solr. There were many iterations on SOLR-236, but that impl was never committed. Instead, SOLR-1682 was committed, but committed only to trunk (never backported to 3.x despite requests). Then, a new grouping module was factored out of Solr's trunk implementation, and was backported to 3.x. Finally, there is now an effort to cut over Solr trunk (SOLR-2564) and Solr 3.x (SOLR-2524) to the new grouping module, which looks like it's close to being done! So hopefully for 3.3 but not promises! This is open-source... Mike McCandless http://blog.mikemccandless.com 2011/6/16 Sergio Martín sergio.mar...@playence.com Hello. Does anybody know if Field Collapsing and Grouping is available in Solr 3.2. I mean directly available, not as a patch. I have read conflicting statements about it... Thanks a lot! [image: Description: playence] http://www.playence.com/ *Sergio Martín Cantero* *playence KG* Penthouse office Soho II - Top 1 Grabenweg 68 6020 Innsbruck Austria Mobile: (+34)654464222 eMail: sergio.mar...@playence.com Web:www.playence.com [image: Description: skypeplayence] [image: Description: twitterplayence]http://twitter.com/playence [image: Description: linkedinplayence]http://www.linkedin.com/companies/playence Stay up to date on the latest developments of playence by subscribing to our blog (http://blog.playence.com) or following us in Twitter ( http://twitter.com/playence). The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee and access to the e-mail by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you have received this e-mail in error please forward to off...@playence.com. Thank you for your cooperation.