Re: how to update billions of docs
An update on how I ended up implementing the requirement in case it helps others. There are lots of other code I did not include but the general logic is below. While performance is still not great, it is 10x faster than atomic updates ( because RealTimeGetComponent.getInputDocument() is not needed ) 1. Wrote an update handler /myupdater?q=*:* & sort=fieldx desc & fl=fieldx, fieldy & stream.file=exampledocs/oldvalueToNewValue.properties & update.chain=myprocessor 2. In the handler read the map from content stream and invoke the export handler for the query params SolrRequestHandler handler = core.getRequestHandler("/export"); core.execute(handler, req, rsp); numFound = (Integer) req.getContext().get("totalHits"); 3. Iterate using /export handler response, similar to SortingResponseWrite.write() method List leaves = req.getSearcher().getTopReaderContext().leaves(); for(int i=0; i<leaves.size(); i++) { DocIdSetIterator it = new BitSetIterator(sets[i], 0); while((docId = it.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { // get lucene doc Document luceneDoc = leaves.get(i).reader.document(docId); // update lucene doc with new values updateDoc(luceneDoc, oldValueToNewValuesMap) // post lucene doc to a linked blocking queue queue.add(luceneDoc); } } 4. have N threads waiting on queue for docs and invokes UpdateRequestProcessor chain using the update.chain param AddUpdateCommand cmd = new AddUpdateCommand(request); IndexSchema schema = req.getLatestSchema(); while (true) { Document luceneDoc = queue.take(); SolrDocument doc = toSolrDocument(luceneDoc, schema); cmd.doc = doc; // set these fields as needed cmd.overwrite = false; cmd.setVersion(0); doc.removeField("_version"_); // post doc updateProcessor.processAdd(cmd); } -Mohsin - Original Message - From: jack.krupan...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, March 18, 2016 6:55:17 AM GMT -08:00 US/Canada Pacific Subject: Re: how to update billions of docs That's another great example of a mode that Bulk Field Update (my mythical feature) needs - switch a list of fields from stored to docvalues. And maybe even the opposite since there are scenarios in which docValues is worse than stored and you would only find that out after indexing... billions of documents. Being able to switch indexed mode of a field (or list of fields) is also a mode needed for bulk update (reindex). -- Jack Krupansky On Fri, Mar 18, 2016 at 4:12 AM, Ishan Chattopadhyaya < ichattopadhy...@gmail.com> wrote: > Hi Mohsin, > There's some work in progress for in-place updates to docValued fields, > https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest > patch there (or ping me if you need a git branch)? > It would be nice to know how fast the updates go for your usecase with that > patch. Please note that for that patch, both the version field and the > updated field needs to have stored=false, indexed=false, docValues=true. > Regards, > Ishan > > > On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky <jack.krupan...@gmail.com > > > wrote: > > > It would be nice to have a wiki/doc for "Bulk Field Update" that listed > all > > of these techniques and tricks. > > > > And, of course, it would be so much better to have an explicit Lucene > > feature for this. It could work in the background like merge and process > > one segment at a time as efficiently as possible. > > > > Have several modes: > > > > 1. Set a field of all documents to explicit value. > > 2. Set a field of query documents to an explicit value. > > 3. Increment by n. > > 4. Add new field to all document, or maybe by query. > > 5. Delete existing field for all documents. > > 6. Delete field value for all documents or a specified query. > > > > > > -- Jack Krupansky > > > > On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler < > kkrugler_li...@transpac.com > > > > > wrote: > > > > > As others noted, currently updating a field means deleting and > inserting > > > the entire document. > > > > > > Depending on how you use the field, you might be able to create another > > > core/container with that one field (plus the key field), and use join > > > support. > > > > > > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an > > > improvement, which looks like it's in the 5.x code line, though I don't > > see > > > a fix version. > > > > > > -- Ken > > > > > > > From: Mohsin Beg Beg > > > > Sent: March 16, 20
how to update billions of docs
Hi, I have a requirement to replace a value of a field in 100B's of docs in 100's of cores. The field is multiValued=false docValues=true type=StrField stored=true indexed=true. Atomic Updates performance is on the order of 5K docs per sec per core in solr 5.3 (other fields are quite big). Any suggestions ? -Mohsin
Re: Dealing with bad apples in a SolrCloud cluster
How about dynamic loading/unloading of some shards (cores) similar to the transient cores feature. Should be ok if the unloaded shard has a replica. If no replica then extending shards.tolerant concept to use some timeout/acceptable-latency value sounds interesting. -Mohsin - Original Message - From: thelabd...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, November 21, 2014 10:56:51 AM GMT -08:00 US/Canada Pacific Subject: Dealing with bad apples in a SolrCloud cluster Just soliciting some advice from the community ... Let's say I have a 10-node SolrCloud cluster and have a single collection with 2 shards with replication factor 10, so basically each shard has one replica on each of my nodes. Now imagine one of those nodes starts getting into a bad state and starts to be slow about serving queries (not bad enough to crash outright though) ... I'm sure we could ponder any number of ways a box might slow down without crashing. From my calculations, about 2/10ths of the queries will now be affected since 1/10 queries from client apps will hit the bad apple + 1/10 queries from other replicas will hit the bad apple (distrib=false) If QPS is high enough and the bad apple is slow enough, things can start to get out of control pretty fast, esp. since we've set max threads so high to avoid distributed dead-lock. What have others done to mitigate this risk? Anything we can do in Solr to help deal with this? It seems reasonable that nodes can identify a bad apple by keeping track of query times and looking for nodes that are significantly outside (=2 stddev) what the other nodes are doing. Then maybe mark the node as being down in ZooKeeper so clients and other nodes stop trying to send requests to it; or maybe a simple policy of just don't send requests to that node for a few minutes.
OutOfMemory on 28 docs with facet.method=fc/fcs
Hi, I am getting OOM when faceting on numFound=28. The receiving solr node throws the OutOfMemoryError even though there is 7gb available heap before the faceting request was submitted. If a different solr node is selected that one fails too. Any suggestions ? 1) Test setup is:- 100 collections with 20 shards each := 2000 cores 20 solr nodes of 16gb jvm memory := 100 cores per jvm node 5 hosts of 300 gb memory := 4 solr nodes per host 2) Query (edited for berevity) is :- fields1...15 below are 15 among ~500 fields of type strings (tokenized) and numerics. http://myhost:8983/solr/Collection1/query ?q=fieldX:xyz AND fieldY:(r OR g OR b) rows=0 fq={!cache=false}time:[begin time TO end time] facet=true facet.sort=count facet.missing=false facet.mincount=1 facet.threads=10 facet.field=field1field15 f.field1...field15.facet.method=fc/fcs collection=Collection1...Collection100 -M
Re: OutOfMemory on 28 docs with facet.method=fc/fcs
Looking at SimpleFacets.java, doesn't fc/fcs iterate only over the DocSet for the fields. So assuming each field has a unique term across the 28 rows, a max of 28 * 15 unique small strings (100bytes), should be in the order of 1MB. For 100 collections, lets say a total of 1GB. Now lets say I multiply it by 3 to 3GB. That still leaves more that 4GB heap used by something else to run out memory? Who and why? *sigh* Now looking at the alternatives... 1. I don't know which shards have the docs so yes, all collections are needed. q and fq params can be complicated. 2. hierarchical field doesn't work when selecting 15 fields (out of 300+) since there is no way to give a one hierarchical path in fq or via facet.prefix value. 3. One facet-at-time exceeds the total latency requirements of the app on top Am I stuck ? ps: Doesn't enum build the uninverted index for each unique term in the field and then intersect with the DocSet to return the facet counts? This causes filterCache entries to be bloated in each core. That causes OOM on just 4 or 5 string fields (depending on their cardinality). -M - Original Message - From: t...@statsbiblioteket.dk To: solr-user@lucene.apache.org Sent: Tuesday, November 18, 2014 12:34:08 PM GMT -08:00 US/Canada Pacific Subject: RE: OutOfMemory on 28 docs with facet.method=fc/fcs Mohsin Beg Beg [mohsin@oracle.com] wrote: I am getting OOM when faceting on numFound=28. The receiving solr node throws the OutOfMemoryError even though there is 7gb available heap before the faceting request was submitted. fc and fcs faceting memory overhead is (nearly) independent on the number of hits in the search result. If a different solr node is selected that one fails too. Any suggestions ? facet.field=field1field15 f.field1...field15.facet.method=fc/fcs collection=Collection1...Collection100 You seem to be issuing a facet request for 15 fields in 100 collection concurrently. The memory overhead will be linear to the number of documents, references from documents to field values and the number of unique values in your facets, for each facet independently. That was confusing. Let me try an example instead: For each field, static memory requirements will be a structure that maps from documents to term ordinals. Depending on circumstances, this can be small (DocValues and a numeric field) or big (multi-value, non-DocValue String). Each concurrent call will temporarily allocate a structure for counting. If the field is numeric, this will be a hashmap. If it is String, it will be an integer-array with as many entries as there are unique values: If there are 1M unique String values in the field, the overhead will be 4 bytes * 1M = 4MB. So, if each field has 250K unique String values, the temporary overhead for all 15 fields will be 15MB. I don't now if the request for multiple collections is threaded, but if so, the 15MB should be multiplied with 100, totalling 1.5GB memory overhead for each call. Add the static structures and it does not seem unreasonable that you run out of memory. All this is very loose, but the overall message is that documents, unique facet values, facets and collections all multiplies memory requirements. * Do you need to query all collections at once? * Can you collapse some of the facet fields, to reduce the total number? * Are some of the fields very small? If so, use enum for them instead of fc/fcs. * Maybe you can determine your limits by issuing requests first for 1 field, then 2 etc. This is to see if it is feasible to do minor tweak to get it to work or if your setup is so large that something entirely else needs to be done. - Toke Eskildsen
Re: OutOfMemory on 28 docs with facet.method=fc/fcs
solrcloud has 8billion+ docs and increasing non-linearly each hour. numFound=28 was for the faceting query only. If fieldCache (lucene caches) is the issue, is q=time:[begin time TO end time] be better instead ? -Mohsin - Original Message - From: apa...@elyograg.org To: solr-user@lucene.apache.org Sent: Tuesday, November 18, 2014 2:45:46 PM GMT -08:00 US/Canada Pacific Subject: Re: OutOfMemory on 28 docs with facet.method=fc/fcs On 11/18/2014 3:06 PM, Mohsin Beg Beg wrote: Looking at SimpleFacets.java, doesn't fc/fcs iterate only over the DocSet for the fields. So assuming each field has a unique term across the 28 rows, a max of 28 * 15 unique small strings (100bytes), should be in the order of 1MB. For 100 collections, lets say a total of 1GB. Now lets say I multiply it by 3 to 3GB. Are there 28 documents in the entire index? It's my understanding that the fieldcache memory required is not dependent on the number of documents that match your query (numFound), it's dependent on the number of documents in the entire index. If my understanding is correct, once that memory structure is calculated and stored in the fieldcache, it's available to speed up future facets on that field, even if the query and filters are different than what was used the first time. It doesn't seem as useful for typical use cases to store a facet cache entry that depends on the specific query. Thanks, Shawn
DocSet getting cached in filterCache for facet request with {!cache=false}
Hello, It seems Solr is caching when facting even with fq={!cache=false}*:* specified. This is what I am doing on Solr 4.10.0 on jre 1.7.0_51. Query 1) No cache in filterCache as expected http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:* http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache confirms this. Query 2) Query result docset cached in filterCache unexpectedly ? http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:*facet=truefacet.field=foobarfacet.method=enum http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache shows entry of item_*:*: org.apache.solr.search.BitDocSet@66afbbf cached. Suggestions why or how this may be avoided since I don't want to cache anything other than facet(ed) terms in the filterCache (for predictable heap usage). The culprit seems to be line 1431 @ http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java?view=markup Thanks. -M
Re: DocSet getting cached in filterCache for facet request with {!cache=false}
Shawn, then how to skip filterCache for facet.method=enum ? Wiki says fq={!cache=false}*:* is ok, no? https://wiki.apache.org/solr/SolrCaching#filterCache -Mohsin - Original Message - From: erickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, November 11, 2014 8:40:54 AM GMT -08:00 US/Canada Pacific Subject: Re: DocSet getting cached in filterCache for facet request with {!cache=false} Well, the difference that you're faceting with method=enum, which uses the filterCache (I think, it's been a while). I admit I'm a little surprised that when I tried faceting with the inStock field in the standard distro I got 3 entries when there are only two values but I'm willing to let that go ;) i.e. this produces 3 entries in the filterCache: http://localhost:8983/solr/techproducts/select?q=*:*rows=0facet=truefacet.field=inStockfacet.method=enum not an fq clause in sight.. Best, Erick On Tue, Nov 11, 2014 at 9:31 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/11/2014 1:22 AM, Mohsin Beg Beg wrote: It seems Solr is caching when facting even with fq={!cache=false}*:* specified. This is what I am doing on Solr 4.10.0 on jre 1.7.0_51. Query 1) No cache in filterCache as expected http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:* http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache confirms this. Query 2) Query result docset cached in filterCache unexpectedly ? http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:*facet=truefacet.field=foobarfacet.method=enum http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache shows entry of item_*:*: org.apache.solr.search.BitDocSet@66afbbf cached. Suggestions why or how this may be avoided since I don't want to cache anything other than facet(ed) terms in the filterCache (for predictable heap usage). I hope this is just for testing, because fq=*:* is completely unnecessary, and will cause Solr to do extra work that it doesn't need to do. Try changing that second query so q and fq are not the same, so you can see for sure which one is producing the filterCache entry. With the same query for both, you cannot know which one is populating the filterCache. If it's coming from the q parameter, then it's probably working as designed. If it comes from the fq, then we probably actually do have a problem that needs investigation. Thanks, Shawn