Re: how to update billions of docs

2016-03-24 Thread Mohsin Beg Beg

An update on how I ended up implementing the requirement in case it helps 
others. There are lots of other code I did not include but the general logic is 
below.

While performance is still not great, it is 10x faster than atomic updates ( 
because RealTimeGetComponent.getInputDocument() is not needed )


1. Wrote an update handler
   /myupdater?q=*:* & sort=fieldx desc & fl=fieldx, fieldy & 
stream.file=exampledocs/oldvalueToNewValue.properties & update.chain=myprocessor


2. In the handler read the map from content stream and invoke the export 
handler for the query params
   SolrRequestHandler handler = core.getRequestHandler("/export");
   core.execute(handler, req, rsp);
   numFound = (Integer) req.getContext().get("totalHits");


3. Iterate using /export handler response, similar to 
SortingResponseWrite.write() method
 
   List leaves = 
req.getSearcher().getTopReaderContext().leaves();
   for(int i=0; i<leaves.size(); i++) {
 DocIdSetIterator it = new BitSetIterator(sets[i], 0);
 while((docId = it.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
// get lucene doc
Document luceneDoc = leaves.get(i).reader.document(docId);

// update lucene doc with new values
updateDoc(luceneDoc, oldValueToNewValuesMap)

// post lucene doc to a linked blocking queue
queue.add(luceneDoc);
 }
  }


4. have N threads waiting on queue for docs and invokes UpdateRequestProcessor 
chain using the update.chain param
   AddUpdateCommand cmd = new AddUpdateCommand(request);
   IndexSchema schema = req.getLatestSchema();
   while (true) {
  Document luceneDoc = queue.take();
  SolrDocument doc = toSolrDocument(luceneDoc, schema);

  cmd.doc = doc;

  // set these fields as needed
  cmd.overwrite = false;
  cmd.setVersion(0);
  doc.removeField("_version"_);

  // post doc
  updateProcessor.processAdd(cmd);
   }


-Mohsin


- Original Message -
From: jack.krupan...@gmail.com
To: solr-user@lucene.apache.org
Sent: Friday, March 18, 2016 6:55:17 AM GMT -08:00 US/Canada Pacific
Subject: Re: how to update billions of docs

That's another great example of a mode that Bulk Field Update (my mythical
feature) needs - switch a list of fields from stored to docvalues.

And maybe even the opposite since there are scenarios in which docValues is
worse than stored and you would only find that out after indexing...
billions of documents.

Being able to switch indexed mode of a field (or list of fields) is also a
mode needed for bulk update (reindex).


-- Jack Krupansky

On Fri, Mar 18, 2016 at 4:12 AM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Hi Mohsin,
> There's some work in progress for in-place updates to docValued fields,
> https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest
> patch there (or ping me if you need a git branch)?
> It would be nice to know how fast the updates go for your usecase with that
> patch. Please note that for that patch, both the version field and the
> updated field needs to have stored=false, indexed=false, docValues=true.
> Regards,
> Ishan
>
>
> On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky <jack.krupan...@gmail.com
> >
> wrote:
>
> > It would be nice to have a wiki/doc for "Bulk Field Update" that listed
> all
> > of these techniques and tricks.
> >
> > And, of course, it would be so much better to have an explicit Lucene
> > feature for this. It could work in the background like merge and process
> > one segment at a time as efficiently as possible.
> >
> > Have several modes:
> >
> > 1. Set a field of all documents to explicit value.
> > 2. Set a field of query documents to an explicit value.
> > 3. Increment by n.
> > 4. Add new field to all document, or maybe by query.
> > 5. Delete existing field for all documents.
> > 6. Delete field value for all documents or a specified query.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <
> kkrugler_li...@transpac.com
> > >
> > wrote:
> >
> > > As others noted, currently updating a field means deleting and
> inserting
> > > the entire document.
> > >
> > > Depending on how you use the field, you might be able to create another
> > > core/container with that one field (plus the key field), and use join
> > > support.
> > >
> > > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> > > improvement, which looks like it's in the 5.x code line, though I don't
> > see
> > > a fix version.
> > >
> > > -- Ken
> > >
> > > > From: Mohsin Beg Beg
> > > > Sent: March 16, 20

how to update billions of docs

2016-03-19 Thread Mohsin Beg Beg
Hi,

I have a requirement to replace a value of a field in 100B's of docs in 100's 
of cores.
The field is multiValued=false docValues=true type=StrField stored=true 
indexed=true.

Atomic Updates performance is on the order of 5K docs per sec per core in solr 
5.3 (other fields are quite big).

Any suggestions ?

-Mohsin


Re: Dealing with bad apples in a SolrCloud cluster

2014-11-21 Thread Mohsin Beg Beg

How about dynamic loading/unloading of some shards (cores) similar to the 
transient cores feature. Should be ok if the unloaded shard has a replica. If 
no replica then extending shards.tolerant concept to use some 
timeout/acceptable-latency value sounds interesting.

-Mohsin

- Original Message -
From: thelabd...@gmail.com
To: solr-user@lucene.apache.org
Sent: Friday, November 21, 2014 10:56:51 AM GMT -08:00 US/Canada Pacific
Subject: Dealing with bad apples in a SolrCloud cluster

Just soliciting some advice from the community ...

Let's say I have a 10-node SolrCloud cluster and have a single collection
with 2 shards with replication factor 10, so basically each shard has one
replica on each of my nodes.

Now imagine one of those nodes starts getting into a bad state and starts
to be slow about serving queries (not bad enough to crash outright though)
... I'm sure we could ponder any number of ways a box might slow down
without crashing.

From my calculations, about 2/10ths of the queries will now be affected
since

1/10 queries from client apps will hit the bad apple
  +
1/10 queries from other replicas will hit the bad apple (distrib=false)


If QPS is high enough and the bad apple is slow enough, things can start to
get out of control pretty fast, esp. since we've set max threads so high to
avoid distributed dead-lock.

What have others done to mitigate this risk? Anything we can do in Solr to
help deal with this? It seems reasonable that nodes can identify a bad
apple by keeping track of query times and looking for nodes that are
significantly outside (=2 stddev) what the other nodes are doing. Then
maybe mark the node as being down in ZooKeeper so clients and other nodes
stop trying to send requests to it; or maybe a simple policy of just don't
send requests to that node for a few minutes.


OutOfMemory on 28 docs with facet.method=fc/fcs

2014-11-18 Thread Mohsin Beg Beg

Hi,

I am getting OOM when faceting on numFound=28. The receiving solr node throws 
the OutOfMemoryError even though there is 7gb available heap before the 
faceting request was submitted. If a different solr node is selected that one 
fails too. Any suggestions ?


1) Test setup is:-
100 collections with 20 shards each := 2000 cores
20 solr nodes of 16gb jvm memory := 100 cores per jvm node
5 hosts of 300 gb memory := 4 solr nodes per host


2) Query (edited for berevity) is :-
fields1...15 below are 15 among ~500 fields of type strings (tokenized) and 
numerics.

http://myhost:8983/solr/Collection1/query
?q=fieldX:xyz AND fieldY:(r OR g OR b)
rows=0
fq={!cache=false}time:[begin time TO end time]
facet=true
facet.sort=count
facet.missing=false
facet.mincount=1
facet.threads=10
 facet.field=field1field15
 f.field1...field15.facet.method=fc/fcs
collection=Collection1...Collection100


-M


Re: OutOfMemory on 28 docs with facet.method=fc/fcs

2014-11-18 Thread Mohsin Beg Beg


Looking at SimpleFacets.java, doesn't fc/fcs iterate only over the DocSet for 
the fields. So assuming each field has a unique term across the 28 rows, a max 
of 28 * 15 unique small strings (100bytes), should be in the order of 1MB. For 
100 collections, lets say a total of 1GB. Now lets say I multiply it by 3 to 
3GB. 

That still leaves more that 4GB heap used by something else to run out memory? 
Who and why? *sigh*

Now looking at the alternatives...
1. I don't know which shards have the docs so yes, all collections are needed. 
q and fq params can be complicated.
2. hierarchical field doesn't work when selecting 15 fields (out of 300+) 
since there is no way to give a one hierarchical path in fq or via facet.prefix 
value.
3. One facet-at-time exceeds the total latency requirements of the app on top

Am I stuck ?

ps: Doesn't enum build the uninverted index for each unique term in the field 
and then intersect with the DocSet to return the facet counts? This causes 
filterCache entries to be bloated in each core. That causes OOM on just  4 or 5 
string fields (depending on their cardinality).

-M


- Original Message -
From: t...@statsbiblioteket.dk
To: solr-user@lucene.apache.org
Sent: Tuesday, November 18, 2014 12:34:08 PM GMT -08:00 US/Canada Pacific
Subject: RE: OutOfMemory on 28 docs with facet.method=fc/fcs

Mohsin Beg Beg [mohsin@oracle.com] wrote:
 I am getting OOM when faceting on numFound=28. The receiving
 solr node throws the OutOfMemoryError even though there is 7gb
 available heap before the faceting request was submitted.

fc and fcs faceting memory overhead is (nearly) independent on the number of 
hits in the search result. 

 If a different solr node is selected that one fails too. Any suggestions ?

 facet.field=field1field15
 f.field1...field15.facet.method=fc/fcs
 collection=Collection1...Collection100

You seem to be issuing a facet request for 15 fields in 100 collection 
concurrently. The memory overhead will be linear to the number of documents, 
references from documents to field values and the number of unique values in 
your facets, for each facet independently.

That was confusing. Let me try an example instead:

For each field, static memory requirements will be a structure that maps from 
documents to term ordinals. Depending on circumstances, this can be small 
(DocValues and a numeric field) or big (multi-value, non-DocValue String). Each 
concurrent call will temporarily allocate a structure for counting. If the 
field is numeric, this will be a hashmap. If it is String, it will be an 
integer-array with as many entries as there are unique values: If there are 1M 
unique String values in the field, the overhead will be 4 bytes * 1M = 4MB.

So, if each field has 250K unique String values, the temporary overhead for all 
15 fields will be 15MB. I don't now if the request for multiple collections is 
threaded, but if so, the 15MB should be multiplied with 100, totalling 1.5GB 
memory overhead for each call. Add the static structures and it does not seem 
unreasonable that you run out of memory.

All this is very loose, but the overall message is that documents, unique facet 
values, facets and collections all multiplies memory requirements.

* Do you need to query all collections at once?
* Can you collapse some of the facet fields, to reduce the total number?
* Are some of the fields very small? If so, use enum for them instead of fc/fcs.
* Maybe you can determine your limits by issuing requests first for 1 field, 
then 2 etc. This is to see if it is feasible to do minor tweak to get it to 
work or if your setup is so large that something entirely else needs to be done.

- Toke Eskildsen


Re: OutOfMemory on 28 docs with facet.method=fc/fcs

2014-11-18 Thread Mohsin Beg Beg


solrcloud has 8billion+ docs and increasing non-linearly each hour.
numFound=28 was for the faceting query only.

If fieldCache (lucene caches) is the issue, is q=time:[begin time TO end 
time] be better instead ?

-Mohsin



- Original Message -
From: apa...@elyograg.org
To: solr-user@lucene.apache.org
Sent: Tuesday, November 18, 2014 2:45:46 PM GMT -08:00 US/Canada Pacific
Subject: Re: OutOfMemory on 28 docs with facet.method=fc/fcs

On 11/18/2014 3:06 PM, Mohsin Beg Beg wrote:
 Looking at SimpleFacets.java, doesn't fc/fcs iterate only over the DocSet for 
 the fields. So assuming each field has a unique term across the 28 rows, a 
 max of 28 * 15 unique small strings (100bytes), should be in the order of 
 1MB. For 100 collections, lets say a total of 1GB. Now lets say I multiply it 
 by 3 to 3GB. 

Are there 28 documents in the entire index?  It's my understanding that
the fieldcache memory required is not dependent on the number of
documents that match your query (numFound), it's dependent on the number
of documents in the entire index.

If my understanding is correct, once that memory structure is calculated
and stored in the fieldcache, it's available to speed up future facets
on that field, even if the query and filters are different than what was
used the first time.  It doesn't seem as useful for typical use cases to
store a facet cache entry that depends on the specific query.

Thanks,
Shawn


DocSet getting cached in filterCache for facet request with {!cache=false}

2014-11-11 Thread Mohsin Beg Beg

Hello,

It seems Solr is caching when facting even with fq={!cache=false}*:* specified. 
This is what I am doing on Solr 4.10.0 on jre 1.7.0_51.

Query 1) No cache in filterCache as expected
http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:*
http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache 
confirms this.

Query 2) Query result docset cached in filterCache unexpectedly ?
http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:*facet=truefacet.field=foobarfacet.method=enum
http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache shows 
entry of item_*:*: org.apache.solr.search.BitDocSet@​66afbbf cached.

Suggestions why or how this may be avoided since I don't want to cache anything 
other than facet(ed) terms in the filterCache (for predictable heap usage).

The culprit seems to be line 1431 @ 
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java?view=markup

Thanks.

-M


Re: DocSet getting cached in filterCache for facet request with {!cache=false}

2014-11-11 Thread Mohsin Beg Beg

Shawn, then how to skip filterCache for facet.method=enum ?

Wiki says fq={!cache=false}*:* is ok, no?
https://wiki.apache.org/solr/SolrCaching#filterCache

-Mohsin


- Original Message -
From: erickerick...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tuesday, November 11, 2014 8:40:54 AM GMT -08:00 US/Canada Pacific
Subject: Re: DocSet getting cached in filterCache for facet request with 
{!cache=false}

Well, the difference that you're faceting with method=enum, which uses
the filterCache (I think, it's been a while).

I admit I'm a little surprised that when I tried faceting with the
inStock field in the standard distro I got 3 entries when there are
only two values but I'm willing to let that go ;)

i.e. this produces 3 entries in the filterCache:
http://localhost:8983/solr/techproducts/select?q=*:*rows=0facet=truefacet.field=inStockfacet.method=enum

not an fq clause in sight..

Best,
Erick

On Tue, Nov 11, 2014 at 9:31 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 11/11/2014 1:22 AM, Mohsin Beg Beg wrote:
 It seems Solr is caching when facting even with fq={!cache=false}*:* 
 specified. This is what I am doing on Solr 4.10.0 on jre 1.7.0_51.

 Query 1) No cache in filterCache as expected
 http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:*
 http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache 
 confirms this.

 Query 2) Query result docset cached in filterCache unexpectedly ?
 http://localhost:8983/solr/collection1/select?q=*:*rows=0fq={!cache=false}*:*facet=truefacet.field=foobarfacet.method=enum
 http://localhost:8983/solr/#/collection1/plugins/cache?entry=filterCache 
 shows entry of item_*:*: org.apache.solr.search.BitDocSet@66afbbf cached.

 Suggestions why or how this may be avoided since I don't want to cache 
 anything other than facet(ed) terms in the filterCache (for predictable heap 
 usage).

 I hope this is just for testing, because fq=*:* is completely
 unnecessary, and will cause Solr to do extra work that it doesn't need
 to do.

 Try changing that second query so q and fq are not the same, so you can
 see for sure which one is producing the filterCache entry.  With the
 same query for both, you cannot know which one is populating the
 filterCache.  If it's coming from the q parameter, then it's probably
 working as designed.  If it comes from the fq, then we probably actually
 do have a problem that needs investigation.

 Thanks,
 Shawn