Re: Parallel indexing in Solr

Per Steffensen Mon, 06 Feb 2012 07:50:40 -0800

So SolrJ with CommonsHttpSolrServer will not support handling several
requests concurrently?


Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with
a different constructor.

I will try to do that. It is a little bit difficult for me, as we areactually not dealing with Solr ourselves. We are using Lily, but I willmodify Lily, compile and try to see how goes.

 Especially with trunk (4.0)
    and the Document Writer Per Thread stuff.

We are using trunk (4.0). Can you provide me with a little more info on this
"Document Writer Per Thread stuff". A link or something?


I already did, follow the link I provided.

Ahh ok, didnt get it the first time, that the link below was about that

http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/

So Jetty in not a "easy to use, but non-performance"-container?


Again, test and see. Lots of commercial systems use Jetty. Consider
that you're just sending sets of documents at Solr, the container
is doing very little work. You are batching up your Solr documents
aren't you?

Havnt looked into Lily to see whether or not documents are batched, butI will. I didnt expect Jetty to be the problem, basically just wanted toknow that is was not a "stupid" everything-in-a-single-thread container,almost designed to not perform (because the focus might be different,e.g. providing an easy-to-use/understand container for testing etc.)

Actually right now, I am trying to find our what my bottleneck is.


You should see this differ with StreamingUpdateSolrServer assuming your
client can feed documents fast enough. You can consider having multiple
clients feed the same solr indexer if necessary.

Thanks!

5> How high is "high performance"? On a stock solr
    with the Wikipedia dump (11M docs), all running on
    my laptop, I see 7K docs/sec indexed. I know of
    installations that see 60 docs/sec or even less. I'm
   sending simple docs with SolrJ locally and they're
    sending huge documents over the wire that Tika
    handles. There are just so many variables it's hard
    to say anything except "try it and see"......

50mio documents need to be deleted and indexed per day. 2 years history = 36 
billion docs in store


My off-the-top-of-my-head feeling is that this will be a LOT of hardware.

Well it takes what it takes. Someone else will buy the hardware. Myfirst concern is to make sure we have a system that scales, so that wecan buy us out of problems by buying more hardware. On the other hand ofcourse I want to privide at system that makes the most of the hardware.

 You'll
without doubt be sharding the index. NOTE: Shards are cores, just special
purpose ones, i.e. they're all use the same schema. When Solr folks see "cores",
we assume that the several cores that may have different schemas and handle
unrelated queries. It sounds like you're talking about a sharded system rather
than independent cores, is that so?

Yes that is correct. We only have one single schema/config shared by allcores through ZK. So the many cores are just for sharding, because I donot expect that it will work very well with 20 billion docs in the samecore/shard :-)

You should have no trouble indexing 50M documents/day, even assuming that the
ingestion rate is not evenly distributed. The link I referenced talks
about indexing 10M documents in a little over 6 minutes. YMMV however. I think
you're going along the right path when trying to push a single indexer to
the max. My setup uses Jetty and is getting 5-7K docs/second so I doubt it's
inherently a Jetty problem, although there may be configuration tweaks getting
in your way.

Bottom line: I doubt it's a Jetty issue at this point but I've been
wrong on too many
occasions to count. I'd be looking other places first though. Start
with the streaming
update solr server though, and also whether your clients can spit out documents
fast enough...

I will have a look at all that. Thanks!

Best
Erick

Re: Parallel indexing in Solr

Reply via email to