Right. See below.

On Mon, Feb 6, 2012 at 7:53 AM, Per Steffensen <st...@designware.dk> wrote:
> See response below
>
> Erick Erickson skrev:
>
>> Unfortunately, the answer is "it depends(tm)".
>>
>> First question: How are you indexing things? SolrJ? post.jar?
>>
>
> SolrJ, CommonsHttpSolrServer
>
>> But some observations:
>>
>> 1> sure, using multiple cores will have some parallelism. So will
>>    using a single core but using something like SolrJ and
>>    StreamingUpdateSolrServer.
>
> So SolrJ with CommonsHttpSolrServer will not support handling several
> requests concurrently?
>

Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with
a different constructor.

>>  Especially with trunk (4.0)
>>     and the Document Writer Per Thread stuff.
>
> We are using trunk (4.0). Can you provide me with a little more info on this
> "Document Writer Per Thread stuff". A link or something?
>

I already did, follow the link I provided.

>>  In 3.x, you'll
>>     see some pauses when segments are merged that you
>>     can't get around (per core). See:
>>
>> http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
>>     for an excellent writeup. But whether or not you use several
>>     cores should be determined by your problem space, certainly
>>     not by trying to increase the throughput. Indexing usually
>>     take a back seat to search performance.
>>
>
> We will have few searches, but a lot of indexing.
>

Hmmm, this the inverse of most installations, so it's good to know.

>> 2> general settings are hard to come by. If you're sending
>>      structured documents that use Tika to parse the data
>>      behind the scenes, your performance will be much
>>      different (slower) than sending SolrInputDocuments
>>     (SolrJ).
>>
>
> We are sending SolrInputDocuments
>
>> 3> The recommended servlet container is, generally,
>>      "The one you're most comfortable with". Tomcat is
>>      certainly popular. That said, use whatever you're
>>      most comfortable with until you see a performance
>>     problem. Odds are you'll find your load on Solr is a
>>      at its limit before your servlet container has problems.
>>
>
> So Jetty in not a "easy to use, but non-performance"-container?
>

Again, test and see. Lots of commercial systems use Jetty. Consider
that you're just sending sets of documents at Solr, the container
is doing very little work. You are batching up your Solr documents
aren't you?

>> 4> Monitor you CPU, fire more requests at it until it
>>     hits 100%. Note that there are occasions where the
>>    servlet container limits the number of outstanding
>>     requests it will allow and queues ones over that
>>     limit (find the magic setting to increase this if it's a
>>     problem, it differs by container). If you start to see
>>     your response times lengthen but the CPU not being
>>    fully utilized, that may be the cause.
>>
>
> Actually right now, I am trying to find our what my bottleneck is. The setup
> is more complex, than I would bother you with, but basically I have servers
> with 80-90% IO-wait and only 5-10% "real CPU usage". It might not be a
> Solr-related problem, I am investigating different things, but just wanted
> to know a little more about how Jetty/Solr works in order to make a
> qualified guess.

You should see this differ with StreamingUpdateSolrServer assuming your
client can feed documents fast enough. You can consider having multiple
clients feed the same solr indexer if necessary.


>
>> 5> How high is "high performance"? On a stock solr
>>     with the Wikipedia dump (11M docs), all running on
>>     my laptop, I see 7K docs/sec indexed. I know of
>>     installations that see 60 docs/sec or even less. I'm
>>    sending simple docs with SolrJ locally and they're
>>     sending huge documents over the wire that Tika
>>     handles. There are just so many variables it's hard
>>     to say anything except "try it and see"......
>>
>
> Well eventaually we need to be able to index and delete about 50mio
> documents per day. We will need to keep a "history" of 2 years of data in
> our system, deletion will not start before we have been in production for 2
> years. At that point in time the system needs to contain 2 year * 365
> days/year * 50mio docs/day = 36,5billion documents. At that point 50mio
> documents need to be deleted and index per day - before that we only need to
> index 50mio documents per day. We are aware that we are probably going to
> need a certain amout of hardware for this, but most important thing is that
> we make a scalable setup so that we can get to this kind of numbers at all.
> Right now I am focusing on getting most out of one Solr instance potentially
> with several cores, though.

My off-the-top-of-my-head feeling is that this will be a LOT of hardware. You'll
without doubt be sharding the index. NOTE: Shards are cores, just special
purpose ones, i.e. they're all use the same schema. When Solr folks see "cores",
we assume that the several cores that may have different schemas and handle
unrelated queries. It sounds like you're talking about a sharded system rather
than independent cores, is that so?

You should have no trouble indexing 50M documents/day, even assuming that the
ingestion rate is not evenly distributed. The link I referenced talks
about indexing 10M documents in a little over 6 minutes. YMMV however. I think
you're going along the right path when trying to push a single indexer to
the max. My setup uses Jetty and is getting 5-7K docs/second so I doubt it's
inherently a Jetty problem, although there may be configuration tweaks getting
in your way.

Bottom line: I doubt it's a Jetty issue at this point but I've been
wrong on too many
occasions to count. I'd be looking other places first though. Start
with the streaming
update solr server though, and also whether your clients can spit out documents
fast enough...

Best
Erick

>
>> Best
>> Erick
>>
>> On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen <st...@designware.dk>
>> wrote:
>>
>>>
>>> Hi
>>>
>>> This topic has probably been covered before, but I havnt had the luck to
>>> find the answer.
>>>
>>> We are running solr instances with several cores inside. Solr running
>>> out-of-the-box on top of jetty. I believe jetty is receiving all the
>>> http-requests about indexing ned documents, and forwards it to the solr
>>> engine. What kind of parallelism does this setup provide. Can more than
>>> one
>>> index-request get processed concurrently? How many? How to increase the
>>> number of index-requests that can be handled in parallel? Will I get
>>> better
>>> parallelism by running on another web-container than jetty - e.g. tomcat?
>>> What is the recommended web-container for high performance production
>>> systems?
>>>
>>> Thanks!
>>>
>>> Regards, Per Steffensen
>>>
>>
>>
>>
>
>

Reply via email to