RE: Embedded about 50% faster for indexing

Sundling, Paul Mon, 27 Aug 2007 19:24:22 -0700

At this point I think I'm going recommend against embedded, regardless
of any performance advantage.  The level of documentation is just too
low, while the XML API is clearly documented.  It's clear that XML is
preferred.

The embedded example on the wiki is pretty good, but until mutliple core
support comes out in the next version, you have to use multiple
SolrCore.  If they are accessed in the same webapp, then you can't just
set JNDI (since you can only have one value).  So you have to use a
Config object as alluded to in the example.  However, you look at the
code and there is no javadoc for the constructor.  The constructor args
are (String name, InputStream is, String prefix).  I think name is a
unique name for the solr core, but that is a guess.  Inputstream may be
a stream to the solr home, but it could be anything.  Prefix may be a
URI prefix.  These are all guesses without trying to read through the
code.

When I look at SolrCore, it looks like it's a singleton, so maybe I
can't even access more than one SolrCore using embedded anyway.  :(  So
I apologize for highlighting Embedded.  

Anyway it's clear how to do multiple solr cores using XML.  You just
have different post URI for the difference cores.  You can easily inject
that with Spring and externalize the config.  Simple and easy.  So I
concede XML is the way to go. :)  

Paul Sundling

-----Original Message-----
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 27, 2007 5:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Embedded about 50% faster for indexing

On 27-Aug-07, at 12:44 PM, Sundling, Paul wrote:

> Whether embedded solr should give me a performance boost or not, it
> did.
> :)  I'm not surprised, since it skips XML parsing.  Although you never
> know where cycles are used for sure until you profile.

It certainly is possible that XML parsing dwarfs indexing, but I'd  
expect that only to occur under very light analysis and field storage  
workloads.

> I tried doing more records per post (200) and it was actually slightly

> slower and seemed to require more memory.  This makes sense because
> you
> have to take up more memory for the StringBuilder to store the much
> larger XML.  For 10,000 it was much slower.  For that size I would  
> need
> to XML streaming or something to make it work.
>
> The solr war was on the same machine, so network overhead was only
> from
> using loopback.

The big question is still your connection handling strategy:  are you  
using persistent http connections?  Are you threadedly indexing?

cheers,
-Mike

> Paul Sundling
>
> -----Original Message-----
> From: climbingrose [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 27, 2007 12:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Embedded about 50% faster for indexing
>
>
> Haven't tried the embedded server but I think I have to agree with
> Mike.
> We're currently sending 2000 job batches to SOLR server and the amount
> of time required to transfer documents over http is insignificant
> compared with the time required to index them. So I do think unless  
> you
> are sending document one by one, embedded SOLR shouldn't give you much
> more performance boost.
>
> On 8/25/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>>
>> On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote:
>>
>>>> -----Original Message-----
>>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of 
>>>> Yonik Seeley
>>>> Sent: Friday, August 24, 2007 2:07 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Embedded about 50% faster for indexing
>>>>
>>>> One thing I'd like to avoid is everyone trying to embed just for 
>>>> performance gains. If there is really that much difference, then we
>
>>>> need a better way for people to get that without resorting to Java 
>>>> code.
>>>>
>>>> -Yonik
>>>>
>>>
>>> Theoretically and practically, embedded solution will be faster than
>
>>> going through http/xml.
>>
>> This is only true if the http interface adds significant overhead to 
>> the cost of indexing a document, and I don't see why this should be 
>> so, as indexing is relatively heavyweight.  setting up the connection

>> could be expensive, but this can be greatly mitigated by sending more

>> than one doc per http request, using persistent connections, and 
>> threading.
>>
>> -Mike
>>
>
>
>
> --
> Regards,
>
> Cuong Hoang

RE: Embedded about 50% faster for indexing

Reply via email to