I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
~400M documents in total, with 4-way replication (so its quite a big
setup!)  I had thought that HTTP would slow things down, so we recently
trialed a JNI approach (clients are C++) so we could call SolrJ and get the
benefits of JavaBin encoding for our indexing....

Once we had done benchmarks with both solutions, I think we saved about 1ms
per document (on average) with JNI, so it wasn't as big a gain as we were
expecting.  There are other benefits of SolrJ (zookeeper integration,
better routing, etc) and we were doing local HTTP (so it was literally just
a TCP port to localhost, no actual net traffic) but that just goes to prove
what other posters have said here.  Check whether HTTP really *is* the
bottleneck before you try to replace it!


On 7 April 2014 17:05, Shawn Heisey <s...@elyograg.org> wrote:

> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
>
>> Do you mean to tell me that the people on this list that are indexing
>> 100s of millions of documents are doing this over http?  I have been using
>> custom Lucene code to index files, as I thought this would be faster for
>> many documents and I wanted some non-standard OCR and index fields.  Is
>> there a better way?
>>
>> To the OP: You can also use Lucene to locally index files for Solr.
>>
>
> My sharded index has 94 million docs in it.  All normal indexing and
> maintenance is done with SolrJ, over http.Currently full rebuilds are done
> with the dataimport handler loading from MySQL, but that is legacy.  This
> is NOT a SolrCloud installation.  It is also not a replicated setup -- my
> indexing program keeps both copies up to date independently, similar to
> what happens behind the scenes with SolrCloud.
>
> The single-thread DIH is very well optimized, and is faster than what I
> have written myself -- also single-threaded.
>
> The real reason that we still use DIH for rebuilds is that I can run the
> DIH simultaenously on all shards.  A full rebuild that way takes about 5
> hours.  A SolrJ process feeding all shards with a single thread would take
> a lot longer.  Once I have time to work on it, I can make the SolrJ rebuild
> multi-threaded, and I expect it will be similar to DIH in rebuild speed.
>  Hopefully I can make it faster.
>
> There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
> high enough to matter.
>
> Using Lucene to index files for Solr is an option -- but that requires
> writing a custom Lucene application, and knowledge about how to turn the
> Solr schema into Lucene code.  A lot of users on this list (me included) do
> not have the skills required.  I know SolrJ reasonably well, but Lucene is
> a nut that I haven't cracked.
>
> Thanks,
> Shawn
>
>

Reply via email to