I have to agree with Shawn. We have a SolrCloud setup with 256 shards, ~400M documents in total, with 4-way replication (so its quite a big setup!) I had thought that HTTP would slow things down, so we recently trialed a JNI approach (clients are C++) so we could call SolrJ and get the benefits of JavaBin encoding for our indexing....
Once we had done benchmarks with both solutions, I think we saved about 1ms per document (on average) with JNI, so it wasn't as big a gain as we were expecting. There are other benefits of SolrJ (zookeeper integration, better routing, etc) and we were doing local HTTP (so it was literally just a TCP port to localhost, no actual net traffic) but that just goes to prove what other posters have said here. Check whether HTTP really *is* the bottleneck before you try to replace it! On 7 April 2014 17:05, Shawn Heisey <s...@elyograg.org> wrote: > On 4/7/2014 5:52 AM, Jonathan Varsanik wrote: > >> Do you mean to tell me that the people on this list that are indexing >> 100s of millions of documents are doing this over http? I have been using >> custom Lucene code to index files, as I thought this would be faster for >> many documents and I wanted some non-standard OCR and index fields. Is >> there a better way? >> >> To the OP: You can also use Lucene to locally index files for Solr. >> > > My sharded index has 94 million docs in it. All normal indexing and > maintenance is done with SolrJ, over http.Currently full rebuilds are done > with the dataimport handler loading from MySQL, but that is legacy. This > is NOT a SolrCloud installation. It is also not a replicated setup -- my > indexing program keeps both copies up to date independently, similar to > what happens behind the scenes with SolrCloud. > > The single-thread DIH is very well optimized, and is faster than what I > have written myself -- also single-threaded. > > The real reason that we still use DIH for rebuilds is that I can run the > DIH simultaenously on all shards. A full rebuild that way takes about 5 > hours. A SolrJ process feeding all shards with a single thread would take > a lot longer. Once I have time to work on it, I can make the SolrJ rebuild > multi-threaded, and I expect it will be similar to DIH in rebuild speed. > Hopefully I can make it faster. > > There is always overhead with HTTP. On a gigabit LAN, I don't think it's > high enough to matter. > > Using Lucene to index files for Solr is an option -- but that requires > writing a custom Lucene application, and knowledge about how to turn the > Solr schema into Lucene code. A lot of users on this list (me included) do > not have the skills required. I know SolrJ reasonably well, but Lucene is > a nut that I haven't cracked. > > Thanks, > Shawn > >