Thanks everybody for the information.

Shawn, thanks for bringing up the issues around making sure each document
is indexed ok.  With our current architecture, that is important for us.

Yonik's clarification about streaming really helped me to understand one of
the main advantages of CUSS:

>>When you add a document, it immediately writes it to a stream where
solr can read it off and index it.  When you add a second document,
it's immediately written to the same stream (or at least one of the
open streams), as part of the same udpate request.  No separate HTTP
request, No separate update request.

In our use case, where documents are in the 700K-2MB range, I suspect that
the overhead of opening/closing new requests is dwarfed by the time it
takes to just send the data over the wire and parsing the data. However,
I'm starting to think about whether I can find some time to do some testing.

Mikhail, thanks for suggesting looking at DIH,  I haven't looked at it in
several years and didn't realize there is now functionality to deal with
XML documents.

When I asked about being able to read XML files from the filesystem, it was
for the purposes of running some benchmark tests to see if CUSS offers
enough advantages to re-architect our system.

Currently the main bottleneck in our system is constructing Solr documents.
We use multiple "document producers" which are responsible both for
creating a document and for sending it to Solr.  Although each producer
waits until it gets a response from Solr before sending the next document
to be indexed, we run 20-100 producers, so this is similar to CUSS running
multiple threads. (although of course we open a new http request and Solr
update request each time)

As far as using DIH or something like it, we might be able to use it for
testing with already created documents.

Creating the documents requires assembling (and massaging) data from
several sources including a few database queries, unzipping files on our
filesystem and contatenating them, and querying another Solr instance which
has metadata.

I'm now thinking that for testing purposes it  might be sufficient to
construct dummy documents as in the examples rather than trying to use our
actual documents.  If the speed improvements look significant enough, then
I'd need to figure out how to test with real documents.

Thanks again for all the input.

Tom

Reply via email to