To sum it up, there is no way for bulk loading in solr, due to the lack of preserving the order of operation. Solr can only supply bulk loading if you really have unique data, right?
By the way, the queue used is java.util.concurrent.BlockingQueue. Changing that to ArrayBlockingQueue (to force FIFO) would not really help, I guess. Because the bottleneck is not reading the content from filesystem, but analyzing and indexing. Any other options for bulk loading? You say "If there are at least three threads in the concurrent client...", but two threads would work? How are other users doing bulk loading with archived backups and preserving the order? Can't believe that I'm the only one on earth having this need. Regards Bernd Am 11.01.2018 um 08:53 schrieb Shawn Heisey: > On 1/11/2018 12:05 AM, Bernd Fehling wrote: >> This will nerver pass a Jepsen test and I call it _NOT_ thread safe. >> >> I haven't looked into the code yet, to see if the queue is FIFO, otherwise >> this would be stupid. > > I was not thinking about order of operations when I said that the client was > threadsafe. I meant that one client object can be used > simultaneously by multiple threads without anything getting > cross-contaminated within the program. > > If you are absolutely reliant on operations happening in a precise order, > such that a document could get indexed in one request and then > replaced (or updated) with a later request, you should not use the concurrent > client. You could define it with a single thread, but if you do > that, then the concurrent client doesn't work any faster than the standard > client. > > When a concurrent client is built, it creates the specified number of > processing threads. When updates are sent, they are added to an internal > queue. The processing threads will handle requests from the queue as long as > the queue is not empty. > > Those threads will process the requests they have been assigned > simultaneously. Although I'm sure that each thread pulls requests off the > queue > in a FIFO manner, I have a scenario for you to consider. This scenario is > not just an intellectual exercise, it is the kind of thing that can > easily happen in the wild. > > Let's say that when document X is initially indexed, it is at position 997 in > a batch of 1000 documents. Then two update requests later, the > new version of document X is at position 2 in another batch of 1000 documents. > > If there are at least three threads in the concurrent client, those update > requests may begin execution at nearly the same time. In that > situation, Solr is likely to index document X in the request added later > before it indexes document X in the request added earlier, resulting in > outdated information ending up in the index. > > The same thing can happen even with a non-concurrent client when it is used > in a multi-threaded manner. > > Preserving order of operations cannot be guaranteed if there are multiple > threads. It could be possible to add some VERY sophisticated > synchronization capabilities, but writing code to do that would be very > difficult, and it wouldn't be trivial to use either. > > Thanks, > Shawn