What are your commit settings? Solr keeps certain in-memory structures between commits, so it’s important to commit periodically. Say every 60 seconds as a straw-man proposal (and openSearcher should be set to true or soft commits should be enabled).
When firing a zillion docs at Solr, it’s also best that your commits (both hard and soft) aren’t happening too frequently, thus my 60 second proposal. The commit on the command you send will be executed after the last doc is sent, so it’s irrelevant to the above. Apart from that, when indexing every time you do commit, background merges are kicked off and there’s a limited number of threads that are allowed to run concurrently. When that max is reached the next update is queued until one of the threads is free. So you _may_ be hitting a simple timeout that’s showing up as a 400 error, which is something of a catch-all return code. If this is the case, just lengthening the timeouts might fix the issue. Are you sending the documents to the leader? That’ll make the process simpler since docs received by followers are simply forwarded to the leader. That shouldn’t really matter, just a side-note. Not all that helpful I know. Does the failure happen in the same place? I.e. is it possible that a particular doc is making this happen? Unlikely, but worth asking. One bad doc shouldn’t stop the whole process, but it’d be a clue if there was. If you’re particularly interested in performance, you should consider indexing to a leader-only collection, either by deleting the followers or shutting down the Solr instances. There’s a performance penalty due to forwarding the docs (talking NRT replicas here) that can be quite substantial. When you turn the Solr instances back on (or ADDREPLICA), they’ll sync back up. Finally, I mistrust just sending a large amount of data via HTTP, just because there’s not much you can do except hope it all works. If this is a recurring process I’d seriously consider writing a SolrJ program that parsed the csv file and sent it to Solr. Best, Erick > On Feb 2, 2020, at 9:32 AM, Joseph Lorenzini <jalo...@gmail.com> wrote: > > Hi all, > > I have three node solr cloud cluster. The collection has a single shard. I > am importing 140 GB CSV file into solr using curl with a URL that looks > roughly like this. I am streaming the file from disk for performance > reasons. > > http://localhost:8983/solr/example/update?separator=%09&stream.file=/tmp/input.tsv&stream.contentType=text/csv;charset=utf-8&commit=true&f.note.split=true&f.note.separator=%7C > > There are 139 million records in that file. I am able to import about > 800,000 records into solr at which point solr hangs and then several > minutes later returns a 400 bad request back to curl. I looked in the logs > and I did find a handful of exceptions (e.g invalid date, docvalues field > is too large etc) for particular records but nothing that would explain why > the processing stalled and failed. > > My expectation is that if solr encounters a record it cannot ingest, it > will throw an exception for that particular record and continue processing > the next record. Is that how the importing works or do all records need to > be valid? If invalid records should not abort the process, then does anyone > have any idea what might be going on here? > > Thanks, > Joe