What are your commit settings? Solr keeps certain in-memory structures
between commits, so it’s important to commit periodically. Say every 60
seconds as a straw-man proposal (and openSearcher should be set to
true or soft commits should be enabled).

When firing a zillion docs at Solr, it’s also best that your commits (both hard
and soft) aren’t happening too frequently, thus my 60 second proposal.

The commit on the command you send will be executed after the last doc
is sent, so it’s irrelevant to the above.

Apart from that, when indexing every time you do commit, background
merges are kicked off and there’s a limited number of threads that are
allowed to run concurrently. When that max is reached the next update is
queued until one of the threads is free. So you _may_ be hitting a simple
timeout that’s showing up as a 400 error, which is something of a 
catch-all return code. If this is the case, just lengthening the timeouts
might fix the issue.

Are you sending the documents to the leader? That’ll make the process
simpler since docs received by followers are simply forwarded to the
leader. That shouldn’t really matter, just a side-note.

Not all that helpful I know. Does the failure happen in the same place? I.e.
is it possible that a particular doc is making this happen? Unlikely, but worth
asking. One bad doc shouldn’t stop the whole process, but it’d be a clue
if there was.

If you’re particularly interested in performance, you should consider 
indexing to a leader-only collection, either by deleting the followers or
shutting down the Solr instances. There’s a performance penalty due to
forwarding the docs (talking NRT replicas here) that can be quite
substantial. When you turn the Solr instances back on (or ADDREPLICA),
they’ll sync back up.

Finally, I mistrust just sending a large amount of data via HTTP, just because
there’s not much you can do except hope it all works. If this is a recurring
process I’d seriously consider writing a SolrJ program that parsed the
csv file and sent it to Solr.

Best,
Erick



> On Feb 2, 2020, at 9:32 AM, Joseph Lorenzini <jalo...@gmail.com> wrote:
> 
> Hi all,
> 
> I have three node solr cloud cluster. The collection has a single shard. I
> am importing 140 GB CSV file into solr using curl with a URL that looks
> roughly like this. I am streaming the file from disk for performance
> reasons.
> 
> http://localhost:8983/solr/example/update?separator=%09&stream.file=/tmp/input.tsv&stream.contentType=text/csv;charset=utf-8&commit=true&f.note.split=true&f.note.separator=%7C
> 
> There are 139 million records in that file. I am able to import about
> 800,000 records into solr at which point solr hangs and then several
> minutes later returns a 400 bad request back to curl. I looked in the logs
> and I did find a handful of exceptions (e.g invalid date, docvalues field
> is too large etc) for particular records but nothing that would explain why
> the processing stalled and failed.
> 
> My expectation is that if solr encounters a record it cannot ingest, it
> will throw an exception for that particular record and continue processing
> the next record. Is that how the importing works or do all records need to
> be valid? If invalid records should not abort the process, then does anyone
> have any idea what might be going on here?
> 
> Thanks,
> Joe

Reply via email to