Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

Joseph Lorenzini Sun, 02 Feb 2020 07:48:13 -0800

Hi Eric,

Thanks for the help.


For commit settings, you are referring to
https://lucene.apache.org/solr/guide/8_3/updatehandlers-in-solrconfig.html.
If so, yes, i have soft commits on. According to the docs, open search is
turned by default. Here are the settings.

        <autoCommit>
            <maxTime>600000</maxTime>
            <maxDocs>180000</maxDocs>
        </autoCommit>
        <autoSoftCommit>
            <maxTime>1000</maxTime>
            <maxDocs>10000</maxDocs>
        </autoSoftCommit>


Please note, I am actually streaming a file from disk -- i am not sending
the data via curl. curl is merely telling solr what local file to read from.

So I turned off two solr nodes, leaving a single solr node up. When I ran
curl again, I noticed the import aborted with this exception.

Error adding field 'primary_dob'='1983-12-21T00:00:00Z' msg=Invalid Date in
Date Math String:'1983-12-21T00:00:00Z
caused by: java.time.format.DateTimeParseException: Text
'1983-12-21T00:00:00Z' could not be parsed at index 0'

This field is a DatePointField. I've verified that if i remove records with
a DatePointField that has parsing problems then solr upload proceeds
further ....until it hits another record with a similar problem. I was
surprised that a single record with invalid DatePointField would abort the
whole process but that does seem to be what's happening.

So that's easy enough to fix if I knew why the text was failing to parse.
The date certainly seems valid to me based on this documentation.

http://lucene.apache.org/solr/7_2_1/solr-core/org/apache/solr/schema/DatePointField.html

Any ideas on why that won't parse?

Thanks,
Joe


On Sun, Feb 2, 2020 at 8:51 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> What are your commit settings? Solr keeps certain in-memory structures
> between commits, so it’s important to commit periodically. Say every 60
> seconds as a straw-man proposal (and openSearcher should be set to
> true or soft commits should be enabled).
>
> When firing a zillion docs at Solr, it’s also best that your commits (both
> hard
> and soft) aren’t happening too frequently, thus my 60 second proposal.
>
> The commit on the command you send will be executed after the last doc
> is sent, so it’s irrelevant to the above.
>
> Apart from that, when indexing every time you do commit, background
> merges are kicked off and there’s a limited number of threads that are
> allowed to run concurrently. When that max is reached the next update is
> queued until one of the threads is free. So you _may_ be hitting a simple
> timeout that’s showing up as a 400 error, which is something of a
> catch-all return code. If this is the case, just lengthening the timeouts
> might fix the issue.
>
> Are you sending the documents to the leader? That’ll make the process
> simpler since docs received by followers are simply forwarded to the
> leader. That shouldn’t really matter, just a side-note.
>
> Not all that helpful I know. Does the failure happen in the same place?
> I.e.
> is it possible that a particular doc is making this happen? Unlikely, but
> worth
> asking. One bad doc shouldn’t stop the whole process, but it’d be a clue
> if there was.
>
> If you’re particularly interested in performance, you should consider
> indexing to a leader-only collection, either by deleting the followers or
> shutting down the Solr instances. There’s a performance penalty due to
> forwarding the docs (talking NRT replicas here) that can be quite
> substantial. When you turn the Solr instances back on (or ADDREPLICA),
> they’ll sync back up.
>
> Finally, I mistrust just sending a large amount of data via HTTP, just
> because
> there’s not much you can do except hope it all works. If this is a
> recurring
> process I’d seriously consider writing a SolrJ program that parsed the
> csv file and sent it to Solr.
>
> Best,
> Erick
>
>
>
> > On Feb 2, 2020, at 9:32 AM, Joseph Lorenzini <jalo...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I have three node solr cloud cluster. The collection has a single shard.
> I
> > am importing 140 GB CSV file into solr using curl with a URL that looks
> > roughly like this. I am streaming the file from disk for performance
> > reasons.
> >
> >
> http://localhost:8983/solr/example/update?separator=%09&stream.file=/tmp/input.tsv&stream.contentType=text/csv;charset=utf-8&commit=true&f.note.split=true&f.note.separator=%7C
> >
> > There are 139 million records in that file. I am able to import about
> > 800,000 records into solr at which point solr hangs and then several
> > minutes later returns a 400 bad request back to curl. I looked in the
> logs
> > and I did find a handful of exceptions (e.g invalid date, docvalues field
> > is too large etc) for particular records but nothing that would explain
> why
> > the processing stalled and failed.
> >
> > My expectation is that if solr encounters a record it cannot ingest, it
> > will throw an exception for that particular record and continue
> processing
> > the next record. Is that how the importing works or do all records need
> to
> > be valid? If invalid records should not abort the process, then does
> anyone
> > have any idea what might be going on here?
> >
> > Thanks,
> > Joe
>
>

Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

Reply via email to