Hi Eric, Thanks for the help.
For commit settings, you are referring to https://lucene.apache.org/solr/guide/8_3/updatehandlers-in-solrconfig.html. If so, yes, i have soft commits on. According to the docs, open search is turned by default. Here are the settings. <autoCommit> <maxTime>600000</maxTime> <maxDocs>180000</maxDocs> </autoCommit> <autoSoftCommit> <maxTime>1000</maxTime> <maxDocs>10000</maxDocs> </autoSoftCommit> Please note, I am actually streaming a file from disk -- i am not sending the data via curl. curl is merely telling solr what local file to read from. So I turned off two solr nodes, leaving a single solr node up. When I ran curl again, I noticed the import aborted with this exception. Error adding field 'primary_dob'='1983-12-21T00:00:00Z' msg=Invalid Date in Date Math String:'1983-12-21T00:00:00Z caused by: java.time.format.DateTimeParseException: Text '1983-12-21T00:00:00Z' could not be parsed at index 0' This field is a DatePointField. I've verified that if i remove records with a DatePointField that has parsing problems then solr upload proceeds further ....until it hits another record with a similar problem. I was surprised that a single record with invalid DatePointField would abort the whole process but that does seem to be what's happening. So that's easy enough to fix if I knew why the text was failing to parse. The date certainly seems valid to me based on this documentation. http://lucene.apache.org/solr/7_2_1/solr-core/org/apache/solr/schema/DatePointField.html Any ideas on why that won't parse? Thanks, Joe On Sun, Feb 2, 2020 at 8:51 AM Erick Erickson <erickerick...@gmail.com> wrote: > What are your commit settings? Solr keeps certain in-memory structures > between commits, so it’s important to commit periodically. Say every 60 > seconds as a straw-man proposal (and openSearcher should be set to > true or soft commits should be enabled). > > When firing a zillion docs at Solr, it’s also best that your commits (both > hard > and soft) aren’t happening too frequently, thus my 60 second proposal. > > The commit on the command you send will be executed after the last doc > is sent, so it’s irrelevant to the above. > > Apart from that, when indexing every time you do commit, background > merges are kicked off and there’s a limited number of threads that are > allowed to run concurrently. When that max is reached the next update is > queued until one of the threads is free. So you _may_ be hitting a simple > timeout that’s showing up as a 400 error, which is something of a > catch-all return code. If this is the case, just lengthening the timeouts > might fix the issue. > > Are you sending the documents to the leader? That’ll make the process > simpler since docs received by followers are simply forwarded to the > leader. That shouldn’t really matter, just a side-note. > > Not all that helpful I know. Does the failure happen in the same place? > I.e. > is it possible that a particular doc is making this happen? Unlikely, but > worth > asking. One bad doc shouldn’t stop the whole process, but it’d be a clue > if there was. > > If you’re particularly interested in performance, you should consider > indexing to a leader-only collection, either by deleting the followers or > shutting down the Solr instances. There’s a performance penalty due to > forwarding the docs (talking NRT replicas here) that can be quite > substantial. When you turn the Solr instances back on (or ADDREPLICA), > they’ll sync back up. > > Finally, I mistrust just sending a large amount of data via HTTP, just > because > there’s not much you can do except hope it all works. If this is a > recurring > process I’d seriously consider writing a SolrJ program that parsed the > csv file and sent it to Solr. > > Best, > Erick > > > > > On Feb 2, 2020, at 9:32 AM, Joseph Lorenzini <jalo...@gmail.com> wrote: > > > > Hi all, > > > > I have three node solr cloud cluster. The collection has a single shard. > I > > am importing 140 GB CSV file into solr using curl with a URL that looks > > roughly like this. I am streaming the file from disk for performance > > reasons. > > > > > http://localhost:8983/solr/example/update?separator=%09&stream.file=/tmp/input.tsv&stream.contentType=text/csv;charset=utf-8&commit=true&f.note.split=true&f.note.separator=%7C > > > > There are 139 million records in that file. I am able to import about > > 800,000 records into solr at which point solr hangs and then several > > minutes later returns a 400 bad request back to curl. I looked in the > logs > > and I did find a handful of exceptions (e.g invalid date, docvalues field > > is too large etc) for particular records but nothing that would explain > why > > the processing stalled and failed. > > > > My expectation is that if solr encounters a record it cannot ingest, it > > will throw an exception for that particular record and continue > processing > > the next record. Is that how the importing works or do all records need > to > > be valid? If invalid records should not abort the process, then does > anyone > > have any idea what might be going on here? > > > > Thanks, > > Joe > >