Re: Handling All Replicas Down in Solr 8.3 Cloud Collection

2020-02-04 Thread Joseph Lorenzini
Here's roughly what was going on:


   1. set up three node cluster with a collection. The collection has one
   shard and three replicas for that shard.
   2. Shut down two of the nodes and verify the remaining node is the
   leader. Verified the other two nodes are registered as dead in solr ui.
   3. bulk import several million documents into solr from a CSV file.
   4. shut down the remaining node
   5. start up all three nodes

Even after three minutes no leader was active. I executed the FORCELEADER
API call which completed successfully and waited three minutes -- still no
replica was elected leader. I then compared my solr 8 cluster to a
different solr cluster. I noticed that the znode
*/collections/example/leaders/shard1
*existed on both clusters but in the solr 8 cluster the znode was empty. I
manually uploaded a json document with the proper settings to that znode
and then called the FORCELEADER API call again and waited 3 minutes.

A leader still wasn't elected.

Then, I removed the replica for the node that I imported all the documents
into it and added the replica back in. At that point, a leader was elected.
I am not sure i have exact steps to reproduce but I did get it working.

Thanks,
Joe

On Tue, Feb 4, 2020 at 7:54 AM Erick Erickson 
wrote:

> First, be sure to wait at least 3 minutes before concluding the replicas
> are permanently down, that’s the default wait period for certain leader
> election fallbacks. It’s easy to conclude it’s never going to recover, 180
> seconds is an eternity ;).
>
> You can try the collections API FORCELEADER command. Assuming a leader is
> elected and becomes active, you _may_ have to restart the other two Solr
> nodes.
>
> How did you stop the servers? You mention disaster recovery, so I’m
> thinking you did a “kill -9” or similar? Were you actively indexing at the
> time? Solr _should_ manage the recovery even in that case, I’m mostly
> wondering what the sequence of events that lead up to this was…
>
> Best,
> Erick
>
> > On Feb 4, 2020, at 8:38 AM, Joseph Lorenzini  wrote:
> >
> > Hi all,
> >
> > I have a 3 node solr cloud instance with a single collection. The solr
> > nodes are pointed to a 3-node zookeeper ensemble. I was doing some basic
> > disaster recovery testing and have encountered a problem that hasn't been
> > obvious to me on how to fix.
> >
> > After i started back up the three solr java processes, i can see that
> they
> > are registered back in the solr UI. However, each replica is in a down
> > state permanently. there are no logs in either solr or zookeeper that may
> > indicate what the the problem would be -- neither exceptions nor
> warnings.
> >
> > So is there any way to collect more diagnostics to figure out what's
> going
> > on? Short of deleting and recreating the replicas is there any way to fix
> > this?
> >
> > Thanks,
> > Joe
>
>


Handling All Replicas Down in Solr 8.3 Cloud Collection

2020-02-04 Thread Joseph Lorenzini
Hi all,

I have a 3 node solr cloud instance with a single collection. The solr
nodes are pointed to a 3-node zookeeper ensemble. I was doing some basic
disaster recovery testing and have encountered a problem that hasn't been
obvious to me on how to fix.

After i started back up the three solr java processes, i can see that they
are registered back in the solr UI. However, each replica is in a down
state permanently. there are no logs in either solr or zookeeper that may
indicate what the the problem would be -- neither exceptions nor warnings.

So is there any way to collect more diagnostics to figure out what's going
on? Short of deleting and recreating the replicas is there any way to fix
this?

Thanks,
Joe


Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

2020-02-03 Thread Joseph Lorenzini
Hi Shawn/Erick,

This information has been very helpful. Thank you.

So I did some more investigation into our ETL process and I verified that
with the exception of the text I sent above they are all obviously invalid
dates. For example, one field value had 00 for a day so would guess that
field had a non-printable character in it. S at least in the case of a
record where a field has invalid date, the entire import process is
aborted. I'll adjust the ETL process to stop passing invalid dates but this
does lead me to question about failure modes for importing large data sets
into a collection. Is there any way to specify a "continue on failure" mode
such that solr logs that it was unable to parse a record and why and then
continues onto the next node?

Thanks,
Joe

On Sun, Feb 2, 2020 at 4:46 PM Shawn Heisey  wrote:

> On 2/2/2020 8:47 AM, Joseph Lorenzini wrote:
> >  
> >  1000
> >  1
> >  
>
> That autoSoftCommit setting is far too aggressive, especially for bulk
> indexing.  I don't know whether it's causing the specific problem you're
> asking about here, but it's still a setting that will cause problems,
> because Solr will constantly be doing commit operations while bulk
> indexing is underway.
>
> Erick mentioned this as well.  Greatly increasing the maxTime, and
> removing maxDocs, is recommended.  I would recommend starting at one
> minute.  The maxDocs setting should be removed from autoCommit as well.
>
> > So I turned off two solr nodes, leaving a single solr node up. When I ran
> > curl again, I noticed the import aborted with this exception.
> >
> > Error adding field 'primary_dob'='1983-12-21T00:00:00Z' msg=Invalid Date
> in
> > Date Math String:'1983-12-21T00:00:00Z
> > caused by: java.time.format.DateTimeParseException: Text
> > '1983-12-21T00:00:00Z' could not be parsed at index 0'
>
> That date string looks OK.  Which MIGHT mean there are characters in it
> that are not visible.  Erick said that the single quote is balanced in
> his message, which COULD mean that the character causing the problem is
> one that deletes things when it is printed.
>
> Thanks,
> Shawn
>


Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

2020-02-02 Thread Joseph Lorenzini
Hi Eric,

Thanks for the help.

For commit settings, you are referring to
https://lucene.apache.org/solr/guide/8_3/updatehandlers-in-solrconfig.html.
If so, yes, i have soft commits on. According to the docs, open search is
turned by default. Here are the settings.


60
18


1000
1



Please note, I am actually streaming a file from disk -- i am not sending
the data via curl. curl is merely telling solr what local file to read from.

So I turned off two solr nodes, leaving a single solr node up. When I ran
curl again, I noticed the import aborted with this exception.

Error adding field 'primary_dob'='1983-12-21T00:00:00Z' msg=Invalid Date in
Date Math String:'1983-12-21T00:00:00Z
caused by: java.time.format.DateTimeParseException: Text
'1983-12-21T00:00:00Z' could not be parsed at index 0'

This field is a DatePointField. I've verified that if i remove records with
a DatePointField that has parsing problems then solr upload proceeds
further until it hits another record with a similar problem. I was
surprised that a single record with invalid DatePointField would abort the
whole process but that does seem to be what's happening.

So that's easy enough to fix if I knew why the text was failing to parse.
The date certainly seems valid to me based on this documentation.

http://lucene.apache.org/solr/7_2_1/solr-core/org/apache/solr/schema/DatePointField.html

Any ideas on why that won't parse?

Thanks,
Joe


On Sun, Feb 2, 2020 at 8:51 AM Erick Erickson 
wrote:

> What are your commit settings? Solr keeps certain in-memory structures
> between commits, so it’s important to commit periodically. Say every 60
> seconds as a straw-man proposal (and openSearcher should be set to
> true or soft commits should be enabled).
>
> When firing a zillion docs at Solr, it’s also best that your commits (both
> hard
> and soft) aren’t happening too frequently, thus my 60 second proposal.
>
> The commit on the command you send will be executed after the last doc
> is sent, so it’s irrelevant to the above.
>
> Apart from that, when indexing every time you do commit, background
> merges are kicked off and there’s a limited number of threads that are
> allowed to run concurrently. When that max is reached the next update is
> queued until one of the threads is free. So you _may_ be hitting a simple
> timeout that’s showing up as a 400 error, which is something of a
> catch-all return code. If this is the case, just lengthening the timeouts
> might fix the issue.
>
> Are you sending the documents to the leader? That’ll make the process
> simpler since docs received by followers are simply forwarded to the
> leader. That shouldn’t really matter, just a side-note.
>
> Not all that helpful I know. Does the failure happen in the same place?
> I.e.
> is it possible that a particular doc is making this happen? Unlikely, but
> worth
> asking. One bad doc shouldn’t stop the whole process, but it’d be a clue
> if there was.
>
> If you’re particularly interested in performance, you should consider
> indexing to a leader-only collection, either by deleting the followers or
> shutting down the Solr instances. There’s a performance penalty due to
> forwarding the docs (talking NRT replicas here) that can be quite
> substantial. When you turn the Solr instances back on (or ADDREPLICA),
> they’ll sync back up.
>
> Finally, I mistrust just sending a large amount of data via HTTP, just
> because
> there’s not much you can do except hope it all works. If this is a
> recurring
> process I’d seriously consider writing a SolrJ program that parsed the
> csv file and sent it to Solr.
>
> Best,
> Erick
>
>
>
> > On Feb 2, 2020, at 9:32 AM, Joseph Lorenzini  wrote:
> >
> > Hi all,
> >
> > I have three node solr cloud cluster. The collection has a single shard.
> I
> > am importing 140 GB CSV file into solr using curl with a URL that looks
> > roughly like this. I am streaming the file from disk for performance
> > reasons.
> >
> >
> http://localhost:8983/solr/example/update?separator=%09=/tmp/input.tsv=text/csv;charset=utf-8=true=true=%7C
> >
> > There are 139 million records in that file. I am able to import about
> > 800,000 records into solr at which point solr hangs and then several
> > minutes later returns a 400 bad request back to curl. I looked in the
> logs
> > and I did find a handful of exceptions (e.g invalid date, docvalues field
> > is too large etc) for particular records but nothing that would explain
> why
> > the processing stalled and failed.
> >
> > My expectation is that if solr encounters a record it cannot ingest, it
> > will throw an exception for that particular record and continue
> processing
> > the next record. Is that how the importing works or do all records need
> to
> > be valid? If invalid records should not abort the process, then does
> anyone
> > have any idea what might be going on here?
> >
> > Thanks,
> > Joe
>
>


Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

2020-02-02 Thread Joseph Lorenzini
Hi all,

I have three node solr cloud cluster. The collection has a single shard. I
am importing 140 GB CSV file into solr using curl with a URL that looks
roughly like this. I am streaming the file from disk for performance
reasons.

http://localhost:8983/solr/example/update?separator=%09=/tmp/input.tsv=text/csv;charset=utf-8=true=true=%7C

There are 139 million records in that file. I am able to import about
800,000 records into solr at which point solr hangs and then several
minutes later returns a 400 bad request back to curl. I looked in the logs
and I did find a handful of exceptions (e.g invalid date, docvalues field
is too large etc) for particular records but nothing that would explain why
the processing stalled and failed.

My expectation is that if solr encounters a record it cannot ingest, it
will throw an exception for that particular record and continue processing
the next record. Is that how the importing works or do all records need to
be valid? If invalid records should not abort the process, then does anyone
have any idea what might be going on here?

Thanks,
Joe


Cost of Stored=True Setting for All Fields

2020-01-28 Thread Joseph Lorenzini
Hi all,

I am in the process of migrating a solr collection from 4 to 8. I
discovered that there was no ETL process for loading all the data into a
new collection in solr 8, so I had to build one. For technical reasons that
aren't important here, I'd prefer this tool to be a one-off.

In the future, I'd like to use the Solr DIH to do the reindexing. However,
that can only work if the DIH can get all the solr fields. I discovered
that in solr 4 at least if a field is set to stored=false, then the DIH
won't get that field. So I am wondering if i can fix this by simply set
stored=true for all the fields. Since I am going to have to do a full
re-index for solr 8 migration, now would be the time to update the schema
for this.

I expect that disk size would grow but I'd like to find out if there are
any other costs or potential problems that could come up if i go that route.

Thanks,
Joe


Performance of Bulk Importing TSV File in Solr 8

2020-01-02 Thread Joseph Lorenzini
Hi all,

I have TSV file that contains 1.2 million rows. I want to bulk import this
file into solr where each row becomes a solr document. The TSV has 24
columns. I am using the streaming API like so:

curl -v '
http://localhost:8983/solr/example/update?stream.file=/opt/solr/results.tsv=%09=%5c=text/csv;charset=utf-8=true
'

The ingestion rate is 167,000 rows a minute and takes about 7.5 minutes to
complete. I have a few questions.

- is there a way to increase the performance of the ingestion rate? I am
open to doing something other than bulk import of a TSV up to and including
writing a small program. I am just not sure what that would look like at a
high level.
- if the file is a TSV, I noticed that solr never closes a HTTP connection
with a 200 OK after all the documents are uploaded. The connection seems to
be held open indefinitely. If however, i upload the same file as a CSV,
then solr does close the http connection. Is this a bug?


need for re-indexing when using managed schema

2019-12-16 Thread Joseph Lorenzini
Hi all,

I have question about the managed schema functionality.  According to the
docs, "All changes to a collection’s schema require reindexing". This would
imply that if you use a managed schema and you use the schema API to update
the schema, then doing a full re-index is necessary each time.

Is this accurate or can a full re-index be avoided?

Thanks,
Joe