Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-12 Thread Erick Erickson
Right, look at autoAddReplicas which is designed to do this
automagically (but I confess I don't have much experience with it).

What that doesn't handle is capacity, if you need to increase the QPS
you need to add replicas though.

Depends on your needs of course.

Best,
Erick

On Mon, Dec 11, 2017 at 2:39 PM, Joe Obernberger
 wrote:
> Thank you Erick.  Perhaps it makes more sense to not use any replicas when
> using HDFS for storage (and having a very large index) since it is already
> replicated.  It seems to me that if there were no replicas, and a leader
> went down, that another node could take over by just going through the
> regular startup cycle (replaying logs etc.) similar to the auto add replicas
> capability.  Not sure how one would handle a node coming back.
>
> I think there could be a lot to be gained by taking advantage of a global
> file system with Solr.  Would be fun!
>
> -Joe
>
>
> On 12/9/2017 10:26 PM, Erick Erickson wrote:
>>
>> The complications are things like this:
>>
>> Say an update comes in and gets written to the tlog and indexed but
>> not committed. Now the leader goes down. How does the replica that
>> takes over leadership
>> 1> understand the current state of the index, i.e. that there are
>> uncommitted updates
>> 2> replay the updates from the tlog correctly?
>>
>> Not to mention that during leader election one of the read-only
>> replicas must become a read/write replica when it takes over
>> leadership.
>>
>> The current mechanism does, indeed, use Zk to elect a new leader, the
>> devil is in the details of how in-flight updates get handled properly.
>>
>> There's no a-priori reason all those details couldn't be worked out,
>> it's just gnarly. Nobody has yet stepped up to commit the
>> time/resources to work them all out. My guess is that the cost of
>> having a bunch more disks is cheaper than the engineering time it
>> would take to changes this. The standard answer is "patches welcome"
>> ;).
>>
>> Best,
>> Erick
>>
>> On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp 
>> wrote:
>>>
>>> Ok, thanks for the answer. The leader election and update notification
>>> sound
>>> like they should work using ZooKeeper (leader election recipe and a
>>> normal
>>> watch) but I guess there are some details that make things more
>>> complicated.
>>>
>>> On 09.12.2017 20:19, Erick Erickson wrote:

 This has been bandied about on a number of occasions, it boils down to
 nobody has stepped up to make it happen. It turns out there are a
 number of tricky issues:

> how does leadership change if the leader goes down?
> the raw complexity of getting it right. Getting it wrong corrupts
> indexes
> how do you resolve leadership in the first place so only the leader
> writes to the index?
> how would that affect performance if N replicas were autowarming at the
> same time, thus reading from HDFS?
> how do the read-only replicas know to open a new searcher?
> I'm sure there are a bunch more.

 So this is one of those things that everyone agrees is interesting,
 but nobody is willing to code and it's not actually clear that it
 makes sense in the Solr context. It'd be a pity to put in all the work
 then discover that the performance issues prohibited using it.

 If you _guarantee_ that the index doesn't change, there's the
 NoLockFactory you could specify. That would allow you to share a
 common index, woe be unto you if you start updating the index though.

 Best,
 Erick

 On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp
 
 wrote:
>
> Hi,
>
> for the HDFS case wouldn't it be nice if there was a mode in which the
> replicas just read the same index files as the leader? I mean after all
> the
> data is already on a shared readable file system so why would one even
> need
> to replicate the transaction log files?
>
> regards,
> Hendrik
>
>
> On 08.12.2017 21:07, Erick Erickson wrote:
>>
>> bq: Will TLOG replicas use less network bandwidth?
>>
>> No, probably more bandwidth. TLOG replicas work like this:
>> 1> the raw docs are forwarded
>> 2> the old-style master/slave replication is used
>>
>> So what you do save is CPU processing on the TLOG replica in exchange
>> for increased bandwidth.
>>
>> Since the only thing forwarded in NRT replicas (outside of recovery)
>> is the raw documents, I expect that TLOG replicas would _increase_
>> network usage. The deal is that TLOG replicas can take over leadership
>> if the leader goes down so they must have an
>> up-to-date-after-last-index-sync set of tlogs.
>>
>> At least that's my current understanding...
>>
>> Best,
>> Erick
>>
>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>>  wrote:
>>>
>>> Anyone have any thoughts on this?  Will TLOG replicas use less
>>> netwo

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-11 Thread Joe Obernberger
Thank you Erick.  Perhaps it makes more sense to not use any replicas 
when using HDFS for storage (and having a very large index) since it is 
already replicated.  It seems to me that if there were no replicas, and 
a leader went down, that another node could take over by just going 
through the regular startup cycle (replaying logs etc.) similar to the 
auto add replicas capability.  Not sure how one would handle a node 
coming back.


I think there could be a lot to be gained by taking advantage of a 
global file system with Solr.  Would be fun!


-Joe


On 12/9/2017 10:26 PM, Erick Erickson wrote:

The complications are things like this:

Say an update comes in and gets written to the tlog and indexed but
not committed. Now the leader goes down. How does the replica that
takes over leadership
1> understand the current state of the index, i.e. that there are
uncommitted updates
2> replay the updates from the tlog correctly?

Not to mention that during leader election one of the read-only
replicas must become a read/write replica when it takes over
leadership.

The current mechanism does, indeed, use Zk to elect a new leader, the
devil is in the details of how in-flight updates get handled properly.

There's no a-priori reason all those details couldn't be worked out,
it's just gnarly. Nobody has yet stepped up to commit the
time/resources to work them all out. My guess is that the cost of
having a bunch more disks is cheaper than the engineering time it
would take to changes this. The standard answer is "patches welcome"
;).

Best,
Erick

On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp  wrote:

Ok, thanks for the answer. The leader election and update notification sound
like they should work using ZooKeeper (leader election recipe and a normal
watch) but I guess there are some details that make things more complicated.

On 09.12.2017 20:19, Erick Erickson wrote:

This has been bandied about on a number of occasions, it boils down to
nobody has stepped up to make it happen. It turns out there are a
number of tricky issues:


how does leadership change if the leader goes down?
the raw complexity of getting it right. Getting it wrong corrupts indexes
how do you resolve leadership in the first place so only the leader
writes to the index?
how would that affect performance if N replicas were autowarming at the
same time, thus reading from HDFS?
how do the read-only replicas know to open a new searcher?
I'm sure there are a bunch more.

So this is one of those things that everyone agrees is interesting,
but nobody is willing to code and it's not actually clear that it
makes sense in the Solr context. It'd be a pity to put in all the work
then discover that the performance issues prohibited using it.

If you _guarantee_ that the index doesn't change, there's the
NoLockFactory you could specify. That would allow you to share a
common index, woe be unto you if you start updating the index though.

Best,
Erick

On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp 
wrote:

Hi,

for the HDFS case wouldn't it be nice if there was a mode in which the
replicas just read the same index files as the leader? I mean after all
the
data is already on a shared readable file system so why would one even
need
to replicate the transaction log files?

regards,
Hendrik


On 08.12.2017 21:07, Erick Erickson wrote:

bq: Will TLOG replicas use less network bandwidth?

No, probably more bandwidth. TLOG replicas work like this:
1> the raw docs are forwarded
2> the old-style master/slave replication is used

So what you do save is CPU processing on the TLOG replica in exchange
for increased bandwidth.

Since the only thing forwarded in NRT replicas (outside of recovery)
is the raw documents, I expect that TLOG replicas would _increase_
network usage. The deal is that TLOG replicas can take over leadership
if the leader goes down so they must have an
up-to-date-after-last-index-sync set of tlogs.

At least that's my current understanding...

Best,
Erick

On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
 wrote:

Anyone have any thoughts on this?  Will TLOG replicas use less network
bandwidth?

-Joe


On 12/4/2017 12:54 PM, Joe Obernberger wrote:

Hi All - this same problem happened again, and I think I partially
understand what is going on.  The part I don't know is what caused any
of
the replicas to go into full recovery in the first place, but once
they
do,
they cause network interfaces on servers to go fully utilized in both
in/out
directions.  It appears that when a solr replica needs to recover, it
calls
on the leader for all the data.  In HDFS, the data from the leader's
point
of view goes:

HDFS --> Solr Leader Process -->Network--> Replica Solr Process
-->HDFS

Do I have this correct?  That poor network in the middle becomes a
bottleneck and causes other replicas to go into recovery, which causes
more
network traffic.  Perhaps going to TLOG replicas with 7.1 would be
better
with HDFS?  Would it be possible for the leader to send a mes

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-09 Thread Erick Erickson
The complications are things like this:

Say an update comes in and gets written to the tlog and indexed but
not committed. Now the leader goes down. How does the replica that
takes over leadership
1> understand the current state of the index, i.e. that there are
uncommitted updates
2> replay the updates from the tlog correctly?

Not to mention that during leader election one of the read-only
replicas must become a read/write replica when it takes over
leadership.

The current mechanism does, indeed, use Zk to elect a new leader, the
devil is in the details of how in-flight updates get handled properly.

There's no a-priori reason all those details couldn't be worked out,
it's just gnarly. Nobody has yet stepped up to commit the
time/resources to work them all out. My guess is that the cost of
having a bunch more disks is cheaper than the engineering time it
would take to changes this. The standard answer is "patches welcome"
;).

Best,
Erick

On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp  wrote:
> Ok, thanks for the answer. The leader election and update notification sound
> like they should work using ZooKeeper (leader election recipe and a normal
> watch) but I guess there are some details that make things more complicated.
>
> On 09.12.2017 20:19, Erick Erickson wrote:
>>
>> This has been bandied about on a number of occasions, it boils down to
>> nobody has stepped up to make it happen. It turns out there are a
>> number of tricky issues:
>>
>>> how does leadership change if the leader goes down?
>>> the raw complexity of getting it right. Getting it wrong corrupts indexes
>>> how do you resolve leadership in the first place so only the leader
>>> writes to the index?
>>> how would that affect performance if N replicas were autowarming at the
>>> same time, thus reading from HDFS?
>>> how do the read-only replicas know to open a new searcher?
>>> I'm sure there are a bunch more.
>>
>> So this is one of those things that everyone agrees is interesting,
>> but nobody is willing to code and it's not actually clear that it
>> makes sense in the Solr context. It'd be a pity to put in all the work
>> then discover that the performance issues prohibited using it.
>>
>> If you _guarantee_ that the index doesn't change, there's the
>> NoLockFactory you could specify. That would allow you to share a
>> common index, woe be unto you if you start updating the index though.
>>
>> Best,
>> Erick
>>
>> On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp 
>> wrote:
>>>
>>> Hi,
>>>
>>> for the HDFS case wouldn't it be nice if there was a mode in which the
>>> replicas just read the same index files as the leader? I mean after all
>>> the
>>> data is already on a shared readable file system so why would one even
>>> need
>>> to replicate the transaction log files?
>>>
>>> regards,
>>> Hendrik
>>>
>>>
>>> On 08.12.2017 21:07, Erick Erickson wrote:

 bq: Will TLOG replicas use less network bandwidth?

 No, probably more bandwidth. TLOG replicas work like this:
 1> the raw docs are forwarded
 2> the old-style master/slave replication is used

 So what you do save is CPU processing on the TLOG replica in exchange
 for increased bandwidth.

 Since the only thing forwarded in NRT replicas (outside of recovery)
 is the raw documents, I expect that TLOG replicas would _increase_
 network usage. The deal is that TLOG replicas can take over leadership
 if the leader goes down so they must have an
 up-to-date-after-last-index-sync set of tlogs.

 At least that's my current understanding...

 Best,
 Erick

 On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
  wrote:
>
> Anyone have any thoughts on this?  Will TLOG replicas use less network
> bandwidth?
>
> -Joe
>
>
> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>
>> Hi All - this same problem happened again, and I think I partially
>> understand what is going on.  The part I don't know is what caused any
>> of
>> the replicas to go into full recovery in the first place, but once
>> they
>> do,
>> they cause network interfaces on servers to go fully utilized in both
>> in/out
>> directions.  It appears that when a solr replica needs to recover, it
>> calls
>> on the leader for all the data.  In HDFS, the data from the leader's
>> point
>> of view goes:
>>
>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process
>> -->HDFS
>>
>> Do I have this correct?  That poor network in the middle becomes a
>> bottleneck and causes other replicas to go into recovery, which causes
>> more
>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be
>> better
>> with HDFS?  Would it be possible for the leader to send a message to
>> the
>> replica to instead get the data straight from HDFS instead of going
>> from
>> one
>> solr process to anoth

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-09 Thread Hendrik Haddorp
Ok, thanks for the answer. The leader election and update notification 
sound like they should work using ZooKeeper (leader election recipe and 
a normal watch) but I guess there are some details that make things more 
complicated.


On 09.12.2017 20:19, Erick Erickson wrote:

This has been bandied about on a number of occasions, it boils down to
nobody has stepped up to make it happen. It turns out there are a
number of tricky issues:


how does leadership change if the leader goes down?
the raw complexity of getting it right. Getting it wrong corrupts indexes
how do you resolve leadership in the first place so only the leader writes to 
the index?
how would that affect performance if N replicas were autowarming at the same 
time, thus reading from HDFS?
how do the read-only replicas know to open a new searcher?
I'm sure there are a bunch more.

So this is one of those things that everyone agrees is interesting,
but nobody is willing to code and it's not actually clear that it
makes sense in the Solr context. It'd be a pity to put in all the work
then discover that the performance issues prohibited using it.

If you _guarantee_ that the index doesn't change, there's the
NoLockFactory you could specify. That would allow you to share a
common index, woe be unto you if you start updating the index though.

Best,
Erick

On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp  wrote:

Hi,

for the HDFS case wouldn't it be nice if there was a mode in which the
replicas just read the same index files as the leader? I mean after all the
data is already on a shared readable file system so why would one even need
to replicate the transaction log files?

regards,
Hendrik


On 08.12.2017 21:07, Erick Erickson wrote:

bq: Will TLOG replicas use less network bandwidth?

No, probably more bandwidth. TLOG replicas work like this:
1> the raw docs are forwarded
2> the old-style master/slave replication is used

So what you do save is CPU processing on the TLOG replica in exchange
for increased bandwidth.

Since the only thing forwarded in NRT replicas (outside of recovery)
is the raw documents, I expect that TLOG replicas would _increase_
network usage. The deal is that TLOG replicas can take over leadership
if the leader goes down so they must have an
up-to-date-after-last-index-sync set of tlogs.

At least that's my current understanding...

Best,
Erick

On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
 wrote:

Anyone have any thoughts on this?  Will TLOG replicas use less network
bandwidth?

-Joe


On 12/4/2017 12:54 PM, Joe Obernberger wrote:

Hi All - this same problem happened again, and I think I partially
understand what is going on.  The part I don't know is what caused any
of
the replicas to go into full recovery in the first place, but once they
do,
they cause network interfaces on servers to go fully utilized in both
in/out
directions.  It appears that when a solr replica needs to recover, it
calls
on the leader for all the data.  In HDFS, the data from the leader's
point
of view goes:

HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS

Do I have this correct?  That poor network in the middle becomes a
bottleneck and causes other replicas to go into recovery, which causes
more
network traffic.  Perhaps going to TLOG replicas with 7.1 would be
better
with HDFS?  Would it be possible for the leader to send a message to the
replica to instead get the data straight from HDFS instead of going from
one
solr process to another?  HDFS would better be able to use the cluster
since
each block has 3x replicas.  Perhaps there is a better way to handle
replicas with a shared file system.

Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
Good idea?  Thank you!

-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:

Hi Shawn - thank you for your reply. The index is 29.9TBytes as
reported
by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about
1.1
billion documents indexed and we index about 2.5 million documents per
day.
Assuming an even distribution, each node is handling about 680GBytes
of
index.  So our cache size is 1.4%. Perhaps 'relatively small block
cache'
was an understatement! This is why we split the largest collection
into
two,
where one is data going back 30 days, and the other is all the data.
Most
of our searches are not longer than 30 days back.  The 30 day 

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-09 Thread Erick Erickson
This has been bandied about on a number of occasions, it boils down to
nobody has stepped up to make it happen. It turns out there are a
number of tricky issues:

> how does leadership change if the leader goes down?
> the raw complexity of getting it right. Getting it wrong corrupts indexes
> how do you resolve leadership in the first place so only the leader writes to 
> the index?
> how would that affect performance if N replicas were autowarming at the same 
> time, thus reading from HDFS?
> how do the read-only replicas know to open a new searcher?
> I'm sure there are a bunch more.

So this is one of those things that everyone agrees is interesting,
but nobody is willing to code and it's not actually clear that it
makes sense in the Solr context. It'd be a pity to put in all the work
then discover that the performance issues prohibited using it.

If you _guarantee_ that the index doesn't change, there's the
NoLockFactory you could specify. That would allow you to share a
common index, woe be unto you if you start updating the index though.

Best,
Erick

On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp  wrote:
> Hi,
>
> for the HDFS case wouldn't it be nice if there was a mode in which the
> replicas just read the same index files as the leader? I mean after all the
> data is already on a shared readable file system so why would one even need
> to replicate the transaction log files?
>
> regards,
> Hendrik
>
>
> On 08.12.2017 21:07, Erick Erickson wrote:
>>
>> bq: Will TLOG replicas use less network bandwidth?
>>
>> No, probably more bandwidth. TLOG replicas work like this:
>> 1> the raw docs are forwarded
>> 2> the old-style master/slave replication is used
>>
>> So what you do save is CPU processing on the TLOG replica in exchange
>> for increased bandwidth.
>>
>> Since the only thing forwarded in NRT replicas (outside of recovery)
>> is the raw documents, I expect that TLOG replicas would _increase_
>> network usage. The deal is that TLOG replicas can take over leadership
>> if the leader goes down so they must have an
>> up-to-date-after-last-index-sync set of tlogs.
>>
>> At least that's my current understanding...
>>
>> Best,
>> Erick
>>
>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>>  wrote:
>>>
>>> Anyone have any thoughts on this?  Will TLOG replicas use less network
>>> bandwidth?
>>>
>>> -Joe
>>>
>>>
>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:

 Hi All - this same problem happened again, and I think I partially
 understand what is going on.  The part I don't know is what caused any
 of
 the replicas to go into full recovery in the first place, but once they
 do,
 they cause network interfaces on servers to go fully utilized in both
 in/out
 directions.  It appears that when a solr replica needs to recover, it
 calls
 on the leader for all the data.  In HDFS, the data from the leader's
 point
 of view goes:

 HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS

 Do I have this correct?  That poor network in the middle becomes a
 bottleneck and causes other replicas to go into recovery, which causes
 more
 network traffic.  Perhaps going to TLOG replicas with 7.1 would be
 better
 with HDFS?  Would it be possible for the leader to send a message to the
 replica to instead get the data straight from HDFS instead of going from
 one
 solr process to another?  HDFS would better be able to use the cluster
 since
 each block has 3x replicas.  Perhaps there is a better way to handle
 replicas with a shared file system.

 Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
 Good idea?  Thank you!

 -Joe


 On 11/22/2017 8:17 PM, Erick Erickson wrote:
>
> Hmm. This is quite possible. Any time things take "too long" it can be
>a problem. For instance, if the leader sends docs to a replica and
> the request times out, the leader throws the follower into "Leader
> Initiated Recovery". The smoking gun here is that there are no errors
> on the follower, just the notification that the leader put it into
> recovery.
>
> There are other variations on the theme, it all boils down to when
> communications fall apart replicas go into recovery.
>
> Best,
> Erick
>
> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>  wrote:
>>
>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as
>> reported
>> by:
>> hadoop fs -du -s -h /solr6.6.0
>> 29.9 T  89.9 T  /solr6.6.0
>>
>> The 89.9TBytes is due to HDFS having 3x replication.  There are about
>> 1.1
>> billion documents indexed and we index about 2.5 million documents per
>> day.
>> Assuming an even distribution, each node is handling about 680GBytes
>> of
>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>> cache'

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-09 Thread Hendrik Haddorp

Hi,

for the HDFS case wouldn't it be nice if there was a mode in which the 
replicas just read the same index files as the leader? I mean after all 
the data is already on a shared readable file system so why would one 
even need to replicate the transaction log files?


regards,
Hendrik

On 08.12.2017 21:07, Erick Erickson wrote:

bq: Will TLOG replicas use less network bandwidth?

No, probably more bandwidth. TLOG replicas work like this:
1> the raw docs are forwarded
2> the old-style master/slave replication is used

So what you do save is CPU processing on the TLOG replica in exchange
for increased bandwidth.

Since the only thing forwarded in NRT replicas (outside of recovery)
is the raw documents, I expect that TLOG replicas would _increase_
network usage. The deal is that TLOG replicas can take over leadership
if the leader goes down so they must have an
up-to-date-after-last-index-sync set of tlogs.

At least that's my current understanding...

Best,
Erick

On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
 wrote:

Anyone have any thoughts on this?  Will TLOG replicas use less network
bandwidth?

-Joe


On 12/4/2017 12:54 PM, Joe Obernberger wrote:

Hi All - this same problem happened again, and I think I partially
understand what is going on.  The part I don't know is what caused any of
the replicas to go into full recovery in the first place, but once they do,
they cause network interfaces on servers to go fully utilized in both in/out
directions.  It appears that when a solr replica needs to recover, it calls
on the leader for all the data.  In HDFS, the data from the leader's point
of view goes:

HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS

Do I have this correct?  That poor network in the middle becomes a
bottleneck and causes other replicas to go into recovery, which causes more
network traffic.  Perhaps going to TLOG replicas with 7.1 would be better
with HDFS?  Would it be possible for the leader to send a message to the
replica to instead get the data straight from HDFS instead of going from one
solr process to another?  HDFS would better be able to use the cluster since
each block has 3x replicas.  Perhaps there is a better way to handle
replicas with a shared file system.

Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
Good idea?  Thank you!

-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
   a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:

Hi Shawn - thank you for your reply. The index is 29.9TBytes as reported
by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about
1.1
billion documents indexed and we index about 2.5 million documents per
day.
Assuming an even distribution, each node is handling about 680GBytes of
index.  So our cache size is 1.4%. Perhaps 'relatively small block
cache'
was an understatement! This is why we split the largest collection into
two,
where one is data going back 30 days, and the other is all the data.
Most
of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would need
a
block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the
cluster
and their network interfaces became pegged and requests for HDFS blocks
timed out.  When that happened, SolrCloud went into recovery which
caused
more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, while
still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also sa

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-08 Thread Erick Erickson
bq: Will TLOG replicas use less network bandwidth?

No, probably more bandwidth. TLOG replicas work like this:
1> the raw docs are forwarded
2> the old-style master/slave replication is used

So what you do save is CPU processing on the TLOG replica in exchange
for increased bandwidth.

Since the only thing forwarded in NRT replicas (outside of recovery)
is the raw documents, I expect that TLOG replicas would _increase_
network usage. The deal is that TLOG replicas can take over leadership
if the leader goes down so they must have an
up-to-date-after-last-index-sync set of tlogs.

At least that's my current understanding...

Best,
Erick

On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
 wrote:
> Anyone have any thoughts on this?  Will TLOG replicas use less network
> bandwidth?
>
> -Joe
>
>
> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>
>> Hi All - this same problem happened again, and I think I partially
>> understand what is going on.  The part I don't know is what caused any of
>> the replicas to go into full recovery in the first place, but once they do,
>> they cause network interfaces on servers to go fully utilized in both in/out
>> directions.  It appears that when a solr replica needs to recover, it calls
>> on the leader for all the data.  In HDFS, the data from the leader's point
>> of view goes:
>>
>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS
>>
>> Do I have this correct?  That poor network in the middle becomes a
>> bottleneck and causes other replicas to go into recovery, which causes more
>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be better
>> with HDFS?  Would it be possible for the leader to send a message to the
>> replica to instead get the data straight from HDFS instead of going from one
>> solr process to another?  HDFS would better be able to use the cluster since
>> each block has 3x replicas.  Perhaps there is a better way to handle
>> replicas with a shared file system.
>>
>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>> Good idea?  Thank you!
>>
>> -Joe
>>
>>
>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>
>>> Hmm. This is quite possible. Any time things take "too long" it can be
>>>   a problem. For instance, if the leader sends docs to a replica and
>>> the request times out, the leader throws the follower into "Leader
>>> Initiated Recovery". The smoking gun here is that there are no errors
>>> on the follower, just the notification that the leader put it into
>>> recovery.
>>>
>>> There are other variations on the theme, it all boils down to when
>>> communications fall apart replicas go into recovery.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>  wrote:

 Hi Shawn - thank you for your reply. The index is 29.9TBytes as reported
 by:
 hadoop fs -du -s -h /solr6.6.0
 29.9 T  89.9 T  /solr6.6.0

 The 89.9TBytes is due to HDFS having 3x replication.  There are about
 1.1
 billion documents indexed and we index about 2.5 million documents per
 day.
 Assuming an even distribution, each node is handling about 680GBytes of
 index.  So our cache size is 1.4%. Perhaps 'relatively small block
 cache'
 was an understatement! This is why we split the largest collection into
 two,
 where one is data going back 30 days, and the other is all the data.
 Most
 of our searches are not longer than 30 days back.  The 30 day index is
 2.6TBytes total.  I don't know how the HDFS block cache splits between
 collections, but the 30 day index performs acceptable for our specific
 application.

 If we wanted to cache 50% of the index, each of our 45 nodes would need
 a
 block cache of about 350GBytes.  I'm accepting offers of DIMMs!

 What I believe caused our 'recovery, fail, retry loop' was one of our
 servers died.  This caused HDFS to start to replicate blocks across the
 cluster and produced a lot of network activity.  When this happened, I
 believe there was high network contention for specific nodes in the
 cluster
 and their network interfaces became pegged and requests for HDFS blocks
 timed out.  When that happened, SolrCloud went into recovery which
 caused
 more network traffic.  Fun stuff.

 -Joe


 On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>
> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>
>> Right now, we have a relatively small block cache due to the
>> requirements that the servers run other software.  We tried to find
>> the best balance between block cache size, and RAM for programs, while
>> still giving enough for local FS cache.  This came out to be 84 128M
>> blocks - or about 10G for the cache per node (45 nodes total).
>
> How much data is being handled on a server with 10GB allocated for
> caching HDFS data?
>
> The first message in

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-08 Thread Joe Obernberger
Anyone have any thoughts on this?  Will TLOG replicas use less network 
bandwidth?


-Joe


On 12/4/2017 12:54 PM, Joe Obernberger wrote:
Hi All - this same problem happened again, and I think I partially 
understand what is going on.  The part I don't know is what caused any 
of the replicas to go into full recovery in the first place, but once 
they do, they cause network interfaces on servers to go fully utilized 
in both in/out directions.  It appears that when a solr replica needs 
to recover, it calls on the leader for all the data.  In HDFS, the 
data from the leader's point of view goes:


HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS

Do I have this correct?  That poor network in the middle becomes a 
bottleneck and causes other replicas to go into recovery, which causes 
more network traffic.  Perhaps going to TLOG replicas with 7.1 would 
be better with HDFS?  Would it be possible for the leader to send a 
message to the replica to instead get the data straight from HDFS 
instead of going from one solr process to another?  HDFS would better 
be able to use the cluster since each block has 3x replicas.  Perhaps 
there is a better way to handle replicas with a shared file system.


Our current plan to fix the issue is to go to Solr 7.1.0 and use 
TLOG.  Good idea?  Thank you!


-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
  a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:
Hi Shawn - thank you for your reply. The index is 29.9TBytes as 
reported

by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are 
about 1.1
billion documents indexed and we index about 2.5 million documents 
per day.

Assuming an even distribution, each node is handling about 680GBytes of
index.  So our cache size is 1.4%. Perhaps 'relatively small block 
cache'
was an understatement! This is why we split the largest collection 
into two,
where one is data going back 30 days, and the other is all the 
data.  Most

of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would 
need a

block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the 
cluster

and their network interfaces became pegged and requests for HDFS blocks
timed out.  When that happened, SolrCloud went into recovery which 
caused

more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, 
while

still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not 
in the

cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you 
don't

need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com







Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-12-04 Thread Joe Obernberger
Hi All - this same problem happened again, and I think I partially 
understand what is going on.  The part I don't know is what caused any 
of the replicas to go into full recovery in the first place, but once 
they do, they cause network interfaces on servers to go fully utilized 
in both in/out directions.  It appears that when a solr replica needs to 
recover, it calls on the leader for all the data.  In HDFS, the data 
from the leader's point of view goes:


HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS

Do I have this correct?  That poor network in the middle becomes a 
bottleneck and causes other replicas to go into recovery, which causes 
more network traffic.  Perhaps going to TLOG replicas with 7.1 would be 
better with HDFS?  Would it be possible for the leader to send a message 
to the replica to instead get the data straight from HDFS instead of 
going from one solr process to another?  HDFS would better be able to 
use the cluster since each block has 3x replicas.  Perhaps there is a 
better way to handle replicas with a shared file system.


Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.  
Good idea?  Thank you!


-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
  a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:

Hi Shawn - thank you for your reply.  The index is 29.9TBytes as reported
by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about 1.1
billion documents indexed and we index about 2.5 million documents per day.
Assuming an even distribution, each node is handling about 680GBytes of
index.  So our cache size is 1.4%. Perhaps 'relatively small block cache'
was an understatement! This is why we split the largest collection into two,
where one is data going back 30 days, and the other is all the data.  Most
of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would need a
block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the cluster
and their network interfaces became pegged and requests for HDFS blocks
timed out.  When that happened, SolrCloud went into recovery which caused
more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, while
still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not in the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com





Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-27 Thread Joe Obernberger
Just to add onto this.  Right now the cluster has recovered, and life is 
good.  My concern with a cluster restart are, lock files, and network 
timeouts on startup.  The 1st can be addressed by stopping indexing, 
waiting until things flush out, and then halting all the nodes.  No lock 
files.


The 2nd is the one I'm scared about.  We use puppet to start/stop all 
the 45 nodes in the cluster, and on startup there is a massive amount of 
HDFS activity, that I'm afraid will put some of the replicas into 
recovery.  If that happens, then we're probably in for the recovery, 
failed, retry loop.  Anyone else run into this?


Thanks.

-Joe


On 11/27/2017 11:28 AM, Joe Obernberger wrote:
Thank you Erick.  Right now, we have our autoCommit time set to 
180 (30 minutes), and our autoSoftCommit set to 12.  The 
thought was that with HDFS we want less frequent, but larger 
operations, since HDFS has such a large block size.  Is that incorrect 
thinking?


As to why we are using HDFS.  For our use case, we already have a 
large cluster that runs HBase, and we want to index data within it.  
Adding another layer of storage that we would need to manage would add 
complexity.  With HDFS, we just add another box that has disk, and 
boom - more storage for all players involved.


-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
  a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:
Hi Shawn - thank you for your reply. The index is 29.9TBytes as 
reported

by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are 
about 1.1
billion documents indexed and we index about 2.5 million documents 
per day.

Assuming an even distribution, each node is handling about 680GBytes of
index.  So our cache size is 1.4%. Perhaps 'relatively small block 
cache'
was an understatement! This is why we split the largest collection 
into two,
where one is data going back 30 days, and the other is all the 
data.  Most

of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would 
need a

block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the 
cluster

and their network interfaces became pegged and requests for HDFS blocks
timed out.  When that happened, SolrCloud went into recovery which 
caused

more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, 
while

still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not 
in the

cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you 
don't

need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com







Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-27 Thread Joe Obernberger
Thank you Erick.  Right now, we have our autoCommit time set to 180 
(30 minutes), and our autoSoftCommit set to 12.  The thought was 
that with HDFS we want less frequent, but larger operations, since HDFS 
has such a large block size.  Is that incorrect thinking?


As to why we are using HDFS.  For our use case, we already have a large 
cluster that runs HBase, and we want to index data within it.  Adding 
another layer of storage that we would need to manage would add 
complexity.  With HDFS, we just add another box that has disk, and boom 
- more storage for all players involved.


-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:

Hmm. This is quite possible. Any time things take "too long" it can be
  a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:

Hi Shawn - thank you for your reply.  The index is 29.9TBytes as reported
by:
hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about 1.1
billion documents indexed and we index about 2.5 million documents per day.
Assuming an even distribution, each node is handling about 680GBytes of
index.  So our cache size is 1.4%. Perhaps 'relatively small block cache'
was an understatement! This is why we split the largest collection into two,
where one is data going back 30 days, and the other is all the data.  Most
of our searches are not longer than 30 days back.  The 30 day index is
2.6TBytes total.  I don't know how the HDFS block cache splits between
collections, but the 30 day index performs acceptable for our specific
application.

If we wanted to cache 50% of the index, each of our 45 nodes would need a
block cache of about 350GBytes.  I'm accepting offers of DIMMs!

What I believe caused our 'recovery, fail, retry loop' was one of our
servers died.  This caused HDFS to start to replicate blocks across the
cluster and produced a lot of network activity.  When this happened, I
believe there was high network contention for specific nodes in the cluster
and their network interfaces became pegged and requests for HDFS blocks
timed out.  When that happened, SolrCloud went into recovery which caused
more network traffic.  Fun stuff.

-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, while
still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not in the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com





Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Erick Erickson
Hmm. This is quite possible. Any time things take "too long" it can be
 a problem. For instance, if the leader sends docs to a replica and
the request times out, the leader throws the follower into "Leader
Initiated Recovery". The smoking gun here is that there are no errors
on the follower, just the notification that the leader put it into
recovery.

There are other variations on the theme, it all boils down to when
communications fall apart replicas go into recovery.

Best,
Erick

On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
 wrote:
> Hi Shawn - thank you for your reply.  The index is 29.9TBytes as reported
> by:
> hadoop fs -du -s -h /solr6.6.0
> 29.9 T  89.9 T  /solr6.6.0
>
> The 89.9TBytes is due to HDFS having 3x replication.  There are about 1.1
> billion documents indexed and we index about 2.5 million documents per day.
> Assuming an even distribution, each node is handling about 680GBytes of
> index.  So our cache size is 1.4%. Perhaps 'relatively small block cache'
> was an understatement! This is why we split the largest collection into two,
> where one is data going back 30 days, and the other is all the data.  Most
> of our searches are not longer than 30 days back.  The 30 day index is
> 2.6TBytes total.  I don't know how the HDFS block cache splits between
> collections, but the 30 day index performs acceptable for our specific
> application.
>
> If we wanted to cache 50% of the index, each of our 45 nodes would need a
> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>
> What I believe caused our 'recovery, fail, retry loop' was one of our
> servers died.  This caused HDFS to start to replicate blocks across the
> cluster and produced a lot of network activity.  When this happened, I
> believe there was high network contention for specific nodes in the cluster
> and their network interfaces became pegged and requests for HDFS blocks
> timed out.  When that happened, SolrCloud went into recovery which caused
> more network traffic.  Fun stuff.
>
> -Joe
>
>
> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>
>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>
>>> Right now, we have a relatively small block cache due to the
>>> requirements that the servers run other software.  We tried to find
>>> the best balance between block cache size, and RAM for programs, while
>>> still giving enough for local FS cache.  This came out to be 84 128M
>>> blocks - or about 10G for the cache per node (45 nodes total).
>>
>> How much data is being handled on a server with 10GB allocated for
>> caching HDFS data?
>>
>> The first message in this thread says the index size is 31TB, which is
>> *enormous*.  You have also said that the index takes 93TB of disk
>> space.  If the data is distributed somewhat evenly, then the answer to
>> my question would be that each of those 45 Solr servers would be
>> handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.
>>
>> When index data that Solr needs to access for an operation is not in the
>> cache and Solr must actually wait for disk and/or network I/O, the
>> resulting performance usually isn't very good.  In most cases you don't
>> need to have enough memory to fully cache the index data ... but less
>> than half a percent is not going to be enough.
>>
>> Thanks,
>> Shawn
>>
>>
>> ---
>> This email has been checked for viruses by AVG.
>> http://www.avg.com
>>
>


Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger
Hi Shawn - thank you for your reply.  The index is 29.9TBytes as 
reported by:

hadoop fs -du -s -h /solr6.6.0
29.9 T  89.9 T  /solr6.6.0

The 89.9TBytes is due to HDFS having 3x replication.  There are about 
1.1 billion documents indexed and we index about 2.5 million documents 
per day.  Assuming an even distribution, each node is handling about 
680GBytes of index.  So our cache size is 1.4%. Perhaps 'relatively 
small block cache' was an understatement! This is why we split the 
largest collection into two, where one is data going back 30 days, and 
the other is all the data.  Most of our searches are not longer than 30 
days back.  The 30 day index is 2.6TBytes total.  I don't know how the 
HDFS block cache splits between collections, but the 30 day index 
performs acceptable for our specific application.


If we wanted to cache 50% of the index, each of our 45 nodes would need 
a block cache of about 350GBytes.  I'm accepting offers of DIMMs!


What I believe caused our 'recovery, fail, retry loop' was one of our 
servers died.  This caused HDFS to start to replicate blocks across the 
cluster and produced a lot of network activity.  When this happened, I 
believe there was high network contention for specific nodes in the 
cluster and their network interfaces became pegged and requests for HDFS 
blocks timed out.  When that happened, SolrCloud went into recovery 
which caused more network traffic.  Fun stuff.


-Joe


On 11/22/2017 11:44 AM, Shawn Heisey wrote:

On 11/22/2017 6:44 AM, Joe Obernberger wrote:

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find
the best balance between block cache size, and RAM for programs, while
still giving enough for local FS cache.  This came out to be 84 128M
blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not in the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
http://www.avg.com





Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Shawn Heisey
On 11/22/2017 6:44 AM, Joe Obernberger wrote:
> Right now, we have a relatively small block cache due to the
> requirements that the servers run other software.  We tried to find
> the best balance between block cache size, and RAM for programs, while
> still giving enough for local FS cache.  This came out to be 84 128M
> blocks - or about 10G for the cache per node (45 nodes total).

How much data is being handled on a server with 10GB allocated for
caching HDFS data?

The first message in this thread says the index size is 31TB, which is
*enormous*.  You have also said that the index takes 93TB of disk
space.  If the data is distributed somewhat evenly, then the answer to
my question would be that each of those 45 Solr servers would be
handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.

When index data that Solr needs to access for an operation is not in the
cache and Solr must actually wait for disk and/or network I/O, the
resulting performance usually isn't very good.  In most cases you don't
need to have enough memory to fully cache the index data ... but less
than half a percent is not going to be enough.

Thanks,
Shawn



Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Kevin Risden
Thanks for the detailed answers Joe. Definitely sounds like you covered
most of the easy HDFS performance items.

Kevin Risden

On Wed, Nov 22, 2017 at 7:44 AM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi Kevin -
> * HDFS is part of Cloudera 5.12.0.
> * Solr is co-located in most cases.  We do have several nodes that run on
> servers that are not data nodes, but most do. Unfortunately, our nodes are
> not the same size.  Some nodes have 8TBytes of disk, while our largest
> nodes are 64TBytes.  This results in a lot of data that needs to go over
> the network.
>
> * Command is:
> /usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m
> -XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem
> -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m
> -XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75
> -XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB
> -XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
> -Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=30
> -DzkHost=frodo.querymasters.com:2181,bilbo.querymasters.com:2181,
> gandalf.querymasters.com:2181,cordelia.querymasters.com:2181,cressida.
> querymasters.com:2181/solr6.6.0 -Dsolr.log.dir=/opt/solr6/server/logs
> -Djetty.port=9100 -DSTOP.PORT=8100 -DSTOP.KEY=solrrocks -Dhost=tarvos
> -Duser.timezone=UTC -Djetty.home=/opt/solr6/server
> -Dsolr.solr.home=/opt/solr6/server/solr -Dsolr.install.dir=/opt/solr6
> -Dsolr.clustering.enabled=true -Dsolr.lock.type=hdfs
> -Dsolr.autoSoftCommit.maxTime=12 -Dsolr.autoCommit.maxTime=180
> -Dsolr.solr.home=/etc/solr6 -Djava.library.path=/opt/cloud
> era/parcels/CDH/lib/hadoop/lib/native -Xss256k -Dsolr.log.muteconsole
> -XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100
> /opt/solr6/server/logs -jar start.jar --module=http
>
> * We have enabled short circuit reads.
>
> Right now, we have a relatively small block cache due to the requirements
> that the servers run other software.  We tried to find the best balance
> between block cache size, and RAM for programs, while still giving enough
> for local FS cache.  This came out to be 84 128M blocks - or about 10G for
> the cache per node (45 nodes total).
>
>  class="solr.HdfsDirectoryFactory">
> true
> true
> 84
> true bool>
> 16384
> true
> true
> 128
> 1024
> hdfs://nameservice1:8020/solr6.6.0 r>
> /etc/hadoop/conf.cloudera.hdfs1 r>
> 
>
> Thanks for reviewing!
>
> -Joe
>
>
>
> On 11/22/2017 8:20 AM, Kevin Risden wrote:
>
>> Joe,
>>
>> I have a few questions about your Solr and HDFS setup that could help
>> improve the recovery performance.
>>
>> * Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
>> * Is Solr colocated with HDFS data nodes?
>> * What is the output of "ps aux | grep solr"? (specifically looking for
>> the
>> Java arguments that are being set.)
>>
>> Depending on how Solr on HDFS was setup, there are some potentially simple
>> settings that can help significantly improve performance.
>>
>> 1) Short circuit reads
>>
>> If Solr is colocated with an HDFS datanode, short circuit reads can
>> improve
>> read performance since it skips a network hop if the data is local to that
>> node. This requires HDFS native libraries to be added to Solr.
>>
>> 2) HDFS block cache in Solr
>>
>> Solr without HDFS uses the OS page cache to handle caching data for
>> queries. With HDFS, Solr has a special HDFS block cache which allows for
>> caching HDFS blocks. This significantly helps query performance. There are
>> a few configuration parameters that can help here.
>>
>> Kevin Risden
>>
>> On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp > >
>> wrote:
>>
>> Hi Joe,
>>>
>>> sorry, I have not seen that problem. I would normally not delete a
>>> replica
>>> if the shard is down but only if there is an active shard. Without an
>>> active leader the replica should not be able to recover. I also just had
>>> a
>>> case where all replicas of a shard stayed in down state and restarts
>>> didn't
>>> help. This was however also caused by lock files. Once I cleaned them up
>>> and restarted all Solr instances that had a replica they recovered.
>>>
>>> For the lock files I discovered that the index is not always in the
>>> "index" folder but can also be in an index. folder. There can
>>> be
>>> an "index.properties" file in the "data" directory in HDFS and this
>>> contains the correct index folder name.
>>>
>>> If you are really desperate you could also delete all but one replica so
>>> that the leader election is quite trivial. But this does of course
>>> increase
>>> the risk of finally loosing the data quite a bit. So I would try looking
>>> into the code and figure out what the p

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger

Hi Kevin -
* HDFS is part of Cloudera 5.12.0.
* Solr is co-located in most cases.  We do have several nodes that run 
on servers that are not data nodes, but most do. Unfortunately, our 
nodes are not the same size.  Some nodes have 8TBytes of disk, while our 
largest nodes are 64TBytes.  This results in a lot of data that needs to 
go over the network.


* Command is:
/usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m 
-XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem 
-XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m 
-XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75 
-XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB 
-XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime 
-Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=30 
-DzkHost=frodo.querymasters.com:2181,bilbo.querymasters.com:2181,gandalf.querymasters.com:2181,cordelia.querymasters.com:2181,cressida.querymasters.com:2181/solr6.6.0 
-Dsolr.log.dir=/opt/solr6/server/logs -Djetty.port=9100 -DSTOP.PORT=8100 
-DSTOP.KEY=solrrocks -Dhost=tarvos -Duser.timezone=UTC 
-Djetty.home=/opt/solr6/server -Dsolr.solr.home=/opt/solr6/server/solr 
-Dsolr.install.dir=/opt/solr6 -Dsolr.clustering.enabled=true 
-Dsolr.lock.type=hdfs -Dsolr.autoSoftCommit.maxTime=12 
-Dsolr.autoCommit.maxTime=180 -Dsolr.solr.home=/etc/solr6 
-Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native 
-Xss256k -Dsolr.log.muteconsole 
-XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100 
/opt/solr6/server/logs -jar start.jar --module=http


* We have enabled short circuit reads.

Right now, we have a relatively small block cache due to the 
requirements that the servers run other software.  We tried to find the 
best balance between block cache size, and RAM for programs, while still 
giving enough for local FS cache.  This came out to be 84 128M blocks - 
or about 10G for the cache per node (45 nodes total).



    true
    true
    84
    name="solr.hdfs.blockcache.direct.memory.allocation">true

    16384
    true
    true
    128
    1024
    hdfs://nameservice1:8020/solr6.6.0
    /etc/hadoop/conf.cloudera.hdfs1
    

Thanks for reviewing!

-Joe


On 11/22/2017 8:20 AM, Kevin Risden wrote:

Joe,

I have a few questions about your Solr and HDFS setup that could help
improve the recovery performance.

* Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
* Is Solr colocated with HDFS data nodes?
* What is the output of "ps aux | grep solr"? (specifically looking for the
Java arguments that are being set.)

Depending on how Solr on HDFS was setup, there are some potentially simple
settings that can help significantly improve performance.

1) Short circuit reads

If Solr is colocated with an HDFS datanode, short circuit reads can improve
read performance since it skips a network hop if the data is local to that
node. This requires HDFS native libraries to be added to Solr.

2) HDFS block cache in Solr

Solr without HDFS uses the OS page cache to handle caching data for
queries. With HDFS, Solr has a special HDFS block cache which allows for
caching HDFS blocks. This significantly helps query performance. There are
a few configuration parameters that can help here.

Kevin Risden

On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp 
wrote:


Hi Joe,

sorry, I have not seen that problem. I would normally not delete a replica
if the shard is down but only if there is an active shard. Without an
active leader the replica should not be able to recover. I also just had a
case where all replicas of a shard stayed in down state and restarts didn't
help. This was however also caused by lock files. Once I cleaned them up
and restarted all Solr instances that had a replica they recovered.

For the lock files I discovered that the index is not always in the
"index" folder but can also be in an index. folder. There can be
an "index.properties" file in the "data" directory in HDFS and this
contains the correct index folder name.

If you are really desperate you could also delete all but one replica so
that the leader election is quite trivial. But this does of course increase
the risk of finally loosing the data quite a bit. So I would try looking
into the code and figure out what the problem is here and maybe compare the
state in HDFS and ZK with a shard that works.

regards,
Hendrik


On 21.11.2017 23:57, Joe Obernberger wrote:


Hi Hendrick - the shards in question have three replicas.  I tried
restarting each one (one by one) - no luck.  No leader is found. I deleted
one of the replicas and added a new one, and the new one also shows as
'down'.  I also tried the FORCELEADER call, but that had no effect.  I
checked the OVERSEERSTATUS, but there is nothing u

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Kevin Risden
Joe,

I have a few questions about your Solr and HDFS setup that could help
improve the recovery performance.

* Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
* Is Solr colocated with HDFS data nodes?
* What is the output of "ps aux | grep solr"? (specifically looking for the
Java arguments that are being set.)

Depending on how Solr on HDFS was setup, there are some potentially simple
settings that can help significantly improve performance.

1) Short circuit reads

If Solr is colocated with an HDFS datanode, short circuit reads can improve
read performance since it skips a network hop if the data is local to that
node. This requires HDFS native libraries to be added to Solr.

2) HDFS block cache in Solr

Solr without HDFS uses the OS page cache to handle caching data for
queries. With HDFS, Solr has a special HDFS block cache which allows for
caching HDFS blocks. This significantly helps query performance. There are
a few configuration parameters that can help here.

Kevin Risden

On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp 
wrote:

> Hi Joe,
>
> sorry, I have not seen that problem. I would normally not delete a replica
> if the shard is down but only if there is an active shard. Without an
> active leader the replica should not be able to recover. I also just had a
> case where all replicas of a shard stayed in down state and restarts didn't
> help. This was however also caused by lock files. Once I cleaned them up
> and restarted all Solr instances that had a replica they recovered.
>
> For the lock files I discovered that the index is not always in the
> "index" folder but can also be in an index. folder. There can be
> an "index.properties" file in the "data" directory in HDFS and this
> contains the correct index folder name.
>
> If you are really desperate you could also delete all but one replica so
> that the leader election is quite trivial. But this does of course increase
> the risk of finally loosing the data quite a bit. So I would try looking
> into the code and figure out what the problem is here and maybe compare the
> state in HDFS and ZK with a shard that works.
>
> regards,
> Hendrik
>
>
> On 21.11.2017 23:57, Joe Obernberger wrote:
>
>> Hi Hendrick - the shards in question have three replicas.  I tried
>> restarting each one (one by one) - no luck.  No leader is found. I deleted
>> one of the replicas and added a new one, and the new one also shows as
>> 'down'.  I also tried the FORCELEADER call, but that had no effect.  I
>> checked the OVERSEERSTATUS, but there is nothing unusual there.  I don't
>> see anything useful in the logs except the error:
>>
>> org.apache.solr.common.SolrException: Error getting leader from zk for
>> shard shard21
>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>> java:996)
>> at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
>> at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
>> at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(
>> ZkContainer.java:181)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1149)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.solr.common.SolrException: Could not get leader
>> props
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1043)
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1007)
>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>> java:963)
>> ... 7 more
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:111)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:51)
>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>> ient.java:357)
>> at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>> ient.java:354)
>> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> CmdExecutor.java:60)
>> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClie
>> nt.java:354)
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1021)
>> ... 9 more
>>
>> Can I modify zookeeper to force a leader?  Is there any other way to
>> recover from this?  Thanks very much!
>>
>> -Joe
>>
>>
>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>
>>> We sometimes also have replicas not recovering. If one replica is left
>>> active the easiest is to then to delete the replica and create a new one.
>>> When all replicas are down it h

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Joe Obernberger
Hi Hendrick - I was halting a replica and then restarting it, waited, 
then restarted another one.  That didn't work, but when I halted all 
three, and then restarted those one by one, the shard finally elected a 
leader and came up.  Phew!  I too noticed the lock files in 
index. folders.  Usually what I do is:

hadoop fs -ls -R /solr6.6.0 | grep write.lock > out.txt
then
cat out.txt | cut --bytes 57-
to get a list of files to delete

Glad these shards have come up!  Thanks very much.

-Joe


On 11/22/2017 5:20 AM, Hendrik Haddorp wrote:

Hi Joe,

sorry, I have not seen that problem. I would normally not delete a 
replica if the shard is down but only if there is an active shard. 
Without an active leader the replica should not be able to recover. I 
also just had a case where all replicas of a shard stayed in down 
state and restarts didn't help. This was however also caused by lock 
files. Once I cleaned them up and restarted all Solr instances that 
had a replica they recovered.


For the lock files I discovered that the index is not always in the 
"index" folder but can also be in an index. folder. There 
can be an "index.properties" file in the "data" directory in HDFS and 
this contains the correct index folder name.


If you are really desperate you could also delete all but one replica 
so that the leader election is quite trivial. But this does of course 
increase the risk of finally loosing the data quite a bit. So I would 
try looking into the code and figure out what the problem is here and 
maybe compare the state in HDFS and ZK with a shard that works.


regards,
Hendrik

On 21.11.2017 23:57, Joe Obernberger wrote:
Hi Hendrick - the shards in question have three replicas.  I tried 
restarting each one (one by one) - no luck.  No leader is found. I 
deleted one of the replicas and added a new one, and the new one also 
shows as 'down'.  I also tried the FORCELEADER call, but that had no 
effect.  I checked the OVERSEERSTATUS, but there is nothing unusual 
there.  I don't see anything useful in the logs except the error:


org.apache.solr.common.SolrException: Error getting leader from zk 
for shard shard21
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)
    at 
org.apache.solr.cloud.ZkController.register(ZkController.java:902)
    at 
org.apache.solr.cloud.ZkController.register(ZkController.java:846)
    at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader 
props
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)

    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)

    ... 9 more

Can I modify zookeeper to force a leader?  Is there any other way to 
recover from this?  Thanks very much!


-Joe


On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
We sometimes also have replicas not recovering. If one replica is 
left active the easiest is to then to delete the replica and create 
a new one. When all replicas are down it helps most of the time to 
restart one of the nodes that contains a replica in down state. If 
that also doesn't get the replica to recover I would check the logs 
of the node and also that of the overseer node. I have seen the same 
issue on Solr using local storage. The main HDFS related issues we 
had so far was those lock files and if you delete and recreate 
collections/cores and it sometimes happens that the data was not 
cleaned up in HDFS and then causes a conflict.


Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:
We've never run an index this size in anything but HDFS, so I have 
no comparison.  What we've been doing is keeping two main 
collections - all data, and the last 30 days of data.  Then w

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-22 Thread Hendrik Haddorp

Hi Joe,

sorry, I have not seen that problem. I would normally not delete a 
replica if the shard is down but only if there is an active shard. 
Without an active leader the replica should not be able to recover. I 
also just had a case where all replicas of a shard stayed in down state 
and restarts didn't help. This was however also caused by lock files. 
Once I cleaned them up and restarted all Solr instances that had a 
replica they recovered.


For the lock files I discovered that the index is not always in the 
"index" folder but can also be in an index. folder. There can 
be an "index.properties" file in the "data" directory in HDFS and this 
contains the correct index folder name.


If you are really desperate you could also delete all but one replica so 
that the leader election is quite trivial. But this does of course 
increase the risk of finally loosing the data quite a bit. So I would 
try looking into the code and figure out what the problem is here and 
maybe compare the state in HDFS and ZK with a shard that works.


regards,
Hendrik

On 21.11.2017 23:57, Joe Obernberger wrote:
Hi Hendrick - the shards in question have three replicas.  I tried 
restarting each one (one by one) - no luck.  No leader is found. I 
deleted one of the replicas and added a new one, and the new one also 
shows as 'down'.  I also tried the FORCELEADER call, but that had no 
effect.  I checked the OVERSEERSTATUS, but there is nothing unusual 
there.  I don't see anything useful in the logs except the error:


org.apache.solr.common.SolrException: Error getting leader from zk for 
shard shard21
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)

    at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
    at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader 
props
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
    at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)

    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)

    ... 9 more

Can I modify zookeeper to force a leader?  Is there any other way to 
recover from this?  Thanks very much!


-Joe


On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
We sometimes also have replicas not recovering. If one replica is 
left active the easiest is to then to delete the replica and create a 
new one. When all replicas are down it helps most of the time to 
restart one of the nodes that contains a replica in down state. If 
that also doesn't get the replica to recover I would check the logs 
of the node and also that of the overseer node. I have seen the same 
issue on Solr using local storage. The main HDFS related issues we 
had so far was those lock files and if you delete and recreate 
collections/cores and it sometimes happens that the data was not 
cleaned up in HDFS and then causes a conflict.


Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:
We've never run an index this size in anything but HDFS, so I have 
no comparison.  What we've been doing is keeping two main 
collections - all data, and the last 30 days of data.  Then we 
handle queries based on date range. The 30 day index is 
significantly faster.


My main concern right now is that 6 of the 100 shards are not coming 
back because of no leader.  I've never seen this error before.  Any 
ideas?  ClusterStatus shows all three replicas with state 'down'.


Thanks!

-joe


On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
We actually also have some performance issue with HDFS at the 
moment. We are doing lots of soft commits for NRT search. Those 
seem to be slower then with local storage. The investigation is 

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Erick Erickson
Well, you can  always manually change the ZK nodes, but whether just
setting a node's state to "leader" in ZK then starting the Solr
instance hosting that node would work... I don't know. Do consider
running CheckIndex on one of the replicas in question first though.

Best,
Erick

On Tue, Nov 21, 2017 at 3:06 PM, Joe Obernberger
 wrote:
> One other data point I just saw on one of the nodes.  It has the following
> error:
> 2017-11-21 22:59:48.886 ERROR
> (coreZkRegister-1-thread-1-processing-n:leda:9100_solr) [c:UNCLASS s:shard14
> r:core_node175 x:UNCLASS_shard14_replica3]
> o.a.s.c.ShardLeaderElectionContext There was a problem trying to register as
> the leader:org.apache.solr.common.SolrException: Leader Initiated Recovery
> prevented leadership
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.checkLIR(ElectionContext.java:521)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:424)
> at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
> at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
> at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
> at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
> at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
> at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
> at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
> at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
> at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
> at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
> at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
> at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
>
> This stack trace repeats for a long while; looks like a recursive call.
>
>
> -Joe
>
>
> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>
>> We sometimes also have replicas not recovering. If one replica is left
>> active the easiest is to then to delete the replica and create a new one.
>> When all replicas are down it helps most of the time to restart one of the
>> nodes that contains a replica in down state. If that also doesn't get the
>> replica to recover I would check the logs of the node and also that of the
>> overseer node. I have seen the same issue on Solr using local storage. The
>> main HDFS related issues we had so far was those lock files and if you
>> delete and recreate collections/cores and it sometimes happens that the data
>> was not cleaned up in HDFS and then causes a conflict.
>>
>> Hendrik
>>
>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>>
>>> We've never run an index this size in anything but HDFS, so I have no
>>> comparison.  What we've been doing is keeping two main collections - all
>>> data, and the last 30 days of data.  Then we handle queries based on date
>>> range. The 30 day index is significantly faster.
>>>
>>> My main concern right now is that 6 of the 100 shards are not coming back
>>> because of no leader.  I've never seen this error before.  Any ideas?
>>> ClusterStatus shows all three replicas with state 'down'.
>>>
>>> Thanks!
>>>
>>> -joe
>>>
>>>
>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:

 We actually also have some performance issue with HDFS at the moment. We
 are doing lots of soft commits for NRT search. Those seem to be slower then
 with local storage. The investigation is however not really far yet.

 We have a setup with 2000 collections, with one shard each and a
 replication factor of 2 or 3. When we restart nodes too fast that causes
 problems with the overseer queue, which can lead to the queue getting out 
 of
 control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some
 improvements and should handle these actions faster. I would check what you
 see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
 critical part is the "overseer_queue_size" value. If this goes up to about
 1 it is pretty much game over on our setup. In that case it seems to be
 best to stop all nodes, clear the queue in ZK and then restart the nodes 
 one
 by one with a gap of like 5min. That normally recovers pretty well.
>

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
One other data point I just saw on one of the nodes.  It has the 
following error:
2017-11-21 22:59:48.886 ERROR 
(coreZkRegister-1-thread-1-processing-n:leda:9100_solr) [c:UNCLASS 
s:shard14 r:core_node175 x:UNCLASS_shard14_replica3] 
o.a.s.c.ShardLeaderElectionContext There was a problem trying to 
register as the leader:org.apache.solr.common.SolrException: Leader 
Initiated Recovery prevented leadership
    at 
org.apache.solr.cloud.ShardLeaderElectionContext.checkLIR(ElectionContext.java:521)
    at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:424)
    at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
    at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
    at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
    at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
    at 
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
    at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
    at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
    at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
    at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
    at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
    at 
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
    at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
    at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
    at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
    at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
    at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)


This stack trace repeats for a long while; looks like a recursive call.

-Joe


On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
We sometimes also have replicas not recovering. If one replica is left 
active the easiest is to then to delete the replica and create a new 
one. When all replicas are down it helps most of the time to restart 
one of the nodes that contains a replica in down state. If that also 
doesn't get the replica to recover I would check the logs of the node 
and also that of the overseer node. I have seen the same issue on Solr 
using local storage. The main HDFS related issues we had so far was 
those lock files and if you delete and recreate collections/cores and 
it sometimes happens that the data was not cleaned up in HDFS and then 
causes a conflict.


Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:
We've never run an index this size in anything but HDFS, so I have no 
comparison.  What we've been doing is keeping two main collections - 
all data, and the last 30 days of data.  Then we handle queries based 
on date range. The 30 day index is significantly faster.


My main concern right now is that 6 of the 100 shards are not coming 
back because of no leader.  I've never seen this error before.  Any 
ideas?  ClusterStatus shows all three replicas with state 'down'.


Thanks!

-joe


On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
We actually also have some performance issue with HDFS at the 
moment. We are doing lots of soft commits for NRT search. Those seem 
to be slower then with local storage. The investigation is however 
not really far yet.


We have a setup with 2000 collections, with one shard each and a 
replication factor of 2 or 3. When we restart nodes too fast that 
causes problems with the overseer queue, which can lead to the queue 
getting out of control and Solr pretty much dying. We are still on 
Solr 6.3. 6.6 has some improvements and should handle these actions 
faster. I would check what you see for 
"/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The 
critical part is the "overseer_queue_size" value. If this goes up to 
about 1 it is pretty much game over on our setup. In that case 
it seems to be best to stop all nodes, clear the queue in ZK and 
then restart the nodes one by one with a gap of like 5min. That 
normally recovers pretty well.


regards,
Hendrik

On 21.11.2017 20:12, Joe Obernberger wrote:
We set the hard commit time long because we were having performance 
issues with HDFS, and thought that since the block size is 128M, 
having a longer hard commit made sense.  That was our hypothesis 
anyway. Happy to switch it back and see what happens.


I don't know what caused the cluster to go into recovery in the 
first place.  We had a server die over the weekend, but it's just 
one out of ~50.  Every shard is 3x replicated (and 3x replicated in 
HDFS...so 9 copies).  It was at this point th

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
Hi Hendrick - the shards in question have three replicas.  I tried 
restarting each one (one by one) - no luck.  No leader is found.  I 
deleted one of the replicas and added a new one, and the new one also 
shows as 'down'.  I also tried the FORCELEADER call, but that had no 
effect.  I checked the OVERSEERSTATUS, but there is nothing unusual 
there.  I don't see anything useful in the logs except the error:


org.apache.solr.common.SolrException: Error getting leader from zk for 
shard shard21

    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
    at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader props
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)

    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)
    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)

    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)

    ... 9 more

Can I modify zookeeper to force a leader?  Is there any other way to 
recover from this?  Thanks very much!


-Joe


On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
We sometimes also have replicas not recovering. If one replica is left 
active the easiest is to then to delete the replica and create a new 
one. When all replicas are down it helps most of the time to restart 
one of the nodes that contains a replica in down state. If that also 
doesn't get the replica to recover I would check the logs of the node 
and also that of the overseer node. I have seen the same issue on Solr 
using local storage. The main HDFS related issues we had so far was 
those lock files and if you delete and recreate collections/cores and 
it sometimes happens that the data was not cleaned up in HDFS and then 
causes a conflict.


Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:
We've never run an index this size in anything but HDFS, so I have no 
comparison.  What we've been doing is keeping two main collections - 
all data, and the last 30 days of data.  Then we handle queries based 
on date range. The 30 day index is significantly faster.


My main concern right now is that 6 of the 100 shards are not coming 
back because of no leader.  I've never seen this error before.  Any 
ideas?  ClusterStatus shows all three replicas with state 'down'.


Thanks!

-joe


On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
We actually also have some performance issue with HDFS at the 
moment. We are doing lots of soft commits for NRT search. Those seem 
to be slower then with local storage. The investigation is however 
not really far yet.


We have a setup with 2000 collections, with one shard each and a 
replication factor of 2 or 3. When we restart nodes too fast that 
causes problems with the overseer queue, which can lead to the queue 
getting out of control and Solr pretty much dying. We are still on 
Solr 6.3. 6.6 has some improvements and should handle these actions 
faster. I would check what you see for 
"/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The 
critical part is the "overseer_queue_size" value. If this goes up to 
about 1 it is pretty much game over on our setup. In that case 
it seems to be best to stop all nodes, clear the queue in ZK and 
then restart the nodes one by one with a gap of like 5min. That 
normally recovers pretty well.


regards,
Hendrik

On 21.11.2017 20:12, Joe Obernberger wrote:
We set the hard commit time long because we were having performance 
issues with HDFS, and thought that since the block size is 128M, 
having a longer hard commit made sense.  That was our hypothesis 
anyway. Happy to switch it back and see what happens.


I don't know what caused

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
Thank you Erick.  I've set the RamBufferSize to 1G; perhaps higher would 
be beneficial.  One more data point is that if I restart a node, more 
often than not, it goes into recovery, beats up the network for a while, 
and then goes green.  This happens even if I do no indexing between 
restarts.  Is that expected? Sometimes this can take longer than 20 
minutes.  No new data was added to the index between the restarts.


-Joe


On 11/21/2017 3:43 PM, Erick Erickson wrote:

bq: We are doing lots of soft commits for NRT search...

It's not surprising that this is slower than local storage, especially
if you have any autowarming going on. Opening  new searchers will need
to read data from disk for the new segments, and HDFS may be slower
here.

As far as the commit interval, an under-appreciated event is that when
RAMBufferSizeMB is exceeded (default 100M last I knew) new segments
are written _anyway_, they're just a little invisible. That is, the
segments_n file isn't updated even though they're closed IIUC at
least. So that very long interval isn't helping with that problem I
don't think

Evidence to the contrary trumps my understanding of course.

About starting all these collections up at once and the Overseer
queue. I've seen this in similar situations. There are a _lot_ of
messages flying back and forth for each replica on startup, and the
Overseer processing was very inefficient historically so that queue
could get in the 100s of K, I've seen some pathological situations
where it's over 1M. SOLR-10524 made this a lot better. There are still
a lot of messages written in a case like yours, but at least the
Overseer has a much better chance to keep up Solr 6.6... At that
point bringing up Solr took a very long time.


Erick

On Tue, Nov 21, 2017 at 12:24 PM, Hendrik Haddorp
 wrote:

We sometimes also have replicas not recovering. If one replica is left
active the easiest is to then to delete the replica and create a new one.
When all replicas are down it helps most of the time to restart one of the
nodes that contains a replica in down state. If that also doesn't get the
replica to recover I would check the logs of the node and also that of the
overseer node. I have seen the same issue on Solr using local storage. The
main HDFS related issues we had so far was those lock files and if you
delete and recreate collections/cores and it sometimes happens that the data
was not cleaned up in HDFS and then causes a conflict.

Hendrik


On 21.11.2017 21:07, Joe Obernberger wrote:

We've never run an index this size in anything but HDFS, so I have no
comparison.  What we've been doing is keeping two main collections - all
data, and the last 30 days of data.  Then we handle queries based on date
range.  The 30 day index is significantly faster.

My main concern right now is that 6 of the 100 shards are not coming back
because of no leader.  I've never seen this error before.  Any ideas?
ClusterStatus shows all three replicas with state 'down'.

Thanks!

-joe


On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:

We actually also have some performance issue with HDFS at the moment. We
are doing lots of soft commits for NRT search. Those seem to be slower then
with local storage. The investigation is however not really far yet.

We have a setup with 2000 collections, with one shard each and a
replication factor of 2 or 3. When we restart nodes too fast that causes
problems with the overseer queue, which can lead to the queue getting out of
control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some
improvements and should handle these actions faster. I would check what you
see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
critical part is the "overseer_queue_size" value. If this goes up to about
1 it is pretty much game over on our setup. In that case it seems to be
best to stop all nodes, clear the queue in ZK and then restart the nodes one
by one with a gap of like 5min. That normally recovers pretty well.

regards,
Hendrik

On 21.11.2017 20:12, Joe Obernberger wrote:

We set the hard commit time long because we were having performance
issues with HDFS, and thought that since the block size is 128M, having a
longer hard commit made sense.  That was our hypothesis anyway.  Happy to
switch it back and see what happens.

I don't know what caused the cluster to go into recovery in the first
place.  We had a server die over the weekend, but it's just one out of ~50.
Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies).  It
was at this point that we noticed lots of network activity, and most of the
shards in this recovery, fail, retry loop.  That is when we decided to shut
it down resulting in zombie lock files.

I tried using the FORCELEADER call, which completed, but doesn't seem to
have any effect on the shards that have no leader. Kinda out of ideas for
that problem.  If I can get the cluster back up, I'll try a lower hard
commit time.  Thanks again Erick!

-J

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Erick Erickson
bq: We are doing lots of soft commits for NRT search...

It's not surprising that this is slower than local storage, especially
if you have any autowarming going on. Opening  new searchers will need
to read data from disk for the new segments, and HDFS may be slower
here.

As far as the commit interval, an under-appreciated event is that when
RAMBufferSizeMB is exceeded (default 100M last I knew) new segments
are written _anyway_, they're just a little invisible. That is, the
segments_n file isn't updated even though they're closed IIUC at
least. So that very long interval isn't helping with that problem I
don't think

Evidence to the contrary trumps my understanding of course.

About starting all these collections up at once and the Overseer
queue. I've seen this in similar situations. There are a _lot_ of
messages flying back and forth for each replica on startup, and the
Overseer processing was very inefficient historically so that queue
could get in the 100s of K, I've seen some pathological situations
where it's over 1M. SOLR-10524 made this a lot better. There are still
a lot of messages written in a case like yours, but at least the
Overseer has a much better chance to keep up Solr 6.6... At that
point bringing up Solr took a very long time.


Erick

On Tue, Nov 21, 2017 at 12:24 PM, Hendrik Haddorp
 wrote:
> We sometimes also have replicas not recovering. If one replica is left
> active the easiest is to then to delete the replica and create a new one.
> When all replicas are down it helps most of the time to restart one of the
> nodes that contains a replica in down state. If that also doesn't get the
> replica to recover I would check the logs of the node and also that of the
> overseer node. I have seen the same issue on Solr using local storage. The
> main HDFS related issues we had so far was those lock files and if you
> delete and recreate collections/cores and it sometimes happens that the data
> was not cleaned up in HDFS and then causes a conflict.
>
> Hendrik
>
>
> On 21.11.2017 21:07, Joe Obernberger wrote:
>>
>> We've never run an index this size in anything but HDFS, so I have no
>> comparison.  What we've been doing is keeping two main collections - all
>> data, and the last 30 days of data.  Then we handle queries based on date
>> range.  The 30 day index is significantly faster.
>>
>> My main concern right now is that 6 of the 100 shards are not coming back
>> because of no leader.  I've never seen this error before.  Any ideas?
>> ClusterStatus shows all three replicas with state 'down'.
>>
>> Thanks!
>>
>> -joe
>>
>>
>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>
>>> We actually also have some performance issue with HDFS at the moment. We
>>> are doing lots of soft commits for NRT search. Those seem to be slower then
>>> with local storage. The investigation is however not really far yet.
>>>
>>> We have a setup with 2000 collections, with one shard each and a
>>> replication factor of 2 or 3. When we restart nodes too fast that causes
>>> problems with the overseer queue, which can lead to the queue getting out of
>>> control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some
>>> improvements and should handle these actions faster. I would check what you
>>> see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>> critical part is the "overseer_queue_size" value. If this goes up to about
>>> 1 it is pretty much game over on our setup. In that case it seems to be
>>> best to stop all nodes, clear the queue in ZK and then restart the nodes one
>>> by one with a gap of like 5min. That normally recovers pretty well.
>>>
>>> regards,
>>> Hendrik
>>>
>>> On 21.11.2017 20:12, Joe Obernberger wrote:

 We set the hard commit time long because we were having performance
 issues with HDFS, and thought that since the block size is 128M, having a
 longer hard commit made sense.  That was our hypothesis anyway.  Happy to
 switch it back and see what happens.

 I don't know what caused the cluster to go into recovery in the first
 place.  We had a server die over the weekend, but it's just one out of ~50.
 Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies).  It
 was at this point that we noticed lots of network activity, and most of the
 shards in this recovery, fail, retry loop.  That is when we decided to shut
 it down resulting in zombie lock files.

 I tried using the FORCELEADER call, which completed, but doesn't seem to
 have any effect on the shards that have no leader. Kinda out of ideas for
 that problem.  If I can get the cluster back up, I'll try a lower hard
 commit time.  Thanks again Erick!

 -Joe


 On 11/21/2017 2:00 PM, Erick Erickson wrote:
>
> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>
> I need to back up a bit. Once nodes are in this state it's not
> surprising that the

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Hendrik Haddorp
We sometimes also have replicas not recovering. If one replica is left 
active the easiest is to then to delete the replica and create a new 
one. When all replicas are down it helps most of the time to restart one 
of the nodes that contains a replica in down state. If that also doesn't 
get the replica to recover I would check the logs of the node and also 
that of the overseer node. I have seen the same issue on Solr using 
local storage. The main HDFS related issues we had so far was those lock 
files and if you delete and recreate collections/cores and it sometimes 
happens that the data was not cleaned up in HDFS and then causes a conflict.


Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:
We've never run an index this size in anything but HDFS, so I have no 
comparison.  What we've been doing is keeping two main collections - 
all data, and the last 30 days of data.  Then we handle queries based 
on date range.  The 30 day index is significantly faster.


My main concern right now is that 6 of the 100 shards are not coming 
back because of no leader.  I've never seen this error before.  Any 
ideas?  ClusterStatus shows all three replicas with state 'down'.


Thanks!

-joe


On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
We actually also have some performance issue with HDFS at the moment. 
We are doing lots of soft commits for NRT search. Those seem to be 
slower then with local storage. The investigation is however not 
really far yet.


We have a setup with 2000 collections, with one shard each and a 
replication factor of 2 or 3. When we restart nodes too fast that 
causes problems with the overseer queue, which can lead to the queue 
getting out of control and Solr pretty much dying. We are still on 
Solr 6.3. 6.6 has some improvements and should handle these actions 
faster. I would check what you see for 
"/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The critical 
part is the "overseer_queue_size" value. If this goes up to about 
1 it is pretty much game over on our setup. In that case it seems 
to be best to stop all nodes, clear the queue in ZK and then restart 
the nodes one by one with a gap of like 5min. That normally recovers 
pretty well.


regards,
Hendrik

On 21.11.2017 20:12, Joe Obernberger wrote:
We set the hard commit time long because we were having performance 
issues with HDFS, and thought that since the block size is 128M, 
having a longer hard commit made sense.  That was our hypothesis 
anyway.  Happy to switch it back and see what happens.


I don't know what caused the cluster to go into recovery in the 
first place.  We had a server die over the weekend, but it's just 
one out of ~50.  Every shard is 3x replicated (and 3x replicated in 
HDFS...so 9 copies).  It was at this point that we noticed lots of 
network activity, and most of the shards in this recovery, fail, 
retry loop.  That is when we decided to shut it down resulting in 
zombie lock files.


I tried using the FORCELEADER call, which completed, but doesn't 
seem to have any effect on the shards that have no leader. Kinda out 
of ideas for that problem.  If I can get the cluster back up, I'll 
try a lower hard commit time.  Thanks again Erick!


-Joe


On 11/21/2017 2:00 PM, Erick Erickson wrote:

Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...

I need to back up a bit. Once nodes are in this state it's not
surprising that they need to be forcefully killed. I was more thinking
about how they got in this situation in the first place. _Before_ you
get into the nasty state how are the Solr nodes shut down? Forcefully?

Your hard commit is far longer than it needs to be, resulting in much
larger tlog files etc. I usually set this at 15-60 seconds with local
disks, not quite sure whether longer intervals are helpful on HDFS.
What this means is that you can spend up to 30 minutes when you
restart solr _replaying the tlogs_! If Solr is killed, it may not have
had a chance to fsync the segments and may have to replay on startup.
If you have openSearcher set to false, the hard commit operation is
not horribly expensive, it just fsync's the current segments and opens
new ones. It won't be a total cure, but I bet reducing this interval
would help a lot.

Also, if you stop indexing there's no need to wait 30 minutes if you
issue a manual commit, something like
.../collection/update?commit=true. Just reducing the hard commit
interval will make the wait between stopping indexing and restarting
shorter all by itself if you don't want to issue the manual commit.

Best,
Erick

On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
 wrote:

Hi,

the write.lock issue I see as well when Solr is not been stopped 
gracefully.
The write.lock files are then left in the HDFS as they do not get 
removed

automatically when the client disconnects like a ephemeral node in
ZooKeeper. Unfortunately Solr does also not realize that it should 
be owning
the lock as it is marked in the state stored in ZooKee

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
We've never run an index this size in anything but HDFS, so I have no 
comparison.  What we've been doing is keeping two main collections - all 
data, and the last 30 days of data.  Then we handle queries based on 
date range.  The 30 day index is significantly faster.


My main concern right now is that 6 of the 100 shards are not coming 
back because of no leader.  I've never seen this error before.  Any 
ideas?  ClusterStatus shows all three replicas with state 'down'.


Thanks!

-joe


On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
We actually also have some performance issue with HDFS at the moment. 
We are doing lots of soft commits for NRT search. Those seem to be 
slower then with local storage. The investigation is however not 
really far yet.


We have a setup with 2000 collections, with one shard each and a 
replication factor of 2 or 3. When we restart nodes too fast that 
causes problems with the overseer queue, which can lead to the queue 
getting out of control and Solr pretty much dying. We are still on 
Solr 6.3. 6.6 has some improvements and should handle these actions 
faster. I would check what you see for 
"/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The critical 
part is the "overseer_queue_size" value. If this goes up to about 
1 it is pretty much game over on our setup. In that case it seems 
to be best to stop all nodes, clear the queue in ZK and then restart 
the nodes one by one with a gap of like 5min. That normally recovers 
pretty well.


regards,
Hendrik

On 21.11.2017 20:12, Joe Obernberger wrote:
We set the hard commit time long because we were having performance 
issues with HDFS, and thought that since the block size is 128M, 
having a longer hard commit made sense.  That was our hypothesis 
anyway.  Happy to switch it back and see what happens.


I don't know what caused the cluster to go into recovery in the first 
place.  We had a server die over the weekend, but it's just one out 
of ~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 
9 copies).  It was at this point that we noticed lots of network 
activity, and most of the shards in this recovery, fail, retry loop.  
That is when we decided to shut it down resulting in zombie lock files.


I tried using the FORCELEADER call, which completed, but doesn't seem 
to have any effect on the shards that have no leader. Kinda out of 
ideas for that problem.  If I can get the cluster back up, I'll try a 
lower hard commit time.  Thanks again Erick!


-Joe


On 11/21/2017 2:00 PM, Erick Erickson wrote:

Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...

I need to back up a bit. Once nodes are in this state it's not
surprising that they need to be forcefully killed. I was more thinking
about how they got in this situation in the first place. _Before_ you
get into the nasty state how are the Solr nodes shut down? Forcefully?

Your hard commit is far longer than it needs to be, resulting in much
larger tlog files etc. I usually set this at 15-60 seconds with local
disks, not quite sure whether longer intervals are helpful on HDFS.
What this means is that you can spend up to 30 minutes when you
restart solr _replaying the tlogs_! If Solr is killed, it may not have
had a chance to fsync the segments and may have to replay on startup.
If you have openSearcher set to false, the hard commit operation is
not horribly expensive, it just fsync's the current segments and opens
new ones. It won't be a total cure, but I bet reducing this interval
would help a lot.

Also, if you stop indexing there's no need to wait 30 minutes if you
issue a manual commit, something like
.../collection/update?commit=true. Just reducing the hard commit
interval will make the wait between stopping indexing and restarting
shorter all by itself if you don't want to issue the manual commit.

Best,
Erick

On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
 wrote:

Hi,

the write.lock issue I see as well when Solr is not been stopped 
gracefully.
The write.lock files are then left in the HDFS as they do not get 
removed

automatically when the client disconnects like a ephemeral node in
ZooKeeper. Unfortunately Solr does also not realize that it should 
be owning
the lock as it is marked in the state stored in ZooKeeper as the 
owner and
is also not willing to retry, which is why you need to restart the 
whole
Solr instance after the cleanup. I added some logic to my Solr 
start up
script which scans the log files in HDFS and compares that with the 
state in
ZooKeeper and then delete all lock files that belong to the node 
that I'm

starting.

regards,
Hendrik


On 21.11.2017 14:07, Joe Obernberger wrote:
Hi All - we have a system with 45 physical boxes running solr 
6.6.1 using

HDFS as the index.  The current index size is about 31TBytes. With 3x
replication that takes up 93TBytes of disk. Our main collection is 
split
across 100 shards with 3 replicas each.  The issue that we're 
running into
is when restarting

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Hendrik Haddorp
We actually also have some performance issue with HDFS at the moment. We 
are doing lots of soft commits for NRT search. Those seem to be slower 
then with local storage. The investigation is however not really far yet.


We have a setup with 2000 collections, with one shard each and a 
replication factor of 2 or 3. When we restart nodes too fast that causes 
problems with the overseer queue, which can lead to the queue getting 
out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 
has some improvements and should handle these actions faster. I would 
check what you see for 
"/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The critical 
part is the "overseer_queue_size" value. If this goes up to about 1 
it is pretty much game over on our setup. In that case it seems to be 
best to stop all nodes, clear the queue in ZK and then restart the nodes 
one by one with a gap of like 5min. That normally recovers pretty well.


regards,
Hendrik

On 21.11.2017 20:12, Joe Obernberger wrote:
We set the hard commit time long because we were having performance 
issues with HDFS, and thought that since the block size is 128M, 
having a longer hard commit made sense.  That was our hypothesis 
anyway.  Happy to switch it back and see what happens.


I don't know what caused the cluster to go into recovery in the first 
place.  We had a server die over the weekend, but it's just one out of 
~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 9 
copies).  It was at this point that we noticed lots of network 
activity, and most of the shards in this recovery, fail, retry loop.  
That is when we decided to shut it down resulting in zombie lock files.


I tried using the FORCELEADER call, which completed, but doesn't seem 
to have any effect on the shards that have no leader.  Kinda out of 
ideas for that problem.  If I can get the cluster back up, I'll try a 
lower hard commit time.  Thanks again Erick!


-Joe


On 11/21/2017 2:00 PM, Erick Erickson wrote:

Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...

I need to back up a bit. Once nodes are in this state it's not
surprising that they need to be forcefully killed. I was more thinking
about how they got in this situation in the first place. _Before_ you
get into the nasty state how are the Solr nodes shut down? Forcefully?

Your hard commit is far longer than it needs to be, resulting in much
larger tlog files etc. I usually set this at 15-60 seconds with local
disks, not quite sure whether longer intervals are helpful on HDFS.
What this means is that you can spend up to 30 minutes when you
restart solr _replaying the tlogs_! If Solr is killed, it may not have
had a chance to fsync the segments and may have to replay on startup.
If you have openSearcher set to false, the hard commit operation is
not horribly expensive, it just fsync's the current segments and opens
new ones. It won't be a total cure, but I bet reducing this interval
would help a lot.

Also, if you stop indexing there's no need to wait 30 minutes if you
issue a manual commit, something like
.../collection/update?commit=true. Just reducing the hard commit
interval will make the wait between stopping indexing and restarting
shorter all by itself if you don't want to issue the manual commit.

Best,
Erick

On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
 wrote:

Hi,

the write.lock issue I see as well when Solr is not been stopped 
gracefully.
The write.lock files are then left in the HDFS as they do not get 
removed

automatically when the client disconnects like a ephemeral node in
ZooKeeper. Unfortunately Solr does also not realize that it should 
be owning
the lock as it is marked in the state stored in ZooKeeper as the 
owner and
is also not willing to retry, which is why you need to restart the 
whole

Solr instance after the cleanup. I added some logic to my Solr start up
script which scans the log files in HDFS and compares that with the 
state in
ZooKeeper and then delete all lock files that belong to the node 
that I'm

starting.

regards,
Hendrik


On 21.11.2017 14:07, Joe Obernberger wrote:
Hi All - we have a system with 45 physical boxes running solr 6.6.1 
using

HDFS as the index.  The current index size is about 31TBytes. With 3x
replication that takes up 93TBytes of disk. Our main collection is 
split
across 100 shards with 3 replicas each.  The issue that we're 
running into
is when restarting the solr6 cluster.  The shards go into recovery 
and start
to utilize nearly all of their network interfaces.  If we start too 
many of
the nodes at once, the shards will go into a recovery, fail, and 
retry loop

and never come up.  The errors are related to HDFS not responding fast
enough and warnings from the DFSClient.  If we stop a node when 
this is

happening, the script will force a stop (180 second timeout) and upon
restart, we have lock files (write.lock) inside of HDFS.

The process at this point is to start one node, find o

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Hendrik Haddorp
Unfortunately I can not upload my cleanup code but the steps I'm doing 
are quite easy. I wrote it in Java using the HDFS API and Curator for 
ZooKeeper. Steps are:
    - read out the children of /collections in ZK so you know all the 
collection names

    - read /collections//state.json to get the state
    - find the replicas in the state and filter those out that have a 
"node_name" matching your locale node (the node name is basically a 
combination of your host name and the solr port)
    - if the replica data has "dataDir" set then you basically only 
need to add "index/write.lock" to it and you have the lock location
    - if "dataDir" is not set (not really sure why) then you need to 
construct it yourself: //name>/data/index/write.lock

    - if the lock file exist delete it

I believe there is a small race condition in case you use replica auto 
fail over. So I try to keep the time between checking the state in 
ZooKeeper and deleting the lock file as short, like not first determine 
all lock file locations and only then delete them but do that while 
checking the state.


regards,
Hendrik

On 21.11.2017 19:53, Joe Obernberger wrote:
A clever idea.  Normally what we do when we need to do a restart, is 
to halt indexing, and then wait about 30 minutes.  If we do not wait, 
and stop the cluster, the default scripts 180 second timeout is not 
enough and we'll have lock files to clean up.  We use puppet to start 
and stop the nodes, but at this point that is not working well since 
we need to start one node at a time.  With each one taking hours, this 
is a lengthy process!  I'd love to see your script!


This new error is now coming up - see screen shot.  For some reason 
some of the shards have no leader assigned:


http://lovehorsepower.com/SolrClusterErrors.jpg

-Joe


On 11/21/2017 1:34 PM, Hendrik Haddorp wrote:

Hi,

the write.lock issue I see as well when Solr is not been stopped 
gracefully. The write.lock files are then left in the HDFS as they do 
not get removed automatically when the client disconnects like a 
ephemeral node in ZooKeeper. Unfortunately Solr does also not realize 
that it should be owning the lock as it is marked in the state stored 
in ZooKeeper as the owner and is also not willing to retry, which is 
why you need to restart the whole Solr instance after the cleanup. I 
added some logic to my Solr start up script which scans the log files 
in HDFS and compares that with the state in ZooKeeper and then delete 
all lock files that belong to the node that I'm starting.


regards,
Hendrik

On 21.11.2017 14:07, Joe Obernberger wrote:
Hi All - we have a system with 45 physical boxes running solr 6.6.1 
using HDFS as the index. The current index size is about 31TBytes. 
With 3x replication that takes up 93TBytes of disk. Our main 
collection is split across 100 shards with 3 replicas each.  The 
issue that we're running into is when restarting the solr6 cluster.  
The shards go into recovery and start to utilize nearly all of their 
network interfaces.  If we start too many of the nodes at once, the 
shards will go into a recovery, fail, and retry loop and never come 
up.  The errors are related to HDFS not responding fast enough and 
warnings from the DFSClient.  If we stop a node when this is 
happening, the script will force a stop (180 second timeout) and 
upon restart, we have lock files (write.lock) inside of HDFS.


The process at this point is to start one node, find out the lock 
files, wait for it to come up completely (hours), stop it, delete 
the write.lock files, and restart.  Usually this second restart is 
faster, but it still can take 20-60 minutes.


The smaller indexes recover much faster (less than 5 minutes). 
Should we have not used so many replicas with HDFS?  Is there a 
better way we should have built the solr6 cluster?


Thank you for any insight!

-Joe




---
This email has been checked for viruses by AVG.
http://www.avg.com







Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
We set the hard commit time long because we were having performance 
issues with HDFS, and thought that since the block size is 128M, having 
a longer hard commit made sense.  That was our hypothesis anyway.  Happy 
to switch it back and see what happens.


I don't know what caused the cluster to go into recovery in the first 
place.  We had a server die over the weekend, but it's just one out of 
~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 9 
copies).  It was at this point that we noticed lots of network activity, 
and most of the shards in this recovery, fail, retry loop.  That is when 
we decided to shut it down resulting in zombie lock files.


I tried using the FORCELEADER call, which completed, but doesn't seem to 
have any effect on the shards that have no leader.  Kinda out of ideas 
for that problem.  If I can get the cluster back up, I'll try a lower 
hard commit time.  Thanks again Erick!


-Joe


On 11/21/2017 2:00 PM, Erick Erickson wrote:

Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...

I need to back up a bit. Once nodes are in this state it's not
surprising that they need to be forcefully killed. I was more thinking
about how they got in this situation in the first place. _Before_ you
get into the nasty state how are the Solr nodes shut down? Forcefully?

Your hard commit is far longer than it needs to be, resulting in much
larger tlog files etc. I usually set this at 15-60 seconds with local
disks, not quite sure whether longer intervals are helpful on HDFS.
What this means is that you can spend up to 30 minutes when you
restart solr _replaying the tlogs_! If Solr is killed, it may not have
had a chance to fsync the segments and may have to replay on startup.
If you have openSearcher set to false, the hard commit operation is
not horribly expensive, it just fsync's the current segments and opens
new ones. It won't be a total cure, but I bet reducing this interval
would help a lot.

Also, if you stop indexing there's no need to wait 30 minutes if you
issue a manual commit, something like
.../collection/update?commit=true. Just reducing the hard commit
interval will make the wait between stopping indexing and restarting
shorter all by itself if you don't want to issue the manual commit.

Best,
Erick

On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
 wrote:

Hi,

the write.lock issue I see as well when Solr is not been stopped gracefully.
The write.lock files are then left in the HDFS as they do not get removed
automatically when the client disconnects like a ephemeral node in
ZooKeeper. Unfortunately Solr does also not realize that it should be owning
the lock as it is marked in the state stored in ZooKeeper as the owner and
is also not willing to retry, which is why you need to restart the whole
Solr instance after the cleanup. I added some logic to my Solr start up
script which scans the log files in HDFS and compares that with the state in
ZooKeeper and then delete all lock files that belong to the node that I'm
starting.

regards,
Hendrik


On 21.11.2017 14:07, Joe Obernberger wrote:

Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
HDFS as the index.  The current index size is about 31TBytes. With 3x
replication that takes up 93TBytes of disk. Our main collection is split
across 100 shards with 3 replicas each.  The issue that we're running into
is when restarting the solr6 cluster.  The shards go into recovery and start
to utilize nearly all of their network interfaces.  If we start too many of
the nodes at once, the shards will go into a recovery, fail, and retry loop
and never come up.  The errors are related to HDFS not responding fast
enough and warnings from the DFSClient.  If we stop a node when this is
happening, the script will force a stop (180 second timeout) and upon
restart, we have lock files (write.lock) inside of HDFS.

The process at this point is to start one node, find out the lock files,
wait for it to come up completely (hours), stop it, delete the write.lock
files, and restart.  Usually this second restart is faster, but it still can
take 20-60 minutes.

The smaller indexes recover much faster (less than 5 minutes). Should we
have not used so many replicas with HDFS?  Is there a better way we should
have built the solr6 cluster?

Thank you for any insight!

-Joe


---
This email has been checked for viruses by AVG.
http://www.avg.com





Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Erick Erickson
Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...

I need to back up a bit. Once nodes are in this state it's not
surprising that they need to be forcefully killed. I was more thinking
about how they got in this situation in the first place. _Before_ you
get into the nasty state how are the Solr nodes shut down? Forcefully?

Your hard commit is far longer than it needs to be, resulting in much
larger tlog files etc. I usually set this at 15-60 seconds with local
disks, not quite sure whether longer intervals are helpful on HDFS.
What this means is that you can spend up to 30 minutes when you
restart solr _replaying the tlogs_! If Solr is killed, it may not have
had a chance to fsync the segments and may have to replay on startup.
If you have openSearcher set to false, the hard commit operation is
not horribly expensive, it just fsync's the current segments and opens
new ones. It won't be a total cure, but I bet reducing this interval
would help a lot.

Also, if you stop indexing there's no need to wait 30 minutes if you
issue a manual commit, something like
.../collection/update?commit=true. Just reducing the hard commit
interval will make the wait between stopping indexing and restarting
shorter all by itself if you don't want to issue the manual commit.

Best,
Erick

On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
 wrote:
> Hi,
>
> the write.lock issue I see as well when Solr is not been stopped gracefully.
> The write.lock files are then left in the HDFS as they do not get removed
> automatically when the client disconnects like a ephemeral node in
> ZooKeeper. Unfortunately Solr does also not realize that it should be owning
> the lock as it is marked in the state stored in ZooKeeper as the owner and
> is also not willing to retry, which is why you need to restart the whole
> Solr instance after the cleanup. I added some logic to my Solr start up
> script which scans the log files in HDFS and compares that with the state in
> ZooKeeper and then delete all lock files that belong to the node that I'm
> starting.
>
> regards,
> Hendrik
>
>
> On 21.11.2017 14:07, Joe Obernberger wrote:
>>
>> Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
>> HDFS as the index.  The current index size is about 31TBytes. With 3x
>> replication that takes up 93TBytes of disk. Our main collection is split
>> across 100 shards with 3 replicas each.  The issue that we're running into
>> is when restarting the solr6 cluster.  The shards go into recovery and start
>> to utilize nearly all of their network interfaces.  If we start too many of
>> the nodes at once, the shards will go into a recovery, fail, and retry loop
>> and never come up.  The errors are related to HDFS not responding fast
>> enough and warnings from the DFSClient.  If we stop a node when this is
>> happening, the script will force a stop (180 second timeout) and upon
>> restart, we have lock files (write.lock) inside of HDFS.
>>
>> The process at this point is to start one node, find out the lock files,
>> wait for it to come up completely (hours), stop it, delete the write.lock
>> files, and restart.  Usually this second restart is faster, but it still can
>> take 20-60 minutes.
>>
>> The smaller indexes recover much faster (less than 5 minutes). Should we
>> have not used so many replicas with HDFS?  Is there a better way we should
>> have built the solr6 cluster?
>>
>> Thank you for any insight!
>>
>> -Joe
>>
>


Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
A clever idea.  Normally what we do when we need to do a restart, is to 
halt indexing, and then wait about 30 minutes.  If we do not wait, and 
stop the cluster, the default scripts 180 second timeout is not enough 
and we'll have lock files to clean up.  We use puppet to start and stop 
the nodes, but at this point that is not working well since we need to 
start one node at a time.  With each one taking hours, this is a lengthy 
process!  I'd love to see your script!


This new error is now coming up - see screen shot.  For some reason some 
of the shards have no leader assigned:


http://lovehorsepower.com/SolrClusterErrors.jpg

-Joe


On 11/21/2017 1:34 PM, Hendrik Haddorp wrote:

Hi,

the write.lock issue I see as well when Solr is not been stopped 
gracefully. The write.lock files are then left in the HDFS as they do 
not get removed automatically when the client disconnects like a 
ephemeral node in ZooKeeper. Unfortunately Solr does also not realize 
that it should be owning the lock as it is marked in the state stored 
in ZooKeeper as the owner and is also not willing to retry, which is 
why you need to restart the whole Solr instance after the cleanup. I 
added some logic to my Solr start up script which scans the log files 
in HDFS and compares that with the state in ZooKeeper and then delete 
all lock files that belong to the node that I'm starting.


regards,
Hendrik

On 21.11.2017 14:07, Joe Obernberger wrote:
Hi All - we have a system with 45 physical boxes running solr 6.6.1 
using HDFS as the index.  The current index size is about 31TBytes. 
With 3x replication that takes up 93TBytes of disk. Our main 
collection is split across 100 shards with 3 replicas each.  The 
issue that we're running into is when restarting the solr6 cluster.  
The shards go into recovery and start to utilize nearly all of their 
network interfaces.  If we start too many of the nodes at once, the 
shards will go into a recovery, fail, and retry loop and never come 
up.  The errors are related to HDFS not responding fast enough and 
warnings from the DFSClient.  If we stop a node when this is 
happening, the script will force a stop (180 second timeout) and upon 
restart, we have lock files (write.lock) inside of HDFS.


The process at this point is to start one node, find out the lock 
files, wait for it to come up completely (hours), stop it, delete the 
write.lock files, and restart.  Usually this second restart is 
faster, but it still can take 20-60 minutes.


The smaller indexes recover much faster (less than 5 minutes). Should 
we have not used so many replicas with HDFS?  Is there a better way 
we should have built the solr6 cluster?


Thank you for any insight!

-Joe




---
This email has been checked for viruses by AVG.
http://www.avg.com





Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Hendrik Haddorp

Hi,

the write.lock issue I see as well when Solr is not been stopped 
gracefully. The write.lock files are then left in the HDFS as they do 
not get removed automatically when the client disconnects like a 
ephemeral node in ZooKeeper. Unfortunately Solr does also not realize 
that it should be owning the lock as it is marked in the state stored in 
ZooKeeper as the owner and is also not willing to retry, which is why 
you need to restart the whole Solr instance after the cleanup. I added 
some logic to my Solr start up script which scans the log files in HDFS 
and compares that with the state in ZooKeeper and then delete all lock 
files that belong to the node that I'm starting.


regards,
Hendrik

On 21.11.2017 14:07, Joe Obernberger wrote:
Hi All - we have a system with 45 physical boxes running solr 6.6.1 
using HDFS as the index.  The current index size is about 31TBytes. 
With 3x replication that takes up 93TBytes of disk. Our main 
collection is split across 100 shards with 3 replicas each.  The issue 
that we're running into is when restarting the solr6 cluster.  The 
shards go into recovery and start to utilize nearly all of their 
network interfaces.  If we start too many of the nodes at once, the 
shards will go into a recovery, fail, and retry loop and never come 
up.  The errors are related to HDFS not responding fast enough and 
warnings from the DFSClient.  If we stop a node when this is 
happening, the script will force a stop (180 second timeout) and upon 
restart, we have lock files (write.lock) inside of HDFS.


The process at this point is to start one node, find out the lock 
files, wait for it to come up completely (hours), stop it, delete the 
write.lock files, and restart.  Usually this second restart is faster, 
but it still can take 20-60 minutes.


The smaller indexes recover much faster (less than 5 minutes). Should 
we have not used so many replicas with HDFS?  Is there a better way we 
should have built the solr6 cluster?


Thank you for any insight!

-Joe





Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
Erick - thank you very much for the reply.  I'm still working through 
restarting the nodes one by one.


I'm stopping the nodes with the script, but yes - they are being killed 
forcefully because they are in this recovery, failed, retry loop.  I 
could increase the timeout, but they never seem to recover.


The largest tlog file that I see currently is 222MBytes. Autocommit is 
set to 180 and autoSoftCommit is set to 12. Errors when they are 
in the long recovery are things like:


2017-11-20 21:41:29.755 ERROR 
(recoveryExecutor-3-thread-4-processing-n:frodo:9100_solr 
x:UNCLASS_shard37_replica1 s:shard37 c:UNCLASS r:core_node196) 
[c:UNCLASS s:shard37 r:core_node196 x:UNCLASS_shard37_replica1] 
o.a.s.h.IndexFetcher Error closing file: _8dmn.cfs
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/solr6.6.0/UNCLASS/core_node196/data/index.20171120195705961/_8dmn.cfs 
could only be replicated to 0 nodes instead of minReplication (=1).  
There are 39 datanode(s) running and no node(s) are excluded in this 
operation.
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1716)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3385)


Complete log is here for one of the shards that was forcefully stopped.
http://lovehorsepower.com/solr.log

As to what is in the logs when it is recovering for several hours, it's 
many WARN messages from the DFSClient such as:
Abandoning 
BP-1714598269-10.2.100.220-1341346046854:blk_4366207808_1103082741732

and
Excluding datanode 
DatanodeInfoWithStorage[172.16.100.229:50010,​DS-5985e40d-830a-44e7-a2ea-fc60bebabc30,​DISK]


or from the IndexFetcher:

File _a96y.cfe did not match. expected checksum is 3502268220 and actual 
is checksum 2563579651. expected length is 405 and actual length is 405


Unfortunately, I'm not getting errors from some of the nodes (still 
working through restarting them) about zookeeper:


org.apache.solr.common.SolrException: Could not get leader props
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)

    at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
    at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
    at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
    at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
    at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)

    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
    at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
    at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
    at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
    at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)


Any idea what those could be?  Those shards are not coming back up. 
Sorry so many questions!


-Joe

On 11/21/2017 12:11 PM, Erick Erickson wrote:

How are you stopping Solr? Nodes should not go into recovery on
startup unless Solr was killed un-gracefully (i.e. kill -9 or the
like). If you use the bin/solr script to stop Solr and see a message
about "killing XXX forcefully" then you can lengthen out the time the
script waits for shutdown (there's a sysvar you can set, look in the
script).

Actually I'll correct myself a bit. Shards _do_ go into recovery but
it should be very short in the graceful shutdown case. Basically
shards temporarily go into recovery as part of normal processing just
long enough to see there's no recovery necessary, but that should be
measured in a few seconds.

What it sounds like from this "The shards go into recovery and start
to utilize nearly all of their network" is that your nodes go into
"full recovery" where the entire index is copied down because the
replica thinks it's "too far" out of date. That indicates something
weird about the state when the Solr nodes stopped.

wild-shot-in-the-dark question. How big are your tlogs? If you don't
hard commit very often, the tlogs can replay at startup for a very
long time.

This makes no sense to me, I'm surely missing something:

The proc

Re: Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Erick Erickson
How are you stopping Solr? Nodes should not go into recovery on
startup unless Solr was killed un-gracefully (i.e. kill -9 or the
like). If you use the bin/solr script to stop Solr and see a message
about "killing XXX forcefully" then you can lengthen out the time the
script waits for shutdown (there's a sysvar you can set, look in the
script).

Actually I'll correct myself a bit. Shards _do_ go into recovery but
it should be very short in the graceful shutdown case. Basically
shards temporarily go into recovery as part of normal processing just
long enough to see there's no recovery necessary, but that should be
measured in a few seconds.

What it sounds like from this "The shards go into recovery and start
to utilize nearly all of their network" is that your nodes go into
"full recovery" where the entire index is copied down because the
replica thinks it's "too far" out of date. That indicates something
weird about the state when the Solr nodes stopped.

wild-shot-in-the-dark question. How big are your tlogs? If you don't
hard commit very often, the tlogs can replay at startup for a very
long time.

This makes no sense to me, I'm surely missing something:

The process at this point is to start one node, find out the lock
files, wait for it to come up completely (hours), stop it, delete the
write.lock files, and restart.  Usually this second restart is faster,
but it still can take 20-60 minutes.

When you start one node it may take a few minutes for leader electing
to kick in (the default is 180 seconds) but absent replication it
should just be there. Taking hours totally violates my expectations.
What does Solr _think_ it's doing? What's in the logs at that point?
And if you stop solr gracefully, there shouldn't be a problem with
write.lock.

You could also try increasing the timeouts, and the HDFS directory
factory has some parameters to tweak that are a mystery to me...

All in all, this is behavior that I find mystifying.

Best,
Erick

On Tue, Nov 21, 2017 at 5:07 AM, Joe Obernberger
 wrote:
> Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
> HDFS as the index.  The current index size is about 31TBytes.  With 3x
> replication that takes up 93TBytes of disk. Our main collection is split
> across 100 shards with 3 replicas each.  The issue that we're running into
> is when restarting the solr6 cluster.  The shards go into recovery and start
> to utilize nearly all of their network interfaces.  If we start too many of
> the nodes at once, the shards will go into a recovery, fail, and retry loop
> and never come up.  The errors are related to HDFS not responding fast
> enough and warnings from the DFSClient.  If we stop a node when this is
> happening, the script will force a stop (180 second timeout) and upon
> restart, we have lock files (write.lock) inside of HDFS.
>
> The process at this point is to start one node, find out the lock files,
> wait for it to come up completely (hours), stop it, delete the write.lock
> files, and restart.  Usually this second restart is faster, but it still can
> take 20-60 minutes.
>
> The smaller indexes recover much faster (less than 5 minutes). Should we
> have not used so many replicas with HDFS?  Is there a better way we should
> have built the solr6 cluster?
>
> Thank you for any insight!
>
> -Joe
>


Recovery Issue - Solr 6.6.1 and HDFS

2017-11-21 Thread Joe Obernberger
Hi All - we have a system with 45 physical boxes running solr 6.6.1 
using HDFS as the index.  The current index size is about 31TBytes.  
With 3x replication that takes up 93TBytes of disk. Our main collection 
is split across 100 shards with 3 replicas each.  The issue that we're 
running into is when restarting the solr6 cluster.  The shards go into 
recovery and start to utilize nearly all of their network interfaces.  
If we start too many of the nodes at once, the shards will go into a 
recovery, fail, and retry loop and never come up.  The errors are 
related to HDFS not responding fast enough and warnings from the 
DFSClient.  If we stop a node when this is happening, the script will 
force a stop (180 second timeout) and upon restart, we have lock files 
(write.lock) inside of HDFS.


The process at this point is to start one node, find out the lock files, 
wait for it to come up completely (hours), stop it, delete the 
write.lock files, and restart.  Usually this second restart is faster, 
but it still can take 20-60 minutes.


The smaller indexes recover much faster (less than 5 minutes). Should we 
have not used so many replicas with HDFS?  Is there a better way we 
should have built the solr6 cluster?


Thank you for any insight!

-Joe