Re: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-17 Thread Zachary Radtka
I have had this exact error before:

ERROR: read a frame size of 1195725856, which is bigger than the maximum
allowable buffer size for ALL connections.

My cluster was on a client's AWS account which would regularly have
security scans on the weekends. Logging in on Monday the master would
always be down. We didn't know what the security scans were, but we did
solve our issue by placing our servers in a security group that only
allowed the accumulo servers to talk with each other. We also restricted
inbound traffic to security groups for our other systems that were
accessing Accumulo directly.

-Zach



On Thu, Mar 17, 2022 at 11:45 AM Mike Miller  wrote:

> Are you still running Replication? I would turn it off if you can.
>
> On Thu, Mar 17, 2022 at 7:44 AM dev1  wrote:
>
>> When an Accumulo process abnormally terminates, there may be a file
>> create with the exception of the problem – the files may be names *.out (or
>> *.err) can’t recall which. Normally the files have 0 size, but on
>> termination will have some text.
>>
>>
>>
>> Are you seeing those files and do they point to the issue?
>>
>>
>>
>> Do you have the jvm configured to terminate on out of memory – and print
>> that error condition? Maybe the manager is running out of memory.
>>
>>
>>
>> Ed Coleman
>>
>>
>>
>> *From:* Ligade, Shailesh [USA] 
>> *Sent:* Wednesday, March 16, 2022 3:31 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* RE: [External] Re: odd issue with accumulo 1.10.0 starting up
>>
>>
>>
>> Thanks,
>>
>>
>>
>> I think we are having the same or similar issue with virus scan/security
>> scan. However that should not bring down the master, can it??
>>
>>
>>
>> I am still digging thru the logs.
>>
>>
>>
>> -S
>>
>>
>>
>> *From:* Adam J. Shook 
>> *Sent:* Wednesday, March 16, 2022 2:46 PM
>> *To:* user@accumulo.apache.org
>> *Subject:* Re: [External] Re: odd issue with accumulo 1.10.0 starting up
>>
>>
>>
>> This is certainly anecdotal, but we've seen this "ERROR: Read a frame
>> size of (large number)" before on our Accumulo cluster that would show up
>> at a regular and predictable frequency. The root cause was due to a routine
>> scan done by the security team looking for vulnerabilities across the
>> entire enterprise (nothing Accumulo-specific). I don't have any additional
>> information about the specifics of the scan. From all that we can tell, it
>> has no impact on our Accumulo cluster outside of these error messages.
>>
>>
>>
>> --Adam
>>
>>
>>
>> On Wed, Mar 16, 2022 at 8:35 AM Christopher  wrote:
>>
>> Since that error message is coming from the libthrift library, and not
>> Accumulo code, we would need a lot more context to even begin helping you
>> troubleshoot it. For example, the complete stack trace that shows the
>> Accumulo code that called into the Thrift library, would be extremely
>> helpful.
>>
>> It's a bit concerning that you're trying to send a single buffer over
>> thrift that's over a gigabyte in size, according to that number. You've
>> said before that you use live ingest. Are you trying to send a 1GB mutation
>> to a tablet server? Or are you using replication and the stack trace looks
>> like it's sending 1GB of replication data?
>>
>>
>>
>> On Wed, Mar 16, 2022 at 7:14 AM Ligade, Shailesh [USA] <
>> ligade_shail...@bah.com> wrote:
>>
>> Well, I re-initialized accumulo but I still see
>>
>>
>>
>> ERROR: Read a frame size of 1195725856, which is bigger than the maximum
>> allowable buffer size for ALL connections.
>>
>>
>>
>> Is there a setting that I can increase to get past it?
>>
>>
>>
>> -S
>>
>>
>>
>>
>> --
>>
>> *From:* Ligade, Shailesh [USA] 
>> *Sent:* Tuesday, March 15, 2022 12:47 PM
>> *To:* user@accumulo.apache.org 
>> *Subject:* Re: [External] Re: odd issue with accumulo 1.10.0 starting up
>>
>>
>>
>> Not daily but  over weekend.
>> --
>>
>> *From:* Mike Miller 
>> *Sent:* Tuesday, March 15, 2022 10:39 AM
>> *To:* user@accumulo.apache.org 
>> *Subject:* Re: [External] Re: odd issue with accumulo 1.10.0 starting up
>>
>>
>>
>> Why are you bringing the cluster down every night? That is not ideal.
>>
>>
>>
>> On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] <
>> ligade_shail...@bah.com> wrote:
>>
>> Thanks Mike,
>>
>>
>>
>> We bring the servers down nightly. these are on aws. This worked
>> yesterday (Monday) but this (Tuesday) i went on to check on it and it was
>> down, I guess i didn't check yesterday. I assume it was up as no one
>> complained., but it was up and kicking last week for sure.
>>
>>
>>
>> So not exactly sure when or what caused it, all services are up (tserver,
>> master) so services are not crashing themselves.
>>
>>
>>
>> I guess worst case, i can re-initialize and recreate tables form hdfs..:-(
>>
>>
>>
>> -S
>> --
>>
>> *From:* Mike Miller 
>> *Sent:* Tuesday, March 15, 2022 9:16 AM
>> *To:* user@accumulo.apache.org 
>> *Subject:* Re: [External] Re: odd issue with accumulo 

Re: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-17 Thread Mike Miller
Are you still running Replication? I would turn it off if you can.

On Thu, Mar 17, 2022 at 7:44 AM dev1  wrote:

> When an Accumulo process abnormally terminates, there may be a file create
> with the exception of the problem – the files may be names *.out (or *.err)
> can’t recall which. Normally the files have 0 size, but on termination will
> have some text.
>
>
>
> Are you seeing those files and do they point to the issue?
>
>
>
> Do you have the jvm configured to terminate on out of memory – and print
> that error condition? Maybe the manager is running out of memory.
>
>
>
> Ed Coleman
>
>
>
> *From:* Ligade, Shailesh [USA] 
> *Sent:* Wednesday, March 16, 2022 3:31 PM
> *To:* user@accumulo.apache.org
> *Subject:* RE: [External] Re: odd issue with accumulo 1.10.0 starting up
>
>
>
> Thanks,
>
>
>
> I think we are having the same or similar issue with virus scan/security
> scan. However that should not bring down the master, can it??
>
>
>
> I am still digging thru the logs.
>
>
>
> -S
>
>
>
> *From:* Adam J. Shook 
> *Sent:* Wednesday, March 16, 2022 2:46 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: [External] Re: odd issue with accumulo 1.10.0 starting up
>
>
>
> This is certainly anecdotal, but we've seen this "ERROR: Read a frame size
> of (large number)" before on our Accumulo cluster that would show up at a
> regular and predictable frequency. The root cause was due to a routine scan
> done by the security team looking for vulnerabilities across the entire
> enterprise (nothing Accumulo-specific). I don't have any additional
> information about the specifics of the scan. From all that we can tell, it
> has no impact on our Accumulo cluster outside of these error messages.
>
>
>
> --Adam
>
>
>
> On Wed, Mar 16, 2022 at 8:35 AM Christopher  wrote:
>
> Since that error message is coming from the libthrift library, and not
> Accumulo code, we would need a lot more context to even begin helping you
> troubleshoot it. For example, the complete stack trace that shows the
> Accumulo code that called into the Thrift library, would be extremely
> helpful.
>
> It's a bit concerning that you're trying to send a single buffer over
> thrift that's over a gigabyte in size, according to that number. You've
> said before that you use live ingest. Are you trying to send a 1GB mutation
> to a tablet server? Or are you using replication and the stack trace looks
> like it's sending 1GB of replication data?
>
>
>
> On Wed, Mar 16, 2022 at 7:14 AM Ligade, Shailesh [USA] <
> ligade_shail...@bah.com> wrote:
>
> Well, I re-initialized accumulo but I still see
>
>
>
> ERROR: Read a frame size of 1195725856, which is bigger than the maximum
> allowable buffer size for ALL connections.
>
>
>
> Is there a setting that I can increase to get past it?
>
>
>
> -S
>
>
>
>
> --
>
> *From:* Ligade, Shailesh [USA] 
> *Sent:* Tuesday, March 15, 2022 12:47 PM
> *To:* user@accumulo.apache.org 
> *Subject:* Re: [External] Re: odd issue with accumulo 1.10.0 starting up
>
>
>
> Not daily but  over weekend.
> --
>
> *From:* Mike Miller 
> *Sent:* Tuesday, March 15, 2022 10:39 AM
> *To:* user@accumulo.apache.org 
> *Subject:* Re: [External] Re: odd issue with accumulo 1.10.0 starting up
>
>
>
> Why are you bringing the cluster down every night? That is not ideal.
>
>
>
> On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] <
> ligade_shail...@bah.com> wrote:
>
> Thanks Mike,
>
>
>
> We bring the servers down nightly. these are on aws. This worked yesterday
> (Monday) but this (Tuesday) i went on to check on it and it was down, I
> guess i didn't check yesterday. I assume it was up as no one complained.,
> but it was up and kicking last week for sure.
>
>
>
> So not exactly sure when or what caused it, all services are up (tserver,
> master) so services are not crashing themselves.
>
>
>
> I guess worst case, i can re-initialize and recreate tables form hdfs..:-(
>
>
>
> -S
> --
>
> *From:* Mike Miller 
> *Sent:* Tuesday, March 15, 2022 9:16 AM
> *To:* user@accumulo.apache.org 
> *Subject:* Re: [External] Re: odd issue with accumulo 1.10.0 starting up
>
>
>
> What was going on in the tserver before you saw that error? Did it finish
> recovering after the restart? If it is still recovering, I don't think you
> will be able to do any scans.
>
>
>
> On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] <
> ligade_shail...@bah.com> wrote:
>
> Thanks Mike,
>
>
>
> That was my first reaction but the instance is backed up by puppet and no
> configuration was updated (i double checked and ran puppet manually as well
> as automatically after restart), Since the system was operational
> yesterday, So I think I can rule that out.
>
>
>
> For other error, I did see the exact error
> https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j
> 

RE: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-17 Thread dev1
When an Accumulo process abnormally terminates, there may be a file create with 
the exception of the problem – the files may be names *.out (or *.err) can’t 
recall which. Normally the files have 0 size, but on termination will have some 
text.

Are you seeing those files and do they point to the issue?

Do you have the jvm configured to terminate on out of memory – and print that 
error condition? Maybe the manager is running out of memory.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, March 16, 2022 3:31 PM
To: user@accumulo.apache.org
Subject: RE: [External] Re: odd issue with accumulo 1.10.0 starting up

Thanks,

I think we are having the same or similar issue with virus scan/security scan. 
However that should not bring down the master, can it??

I am still digging thru the logs.

-S

From: Adam J. Shook mailto:adamjsh...@gmail.com>>
Sent: Wednesday, March 16, 2022 2:46 PM
To: user@accumulo.apache.org
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

This is certainly anecdotal, but we've seen this "ERROR: Read a frame size of 
(large number)" before on our Accumulo cluster that would show up at a regular 
and predictable frequency. The root cause was due to a routine scan done by the 
security team looking for vulnerabilities across the entire enterprise (nothing 
Accumulo-specific). I don't have any additional information about the specifics 
of the scan. From all that we can tell, it has no impact on our Accumulo 
cluster outside of these error messages.

--Adam

On Wed, Mar 16, 2022 at 8:35 AM Christopher 
mailto:ctubb...@apache.org>> wrote:
Since that error message is coming from the libthrift library, and not Accumulo 
code, we would need a lot more context to even begin helping you troubleshoot 
it. For example, the complete stack trace that shows the Accumulo code that 
called into the Thrift library, would be extremely helpful.

It's a bit concerning that you're trying to send a single buffer over thrift 
that's over a gigabyte in size, according to that number. You've said before 
that you use live ingest. Are you trying to send a 1GB mutation to a tablet 
server? Or are you using replication and the stack trace looks like it's 
sending 1GB of replication data?

On Wed, Mar 16, 2022 at 7:14 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Well, I re-initialized accumulo but I still see

ERROR: Read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

Is there a setting that I can increase to get past it?

-S



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, March 15, 2022 12:47 PM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Not daily but  over weekend.

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 10:39 AM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Why are you bringing the cluster down every night? That is not ideal.

On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

We bring the servers down nightly. these are on aws. This worked yesterday 
(Monday) but this (Tuesday) i went on to check on it and it was down, I guess i 
didn't check yesterday. I assume it was up as no one complained., but it was up 
and kicking last week for sure.

So not exactly sure when or what caused it, all services are up (tserver, 
master) so services are not crashing themselves.

I guess worst case, i can re-initialize and recreate tables form hdfs..:-(

-S

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 9:16 AM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

What was going on in the tserver before you saw that error? Did it finish 
recovering after the restart? If it is still recovering, I don't think you will 
be able to do any scans.

On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error