accumulo prometheus endpoint

2022-05-16 Thread Ligade, Shailesh [USA]
Hello,

I know accumulo can use hadoop metrics 2 to expose metrics. 
https://accumulo.apache.org/blog/2018/03/22/view-metrics-in-grafana.html

Can it also use "Hadoop.prometheus.endpoint.enabled" property or similar to 
expose Prometheus endpoint?

Thanks in advance

-S


accumulo 1.10 replication table issues

2022-05-12 Thread Ligade, Shailesh [USA]
Hello,

I used to have a backup cluster and primary was replicating properly. Now i 
don't need backup cluster, so i deleted replication entries form primary 
cluster and turned replication table offline. However time to time the 
replication table comes online and i am seeing entries in master log stating 
can't find replication targets and replication is utilizing mx 100 threads etc.
since this table comes online on its own, i thinking it is affecting some 
cluster performance 
Question is why replication table is not staying offline? what is the proper 
way of keeping it offline? What is the best way to remove all entries from this 
table that even if it comes online, it will not clutter my logs?

Thanks

-S




RE: [External] Re: minor compaction same as flush

2022-04-14 Thread Ligade, Shailesh [USA]
Thanks Christopher for detailed explanation.

Appreciated your response!

-S



-Original Message-
From: Christopher  
Sent: Thursday, April 14, 2022 3:33 PM
To: accumulo-user 
Subject: [External] Re: minor compaction same as flush

Ed's description is slightly wrong:

Yes, flush is the same as a minor compaction, writing in-memory data to an 
RFile. The shell calls this a "flush", because it's a more intuitive name than 
"minor compaction". In the technical documentation, it could be referred to as 
either, and some of our configuration options and APIs may have "minc" for 
minor compaction.

If you just say "compaction", without qualification, it implies "major 
compaction", which combines multiple files into one replacement file.
The shell has a "compact" command for forcing one to occur.

Furthermore, major compactions come in a few varieties: partial and full. A 
partial one is when only a subset of a tablet's file set are combined. Ed 
incorrectly referred to these partial major comapctions as a minor compaction, 
but these are still major compactions... they just don't include all files. A 
full major compaction is when all the tablet's files are combined. There are 
some operations (like dropping delete markers) that can only occur if the major 
compaction is of the full variety.

There is also something called a "merging minor compaction", which is a kind of 
flush/minor compaction that also picks up a small file to combine with. It was 
overly complex and not really necessary, so we removed it in 2.1 (not yet 
released).

There's also some special terms that refer to major compactions that occur 
under certain circumstances: chop compactions are partial major compactions 
that occur to truncate files in different tablets before those tablets are 
merged; idle compactions are major compactions that happen periodically when a 
tablet is not busy. Most compactions are system-initiated, but a user 
compaction is one that is user-initiated, such as from the shell's 'compact' 
command.

I hope that helps.


RE: minor compaction same as flush

2022-04-14 Thread Ligade, Shailesh [USA]
Thanks Ed,

Appreciated your clarification!

-S

From: dev1 
Sent: Thursday, April 14, 2022 9:36 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: minor compaction same as flush

Flush and compactions are different actions.

Flush - sorts and writes current, in-memory changes to a file.  This can reduce 
the amount of recovery in case of a failure because the flushed entries do not 
need to be processed from the WAL.

Compactions combine multiple files into a single file.  Major compactions 
combine all files into a single file.  Minor compactions select a subset of 
files can combines them into a file.

See: 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_compaction<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_compaction__;Iw!!May37g!JtX1JIc70ShpH3ztLGbjVuBPxlur4g_WBBG-NrAM5Ax5VOPy_41RN4hS2YuaHh9kaQaG5pPpAbdz8m2C$>

Flushing will increase the number of files generated, this will potentially 
increase the number of compactions. There are tradeoffs. If you are asking will 
frequent flushes reduce the time required to perform a major compaction?  
Probably not much, if at all.

Ed Coleman

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Thursday, April 14, 2022 9:14 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: minor compaction same as flush

Hello just wanted to some clarification,

Is the flush same as minor compaction? Is flush better (performance wise) than 
running say range compaction?
Having flush often, will it help major compaction performance or no difference??

Thanks

-S


minor compaction same as flush

2022-04-14 Thread Ligade, Shailesh [USA]
Hello just wanted to some clarification,

Is the flush same as minor compaction? Is flush better (performance wise) than 
running say range compaction?
Having flush often, will it help major compaction performance or no difference??

Thanks

-S


Re: Accumulo 1.10.0

2022-04-13 Thread Ligade, Shailesh [USA]
Thanks Ed,

after few rounds of delete duplicate location, the cluster is up and kicking. I 
didn't have to restart accumulo after deletes.

Thank again

-S

From: Ligade, Shailesh [USA] 
Sent: Wednesday, April 13, 2022 9:20 AM
To: 'user@accumulo.apache.org' 
Subject: Re: Accumulo 1.10.0

Thanks

i noticed that issue with the write permission, anyway i was able to run all 
delete. In reality i ran
CheckForMetadatProblems to generate script and ran all delete
Shell didn't throw any error
Then i restarted my both masters, but then master came up i am seeing the same 
duplicate entry error in the master log so either i didn't do something right, 
may be master didn't get bounce properly or zookeeper or something still has 
that data..

Any suggestions?
-S

From: dev1 
Sent: Wednesday, April 13, 2022 9:14 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: Accumulo 1.10.0


If you still have an issue – check that your user has WRITE permissions on the 
metadata table (even root needs to be added) – if you grant permission(s) you 
likely would want to remove what you added once you are done to prevent 
inadvertently modifying the table in the future if you make a mistake with a 
command intended for another table. (Besides being a good security practice to 
operate with the minimum required permissions)



From: Ligade, Shailesh [USA] 
Sent: Wednesday, April 13, 2022 8:41 AM
To: 'user@accumulo.apache.org' 
Subject: Re: Accumulo 1.10.0



I think i figured out



i have to be on that accumulo.metadata table in order for delete command to 
work.. -t accumulo.metadata did not work ..not sure why ??



Thanks



0S



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: Re: Accumulo 1.10.0



Thanks Ed,



a quick question,



Now that i want to delete those duplicates (there are many of those ??)



the scan output is



a;: loc:a [] tablet1:9997

a;: loc:zz [] tablet2:9997



What is the right delete command, when i issue



delete a; loc a -t accumulo.metadata



i get help so it doesn't think it is right command



i tried



delete a; loca a -t accumulo.metadata  or

delete a;: loc a -t accumulo.metadata



still get the help message..



Thanks in advance,



-S







From: dev1 mailto:d...@etcoleman.com>>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: Accumulo 1.10.0



I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.



If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space ?? all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 



Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?



Thanks



-S


Re: Accumulo 1.10.0

2022-04-13 Thread Ligade, Shailesh [USA]
Thanks

i noticed that issue with the write permission, anyway i was able to run all 
delete. In reality i ran
CheckForMetadatProblems to generate script and ran all delete
Shell didn't throw any error
Then i restarted my both masters, but then master came up i am seeing the same 
duplicate entry error in the master log so either i didn't do something right, 
may be master didn't get bounce properly or zookeeper or something still has 
that data..

Any suggestions?
-S

From: dev1 
Sent: Wednesday, April 13, 2022 9:14 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: Accumulo 1.10.0


If you still have an issue – check that your user has WRITE permissions on the 
metadata table (even root needs to be added) – if you grant permission(s) you 
likely would want to remove what you added once you are done to prevent 
inadvertently modifying the table in the future if you make a mistake with a 
command intended for another table. (Besides being a good security practice to 
operate with the minimum required permissions)



From: Ligade, Shailesh [USA] 
Sent: Wednesday, April 13, 2022 8:41 AM
To: 'user@accumulo.apache.org' 
Subject: Re: Accumulo 1.10.0



I think i figured out



i have to be on that accumulo.metadata table in order for delete command to 
work.. -t accumulo.metadata did not work ..not sure why ??



Thanks



0S



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: Re: Accumulo 1.10.0



Thanks Ed,



a quick question,



Now that i want to delete those duplicates (there are many of those ??)



the scan output is



a;: loc:a [] tablet1:9997

a;: loc:zz [] tablet2:9997



What is the right delete command, when i issue



delete a; loc a -t accumulo.metadata



i get help so it doesn't think it is right command



i tried



delete a; loca a -t accumulo.metadata  or

delete a;: loc a -t accumulo.metadata



still get the help message..



Thanks in advance,



-S







From: dev1 mailto:d...@etcoleman.com>>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: Accumulo 1.10.0



I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.



If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space ?? all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 



Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?



Thanks



-S


Re: Accumulo 1.10.0

2022-04-13 Thread Ligade, Shailesh [USA]
I think i figured out

i have to be on that accumulo.metadata table in order for delete command to 
work.. -t accumulo.metadata did not work ..not sure why 

Thanks

0S

From: Ligade, Shailesh [USA] 
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' 
Subject: Re: Accumulo 1.10.0

Thanks Ed,

a quick question,

Now that i want to delete those duplicates (there are many of those )

the scan output is

a;: loc:a [] tablet1:9997
a;: loc:zz [] tablet2:9997

What is the right delete command, when i issue

delete a; loc a -t accumulo.metadata

i get help so it doesn't think it is right command

i tried

delete a; loca a -t accumulo.metadata  or
delete a;: loc a -t accumulo.metadata

still get the help message..

Thanks in advance,

-S



From: dev1 
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: Accumulo 1.10.0


I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.



If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space  all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 



Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?



Thanks



-S


Re: Accumulo 1.10.0

2022-04-13 Thread Ligade, Shailesh [USA]
Thanks Ed,

a quick question,

Now that i want to delete those duplicates (there are many of those )

the scan output is

a;: loc:a [] tablet1:9997
a;: loc:zz [] tablet2:9997

What is the right delete command, when i issue

delete a; loc a -t accumulo.metadata

i get help so it doesn't think it is right command

i tried

delete a; loca a -t accumulo.metadata  or
delete a;: loc a -t accumulo.metadata

still get the help message..

Thanks in advance,

-S



From: dev1 
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: Accumulo 1.10.0


I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.



If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space  all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 



Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?



Thanks



-S


Re: Accumulo 1.10.0

2022-04-12 Thread Ligade, Shailesh [USA]
Thanks Ed,

Stooping tserver didn't help weird issue i saw is that both location on the 
same tserver. So i guess i have to do it hard way...:-(

-S

From: dev1 
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: Accumulo 1.10.0


I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.



If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space ?? all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 



Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?



Thanks



-S


Accumulo 1.10.0

2022-04-12 Thread Ligade, Shailesh [USA]
Hello, Last weekend we ran out of hdfs space  all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.

Although accumulo was up it had 3 unsassigned tablets

So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.

On the active master i am seeing error

ERROR: Error processing table state for store Normal Tablets
java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 

Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?

Thanks

-S


RE: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-16 Thread Ligade, Shailesh [USA]
Thanks,

I think we are having the same or similar issue with virus scan/security scan. 
However that should not bring down the master, can it??

I am still digging thru the logs.

-S

From: Adam J. Shook 
Sent: Wednesday, March 16, 2022 2:46 PM
To: user@accumulo.apache.org
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

This is certainly anecdotal, but we've seen this "ERROR: Read a frame size of 
(large number)" before on our Accumulo cluster that would show up at a regular 
and predictable frequency. The root cause was due to a routine scan done by the 
security team looking for vulnerabilities across the entire enterprise (nothing 
Accumulo-specific). I don't have any additional information about the specifics 
of the scan. From all that we can tell, it has no impact on our Accumulo 
cluster outside of these error messages.

--Adam

On Wed, Mar 16, 2022 at 8:35 AM Christopher 
mailto:ctubb...@apache.org>> wrote:
Since that error message is coming from the libthrift library, and not Accumulo 
code, we would need a lot more context to even begin helping you troubleshoot 
it. For example, the complete stack trace that shows the Accumulo code that 
called into the Thrift library, would be extremely helpful.

It's a bit concerning that you're trying to send a single buffer over thrift 
that's over a gigabyte in size, according to that number. You've said before 
that you use live ingest. Are you trying to send a 1GB mutation to a tablet 
server? Or are you using replication and the stack trace looks like it's 
sending 1GB of replication data?

On Wed, Mar 16, 2022 at 7:14 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Well, I re-initialized accumulo but I still see

ERROR: Read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

Is there a setting that I can increase to get past it?

-S


________
From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, March 15, 2022 12:47 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Not daily but  over weekend.

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 10:39 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Why are you bringing the cluster down every night? That is not ideal.

On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

We bring the servers down nightly. these are on aws. This worked yesterday 
(Monday) but this (Tuesday) i went on to check on it and it was down, I guess i 
didn't check yesterday. I assume it was up as no one complained., but it was up 
and kicking last week for sure.

So not exactly sure when or what caused it, all services are up (tserver, 
master) so services are not crashing themselves.

I guess worst case, i can re-initialize and recreate tables form hdfs..:-(

-S

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 9:16 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

What was going on in the tserver before you saw that error? Did it finish 
recovering after the restart? If it is still recovering, I don't think you will 
be able to do any scans.

On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error 
https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j<https://urldefense.com/v3/__https:/lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3t6n73Xg$>,
  
https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14<https://urldefense.com/v3/__https:/github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
 
https://markmail.org/message/bc7ijdsgqmod5p2h<https://urldefense.com/v3/__https:/markmail.org/message/bc7ijdsgqmod5p2h__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs2d0PoHHw$>
 but those are for lot older accumulo. and server didn't go out of memory so I 
think that must have been fixed..
<h

Re: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-16 Thread Ligade, Shailesh [USA]
Thanks Mike,

After stopping all the services, i just moved /accumulo  to /old-accumulo and 
then ran

accumulo init --clear-instance-name --instance-name  --password 


With that plain vanilla accumulo came up after restarting the services.

Plan is to re-create all the tables from /old-accumulo using importdirectory

Although accumulo is up and I can see it from monitor and scan accumulo tables, 
I do see that error in master log..

-S

From: Mike Miller 
Sent: Wednesday, March 16, 2022 8:24 AM
To: user@accumulo.apache.org 
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

It is hard to help you without a full explanation of what exactly you are 
doing. Was that error in the Master log? What commands did you run exactly to 
"re-initialize"? Did you wipe all the data or just run "--reset-security"?

On Wed, Mar 16, 2022 at 7:14 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Well, I re-initialized accumulo but I still see

ERROR: Read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

Is there a setting that I can increase to get past it?

-S


________
From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, March 15, 2022 12:47 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Not daily but  over weekend.

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 10:39 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Why are you bringing the cluster down every night? That is not ideal.

On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

We bring the servers down nightly. these are on aws. This worked yesterday 
(Monday) but this (Tuesday) i went on to check on it and it was down, I guess i 
didn't check yesterday. I assume it was up as no one complained., but it was up 
and kicking last week for sure.

So not exactly sure when or what caused it, all services are up (tserver, 
master) so services are not crashing themselves.

I guess worst case, i can re-initialize and recreate tables form hdfs..:-(

-S

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 9:16 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

What was going on in the tserver before you saw that error? Did it finish 
recovering after the restart? If it is still recovering, I don't think you will 
be able to do any scans.

On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error 
https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j<https://urldefense.com/v3/__https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3t6n73Xg$>,
  
https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
 
https://markmail.org/message/bc7ijdsgqmod5p2h<https://urldefense.com/v3/__https://markmail.org/message/bc7ijdsgqmod5p2h__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs2d0PoHHw$>
 but those are for lot older accumulo. and server didn't go out of memory so I 
think that must have been fixed..
[https://opengraph.githubassets.com/a2a13484b2e7a58170dedb3c7c2ac885281f5a1590788aadd302359400e5f74c/RENCI-NRIG/COMET-Accumulo/issues/14]<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
COMET - accumulomaster out of memory issue · Issue #14 · 
RENCI-NRIG/COMET-Accumulo<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
In COMET cluster running in AWS, node running accumulomaster also hosts comet 
head node. In current deployment, EC2 instance is of type small which has 2GB 
ram. Iss

Re: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-16 Thread Ligade, Shailesh [USA]
Well, I re-initialized accumulo but I still see

ERROR: Read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

Is there a setting that I can increase to get past it?

-S



From: Ligade, Shailesh [USA] 
Sent: Tuesday, March 15, 2022 12:47 PM
To: user@accumulo.apache.org 
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Not daily but  over weekend.

From: Mike Miller 
Sent: Tuesday, March 15, 2022 10:39 AM
To: user@accumulo.apache.org 
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Why are you bringing the cluster down every night? That is not ideal.

On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

We bring the servers down nightly. these are on aws. This worked yesterday 
(Monday) but this (Tuesday) i went on to check on it and it was down, I guess i 
didn't check yesterday. I assume it was up as no one complained., but it was up 
and kicking last week for sure.

So not exactly sure when or what caused it, all services are up (tserver, 
master) so services are not crashing themselves.

I guess worst case, i can re-initialize and recreate tables form hdfs..:-(

-S

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 9:16 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

What was going on in the tserver before you saw that error? Did it finish 
recovering after the restart? If it is still recovering, I don't think you will 
be able to do any scans.

On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error 
https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j<https://urldefense.com/v3/__https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3t6n73Xg$>,
  
https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
 
https://markmail.org/message/bc7ijdsgqmod5p2h<https://urldefense.com/v3/__https://markmail.org/message/bc7ijdsgqmod5p2h__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs2d0PoHHw$>
 but those are for lot older accumulo. and server didn't go out of memory so I 
think that must have been fixed..
[https://opengraph.githubassets.com/a2a13484b2e7a58170dedb3c7c2ac885281f5a1590788aadd302359400e5f74c/RENCI-NRIG/COMET-Accumulo/issues/14]<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
COMET - accumulomaster out of memory issue · Issue #14 · 
RENCI-NRIG/COMET-Accumulo<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
In COMET cluster running in AWS, node running accumulomaster also hosts comet 
head node. In current deployment, EC2 instance is of type small which has 2GB 
ram. Issue: Accumulomaster process is kil...
github.com<https://urldefense.com/v3/__http://github.com__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3PgqTmzQ$>


-S


From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 8:47 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: odd issue with accumulo 1.10.0 starting up

Check your configuration. The log message indicates that there is a problem 
with the internal system user performing operations. The internal system user 
uses credentials derived from the configuration (such as the instance.secret 
field). Make sure your configuration is identical across all nodes in your 
cluster.

On Tue, Mar 15, 2022 at 8:34 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Hello,

I am getting little odd issue with accumulo starting up

on tserver i am seeing

[tserver.TabletServer] ERROR: Caller doesn't have permission to get active scnas
ThriftSecurityException(user:!SYSTEM, code:BAD_CREDENTIALS)

on the ,aster log i am seeing

ERROR: read a frame size of 1195725856, which is bigger than the maximum 
allowable

Re: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-15 Thread Ligade, Shailesh [USA]
Not daily but  over weekend.

From: Mike Miller 
Sent: Tuesday, March 15, 2022 10:39 AM
To: user@accumulo.apache.org 
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Why are you bringing the cluster down every night? That is not ideal.

On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

We bring the servers down nightly. these are on aws. This worked yesterday 
(Monday) but this (Tuesday) i went on to check on it and it was down, I guess i 
didn't check yesterday. I assume it was up as no one complained., but it was up 
and kicking last week for sure.

So not exactly sure when or what caused it, all services are up (tserver, 
master) so services are not crashing themselves.

I guess worst case, i can re-initialize and recreate tables form hdfs..:-(

-S

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 9:16 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

What was going on in the tserver before you saw that error? Did it finish 
recovering after the restart? If it is still recovering, I don't think you will 
be able to do any scans.

On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error 
https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j<https://urldefense.com/v3/__https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3t6n73Xg$>,
  
https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
 
https://markmail.org/message/bc7ijdsgqmod5p2h<https://urldefense.com/v3/__https://markmail.org/message/bc7ijdsgqmod5p2h__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs2d0PoHHw$>
 but those are for lot older accumulo. and server didn't go out of memory so I 
think that must have been fixed..
[https://opengraph.githubassets.com/a2a13484b2e7a58170dedb3c7c2ac885281f5a1590788aadd302359400e5f74c/RENCI-NRIG/COMET-Accumulo/issues/14]<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
COMET - accumulomaster out of memory issue · Issue #14 · 
RENCI-NRIG/COMET-Accumulo<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
In COMET cluster running in AWS, node running accumulomaster also hosts comet 
head node. In current deployment, EC2 instance is of type small which has 2GB 
ram. Issue: Accumulomaster process is kil...
github.com<https://urldefense.com/v3/__http://github.com__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3PgqTmzQ$>


-S


From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 8:47 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: odd issue with accumulo 1.10.0 starting up

Check your configuration. The log message indicates that there is a problem 
with the internal system user performing operations. The internal system user 
uses credentials derived from the configuration (such as the instance.secret 
field). Make sure your configuration is identical across all nodes in your 
cluster.

On Tue, Mar 15, 2022 at 8:34 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Hello,

I am getting little odd issue with accumulo starting up

on tserver i am seeing

[tserver.TabletServer] ERROR: Caller doesn't have permission to get active scnas
ThriftSecurityException(user:!SYSTEM, code:BAD_CREDENTIALS)

on the ,aster log i am seeing

ERROR: read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

from the shell i can list all the tables but canot scan any. Monitor is shwoing 
tablet count 0 and unassigned tablet 1

HDFS fsck is all healthy.

Any suggestions?

Thanks

-S



Re: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-15 Thread Ligade, Shailesh [USA]
Thanks Mike,

We bring the servers down nightly. these are on aws. This worked yesterday 
(Monday) but this (Tuesday) i went on to check on it and it was down, I guess i 
didn't check yesterday. I assume it was up as no one complained., but it was up 
and kicking last week for sure.

So not exactly sure when or what caused it, all services are up (tserver, 
master) so services are not crashing themselves.

I guess worst case, i can re-initialize and recreate tables form hdfs..:-(

-S

From: Mike Miller 
Sent: Tuesday, March 15, 2022 9:16 AM
To: user@accumulo.apache.org 
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

What was going on in the tserver before you saw that error? Did it finish 
recovering after the restart? If it is still recovering, I don't think you will 
be able to do any scans.

On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error 
https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j<https://urldefense.com/v3/__https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3t6n73Xg$>,
  
https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
 
https://markmail.org/message/bc7ijdsgqmod5p2h<https://urldefense.com/v3/__https://markmail.org/message/bc7ijdsgqmod5p2h__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs2d0PoHHw$>
 but those are for lot older accumulo. and server didn't go out of memory so I 
think that must have been fixed..
[https://opengraph.githubassets.com/a2a13484b2e7a58170dedb3c7c2ac885281f5a1590788aadd302359400e5f74c/RENCI-NRIG/COMET-Accumulo/issues/14]<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
COMET - accumulomaster out of memory issue · Issue #14 · 
RENCI-NRIG/COMET-Accumulo<https://urldefense.com/v3/__https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3RaAeRzw$>
In COMET cluster running in AWS, node running accumulomaster also hosts comet 
head node. In current deployment, EC2 instance is of type small which has 2GB 
ram. Issue: Accumulomaster process is kil...
github.com<https://urldefense.com/v3/__http://github.com__;!!May37g!bEmBzvybPxmvx4MS_-OYwTOeru_6IIn_qXlJD6pLuO1q59kx4txH7_I3zs3PgqTmzQ$>


-S


From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 8:47 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: odd issue with accumulo 1.10.0 starting up

Check your configuration. The log message indicates that there is a problem 
with the internal system user performing operations. The internal system user 
uses credentials derived from the configuration (such as the instance.secret 
field). Make sure your configuration is identical across all nodes in your 
cluster.

On Tue, Mar 15, 2022 at 8:34 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Hello,

I am getting little odd issue with accumulo starting up

on tserver i am seeing

[tserver.TabletServer] ERROR: Caller doesn't have permission to get active scnas
ThriftSecurityException(user:!SYSTEM, code:BAD_CREDENTIALS)

on the ,aster log i am seeing

ERROR: read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

from the shell i can list all the tables but canot scan any. Monitor is shwoing 
tablet count 0 and unassigned tablet 1

HDFS fsck is all healthy.

Any suggestions?

Thanks

-S



Re: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-15 Thread Ligade, Shailesh [USA]
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error 
https://lists.apache.org/thread/bobn2vhkswl6c0pkzpy8n13z087z1s6j,  
https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14 
https://markmail.org/message/bc7ijdsgqmod5p2h but those are for lot older 
accumulo. and server didn't go out of memory so I think that must have been 
fixed..
[https://opengraph.githubassets.com/a2a13484b2e7a58170dedb3c7c2ac885281f5a1590788aadd302359400e5f74c/RENCI-NRIG/COMET-Accumulo/issues/14]<https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14>
COMET - accumulomaster out of memory issue · Issue #14 · 
RENCI-NRIG/COMET-Accumulo<https://github.com/RENCI-NRIG/COMET-Accumulo/issues/14>
In COMET cluster running in AWS, node running accumulomaster also hosts comet 
head node. In current deployment, EC2 instance is of type small which has 2GB 
ram. Issue: Accumulomaster process is kil...
github.com


-S


From: Mike Miller 
Sent: Tuesday, March 15, 2022 8:47 AM
To: user@accumulo.apache.org 
Subject: [External] Re: odd issue with accumulo 1.10.0 starting up

Check your configuration. The log message indicates that there is a problem 
with the internal system user performing operations. The internal system user 
uses credentials derived from the configuration (such as the instance.secret 
field). Make sure your configuration is identical across all nodes in your 
cluster.

On Tue, Mar 15, 2022 at 8:34 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Hello,

I am getting little odd issue with accumulo starting up

on tserver i am seeing

[tserver.TabletServer] ERROR: Caller doesn't have permission to get active scnas
ThriftSecurityException(user:!SYSTEM, code:BAD_CREDENTIALS)

on the ,aster log i am seeing

ERROR: read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

from the shell i can list all the tables but canot scan any. Monitor is shwoing 
tablet count 0 and unassigned tablet 1

HDFS fsck is all healthy.

Any suggestions?

Thanks

-S



odd issue with accumulo 1.10.0 starting up

2022-03-15 Thread Ligade, Shailesh [USA]
Hello,

I am getting little odd issue with accumulo starting up

on tserver i am seeing

[tserver.TabletServer] ERROR: Caller doesn't have permission to get active scnas
ThriftSecurityException(user:!SYSTEM, code:BAD_CREDENTIALS)

on the ,aster log i am seeing

ERROR: read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

from the shell i can list all the tables but canot scan any. Monitor is shwoing 
tablet count 0 and unassigned tablet 1

HDFS fsck is all healthy.

Any suggestions?

Thanks

-S



RE: [External] Re: accumulo 1.10.0 unassigned tablets issue

2022-03-04 Thread Ligade, Shailesh [USA]
Thanks Chris,

Appreciate your support!

Not sure why volumes.replacement was set, especially since we have HA namenode 
and that’s the only hdfs targeted. The volumes.replacement was set to the same 
url though e.g. nameservice/accumulo, nameservice:8020/accumulo 

Regardless, when tserver went down, even though if we set 
table.suspend.duration=15m, I was seeing volume replacement messages in the 
master log for every tablet hosted and that is taking looong time (hours for 
33k tablets/tserver). So how best to remove this volumes? There is no 
delete-volumes, I see only add-volumes under accumulo init. Is there anything I 
need to do after I remove entire instance.volumes.replacement section from 
accumulo-site.xml?

I will have to look at each and every property to ensure it makes sense for 
sure..

Thanks

-S

-Original Message-
From: Christopher  
Sent: Wednesday, March 2, 2022 3:09 PM
To: accumulo-user 
Subject: Re: [External] Re: accumulo 1.10.0 unassigned tablets issue

On Wed, Mar 2, 2022 at 1:51 PM Ligade, Shailesh [USA]  
wrote:
>
EDIT > Thanks Chris[topher],
>
> I do have instance.volume.replacement overridden
>
> Does that mean it will not work with table.suspend.duration property?

No. It's just that's where the RecoveryManager message is coming from.

>
> uhmm thinking about it i am not sure why we set that as we have only one hdfs 
> and we have less than 10 beefy nodes...
>
> may be I can remove this property after i set table.suspend.duration, and 
> stop/reboot tserver. After i am done, i can restore the property. Please 
> advise.

I have no idea why you would set that if you're not replacing one volume with 
another. I think you would probably benefit from reviewing all of your 
configuration. Please check the documentation for an explanation of each 
property. If you have a specific question regarding them, you can ask here, but 
I would start by reviewing your configs against the docs.

>
> Thanks
>
> -S
>
>
> 
> From: Christopher 
> Sent: Wednesday, March 2, 2022 1:32 PM
> To: accumulo-user 
> Subject: [External] Re: accumulo 1.10.0 unassigned tablets issue
>
> The replacements message should only appear if you have 
> instance.volumes.replacements set in your configuration.
>
> On Wed, Mar 2, 2022 at 11:02 AM Ligade, Shailesh [USA] 
>  wrote:
> >
> > Hello,
> >
> > I need reboot a tserver with 34k hosted tablets.
> >
> > I set table.supend.duration to 15 min and stop tserver and rebooted the 
> > machine.
> >
> > As soon as tablet server came on line the its hosted tablets counts went 
> > from 0 to 34k, however, on the master i see 34k unassigned tablets, 
> > although the count is going down it is taking hours.
> > not sure why master is stating unassigne dtablets when the tablet server 
> > has correct hosted tablet server count?
> >
> > Also in the master log i see
> >
> > recovery.RecoveryManager INFO: Volume replaced hdfs:// -> hdfs://   
> > the issue is both from and to hdfs urls are identical, so why master is 
> > trying to do that??
> >
> > Is the cluster safe to use? I can reboot another tablet server before this 
> > unassigned tablet count goes to 0? I can reboot entire cluster if i have 
> > to, will that help?
> >
> > Thanks in advance.
> >
> > -S


Re: [External] Re: accumulo 1.10.0 unassigned tablets issue

2022-03-02 Thread Ligade, Shailesh [USA]
Thanks Chris,

I do have instance.volume.replacement overridden

Does that mean it will not work with table.suspend.duration property?

uhmm thinking about it i am not sure why we set that as we have only one hdfs 
and we have less than 10 beefy nodes...

may be I can remove this property after i set table.suspend.duration, and 
stop/reboot tserver. After i am done, i can restore the property. Please advise.

Thanks

-S



From: Christopher 
Sent: Wednesday, March 2, 2022 1:32 PM
To: accumulo-user 
Subject: [External] Re: accumulo 1.10.0 unassigned tablets issue

The replacements message should only appear if you have
instance.volumes.replacements set in your configuration.

On Wed, Mar 2, 2022 at 11:02 AM Ligade, Shailesh [USA]
 wrote:
>
> Hello,
>
> I need reboot a tserver with 34k hosted tablets.
>
> I set table.supend.duration to 15 min and stop tserver and rebooted the 
> machine.
>
> As soon as tablet server came on line the its hosted tablets counts went from 
> 0 to 34k, however, on the master i see 34k unassigned tablets, although the 
> count is going down it is taking hours.
> not sure why master is stating unassigne dtablets when the tablet server has 
> correct hosted tablet server count?
>
> Also in the master log i see
>
> recovery.RecoveryManager INFO: Volume replaced hdfs:// -> hdfs://   
> the issue is both from and to hdfs urls are identical, so why master is 
> trying to do that??
>
> Is the cluster safe to use? I can reboot another tablet server before this 
> unassigned tablet count goes to 0? I can reboot entire cluster if i have to, 
> will that help?
>
> Thanks in advance.
>
> -S


accumulo 1.10.0 unassigned tablets issue

2022-03-02 Thread Ligade, Shailesh [USA]
Hello,

I need reboot a tserver with 34k hosted tablets.

I set table.supend.duration to 15 min and stop tserver and rebooted the machine.

As soon as tablet server came on line the its hosted tablets counts went from 0 
to 34k, however, on the master i see 34k unassigned tablets, although the count 
is going down it is taking hours.
not sure why master is stating unassigne dtablets when the tablet server has 
correct hosted tablet server count?

Also in the master log i see

recovery.RecoveryManager INFO: Volume replaced hdfs:// -> hdfs://   the 
issue is both from and to hdfs urls are identical, so why master is trying to 
do that??

Is the cluster safe to use? I can reboot another tablet server before this 
unassigned tablet count goes to 0? I can reboot entire cluster if i have to, 
will that help?

Thanks in advance.

-S


ENTRIES listed on accumulo 1.10 monitor

2022-02-18 Thread Ligade, Shailesh [USA]
Good morning,

Just a quick question, what does the Entries on the accumulo master status page 
indicates for a table?
Yesterday i had to re-initialize accumulo 1.10, before starting i moved my hdfs 
/accumulo to /accumulo-old
then I copied rf files under /accumulo-old/tables/x to /tmp/x and after 
initializing i did directoryimport to create new table x
Everything worked flawlessly and hdfs data size  /accumulo-old/table/x matches 
with /accumulo/table/x however, old Entries were 22B and new Entries are 2B
So what does the entries really represent? I always though it was actually a 
record count but doesn't appear that way..

-S


Re: how to stop entire (or on a single table) replication properly

2022-02-17 Thread Ligade, Shailesh [USA]
Thanks Ed,

I could have done that but at this point my destination cluster is not healthy 
and I need to reinitialize that cluster, eventually the same tables will be 
there after initialization but with different table id. Once that happens I 
will setup replication again.
At this point, i am not worried about in-flight replication data and I don't 
want this replication process to impact my primary cluster.. How safely can I 
achieve this?

-S

From: dev1 
Sent: Thursday, February 17, 2022 8:24 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: how to stop entire (or on a single table) replication 
properly


SO, I’m not familiar with replication in an operational setting – so my 
comments are based on my mental model of what I think replication is doing – 
the implement may not match my mental model – maybe someone else with more 
familiarity can chime in.



I’m reading that you want to stop replication and do not care to preserve data 
that may be “in-flight”



Why don’t you just stop replication on the source and then create the 
destination table that is expected to exist as the destination.  When that data 
has been “replicated”, the source replication table should be empty – then just 
delete the destination table?  You are still getting ride of the data and you 
let replication do the housekeeping for you?



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Thursday, February 17, 2022 8:15 AM
To: 'user@accumulo.apache.org' 
Subject: Re: how to stop entire (or on a single table) replication properly



Thanks Ed,



Let me rephase it. I need to stop replication as my tables on the peer are 
changing. After stopping, I will need to start replication again to the tables.



To stop the replication, on the primary instance tables i am going to set 
config to set replication false. Basically running

config -t my_table -s table.replication=false (currently true).

I believe that setting will stop replicating that table to peer.



However, there is still data in primary replication table and system will still 
try to replicate to peer (on peer corresponding tables no longer exist!), i can 
see it is still replicating to the peer on the replication page on the monitor 
UI. I can set primary replication table offline, but when I bring it online 
again, that data will be still there. So the question is, how can I safely 
remove the data in primary replication table?



One time i tried to do deleteall on primary replication table but when accumulo 
master re-started, it was complaining a lot about replication data, so just 
wanted to figure out proper steps.



thanks



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Thursday, February 17, 2022 8:04 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: how to stop entire (or on a single table) replication 
properly



I do not understand what you are asking – it would help if you stated what you 
are trying to accomplish and if you clearly identified source vs. destination.



Ed Coleman



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Thursday, February 17, 2022 7:37 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: how to stop entire (or on a single table) replication properly



Hello,



If i must stop entire replication, I set config for an individual table 
replication to false. However this will not affect entries in the replication 
table and the system will keep (or try to keep) replicating.

I can take replication table offline, but eventually when I need to start 
replication it will not clean. How can I delete entries in the replication 
table? Can i just do delete all? will that work?



-S


Re: how to stop entire (or on a single table) replication properly

2022-02-17 Thread Ligade, Shailesh [USA]
Thanks Ed,

Let me rephase it. I need to stop replication as my tables on the peer are 
changing. After stopping, I will need to start replication again to the tables.

To stop the replication, on the primary instance tables i am going to set 
config to set replication false. Basically running
config -t my_table -s table.replication=false (currently true).
I believe that setting will stop replicating that table to peer.

However, there is still data in primary replication table and system will still 
try to replicate to peer (on peer corresponding tables no longer exist!), i can 
see it is still replicating to the peer on the replication page on the monitor 
UI. I can set primary replication table offline, but when I bring it online 
again, that data will be still there. So the question is, how can I safely 
remove the data in primary replication table?

One time i tried to do deleteall on primary replication table but when accumulo 
master re-started, it was complaining a lot about replication data, so just 
wanted to figure out proper steps.

thanks

-S

From: dev1 
Sent: Thursday, February 17, 2022 8:04 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: how to stop entire (or on a single table) replication 
properly


I do not understand what you are asking – it would help if you stated what you 
are trying to accomplish and if you clearly identified source vs. destination.



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Thursday, February 17, 2022 7:37 AM
To: user@accumulo.apache.org
Subject: how to stop entire (or on a single table) replication properly



Hello,



If i must stop entire replication, I set config for an individual table 
replication to false. However this will not affect entries in the replication 
table and the system will keep (or try to keep) replicating.

I can take replication table offline, but eventually when I need to start 
replication it will not clean. How can I delete entries in the replication 
table? Can i just do delete all? will that work?



-S


how to stop entire (or on a single table) replication properly

2022-02-17 Thread Ligade, Shailesh [USA]
Hello,

If i must stop entire replication, I set config for an individual table 
replication to false. However this will not affect entries in the replication 
table and the system will keep (or try to keep) replicating.
I can take replication table offline, but eventually when I need to start 
replication it will not clean. How can I delete entries in the replication 
table? Can i just do delete all? will that work?

-S


Re: accumulo 1.10.0 masters won't start

2022-02-16 Thread Ligade, Shailesh [USA]
Thanks Ed,

Uhmm, copying all that 7T data from unfalttend hdfs to falttened one will take 
som etime..i guess it makes sense to just copy/flatten 100 rf files, import, 
and keep repeating it will work without filling single datanode... I wish there 
is accumulo command that will take the structure as is..

-S

From: dev1 
Sent: Wednesday, February 16, 2022 10:49 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: accumulo 1.10.0 masters won't start


I would use importdirectory [src_dir] [fail_dir] -t new_table false



I would move the files from under accumulo (either shutdown, or at least have 
the table offline) into hdfs directories for each batch. (10K or so) batch1, 
batch2,… I think importdirectory expects a flat directory of just files.

Then I can import one batch, check for errors, repeat.



The table you create – the splits will be whatever you set.  Again, maybe 1 
split for each tserver (that’s about optimum for the batch import)  Set the 
split size higher.  The import command will then place the commands according 
to the splits on the new table – so in you case, its likey multiple files from 
you current split will be assigned to 1 tserver – effectively being a merge.



My approach to these things is to create scripts based on the info that I have, 
but have them so that I run them so I can see if things are progressing and 
make adjustments if not.  I use individual scripts so that I have positive 
control.  Pipe, grep, awk to build the commands, review the files as a sanity 
check and then run them.



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 16, 2022 9:37 AM
To: 'user@accumulo.apache.org' 
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



Since hdfs fsck is fine on /accumulo, I can backup my tables to some location 
within hdfs (not under accumulo) and reinitialize accumulo.

then I can recreate my tables/users on new instances.

What will be command to import/load existing hdfs data into this newly created 
table? importtable command create new table as well, so  i guess i need to test 
it somewhere.

Also while loading old data intothe  new table, what can I do get rid of these 
splits/tablets?



I think this will be faster approach for me to recover..



Thank you so much



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 16, 2022 9:29 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



> What happens if I let the #tablets grown?



It sounds like you might be in the worst case now?  There is over head for each 
tablet - at what point the master / other things fall over is not something 
I've tried to find out. Even scanning the metadata table and gc process are 
doing a lot of work to track process that many files / tablets and it likely 
unnecessary.



What is the command / arguments that you are using for compactions?  The 
comment minimal sleep after 100 compaction commands is confusing to me.



Can you buffer the replication?



You might be able to:

 - create a new table.

 - point the replication to write to the new table.

 - ingest data from the old into the new.



You should look towards picking a split threshold so that you have 1 or maybe a 
few tablets per tserver (with some reasonable split size.)  Split sizes of 3G 
or 5G are not uncommon - and larger is reasonable.



Ed Coleman

____

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks Ed,



We have two clusters, one is source other is peer. I am testing this large 
table on peer first (eventually i need to do the same on source cluster). i am 
not stopping ingest on source cluster so replication will continue the peer 
table, however while i am doing this not much ingest is happening.





I tried the range compaction along with range merge however, merge takes 
forever (even over single range.. i didn't try many just first few) and before 
it finishes i get zookeeper error and both master crash. After I bump that jute 
setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it 
back. So left merges alone and just trying 72k compaction, since compactions 
are not backing up, i am doing minimal sleep after every 100 compact commands. 
But sometimes during compactions, i do get zookeeper errors and master crash.



I do get your idea or create new table with less splits that way the new table 
will be compacted. However, for that i will need to stop ingest on primary, and 
then setup replication on the new cluster again..i was avoiding that. but i 
guess that may be my only option.



A last question, if I let this table continue to grow table

Re: accumulo 1.10.0 masters won't start

2022-02-16 Thread Ligade, Shailesh [USA]
Thanks

Since hdfs fsck is fine on /accumulo, I can backup my tables to some location 
within hdfs (not under accumulo) and reinitialize accumulo.
then I can recreate my tables/users on new instances.
What will be command to import/load existing hdfs data into this newly created 
table? importtable command create new table as well, so  i guess i need to test 
it somewhere.
Also while loading old data intothe  new table, what can I do get rid of these 
splits/tablets?

I think this will be faster approach for me to recover..

Thank you so much

-S

From: dev1 
Sent: Wednesday, February 16, 2022 9:29 AM
To: 'user@accumulo.apache.org' 
Subject: [External] Re: accumulo 1.10.0 masters won't start

> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each 
tablet - at what point the master / other things fall over is not something 
I've tried to find out. Even scanning the metadata table and gc process are 
doing a lot of work to track process that many files / tablets and it likely 
unnecessary.

What is the command / arguments that you are using for compactions?  The 
comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:
 - create a new table.
 - point the replication to write to the new table.
 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a 
few tablets per tserver (with some reasonable split size.)  Split sizes of 3G 
or 5G are not uncommon - and larger is reasonable.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' 
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large 
table on peer first (eventually i need to do the same on source cluster). i am 
not stopping ingest on source cluster so replication will continue the peer 
table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes 
forever (even over single range.. i didn't try many just first few) and before 
it finishes i get zookeeper error and both master crash. After I bump that jute 
setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it 
back. So left merges alone and just trying 72k compaction, since compactions 
are not backing up, i am doing minimal sleep after every 100 compact commands. 
But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table 
will be compacted. However, for that i will need to stop ingest on primary, and 
then setup replication on the new cluster again..i was avoiding that. but i 
guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the 
worst case scenerio? How it may affect system performance?

Thanks

-S

From: dev1 
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not 
understand why you are getting the ZooKeeper errors – is it related to the 
number of tablets, or is it something else?  (an iterator that was attached 
with a very large set of arguments, a very large list or some sort of binary 
data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might 
dictate a less aggressive approach.  At this point I’m assuming that getting 
things back on line without data loss is the top priority. And if I was sure 
that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a 
compaction would take and if I could wait.  It is generally preferable to work 
on files that have any deleted data removed and can reduce the total number of 
files when files from minor compactions and bulk ingest files are combined into 
a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if 
compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the 
clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging 
directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the 
existence of files if it is offline.) If not, then it wo

Re: accumulo 1.10.0 masters won't start

2022-02-16 Thread Ligade, Shailesh [USA]
Well now master doesn't come up throwing all sorts of zookeeper errros, only 
chnages i did was jute.maxbuffer set to max of 0x60 (in both zookeeper 
java.env and accumulo-site) and instance.zookeeper.timeout set 90s

Even if both masters are down, i still see table_locks under znode, si that 
normal?

Appreciate your help

-S

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' 
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large 
table on peer first (eventually i need to do the same on source cluster). i am 
not stopping ingest on source cluster so replication will continue the peer 
table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes 
forever (even over single range.. i didn't try many just first few) and before 
it finishes i get zookeeper error and both master crash. After I bump that jute 
setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it 
back. So left merges alone and just trying 72k compaction, since compactions 
are not backing up, i am doing minimal sleep after every 100 compact commands. 
But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table 
will be compacted. However, for that i will need to stop ingest on primary, and 
then setup replication on the new cluster again..i was avoiding that. but i 
guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the 
worst case scenerio? How it may affect system performance?

Thanks

-S

From: dev1 
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not 
understand why you are getting the ZooKeeper errors – is it related to the 
number of tablets, or is it something else?  (an iterator that was attached 
with a very large set of arguments, a very large list or some sort of binary 
data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might 
dictate a less aggressive approach.  At this point I’m assuming that getting 
things back on line without data loss is the top priority. And if I was sure 
that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a 
compaction would take and if I could wait.  It is generally preferable to work 
on files that have any deleted data removed and can reduce the total number of 
files when files from minor compactions and bulk ingest files are combined into 
a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if 
compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the 
clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging 
directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the 
existence of files if it is offline.) If not, then it would be necessary to use 
an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use 
just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – 
this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate 
the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will 
likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present 
on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the 
original.



If you don’t have the space, you could skip operating on the clone, but then 
you can’t fall back to the original if things go wrong.



Another way would be to use the importable, but you need to make sure that it 
doesn’t just recreate the original splits, otherwise you end up with the same 
72K files.



Ed Coleman





From: Ligade, Shailesh [USA] 
Sent

Re: accumulo 1.10.0 masters won't start

2022-02-16 Thread Ligade, Shailesh [USA]
Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large 
table on peer first (eventually i need to do the same on source cluster). i am 
not stopping ingest on source cluster so replication will continue the peer 
table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes 
forever (even over single range.. i didn't try many just first few) and before 
it finishes i get zookeeper error and both master crash. After I bump that jute 
setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it 
back. So left merges alone and just trying 72k compaction, since compactions 
are not backing up, i am doing minimal sleep after every 100 compact commands. 
But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table 
will be compacted. However, for that i will need to stop ingest on primary, and 
then setup replication on the new cluster again..i was avoiding that. but i 
guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the 
worst case scenerio? How it may affect system performance?

Thanks

-S

From: dev1 
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not 
understand why you are getting the ZooKeeper errors – is it related to the 
number of tablets, or is it something else?  (an iterator that was attached 
with a very large set of arguments, a very large list or some sort of binary 
data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might 
dictate a less aggressive approach.  At this point I’m assuming that getting 
things back on line without data loss is the top priority. And if I was sure 
that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a 
compaction would take and if I could wait.  It is generally preferable to work 
on files that have any deleted data removed and can reduce the total number of 
files when files from minor compactions and bulk ingest files are combined into 
a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if 
compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the 
clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging 
directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the 
existence of files if it is offline.) If not, then it would be necessary to use 
an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use 
just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – 
this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate 
the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will 
likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present 
on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the 
original.



If you don’t have the space, you could skip operating on the clone, but then 
you can’t fall back to the original if things go wrong.



Another way would be to use the importable, but you need to make sure that it 
doesn’t just recreate the original splits, otherwise you end up with the same 
72K files.



Ed Coleman





From: Ligade, Shailesh [USA] 
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org
Subject: Re: accumulo 1.10.0 masters won't start



Well,



I am trying to merge a large table (8T with 72k tablets, with default tablet 
size 1G)



Since I keep getting those zookeeper errors realted to size, i keep on bumping 
the jute.maxbuffer adn now it is all the way to 8m



But still i can't merge even for small subset (-b and -e) Now the error is Xid 
out of order, Gox Xid xx with err -101 expected Xid yy for a packet with 
details

Re: accumulo 1.10.0 masters won't start

2022-02-15 Thread Ligade, Shailesh [USA]
Well,

I am trying to merge a large table (8T with 72k tablets, with default tablet 
size 1G)

Since I keep getting those zookeeper errors realted to size, i keep on bumping 
the jute.maxbuffer adn now it is all the way to 8m

But still i can't merge even for small subset (-b and -e) Now the error is Xid 
out of order, Gox Xid xx with err -101 expected Xid yy for a packet with 
details:

after this master crashes

Any suggestion how to go about and how to merge this large table?

Thanks

-S

From: Ligade, Shailesh [USA] 
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

That saved the day. The confusing part setting up that property is 
documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - 
The Apache Software 
Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which 
is a number of servers running ZooKeeper that communicate with each other to 
coordinate the activities of the cluster.
solr.apache.org

states they are setting the value to 2mb but the value really looks like 200k 
(with 5 0)



Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x20"

-

Anyways, a master is up and running for an hour now..so just trying to 
understand what was changed and revert it after it stabilize.

Thanks a bunch.

-S


From: dev1 
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting 
ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that 
file?)

Looking at the Zookeeper documentation it describes what looks like you are 
seeing:

When jute.maxbuffer in the client side is less than the server side, the client 
wants to read the data exceeds jute.maxbuffer in the client side, the client 
side will get java.io.IOException: Unreasonable length or Packet len is out of 
range!

Also, a search showed jira tickets that had a server side limit of 4MB, but 
client limits of 1MB - you may want to see if 4194304 (or larger) as a value 
works,


From: dev1 
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper 
configuration.  If this is still correct, then it looks like there are a few 
options 
https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional 
guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - 
The Apache Software 
Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>
The solution to this problem is to set up an external ZooKeeper ensemble, which 
is a number of servers running ZooKeeper that communicate with each other to 
coordinate the activities of the cluster.
solr.apache.org



From: Shailesh Ligade 
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org 
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=30



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytesnot sure where to set that on 
accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other 
alternatives.



It is hard to recommend what to do, because it is unclear what put you into the 
current situation and what action / impact you might have had trying to fix 
things -



why did the goal state bec

Re: accumulo 1.10.0 masters won't start

2022-02-10 Thread Ligade, Shailesh [USA]
er - how far does it come up?  It will not be able to load 
the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using 
zkCli.sh using any guidance with errors the master is generating. If that looks 
promising, then:



With the master stopped - start the tservers and check a few logs if there are 
exceptions determine if they are they something that is pointing to an issue - 
or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a 
listing of the tables and the ids.  If the worst happens, you can use that to 
map the existing data into a "new" instance. Hopefully it will not come to that 
and you will not need it - but if you don't have it and you need it, well... 
The table names and id are all in ZooKeeper.



Ed Coleman





From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set 
different goal like SAFE_MODE will it come up by ignoring fate and other 
issues? If that comes up, can I switch back to NORMAL, will that work? I 
understand there may be some data loss..



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: 
https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - 
the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a 
Zoo<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b00 : Unused, reserved for future use. 
0b10 : Logs client requests, excluding ping requests. 0b000100 : 
Unused, reserved ...

zookeeper.apache.org







From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is 
that what are you refering to?



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's 
zookeeper.session.timeout.ms





From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo//config/tserver.hold.time.max

/accumulo//tables

/accumulo//tables/1/name

/accumulo//fate

/accumulo//masters/goal_state



So it is all over

Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread Ligade, Shailesh [USA]
thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is 
that what are you refering to?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's 
zookeeper.session.timeout.ms


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for


/accumulo//config/tserver.hold.time.max

/accumulo//tables

/accumulo//tables/1/name

/accumulo//fate

/accumulo//masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  



-S


From: dev1 
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it 
was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: 
java.lang.IllegalArgumentException: No enum constant 
org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state 
is [], is there a way to add some value there?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread Ligade, Shailesh [USA]
Thanks

That fixed goal sate issue but now still getting

Errors with zookeeper
e.g.

KeeperErrorCode = ConnectionLoss for

/accumulo//config/tserver.hold.time.max
/accumulo//tables
/accumulo//tables/1/name
/accumulo//fate
/accumulo//masters/goal_state

So it is all over …some I see good values in zookeeper…so not sure..  

-S


From: dev1 
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it 
was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: 
java.lang.IllegalArgumentException: No enum constant 
org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state 
is [], is there a way to add some value there?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread Ligade, Shailesh [USA]
Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it 
was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: 
java.lang.IllegalArgumentException: No enum constant 
org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state 
is [], is there a way to add some value there?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread Ligade, Shailesh [USA]
Uhmm i guess I can't even list anything under fate without that error

Yes, i updated java.env on all zookeeeper

Can I just delete fate folder and recreate and see if master comes up?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread Ligade, Shailesh [USA]
Thanks

I added

-Djute.maxbuffer=3000

In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up

-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


accumulo 1.10.0 masters won't start

2022-02-09 Thread Ligade, Shailesh [USA]
Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: tablets per tablet server for accumulo 1.10.0

2022-02-04 Thread Ligade, Shailesh [USA]
Thank you,

Will range compaction (compact -t <> --begin-row<> --end-row<>) be faster than 
just compact -t <>? My worry is, if I somehow issue 72k compact command at 
once, it will kill the system?
On that part what is the best way to issue these compact commands, especially 
because there are so many of them. I saw accumulo shell -u<> -p<>  -e 'compact 
...,compact...,compact,' will work just don't know how many i can tack on 
one shell command..is there a better way of doing all this? I mean i want to be 
as gentle to my production system and yet as fast as possible.. don't want to 
spend days doing compact/merge 

Thanks

-S


From: dev1 
Sent: Tuesday, February 1, 2022 8:53 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


Before.  That has the benefit that file sizes are reduced (if data is eligible 
for age off) and the merge is operating on current file sizes.



From: Ligade, Shailesh [USA] 
Sent: Tuesday, February 1, 2022 7:49 AM
To: 'user@accumulo.apache.org' 
Subject: Re: tablets per tablet server for accumulo 1.10.0



Thank you for explanation!



Once ran getsplits it was clear that splits were the culprit, so I need to do 
merge as well bump the threshold to higher number as you have suggested.



If I have to perform a major compaction, should i do it before merge or after 
merge?



Thanks again,



-S







From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0



You can get the hdfs size using standard hdfs commands – count or ls.  As long 
as you have not cloned the table, the size of the hdfs files and the space 
occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table – 
the column qualifier file: will just give you the referenced files. You can 
look at the directories b-xxx are from a bulk import and t-xxx files 
are assigned to the tablets.  Also bulk import file names start with I-xx, 
files from compactions will be A-xx if from a full compaction, C-xxx 
from a minor compaction and F-xx is the result of a flush. You can look at 
the entries for the files – the numbers for the value are number of entities, 
file size



How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files 
end up in /accumulo/table/x/b-x and then are assigned to tablets – the 
directories for the

Tablets will be created, but will be “empty” until a compaction occurs.  A 
compaction will copy from the files referenced by the tablets into a new file 
that will be placed into the corresponding /accumulo/table/x/t-xx 
directory.  When a bulk imported file is no longer referenced by any tablets, 
it will get garbage collected, until then file will exist and inflate the 
actual space used by the table. The compaction will also remove any data that 
is past the TTL for the records.



Do you ever run a compaction?  With a very large number of tablets, you may 
want to run the compaction in parts so that you don’t end up occupying all of 
the compaction slots for a long time.



Are you using keys (row ids) that are always increasing? An typical example 
would be a date.  Say some of your row ids are -mm-dd-hh and there is a 10 
day TTL.  What will happened is that new data will continue to create new 
tablets and on compaction the old tablets will age-off and have 0 size.  You 
can remove the “unused splits” by running a merge.  Anything that creates new 
row ids that are ordered can do this – new splits are necessary and the 
old-splits eventually become unnecessary, if the row ids are distributed across 
the splits it will not do this. It is not necessary a problem if this what you 
data looks like, just something that you may want to manage with merges.



There is usually not much benefit having a large number of tablets for a single 
table on a server.  You can reduce the number of tablets required by setting 
the split threshold to a larger number and then running a merge.  This can be 
done in sections, and you should run a compaction on the section first.



If you have recently compacted, you can figure out the rough number of tables 
necessary  by taking hdfs size / split threshold = number of tablets.   If you 
increase the spilt threshold size you will need fewer tablets.  You may also 
consider setting a split threshold that is larger than your target – say you 
decided that 5G was a good target, if you set the threshold to 8G during the 
merge and then setting it to 5G when completed will cause the table to split – 
and it could give you a better distribution of data in the splits.



This can be done while things are running, but it will be a heavy IO loa

Re: tablets per tablet server for accumulo 1.10.0

2022-02-01 Thread Ligade, Shailesh [USA]
Thank you for explanation!

Once ran getsplits it was clear that splits were the culprit, so I need to do 
merge as well bump the threshold to higher number as you have suggested.

If I have to perform a major compaction, should i do it before merge or after 
merge?

Thanks again,

-S



From: dev1 
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


You can get the hdfs size using standard hdfs commands – count or ls.  As long 
as you have not cloned the table, the size of the hdfs files and the space 
occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table – 
the column qualifier file: will just give you the referenced files. You can 
look at the directories b-xxx are from a bulk import and t-xxx files 
are assigned to the tablets.  Also bulk import file names start with I-xx, 
files from compactions will be A-xx if from a full compaction, C-xxx 
from a minor compaction and F-xx is the result of a flush. You can look at 
the entries for the files – the numbers for the value are number of entities, 
file size



How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files 
end up in /accumulo/table/x/b-x and then are assigned to tablets – the 
directories for the

Tablets will be created, but will be “empty” until a compaction occurs.  A 
compaction will copy from the files referenced by the tablets into a new file 
that will be placed into the corresponding /accumulo/table/x/t-xx 
directory.  When a bulk imported file is no longer referenced by any tablets, 
it will get garbage collected, until then file will exist and inflate the 
actual space used by the table. The compaction will also remove any data that 
is past the TTL for the records.



Do you ever run a compaction?  With a very large number of tablets, you may 
want to run the compaction in parts so that you don’t end up occupying all of 
the compaction slots for a long time.



Are you using keys (row ids) that are always increasing? An typical example 
would be a date.  Say some of your row ids are -mm-dd-hh and there is a 10 
day TTL.  What will happened is that new data will continue to create new 
tablets and on compaction the old tablets will age-off and have 0 size.  You 
can remove the “unused splits” by running a merge.  Anything that creates new 
row ids that are ordered can do this – new splits are necessary and the 
old-splits eventually become unnecessary, if the row ids are distributed across 
the splits it will not do this. It is not necessary a problem if this what you 
data looks like, just something that you may want to manage with merges.



There is usually not much benefit having a large number of tablets for a single 
table on a server.  You can reduce the number of tablets required by setting 
the split threshold to a larger number and then running a merge.  This can be 
done in sections, and you should run a compaction on the section first.



If you have recently compacted, you can figure out the rough number of tables 
necessary  by taking hdfs size / split threshold = number of tablets.   If you 
increase the spilt threshold size you will need fewer tablets.  You may also 
consider setting a split threshold that is larger than your target – say you 
decided that 5G was a good target, if you set the threshold to 8G during the 
merge and then setting it to 5G when completed will cause the table to split – 
and it could give you a better distribution of data in the splits.



This can be done while things are running, but it will be a heavy IO load 
(files and on the hdfs namenode) and can take a very long time. What can be 
useful is you the getSplits command with the number of split options and create 
a script that compacts, then merges a section – using the splits as start / end 
row to the compaction and merge command.



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org
Subject: tablets per tablet server for accumulo 1.10.0



Hello,



table.split.threshold is set to default 1G (except for metadata nd root - which 
is set to 64M)

What can cause tablets per tablet server count to go high? Within a week, that 
count jumped from 5k/tablet server to 23k/tablet server, even though total size 
in hdfs  has not changed.

Is high count, a cause for concern?

We didn't apply any splits. I did a dumpConfig and checked all my tables and 
didn't see splits either.



Is there a way to find tablet size in hdfs? When I look at hdfs 
/accumulo/table/x/ i see some empty folders, meaning not all folders has rf 
files. is that normal?



Thanks in advance!



-S


tablets per tablet server for accumulo 1.10.0

2022-01-31 Thread Ligade, Shailesh [USA]
Hello,

table.split.threshold is set to default 1G (except for metadata nd root - which 
is set to 64M)
What can cause tablets per tablet server count to go high? Within a week, that 
count jumped from 5k/tablet server to 23k/tablet server, even though total size 
in hdfs  has not changed.
Is high count, a cause for concern?
We didn't apply any splits. I did a dumpConfig and checked all my tables and 
didn't see splits either.

Is there a way to find tablet size in hdfs? When I look at hdfs 
/accumulo/table/x/ i see some empty folders, meaning not all folders has rf 
files. is that normal?

Thanks in advance!

-S


uassigned tablets issue

2022-01-25 Thread Ligade, Shailesh [USA]
Hello,

For some strange reason, monitor master page (version 1.10.0) shows unassigned 
tablets going up and down, typically goes up to 4 and down to 0
The master log shows unassigned tablet count and waiting for it to go to zero.

This just started happening, is this a cause for concern? I checked for 
tabletserverlocks, checkformetadataproblems, and other usual debugging 
commands, everything looks good.

-S


Re: issue with OS user using shell

2022-01-21 Thread Ligade, Shailesh [USA]
uhmm I didn't see any -D option on shell

if I try

bin/accumulo -Dlog4j.configuration=/tmp/log4j.properties shell - throws error 
as it is not a classpath

and if I give -D after shell, it says shell do not have that option

What is the correct way?

-S

From: dev1 
Sent: Friday, January 21, 2022 8:31 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: issue with OS user using shell


Did you try to pass a log4j configuration using -D option on the command line?



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Friday, January 21, 2022 8:19 AM
To: user@accumulo.apache.org
Subject: issue with OS user using shell



Heo,



When a regular OS user use shell command, looks like that OS user needs access 
to accumulo log directory. Othewise it throws FileNotFound exception for the 
log file (tserver/monitor/master/gc/tracer - all the log files)



if i user OS root user or OS accumulo user (user who runs accumulo) there is no 
issue with shell command.



I obviously don't want to open log directory for all OS users or turn off shell 
logging. Is there a way around it? I didn't see shell command option for 
accepting any log4j.conf so that just for shell I can use a different file..





Thanks



-S


issue with OS user using shell

2022-01-21 Thread Ligade, Shailesh [USA]
Heo,

When a regular OS user use shell command, looks like that OS user needs access 
to accumulo log directory. Othewise it throws FileNotFound exception for the 
log file (tserver/monitor/master/gc/tracer - all the log files)

if i user OS root user or OS accumulo user (user who runs accumulo) there is no 
issue with shell command.

I obviously don't want to open log directory for all OS users or turn off shell 
logging. Is there a way around it? I didn't see shell command option for 
accepting any log4j.conf so that just for shell I can use a different file..


Thanks

-S


Re: replication table offline issue

2022-01-04 Thread Ligade, Shailesh [USA]
Thanks Ed/Chris,

yes replication is table is hosted only one tablet server and I looked at that 
tablet server, there are no errors.

I am seeing all this errors in active master server log.

zkcli to state the replication table showed its online.

I guess I will restart masters, if that doesn't work then restart that just one 
tserver holding the replication tablet and if that fails then restart entire 
cluster 

Will remove write grant from root and replication user.

Appreciated your help

-S

From: dev1 
Sent: Tuesday, January 4, 2022 1:18 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: replication table offline issue


Deleting / recreating the replication table should not be necessary and in any 
case you very likely cannot delete / create the accumulo.replication table – 
the shell will error on the delete because it is in the accumulo namespace.



Is the replication table hosted on a single tserver?  Are there any exceptions 
in the log for that server? (or any of the tservers host it if hosted across 
multiple tservers)



Have you restarted the client? It looks like the exception fragment has client 
in the classname. What log is that exception occurring?



You can try restarting the master(s)



The monitor shows the replication table is online? Can you check in ZooKeeper 
(using the zkCli.sh)

  *   get /accumulo/[instance id]/tables/+rep/state



That should return the text ONLINE



If the replication table is on a single tserver, then you might be able to just 
restart that server rather than needing to do a rolling restart of the cluster. 
If there a no errors in the tserver log this seems unlikely to help.



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Tuesday, January 4, 2022 12:24 PM
To: user@accumulo.apache.org
Subject: Re: replication table offline issue



Sorry this is for accumulo 1.10.0



I am wondering is there a way to delete and recreate the accumulo.replication 
table. I know it is bit special table..so



Also, will restarting entire cluster solve this? or may be just restarting 
accumulo master may be?



Since rolling restart of tservers is bit lengthy process for us just wanted to 
make sure it may resolve it or not..



-S



From: Ligade, Shailesh [USA]
Sent: Tuesday, January 4, 2022 11:27 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: replication table offline issue



Hello,



I setup replication and ran 'online accumulo.replication' however i n master i 
keep on getting error stating accumulo.replication is offline. I can scan 
accumulo.replication table it has no data at all

the error is:

-



WARN Failed to write work mutations for replication, will retry

clinet.MutationRejectedException: # constraint violations: 0 security codes: {} 
# server errors 0 # exceptions 6

at xxxclient.impl.TabletServerBatchWriter.checkForFailures



caused by TableOfflineException: Table accumulo.replication (+rep) is offline

---



There are no constraints that I am using on any table.

I added grants for root as well as my replication user for accumulo.replication 
Tbale.WRITE (there was only Table.READ before)

if i run offline accumulo.replication i can see it is offline and then i can 
bring in online again however i still keep getting error



Any suggestion on how to fix this?



Thanks



-S






Re: replication table offline issue

2022-01-04 Thread Ligade, Shailesh [USA]
Sorry this is for accumulo 1.10.0

I am wondering is there a way to delete and recreate the accumulo.replication 
table. I know it is bit special table..so

Also, will restarting entire cluster solve this? or may be just restarting 
accumulo master may be?

Since rolling restart of tservers is bit lengthy process for us just wanted to 
make sure it may resolve it or not..

-S

From: Ligade, Shailesh [USA]
Sent: Tuesday, January 4, 2022 11:27 AM
To: user@accumulo.apache.org 
Subject: replication table offline issue

Hello,

I setup replication and ran 'online accumulo.replication' however i n master i 
keep on getting error stating accumulo.replication is offline. I can scan 
accumulo.replication table it has no data at all

the error is:
-

WARN Failed to write work mutations for replication, will retry
clinet.MutationRejectedException: # constraint violations: 0 security codes: {} 
# server errors 0 # exceptions 6
at xxxclient.impl.TabletServerBatchWriter.checkForFailures

caused by TableOfflineException: Table accumulo.replication (+rep) is offline
---

There are no constraints that I am using on any table.
I added grants for root as well as my replication user for accumulo.replication 
Tbale.WRITE (there was only Table.READ before)
if i run offline accumulo.replication i can see it is offline and then i can 
bring in online again however i still keep getting error

Any suggestion on how to fix this?

Thanks

-S




replication table offline issue

2022-01-04 Thread Ligade, Shailesh [USA]
Hello,

I setup replication and ran 'online accumulo.replication' however i n master i 
keep on getting error stating accumulo.replication is offline. I can scan 
accumulo.replication table it has no data at all

the error is:
-

WARN Failed to write work mutations for replication, will retry
clinet.MutationRejectedException: # constraint violations: 0 security codes: {} 
# server errors 0 # exceptions 6
at xxxclient.impl.TabletServerBatchWriter.checkForFailures

caused by TableOfflineException: Table accumulo.replication (+rep) is offline
---

There are no constraints that I am using on any table.
I added grants for root as well as my replication user for accumulo.replication 
Tbale.WRITE (there was only Table.READ before)
if i run offline accumulo.replication i can see it is offline and then i can 
bring in online again however i still keep getting error

Any suggestion on how to fix this?

Thanks

-S




RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-12-27 Thread Ligade, Shailesh [USA]
Thanks,

Just a quick question. The steps identified worked..however I noticed that if 
replication is turned on, and I set the table.suspend.duration=5m and stop and 
reboot a tserver, I do get lot of replication messages in the master log. Since 
ingest is turned off, I thought I will not see much replication. Do I need to 
turn off replication while I am rolling restart? Will it have any adverse 
effects?

-S

From: Mike Miller 
Sent: Thursday, December 2, 2021 10:39 AM
To: user@accumulo.apache.org
Subject: Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling 
restart

Some things to keep in mind... The Master will wait the table.suspend.duration 
before reassigning the SUSPENDED tablets to new tservers. With the 
table.suspend.duration set > 0, a tablet will go from HOSTED to SUSPENDED if 
it's tserver is shutdown. It will then stay SUSPENDED until it's old tserver is 
available or table.suspend.duration has passed. If table.suspend.duration has 
passed before it's tserver has returned, it will then be UNASSIGNED. Once a 
tablet is UNASSIGNED it won't enter the SUSPENDED state.

On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver 
system (it is pretty quick in the order of second), I still get unassigned 
tablets on monitor page. My understand is that with that setting of 5m (or 200s 
etc), master will wait for that mush time before start moving unassigned 
tablets. In my situation, unassigned tablet counts goes back to zero after long 
time, and hence rolling restarts take lot longer (hours in most cases – depends 
on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my 
prod version I have not tested it completely.

Thanks

-S
From: Mike Miller mailto:mmil...@apache.org>>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0" do you mean that the feature is not helping to solve your 
problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0, can I just stop both the masters and then restart tserver 
one at a time (or all at once)? Will that speed up the restart without getting 
into this offline tablet situation and or data loss type situation? I can stop 
the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one 
keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

 *   Stop the tserver
 *   Pause long enough that ZooKeeper recognizes the lost connection
 *   Restart the tserver
 *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what 
is going on – there should be messages showing the tserver leaving and then 
rejoining and any other activity related to recovery.  With a rolling restart 
the idea is to keep

RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-12-02 Thread Ligade, Shailesh [USA]
Thanks Mike,

So let me see if I understood this,

Doesn’t matter what this suspend.duration setting is, we will always see 
unassigned tablets on the monitor page, when tserver goes down.

If the setting is high enough then master is basically assigning the same old 
tablets to that tserver, when it is back online, and thus will not move any 
tablets around.

If the duration is default (0s) or short then, master will start reassigning 
tablets to other tserver. And when the original tablet comes back up, master 
will try to rebalance tablets (may not get the old tablets back).

And thus having that setting high enough will make things faster to recover.

Thanks

-S

From: Mike Miller 
Sent: Thursday, December 2, 2021 10:39 AM
To: user@accumulo.apache.org
Subject: Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling 
restart

Some things to keep in mind... The Master will wait the table.suspend.duration 
before reassigning the SUSPENDED tablets to new tservers. With the 
table.suspend.duration set > 0, a tablet will go from HOSTED to SUSPENDED if 
it's tserver is shutdown. It will then stay SUSPENDED until it's old tserver is 
available or table.suspend.duration has passed. If table.suspend.duration has 
passed before it's tserver has returned, it will then be UNASSIGNED. Once a 
tablet is UNASSIGNED it won't enter the SUSPENDED state.

On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver 
system (it is pretty quick in the order of second), I still get unassigned 
tablets on monitor page. My understand is that with that setting of 5m (or 200s 
etc), master will wait for that mush time before start moving unassigned 
tablets. In my situation, unassigned tablet counts goes back to zero after long 
time, and hence rolling restarts take lot longer (hours in most cases – depends 
on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my 
prod version I have not tested it completely.

Thanks

-S
From: Mike Miller mailto:mmil...@apache.org>>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0" do you mean that the feature is not helping to solve your 
problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0, can I just stop both the masters and then restart tserver 
one at a time (or all at once)? Will that speed up the restart without getting 
into this offline tablet situation and or data loss type situation? I can stop 
the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one 
keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

 *   Stop the tserver
 *   Pause long enough that ZooKeeper recognizes the lost connection
 *   Restart the tserver
 *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)

If

RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-12-02 Thread Ligade, Shailesh [USA]
Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver 
system (it is pretty quick in the order of second), I still get unassigned 
tablets on monitor page. My understand is that with that setting of 5m (or 200s 
etc), master will wait for that mush time before start moving unassigned 
tablets. In my situation, unassigned tablet counts goes back to zero after long 
time, and hence rolling restarts take lot longer (hours in most cases – depends 
on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my 
prod version I have not tested it completely.

Thanks

-S
From: Mike Miller 
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0" do you mean that the feature is not helping to solve your 
problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0, can I just stop both the masters and then restart tserver 
one at a time (or all at once)? Will that speed up the restart without getting 
into this offline tablet situation and or data loss type situation? I can stop 
the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one 
keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

 *   Stop the tserver
 *   Pause long enough that ZooKeeper recognizes the lost connection
 *   Restart the tserver
 *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what 
is going on – there should be messages showing the tserver leaving and then 
rejoining and any other activity related to recovery.  With a rolling restart 
the idea is to keep the cluster up and serving tables – only one (or a few) 
tservers go offline and for a short duration (general less than a minute) and 
between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned  may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we 

RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-12-02 Thread Ligade, Shailesh [USA]
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0, can I just stop both the masters and then restart tserver 
one at a time (or all at once)? Will that speed up the restart without getting 
into this offline tablet situation and or data loss type situation? I can stop 
the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one 
keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time
 *   Stop the tserver
 *   Pause long enough that ZooKeeper recognizes the lost connection
 *   Restart the tserver
 *   Pause to allow for any recovery
  5.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what 
is going on – there should be messages showing the tserver leaving and then 
rejoining and any other activity related to recovery.  With a rolling restart 
the idea is to keep the cluster up and serving tables – only one (or a few) 
tservers go offline and for a short duration (general less than a minute) and 
between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned  may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S

From: Michael Wall mailto:mjw...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
mailto:d...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See 
https://ac

RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-11-29 Thread Ligade, Shailesh [USA]
Sorry I had a typo, I used table.suspend.duration. I tried with 300s or 5m or 
60s but still saw unassigned tablets..I actually tried on a accumulo 2 cluster, 
which didn’t have much data, and it appeared to work there..will test it more 
on accumulo 2

Thanks

-S

From: dev1 
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time
 *   Stop the tserver
 *   Pause long enough that ZooKeeper recognizes the lost connection
 *   Restart the tserver
 *   Pause to allow for any recovery
  5.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what 
is going on – there should be messages showing the tserver leaving and then 
rejoining and any other activity related to recovery.  With a rolling restart 
the idea is to keep the cluster up and serving tables – only one (or a few) 
tservers go offline and for a short duration (general less than a minute) and 
between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned  may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S

From: Michael Wall mailto:mjw...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
mailto:d...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3

RE: accumulo tserver rolling restart

2021-11-29 Thread Ligade, Shailesh [USA]
Thanks,

I am assuming I can set that property using shell and it will take effect 
immediately?

Thanks

-S

From: dev1 
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: accumulo tserver rolling restart

See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_restarting_process_on_a_node__;Iw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q$>
 - A note on rolling restarts.

There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting - there is a trade-off on the data 
not being available so try to minimize the time the tserver is off-line.

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo tserver rolling restart

Hello,

I want to restart al the tservers, say I updated the tserver heap size. Since 
we ar eusing system, I can issue restart command on a tserver. This causes all 
sorts of tablet movements even though accumulo is down for may be a second. If 
I wait for all unassigned tables to become 0, then to restart next tserver, 
then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ 
tablets per tserver)

What may be right way to perform such routine maintenance operation? Is there a 
delay setting we can change so that it will not move tablets around? What may 
be a safe delay value?

-S


accumulo tserver rolling restart

2021-11-29 Thread Ligade, Shailesh [USA]
Hello,

I want to restart al the tservers, say I updated the tserver heap size. Since 
we ar eusing system, I can issue restart command on a tserver. This causes all 
sorts of tablet movements even though accumulo is down for may be a second. If 
I wait for all unassigned tables to become 0, then to restart next tserver, 
then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ 
tablets per tserver)

What may be right way to perform such routine maintenance operation? Is there a 
delay setting we can change so that it will not move tablets around? What may 
be a safe delay value?

-S


accumulo 1.10.0 log rotate

2021-11-23 Thread Ligade, Shailesh [USA]
Hello,

Currently logs (all tserver, master) are getting huge and sometimes fills up 
entire disk. They only get rotated when i restart service. I saw that code in 
start-daemon.sh script

Is there a way to use rollingfileappender? I tried to rotate using logrotate 
but it is still logging into old log file (now master_xxx.out.1 file)

I saw some examples under templates folder but there is no information on which 
java classes can go to a log file. e.g. package x,y,z can go to master_xxx out 
file and package kzb can go to monitor..

Thanks

-S


RE: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

2021-11-23 Thread Ligade, Shailesh [USA]
Yes we are using native library...i was thinking to reduce the heap to 65G

-S

-Original Message-
From: Christopher  
Sent: Monday, November 22, 2021 7:20 PM
To: accumulo-user 
Subject: Re: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

I don't know how to tune the oom killer, but I do wonder why you would need an 
80G Java heap. That seems excessive to me. Are you using the native map library?

On Mon, Nov 22, 2021 at 7:06 PM Ligade, Shailesh [USA] 
 wrote:
>
> Thanks Christopher,
>
> It is actually oom killer. So how can I prevent it? I mean I have Xmx/s set 
> to 80G on a 128G machine. So some process is hogging the memory. On normal 
> usage I don't see the issue but under bulk ingest I see the issue.
> I am going to try to reduce heap and test, but I really don't want to starve 
> tserver either. I added mode tservers, hoping that reducing number of tablets 
> per tserver might help, but it didn't.
> Do you recommend to set oom_score_adj say -100?
>
> Appreciate your help
>
> -S
>
> -Original Message-
> From: Christopher 
> Sent: Monday, November 22, 2021 12:23 PM
> To: accumulo-user 
> Subject: [External] Re: acumulo 1.10.0 tserver goes down under heavy 
> ingest
>
> That log message is basically just reporting that the connection to ZK 
> failed. It's not very helpful in determining what led to that. You'll 
> probably have to gather additional evidence to track down the problem.
> Check the master and tserver logs prior to the crash, as well as the 
> ZooKeeper logs. If you can detect the manager or a tserver in a bad state, 
> try to capture a jstack of its process ID. Also check for system log 
> messages, such as the oom-killer running and killing your processes.
>
> On Mon, Nov 22, 2021 at 12:04 PM Ligade, Shailesh [USA] 
>  wrote:
> >
> > Hello,
> >
> > I have 8 node cluster, under heavy load a tserver goes down, we have 
> > systemd unit file to auto restart, but that causes unassigned tablet for an 
> > hour.
> >
> > In the log of restarted tserver i see
> > WARN: Saw (possibly) transient exception communicating with 
> > zookeeper and then error KeeperErrorCode = ConnectionLoss for 
> > /accumulo//xxx KeeperErrroCode = ConnectionLoss
> > at KeeperExcetion.create(KeeperException.java:102)
> > at KeeperExcetion.create(KeeperException.java:54)
> > at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
> > at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
> > at
> > org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.j
> > av
> > a:159)
> > x
> >
> > Any suggestions?
> >
> > -S


RE: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

2021-11-22 Thread Ligade, Shailesh [USA]
Thanks Christopher,

It is actually oom killer. So how can I prevent it? I mean I have Xmx/s set to 
80G on a 128G machine. So some process is hogging the memory. On normal usage I 
don't see the issue but under bulk ingest I see the issue. 
I am going to try to reduce heap and test, but I really don't want to starve 
tserver either. I added mode tservers, hoping that reducing number of tablets 
per tserver might help, but it didn't.
Do you recommend to set oom_score_adj say -100? 

Appreciate your help

-S

-Original Message-
From: Christopher  
Sent: Monday, November 22, 2021 12:23 PM
To: accumulo-user 
Subject: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

That log message is basically just reporting that the connection to ZK failed. 
It's not very helpful in determining what led to that. You'll probably have to 
gather additional evidence to track down the problem.
Check the master and tserver logs prior to the crash, as well as the ZooKeeper 
logs. If you can detect the manager or a tserver in a bad state, try to capture 
a jstack of its process ID. Also check for system log messages, such as the 
oom-killer running and killing your processes.

On Mon, Nov 22, 2021 at 12:04 PM Ligade, Shailesh [USA] 
 wrote:
>
> Hello,
>
> I have 8 node cluster, under heavy load a tserver goes down, we have systemd 
> unit file to auto restart, but that causes unassigned tablet for an hour.
>
> In the log of restarted tserver i see
> WARN: Saw (possibly) transient exception communicating with zookeeper 
> and then error KeeperErrorCode = ConnectionLoss for 
> /accumulo//xxx KeeperErrroCode = ConnectionLoss
> at KeeperExcetion.create(KeeperException.java:102)
> at KeeperExcetion.create(KeeperException.java:54)
> at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
> at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
> at 
> org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.jav
> a:159)
> x
>
> Any suggestions?
>
> -S


acumulo 1.10.0 tserver goes down under heavy ingest

2021-11-22 Thread Ligade, Shailesh [USA]
Hello,

I have 8 node cluster, under heavy load a tserver goes down, we have systemd 
unit file to auto restart, but that causes unassigned tablet for an hour.

In the log of restarted tserver i see
WARN: Saw (possibly) transient exception communicating with zookeeper
and then error
KeeperErrorCode = ConnectionLoss for /accumulo//xxx
KeeperErrroCode = ConnectionLoss
at KeeperExcetion.create(KeeperException.java:102)
at KeeperExcetion.create(KeeperException.java:54)
at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
at 
org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:159)
x

Any suggestions?

-S


older wal files not getting removed

2021-11-22 Thread Ligade, Shailesh [USA]
Hello,

I am suing accumullo 1.10.0 with replication on.

On the primary accumulo,  when I look at hdfs, i do see many (100 or so)  old 
(month plus) wal files present under tserver folder (/accumulo/wal/tserverX/) 
What can be causing that?

i also see lot of order entries in accumulo.replication table which corresponds 
to wal files which are old. Does that mean the data in those wal files never 
got replicated? my replication table has millions entries and i can see some 
are duplicated..is that normal?

>From my cursory check replication is working properly, but do see my purges on 
>primary are not getting replicated so was checking further ..

thanks

-S


RE: [External] Re: accumulo 1.10 stop-all.sh script

2021-11-01 Thread Ligade, Shailesh [USA]
Thanks for quick reply.

In reality I have a very similar unit file, only thing difference is I have 
added 
Type=forking 
Restart=always. 

I am also using ExecStart "bin/start-ademon.sh  " (as opposed to 
accmulo master in your unit file)

Yes, I updated the stop-server.sh script (which is called from stop-all.sh) to 
use systemctl and not kill command. However, the first part of stop-all calls 
"accumulo admin stopall", do I still need that functionality? If so I can 
replicate it? I can see in the code it does flush etc..

Also when I run start-all (after stop-all), my systemctl status shows failed, 
however, accumulo is working great (can scan tables, monitor is up etc)

I also updated start-all to use systemctl but that did not help..

-S
-Original Message-
From: Christopher  
Sent: Monday, November 1, 2021 9:47 AM
To: accumulo-user 
Subject: [External] Re: accumulo 1.10 stop-all.sh script

The start-all.sh / stop-all.sh scripts that come with Accumulo 1.10 are just 
one possible set of out-of-the-box scripts that you could use. If you have 
written or acquired systemd unit files to manage your services, you may be 
better off using those instead, and avoiding the built-in scripts entirely.

For me, with unit files, I would probably just do something like `pssh -h 
 systemctl stop accumulo-` or similar, rather than use 
the stop-all.sh script.

If you want to try to shut it down "cleanly" first, then you'll definitely have 
to remove the "restart=always" line from your systemd unit files. In fact, I'm 
not sure automatic restarts are ever a good idea, since you won't necessarily 
have triaged the problem that caused a crash before it tries to restart, and 
could be perpetuating a failure or making it worse.

You could also modify your launch scripts or unit files to guard on some 
precondition that must be met before it can be restarted (like the existence of 
a specific file or some other systemd unit being loaded). Systemd supports lots 
of conditions to check:
https://urldefense.com/v3/__https://www.freedesktop.org/software/systemd/man/systemd.unit.html*Conditions*20and*20Asserts__;IyUl!!May37g!Yfrm-rHqBpA30d0Q02VJfQVFeJf7_l9T5CxIvNk0a13QZan_v-weOevKW3weCRpbmw$
When you want to do a graceful shutdown, you can change the state that is 
checked by the precondition, so the service doesn't restart.

One example set of very simple unit files was written by me a couple of years 
ago for 1.x in Fedora. It did not, however, have automatic restarts. These were 
accompanied by a custom accumulo launch script generated by the 
%jpackage_script macro. See 
https://urldefense.com/v3/__https://src.fedoraproject.org/rpms/accumulo/blob/f31/f/accumulo.spec*_369__;Iw!!May37g!Yfrm-rHqBpA30d0Q02VJfQVFeJf7_l9T5CxIvNk0a13QZan_v-weOevKW3x_l1iDhA$
and 
https://urldefense.com/v3/__https://src.fedoraproject.org/rpms/accumulo/tree/f31__;!!May37g!Yfrm-rHqBpA30d0Q02VJfQVFeJf7_l9T5CxIvNk0a13QZan_v-weOevKW3zMcjdULA$
  ; These may not be better than the unit files you're currently using, though.

On Mon, Nov 1, 2021 at 9:11 AM Ligade, Shailesh [USA]  
wrote:
>
> Hello,
>
>
>
> I noticed that stop-all.sh script first calls accmulo admin stopAll and then 
> if the servers are still up, it does stop individual servers by going thru 
> masters, gc, slaves etc files.
>
>
>
> Since we are using systemd unit files to start the services, and our 
> unit files has restart=always, we can’t cleanly stop the services ☹. I 
> understand the unit files didn’t come with accumulo distribution. So 
> the question is, use of unit file supported and if they are what may 
> be the correct way to issue stopAll? IF anyone can share a good unit file 
> that can be used? or do I need to write my own stopAll script? What may be 
> the main logic of such script (since it calls admin stopAll and that does 
> several different things underneath..)
>
>
>
> -S


accumulo 1.10 stop-all.sh script

2021-11-01 Thread Ligade, Shailesh [USA]
Hello,

I noticed that stop-all.sh script first calls accmulo admin stopAll and then if 
the servers are still up, it does stop individual servers by going thru 
masters, gc, slaves etc files.

Since we are using systemd unit files to start the services, and our unit files 
has restart=always, we can’t cleanly stop the services ☹. I understand the unit 
files didn’t come with accumulo distribution. So the question is, use of unit 
file supported and if they are what may be the correct way to issue stopAll? IF 
anyone can share a good unit file that can be used? or do I need to write my 
own stopAll script? What may be the main logic of such script (since it calls 
admin stopAll and that does several different things underneath..)

-S


Re: [External] Re: accumulo 1.10 open file recommendation

2021-10-26 Thread Ligade, Shailesh [USA]
Thanks

I saw that in the scripts!

-S

From: Christopher 
Sent: Tuesday, October 26, 2021 11:01 AM
To: accumulo-user 
Subject: [External] Re: accumulo 1.10 open file recommendation

Some of our launch scripts (start-daemon.sh in 1.x and the equivalent
accumulo-service in 2.x) do check the output of `ulimit -n` to see if
it is at least 32768. These scripts are optional, though.

Everybody's situation is unique, so we probably can't offer specific
advice for your environment, but this seems like a generally good idea
to me if you're running on a typical Linux system, as an active
Accumulo cluster on Hadoop will likely use a lot of open file handles.
I don't see that behavior changing anytime soon.

On Tue, Oct 26, 2021 at 7:22 AM Ligade, Shailesh [USA]
 wrote:
>
> Hello,
>
> Do we still recommend max open files on the system (ulimit -n)? I saw older 
> posts on Cloudera site(32768), but didn't see any on Accumulo documentation. 
> So just curious.
>
> -S


accumulo 1.10 open file recommendation

2021-10-26 Thread Ligade, Shailesh [USA]
Hello,

Do we still recommend max open files on the system (ulimit -n)? I saw older 
posts on Cloudera site(32768), but didn't see any on Accumulo documentation. So 
just curious.

-S


accumulo 1.10 client.conf

2021-10-14 Thread Ligade, Shailesh [USA]
Hello,

did client.conf properties change between 1.10 and 2.x? I saw lot of 
documentation on it and trying to use it so that i can use shell commands from 
script

My current client.conf has

instance.name=
instance.zookeeeper=xxx:2181,yyy:2181
auth.type=PasswordToken
auth.principal=auser
auth.token=

when I run accumulo shell --config-file 

Re: [External] Re: accumulo 1.10 tuning

2021-10-14 Thread Ligade, Shailesh [USA]
Thanks!

-S

From: Christopher 
Sent: Wednesday, October 13, 2021 1:46 PM
To: accumulo-user 
Subject: [External] Re: accumulo 1.10 tuning

I'm not aware of any existing prescriptive recommendations of the type
you are asking for. I think most recommendations tend to be reactive
and specific, rather than prescriptive and general, because so much
depends on the particulars of a user's situation. There are too many
variables, and each user's situation is different. What works for one
person's data and environment may not work for you in your
environment.

Tservers generally don't require a lot of heap for ingest. Make sure
you reserve enough room for the OS, and other processes on the
machine. And, don't forget to account for the native memory taken by
native compression libraries, like GZip. Monitor your tservers to see
how much heap you're using in your workload, and make sure you adjust
to optimize Java GC runs. Take into account your iterators and what
they are doing, as your iterators may require more memory. To optimize
your workloads, you may wish to experiment with running multiple
tservers with smaller memory footprints on the same server, rather
than a single tserver with a larger memory footprint. These are just a
few things to consider. Everybody's use case is unique.

On Wed, Oct 13, 2021 at 8:48 AM Ligade, Shailesh [USA]
 wrote:
>
> Hello,
>
>
>
> I saw various guidelines on how to set memory heap sizes etc. IUs there a 
> pluggable spreadheet like if the server has x memory, tserver heap should be 
> x/3 and datanode should dhave x/4 etc?
>
> Also, is there any recommendation on number of tables hosted per tserver and 
> tserver memory requirement? May be that will also provide  when to add a new 
> tserver (when # of tablets per tserver goes above some threshold)
>
>
> -S
>
>


accumulo 1.10 tuning

2021-10-13 Thread Ligade, Shailesh [USA]
Hello,

I saw various guidelines on how to set memory heap sizes etc. IUs there a 
pluggable spreadheet like if the server has x memory, tserver heap should be 
x/3 and datanode should dhave x/4 etc?
Also, is there any recommendation on number of tables hosted per tserver and 
tserver memory requirement? May be that will also provide  when to add a new 
tserver (when # of tablets per tserver goes above some threshold)

-S



RE: [External] RE: accumulo 1.10 replication issue

2021-09-24 Thread Ligade, Shailesh [USA]
Thank you all,

I think replication itself is working but I am still not sure what triggers it. 
If I restart both cluster (master and tserver on the both), I can see data 
getting replicated. This indicates that if properly triggered the replication 
will work, needed plumbing is there and configs are correct.

I understand, other triggers are wal closing or wal aging. Since those are set 
pretty high, I can't reliably state replication worked ☹ Are there any other 
triggers? I did flush or/and compact  source table but that didn't work.

Also, I noticed that I had some bad configuration  at one point (not anymore) 
e.g., in correct peer name etc (that peer name do not exist under zookeeper), 
but even after tserver/master restart, that incorrect config still shows up 
under in-progress replication..not sure how to clear all that..May be over the 
time it will fix itself.

-S

-Original Message-
From: Christopher  
Sent: Thursday, September 23, 2021 2:23 PM
To: accumulo-user 
Subject: Re: [External] RE: accumulo 1.10 replication issue

The design of replication is intentionally passive and "eventually consistent" 
for efficiency. Batching efficiency is one reason why the feature is tightly 
coupled to WALs. If you need immediate replication, or want greater control 
over the batching process, you can create a layer on top of two Accumulo 
clients that will coordinate sending each mutation to both instances, rather 
than rely on the built-in replication features tied to WALs. I've even seen 
some developers try using Kafka to deliver updates, and having two different 
Accumulo clusters subscribe to the desired Kafka topics and do ingest into 
Accumulo that way.

Also, please be aware that the existing replication features have not been 
maintained in some time, and many of their test cases are known to be buggy. 
There are many known bugs and potential bugs with the replication as currently 
implemented, but there has not been active development maintaining that feature 
in many years. It is unclear what the future of the current implementation is. 
Like all open source software, you use it at your own risk. Please ensure that 
you've tested the feature to ensure it is suitable for your use case, before 
using it in production. And, if you find any fixes to problems as you go, we 
very much would welcome pull requests to fix them. If you find you were able to 
get it to work for your use case, that's great, and we welcome success stories 
as well! :)


On Thu, Sep 23, 2021 at 2:05 PM Ligade, Shailesh [USA] 
 wrote:
>
> Thanks, Appreciate your help.
>
>
>
> It can be confusing as I am waiting here but it is not replicating I will 
> reduce those values temporarily and see what happens. Interesting part is 
> Files Needing replication is stuck at 3 so it is possible that problem is 
> some where else.
>
>
>
> Thanks again
>
> -S
>
>
>
> From: Adam J. Shook 
> Sent: Thursday, September 23, 2021 1:50 PM
> To: user@accumulo.apache.org
> Subject: Re: [External] RE: accumulo 1.10 replication issue
>
>
>
> Yes, if it is not heavily used then you may see a significant delay. You can 
> change the defaults using tserver.walog.max.age [1] and 
> tserver.walog.max.size [2]. If I recall you can change these via the shell 
> and a restart is not required.
>
>
>
> If you aren't seeing much ingestion, then the max age would be what you want 
> to set to ensure data is replicated within the window you want it to be 
> replicated.  Please keep in mind that setting either of these values to very 
> low thresholds will cause the WALs to roll over frequently and could 
> negatively impact the system, particularly for large Accumulo clusters.
>
>
>
> In my experience, using Accumulo replication is a good fit when you have 
> longer SLAs on replication.  If you are looking for anything in the 
> near-real-time realm (milliseconds to seconds to maybe even a few minutes), 
> you'd be better off double writing to multiple Accumulo instances.
>
>
>
> --Adam
>
>
>
> [1] 
> https://urldefense.com/v3/__https://accumulo.apache.org/1.10/accumulo_
> user_manual.html*_tserver_walog_max_age__;Iw!!May37g!dY-gMWjsv8yNC79p6
> sh1ATiOgPTOUuVscS8i9ZprCzRfhnqmMEnX316CntKg5XqA_Q$
>
> [2] 
> https://urldefense.com/v3/__https://accumulo.apache.org/1.10/accumulo_
> user_manual.html*_tserver_walog_max_size__;Iw!!May37g!dY-gMWjsv8yNC79p
> 6sh1ATiOgPTOUuVscS8i9ZprCzRfhnqmMEnX316CntLaFRowyw$
>
>
>
> On Thu, Sep 23, 2021 at 1:31 PM Ligade, Shailesh [USA] 
>  wrote:
>
> Thanks Adam,
>
>
>
> System is not heavily used, Does that mean it will wait for 1G data in wal 
> file, (or 24 hours) before it will replicate?
>
>
>
> I don’t see any error In any log source master, tserver or target 
> master,tserver
>
&g

RE: [External] RE: accumulo 1.10 replication issue

2021-09-23 Thread Ligade, Shailesh [USA]
Thanks, Appreciate your help.

It can be confusing as I am waiting here but it is not replicating I will 
reduce those values temporarily and see what happens. Interesting part is Files 
Needing replication is stuck at 3 so it is possible that problem is some where 
else.

Thanks again
-S

From: Adam J. Shook 
Sent: Thursday, September 23, 2021 1:50 PM
To: user@accumulo.apache.org
Subject: Re: [External] RE: accumulo 1.10 replication issue

Yes, if it is not heavily used then you may see a significant delay. You can 
change the defaults using tserver.walog.max.age [1] and tserver.walog.max.size 
[2]. If I recall you can change these via the shell and a restart is not 
required.

If you aren't seeing much ingestion, then the max age would be what you want to 
set to ensure data is replicated within the window you want it to be 
replicated.  Please keep in mind that setting either of these values to very 
low thresholds will cause the WALs to roll over frequently and could negatively 
impact the system, particularly for large Accumulo clusters.

In my experience, using Accumulo replication is a good fit when you have longer 
SLAs on replication.  If you are looking for anything in the near-real-time 
realm (milliseconds to seconds to maybe even a few minutes), you'd be better 
off double writing to multiple Accumulo instances.

--Adam

[1] 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_tserver_walog_max_age<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_tserver_walog_max_age__;Iw!!May37g!abBCLknLxFVCzoPfcMJQ_DMnbdLbmOa-oYeMQf0CbQTNSS8yF4jOrBV_yNn1zEjBog$>
[2] 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_tserver_walog_max_size<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_tserver_walog_max_size__;Iw!!May37g!abBCLknLxFVCzoPfcMJQ_DMnbdLbmOa-oYeMQf0CbQTNSS8yF4jOrBV_yNnD3o4pgA$>

On Thu, Sep 23, 2021 at 1:31 PM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Adam,

System is not heavily used, Does that mean it will wait for 1G data in wal 
file, (or 24 hours) before it will replicate?

I don’t see any error In any log source master, tserver or target master,tserver

Monitor replication page has correct status, and once in while I see 
In-Progress Replication section flashing by. But don’t see any new data in the 
target table. ☹

-S

From: Adam J. Shook mailto:adamjsh...@gmail.com>>
Sent: Thursday, September 23, 2021 12:10 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [External] RE: accumulo 1.10 replication issue

Yes, inserting via the shell will be enough to test it.

Note that the replication system uses the write-ahead logs (WAL) to replicate 
the data.  These logs must be closed before any replication can occur, so there 
will be a delay before it shows up in the peer table.  How long of a delay 
depends on how much data is actively being written to the TabletServers (and 
therefore the WAL) and/or how much time has passed since the WAL was opened. 
The default max WAL data size is 1 GB and the max age is 24 hours.

--Adam

On Thu, Sep 23, 2021 at 11:13 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Adam,

I am setting 
accumulo.name<https://urldefense.com/v3/__http:/accumulo.name__;!!May37g!au2NJ_bzRengNQZdhTHO9O38cNfzpFNzN8DFC49SzxW58cbMe9Vl20i58oJkg1wZWQ$>
 property in accumulo-site.xml. I think this property must be set to “Instance 
Name” value, I tried to set to “primary” and I saw error stating that instance 
id was not found in zookeeper

I have few tables to replicate so I am thinking I will set all others 
properties using shell config command

To test this, I just insert value using shell right? Or do I need to flush or 
compact on the table to see those values on the other side?

-S

From: Adam J. Shook mailto:adamjsh...@gmail.com>>
Sent: Thursday, September 23, 2021 11:08 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [External] RE: accumulo 1.10 replication issue

Your configurations look correct to me, and it sounds like it is partially 
working as you are seeing files that need replicated in the Accumulo Monitor. I 
do have the 
replication.name<https://urldefense.com/v3/__http:/replication.name__;!!May37g!YaShfRxRNA1m14PM-_NQTaWWuL-fcis6RlUI9RKQU68Q2oWUTZuh-Q1EkHA0XIO8vA$>
 and all replication.peer.* properties defined in accumulo-site.xml. Do you 
have all these properties defined there?  If not, try setting them in 
accumulo-site.xml and restarting your Accumulo services, particularly the 
Master and TabletServers.  The Master may not be queuing work and/or the 
TabletServers may not be looking for work.

You should see DEBUG-level logs in the TabletServers that say "Looking for work 
in /accumulo//replication/workqueue", so enable debug logging if 
you haven't done so already in the generic_logg

RE: [External] RE: accumulo 1.10 replication issue

2021-09-23 Thread Ligade, Shailesh [USA]
Thanks Adam,

System is not heavily used, Does that mean it will wait for 1G data in wal 
file, (or 24 hours) before it will replicate?

I don’t see any error In any log source master, tserver or target master,tserver

Monitor replication page has correct status, and once in while I see 
In-Progress Replication section flashing by. But don’t see any new data in the 
target table. ☹

-S

From: Adam J. Shook 
Sent: Thursday, September 23, 2021 12:10 PM
To: user@accumulo.apache.org
Subject: Re: [External] RE: accumulo 1.10 replication issue

Yes, inserting via the shell will be enough to test it.

Note that the replication system uses the write-ahead logs (WAL) to replicate 
the data.  These logs must be closed before any replication can occur, so there 
will be a delay before it shows up in the peer table.  How long of a delay 
depends on how much data is actively being written to the TabletServers (and 
therefore the WAL) and/or how much time has passed since the WAL was opened. 
The default max WAL data size is 1 GB and the max age is 24 hours.

--Adam

On Thu, Sep 23, 2021 at 11:13 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Adam,

I am setting 
accumulo.name<https://urldefense.com/v3/__http:/accumulo.name__;!!May37g!au2NJ_bzRengNQZdhTHO9O38cNfzpFNzN8DFC49SzxW58cbMe9Vl20i58oJkg1wZWQ$>
 property in accumulo-site.xml. I think this property must be set to “Instance 
Name” value, I tried to set to “primary” and I saw error stating that instance 
id was not found in zookeeper

I have few tables to replicate so I am thinking I will set all others 
properties using shell config command

To test this, I just insert value using shell right? Or do I need to flush or 
compact on the table to see those values on the other side?

-S

From: Adam J. Shook mailto:adamjsh...@gmail.com>>
Sent: Thursday, September 23, 2021 11:08 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [External] RE: accumulo 1.10 replication issue

Your configurations look correct to me, and it sounds like it is partially 
working as you are seeing files that need replicated in the Accumulo Monitor. I 
do have the 
replication.name<https://urldefense.com/v3/__http:/replication.name__;!!May37g!YaShfRxRNA1m14PM-_NQTaWWuL-fcis6RlUI9RKQU68Q2oWUTZuh-Q1EkHA0XIO8vA$>
 and all replication.peer.* properties defined in accumulo-site.xml. Do you 
have all these properties defined there?  If not, try setting them in 
accumulo-site.xml and restarting your Accumulo services, particularly the 
Master and TabletServers.  The Master may not be queuing work and/or the 
TabletServers may not be looking for work.

You should see DEBUG-level logs in the TabletServers that say "Looking for work 
in /accumulo//replication/workqueue", so enable debug logging if 
you haven't done so already in the generic_logger.xml file.

--Adam

On Thu, Sep 23, 2021 at 6:53 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks for reply

I am using insert command from shell to insert data.

Also, a quick question, 
replication.name<https://urldefense.com/v3/__http:/replication.name__;!!May37g!YaShfRxRNA1m14PM-_NQTaWWuL-fcis6RlUI9RKQU68Q2oWUTZuh-Q1EkHA0XIO8vA$>
 property can it be set using cli? Will that work or it must be defined in 
accumilo-site.xml?

Thanks
-S


From: d...@etcoleman.com<mailto:d...@etcoleman.com> 
mailto:d...@etcoleman.com>>
Sent: Thursday, September 23, 2021 6:50 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10 replication issue

How are you inserting the data?

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, September 22, 2021 10:22 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo 1.10 replication issue

Hello,

I am following
Apache Accumulo® User Manual Version 
1.10<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_replication__;Iw!!May37g!epzOA4Zxtj4kjXvE1dPTtAae7AAFXbiZCVMxVk6_yQ3O-AlaG8GkML6q5OX1nt3O0A$>

I want to setup replication from accumulo instance inst1, table source, TO 
inst2, table target
I created a replication user,( same password) on both instances and grant 
Table.READ/WRITE for source and target respectively

I set 
replication.name<https://urldefense.com/v3/__http:/replication.name__;!!May37g!YaShfRxRNA1m14PM-_NQTaWWuL-fcis6RlUI9RKQU68Q2oWUTZuh-Q1EkHA0XIO8vA$>
 property to be same as inst on both instances

On inst1 Set following properties

replication.peer.inst1=org.apache.accumulo.tserver.replication.AccumuloReplicaSystem,inst2,inst2zoo1:2181,inst2zoo2:2181,inst2zoo3:2181
replication.peer.user.inst2=replication
replication.peer.password.inst2=replication

set the source table for replication
config -t source -s table.replication=true
config -t source -s table.replication.target.inst2=(number I got for target 

RE: [External] RE: accumulo 1.10 replication issue

2021-09-23 Thread Ligade, Shailesh [USA]
Thanks Adam,

I am setting accumulo.name property in accumulo-site.xml. I think this property 
must be set to “Instance Name” value, I tried to set to “primary” and I saw 
error stating that instance id was not found in zookeeper

I have few tables to replicate so I am thinking I will set all others 
properties using shell config command

To test this, I just insert value using shell right? Or do I need to flush or 
compact on the table to see those values on the other side?

-S

From: Adam J. Shook 
Sent: Thursday, September 23, 2021 11:08 AM
To: user@accumulo.apache.org
Subject: Re: [External] RE: accumulo 1.10 replication issue

Your configurations look correct to me, and it sounds like it is partially 
working as you are seeing files that need replicated in the Accumulo Monitor. I 
do have the 
replication.name<https://urldefense.com/v3/__http:/replication.name__;!!May37g!YaShfRxRNA1m14PM-_NQTaWWuL-fcis6RlUI9RKQU68Q2oWUTZuh-Q1EkHA0XIO8vA$>
 and all replication.peer.* properties defined in accumulo-site.xml. Do you 
have all these properties defined there?  If not, try setting them in 
accumulo-site.xml and restarting your Accumulo services, particularly the 
Master and TabletServers.  The Master may not be queuing work and/or the 
TabletServers may not be looking for work.

You should see DEBUG-level logs in the TabletServers that say "Looking for work 
in /accumulo//replication/workqueue", so enable debug logging if 
you haven't done so already in the generic_logger.xml file.

--Adam

On Thu, Sep 23, 2021 at 6:53 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks for reply

I am using insert command from shell to insert data.

Also, a quick question, 
replication.name<https://urldefense.com/v3/__http:/replication.name__;!!May37g!YaShfRxRNA1m14PM-_NQTaWWuL-fcis6RlUI9RKQU68Q2oWUTZuh-Q1EkHA0XIO8vA$>
 property can it be set using cli? Will that work or it must be defined in 
accumilo-site.xml?

Thanks
-S


From: d...@etcoleman.com<mailto:d...@etcoleman.com> 
mailto:d...@etcoleman.com>>
Sent: Thursday, September 23, 2021 6:50 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10 replication issue

How are you inserting the data?

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, September 22, 2021 10:22 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo 1.10 replication issue

Hello,

I am following
Apache Accumulo® User Manual Version 
1.10<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_replication__;Iw!!May37g!epzOA4Zxtj4kjXvE1dPTtAae7AAFXbiZCVMxVk6_yQ3O-AlaG8GkML6q5OX1nt3O0A$>

I want to setup replication from accumulo instance inst1, table source, TO 
inst2, table target
I created a replication user,( same password) on both instances and grant 
Table.READ/WRITE for source and target respectively

I set 
replication.name<https://urldefense.com/v3/__http:/replication.name__;!!May37g!YaShfRxRNA1m14PM-_NQTaWWuL-fcis6RlUI9RKQU68Q2oWUTZuh-Q1EkHA0XIO8vA$>
 property to be same as inst on both instances

On inst1 Set following properties

replication.peer.inst1=org.apache.accumulo.tserver.replication.AccumuloReplicaSystem,inst2,inst2zoo1:2181,inst2zoo2:2181,inst2zoo3:2181
replication.peer.user.inst2=replication
replication.peer.password.inst2=replication

set the source table for replication
config -t source -s table.replication=true
config -t source -s table.replication.target.inst2=(number I got for target 
table from inst2 tables -l command)

and finally I did
online accumulo.replication

Now when I insert data in source, I get feiles needing replication 1 on the 
monitor replication section. All other values are correct, TABLE – source, PEER 
– inst2 REMOTE ID as number I set

However my In-Progress Replication always stay empty and I don’t see any data 
in inst2 target table

No errors that I can see in master log or tserver log where tablet exist.

Any idea what may be wrong? Is there any way to debug this?

-S





RE: [External] RE: accumulo 1.10 replication issue

2021-09-23 Thread Ligade, Shailesh [USA]
Thanks for reply

I am using insert command from shell to insert data.

Also, a quick question, replication.name property can it be set using cli? Will 
that work or it must be defined in accumilo-site.xml?

Thanks
-S


From: d...@etcoleman.com 
Sent: Thursday, September 23, 2021 6:50 AM
To: user@accumulo.apache.org
Subject: [External] RE: accumulo 1.10 replication issue

How are you inserting the data?

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, September 22, 2021 10:22 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo 1.10 replication issue

Hello,

I am following
Apache Accumulo(r) User Manual Version 
1.10<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_replication__;Iw!!May37g!epzOA4Zxtj4kjXvE1dPTtAae7AAFXbiZCVMxVk6_yQ3O-AlaG8GkML6q5OX1nt3O0A$>

I want to setup replication from accumulo instance inst1, table source, TO 
inst2, table target
I created a replication user,( same password) on both instances and grant 
Table.READ/WRITE for source and target respectively

I set replication.name property to be same as inst on both instances

On inst1 Set following properties

replication.peer.inst1=org.apache.accumulo.tserver.replication.AccumuloReplicaSystem,inst2,inst2zoo1:2181,inst2zoo2:2181,inst2zoo3:2181
replication.peer.user.inst2=replication
replication.peer.password.inst2=replication

set the source table for replication
config -t source -s table.replication=true
config -t source -s table.replication.target.inst2=(number I got for target 
table from inst2 tables -l command)

and finally I did
online accumulo.replication

Now when I insert data in source, I get feiles needing replication 1 on the 
monitor replication section. All other values are correct, TABLE - source, PEER 
- inst2 REMOTE ID as number I set

However my In-Progress Replication always stay empty and I don't see any data 
in inst2 target table

No errors that I can see in master log or tserver log where tablet exist.

Any idea what may be wrong? Is there any way to debug this?

-S





accumulo 1.10 replication issue

2021-09-22 Thread Ligade, Shailesh [USA]
Hello,

I am following
Apache Accumulo(r) User Manual Version 
1.10

I want to setup replication from accumulo instance inst1, table source, TO 
inst2, table target
I created a replication user,( same password) on both instances and grant 
Table.READ/WRITE for source and target respectively

I set replication.name property to be same as inst on both instances

On inst1 Set following properties

replication.peer.inst1=org.apache.accumulo.tserver.replication.AccumuloReplicaSystem,inst2,inst2zoo1:2181,inst2zoo2:2181,inst2zoo3:2181
replication.peer.user.inst2=replication
replication.peer.password.inst2=replication

set the source table for replication
config -t source -s table.replication=true
config -t source -s table.replication.target.inst2=(number I got for target 
table from inst2 tables -l command)

and finally I did
online accumulo.replication

Now when I insert data in source, I get feiles needing replication 1 on the 
monitor replication section. All other values are correct, TABLE - source, PEER 
- inst2 REMOTE ID as number I set

However my In-Progress Replication always stay empty and I don't see any data 
in inst2 target table

No errors that I can see in master log or tserver log where tablet exist.

Any idea what may be wrong? Is there any way to debug this?

-S





exporttable content to linux local file system (not hdfs)

2021-09-16 Thread Ligade, Shailesh [USA]
Hello,

I need to export data to another instance of hdfs for replication baseline 
purposes.

Can exportable directly export data to local file system? I was planning to scp 
it to other instance

From shell, using accumulo 1.10, I tried

exportable -t tname file:///export

but didn’t work, tried file:// as well as export/ none worked.
Received errors like Failed to create export files Mkdirs failed to create 
file:/export (exist=false, cwd=file:/)

For file:// it gives AccumuloException
Internal error processing waitForFateOperation

So is it possible?

-S



RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo and hdfs data locality

2021-09-10 Thread Ligade, Shailesh [USA]
Thanks Appreciated!

-S

From: Christopher 
Sent: Friday, September 10, 2021 9:15 AM
To: accumulo-user 
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo and hdfs data locality

One correction to what Mike said. The last location column doesn't store where 
it was last hosted. The current location column does that. Rather, the last 
location column stores where it was hosted when it last wrote to HDFS. The goal 
is what Mike said: it tries to provide a mechanism for preserving locality 
during reboots. However, it may not work very well, especially since the last 
written file may be very small, and only a tiny fraction of the tablet's 
overall data.

On Fri, Sep 10, 2021, 09:03 Michael Wall 
mailto:mjw...@apache.org>> wrote:
If a tablet moves, the data files in HDFS do not go with it.  However, during 
the next compaction one copy of the rfile should be written locally.

Note, the metadata has a last column for each tablet, to record where the table 
was last hosted.  On startup, Accumulo will try to assign a tablet to that last 
location if possible to hopefully take advantage of the locality.

To really take advantage of the data locality, you should configure Short 
Circuit reads in HDFS. See 
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

On Fri, Sep 10, 2021 at 8:47 AM Shailesh Ligade 
mailto:slig...@fbi.gov>> wrote:
Thank you,

Is there way to maintain that data locality, I mean over time with table 
splitting, hdfs rebalancing etc we may not have data locality…

Thanks again

-S

From: Christopher mailto:ctubb...@apache.org>>
Sent: Friday, September 10, 2021 8:40 AM
To: accumulo-user mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo and hdfs data locality

Data locality and simplified deployments are the only reasons I can think of. 
Accumulo doesn't do anything particularly special for data locality, but 
typically, an HDFS client (like Accumulo servers) will (or can be configured 
to) write one copy of any new blocks locally, which should permit efficient 
reads later. This works well with Accumulo's hosting behavior, where each 
tablet is hosted on a single server solely responsible for its reads and writes.

On Fri, Sep 10, 2021, 07:22 Shailesh Ligade 
mailto:slig...@fbi.gov>> wrote:
Hello I am suing Hadoop 3.3 and accumulo 1.10. Does accumulo take advantage of 
Hadoop data locality? What are the other benefits of having tserver and 
datanode process on the same instance?

-S




Re: [External] RE: how to decommission tablet server

2021-08-18 Thread Ligade, Shailesh [USA]
Thank you for good explanation! I really appreciate that.

Yes I need to remove the hardware, meaning I need to stop everything on the 
server (tserver and datanode)

One quick question:

What is the difference between accumulo admin stop :9997 and stopping 
tserver linux service?

When I issue admin stop, I can see, from the monitor, hosted tablets count from 
the tserver in the question  goes down to 0, however it doesn't stop the 
tserver process or service.

In your steps, you are stopping datanode service first (adding into exclude 
file and then running refreshNodes and then stop the service), I was thinking 
to stop accumulo tserver and let it handle hosted tablets first, before 
touching datanode, will there be any difference? Just trying to understand how 
the relationship between accumulo and hadoop is.

Thank you!

-S

From: d...@etcoleman.com 
Sent: Tuesday, August 17, 2021 2:39 PM
To: user@accumulo.apache.org 
Subject: [External] RE: how to decommission tablet server


Maybe you could clarify.  Decommissioning tablet servers and hdfs replication 
are separate and distinct issues.  Accumulo will generally be unaware of hdfs 
replication and table assignment does not change the hdfs replication.  You can 
set the replication factor for a tablet – but that is used on writes to hdfs – 
Accumulo will assume that on any successful write, on return hdfs  is managing 
the details.



When a tablet is assigned / migrated, the underlying files in hdfs are not 
changed – the file references are reassigned in a metadata operation, but the 
files themselves are not modified.  They will maintain whatever replication 
factor that was assigned and whatever the namenode decides.



If you are removing servers that have both data nodes and tserver processes 
running:



If you stop / kill the tserver, the tablets assigned to that server will be 
reassigned rather quickly.  It is only an metadata update.  The exact timing 
will depend on your ZooKeeper time-out setting, but the “dead” tserver should 
be detected and reassigned in short order. The reassignment may cause some 
churn of assignments if the cluster becomes un-balanced.   The manager (master) 
will select tablets from tservers that are over-subscribed and then assign them 
to tservers that have fewer tablets – you can monitor the manager (master) 
debug log to see the migration progress.  If you want to be gentile, stop a 
tserver, wait for the number of unassigned tables to hit zero and migration to 
settle and then repeat.



If you want to stop the data nodes, you can do that independently of Accumulo – 
just follow the Hadoop data node decommission process.  Hadoop will move the 
data blocks assigned to the data node so that it is “safe” to then stop the 
data node process.  This is independent of Accumulo and Accumulo will not be 
aware that the blocks are moving.  If you are running compactions, Accumulo may 
try to write blocks locally, but if the data node is rejecting new block 
assignments (which I rather assume that it would when in decommission mode) 
then Accumulo still would not care.  If somehow new blocks where written it may 
just delay the Hadoop data node decommissioning.



If you are running ingest while killing tservers – things should mostly work – 
there may be ingest failures, but normally things would get retried and the 
subsequent effort should succeed – the issue may be that if by bad luck the 
work keeps getting assigned to tservers that are then killed, you could end up 
exceeding the number of retries and the ingest would fail out right.  If you 
can pause ingest, then this limits that chance.  If you can monitor your ingest 
and know when an ingest failed you could just reschedule the ingest (for bulk 
import)  If you are doing continuous ingest, it may be harder to determine if a 
specific ingest fails, so you may need to select an appropriate range for 
replay.  Overall it may mostly work – it will depend on your processes and your 
tolerance for any particular data loss on an ingest.



The modest approach (if you can accept transient errors):



1 Start the data node decommission process.

2 Pause ingest and cancel any running user compactions.

3 Stop a tserver and wait for unassigned tablets to go back to 0.  Wait for the 
tablet migration (if any) to quiet down.

4 Repeat 3 until all tserver processes have been stopped on the nodes you are 
removing.

5 Restart ingest – rerun any user compactions if you stopped any.

6 Wait for the hdfs decommission process to finish moving / replicating blocks.

7 stop the data node process.

8 do what you want with the node.



You do not need to schedule down time – if you can accept transient errors – 
say that a client scan is running and that tserver is stopped – the client may 
receive an error for the scan.  If the scan is resubmitted and the tablet has 
been reassigned it should work – it may pause for the reassignment and / or 
timeout 

RE: [External] RE: hdfs rack awareness and accumulo

2021-08-05 Thread Ligade, Shailesh [USA]
All tables are online (except for replication, which was off to begin with)

Master process is up

I see INFO messages like

Failed to open transport
Waiting for file to be closed /accumulo/wal//xxx
RecoveryManager Volume replaced /accumulo/wal/stopped hostname>/xxx

Is there anything specific I should be looking for?

-S

From: d...@etcoleman.com 
Sent: Thursday, August 5, 2021 8:37 AM
To: user@accumulo.apache.org
Subject: [External] RE: hdfs rack awareness and accumulo

Are there any Accumulo system table that are offline (root, metadata)?  Is 
there a manager (master) process available?  What is the manager log saying?

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Thursday, August 5, 2021 7:55 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: hdfs rack awareness and accumulo

Hello,

Our hdfs setup is rack aware with replication of 3. The datanode and tserver 
share the same hosts. In the event that one rack goes down, will accumulo be 
still functioning (after hdfs data replication)?

What I am finding is accumulo monitor is up and showing half the tablets are 
unreachable, I can get to accumulo shell but I can’t scan any tables. From the 
log I can see there are some locks in zookeeper. But overall accumulo, although 
up, is not usable ☹ Is there any way around it?

-S


hdfs rack awareness and accumulo

2021-08-05 Thread Ligade, Shailesh [USA]
Hello,

Our hdfs setup is rack aware with replication of 3. The datanode and tserver 
share the same hosts. In the event that one rack goes down, will accumulo be 
still functioning (after hdfs data replication)?

What I am finding is accumulo monitor is up and showing half the tablets are 
unreachable, I can get to accumulo shell but I can’t scan any tables. From the 
log I can see there are some locks in zookeeper. But overall accumulo, although 
up, is not usable ☹ Is there any way around it?

-S