ASF Trip opportunity - Berlin Buzzwords

2023-03-24 Thread dev1
In case anyone is interested.

Hi All,

The ASF Travel Assistance Committee is supporting taking up to six (6) people 
to attend Berlin Buzzwords In June this year.

This includes Conference passes, and travel & accommodation as needed.

Please see our website at https://tac.apache.org for more information and how 
to apply.

Applications close on 15th April.

Good luck to those that apply.

Gavin McDonald (VP TAC)


RE: accumulo shell compact --cancel -t mytable does not work

2023-03-11 Thread dev1
Which Accumulo version are you using?

The command should work - it may not stop the compations that are currently 
writing files, but it should stop new ones from starting and cancel once those 
write complete.

Can you look at the number of FATE operations you have running and if there are 
conflicting table locks?

Ed Coleman

From: Hart, Andrew via user 
Sent: Saturday, March 11, 2023 4:06 AM
To: user@accumulo.apache.org
Subject: accumulo shell compact --cancel -t mytable does not work

Hi,
I had some thousands of compactions running and queued (full user type) and 
tried to cancel but despite the command completing, the compactions did not 
cancel.
What could be wrong?
I resorted to repeatedly sending the same compact - cancel commands but I don't 
know if the cancel just takes hours or if it is necessary to repeat until the 
queue is drained?
And.



Public


RE: Exception during accumulo init

2023-01-20 Thread dev1
Why do you keep using "-upload-accumulo-props~?  During initialization Accumulo 
will automatically read your site config file and then set properties in 
ZooKeeper accordingly.

Ed Coleman

From: Samudrala, Ranganath [USA] via user 
Sent: Friday, January 20, 2023 10:48 AM
To: user@accumulo.apache.org
Subject: Re: Exception during accumulo init

I think this problem is resolved. Lower down in the log, after  I enter the 
credentials, I can see logs related to accumul/.

From: Samudrala, Ranganath [USA] via user 
mailto:user@accumulo.apache.org>>
Date: Friday, January 20, 2023 at 9:53 AM
To: Samudrala, Ranganath [USA] via user 
mailto:user@accumulo.apache.org>>
Subject: [External] Exception during accumulo init
Hello

The setup is in Kubernetes and Accumulo version is v2.10. None of the accumulo 
processes are running. Hadoop services are running is separate pods.  I open a 
shell in the Accumulo manager/master pod and invoked the command "accumulo init 
-upload-accumulo-props" .  I see an exception in the log and I am wondering if 
this is normal/acceptable or this problem should be resolved before attempting 
perform a start?

ERROR StatusLogger An exception occurred processing Appender MonitorLog
java.lang.RuntimeException: Accumulo not initialized, there is no instance id 
at 
hdfs://accumulo-hdfs-namenode-0.accumulo-hdfs-namenodes:8020/accumulo/instance_id
at 
org.apache.accumulo.server.fs.VolumeManager.getInstanceIDFromHdfs(VolumeManager.java:218)
at 
org.apache.accumulo.server.ServerInfo.(ServerInfo.java:102)
at 
org.apache.accumulo.server.ServerContext.(ServerContext.java:106)
at 
org.apache.accumulo.monitor.util.logging.AccumuloMonitorAppender.lambda$new$1(AccumuloMonitorAppender.java:93)
at 
org.apache.accumulo.monitor.util.logging.AccumuloMonitorAppender.append(AccumuloMonitorAppender.java:111)
at 
org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:161)
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:134)
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:125)
at 
org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:89)
at 
org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:683)
at 
org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:641)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:624)
at 
org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:674)
at 
org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:643)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:624)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:612)
at 
org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:98)
at 
org.apache.logging.log4j.core.async.AsyncLogger.actualAsyncLog(AsyncLogger.java:488)
at 
org.apache.logging.log4j.core.async.RingBufferLogEvent.execute(RingBufferLogEvent.java:156)
at 
org.apache.logging.log4j.core.async.RingBufferLogEventHandler.onEvent(RingBufferLogEventHandler.java:51)
at 
org.apache.logging.log4j.core.async.RingBufferLogEventHandler.onEvent(RingBufferLogEventHandler.java:29)
at 
com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.base/java.lang.Thread.run(Thread.java:829)

Thanks
Ranga



RE: Debug logging of accumulo classes

2023-01-18 Thread dev1
You can change the logging configuration levels for Accumulo processes editing 
the log4j-service.properties file.

Ed Coleman

From: Samudrala, Ranganath [USA] via user 
Sent: Wednesday, January 18, 2023 9:41 AM
To: user@accumulo.apache.org
Subject: Debug logging of accumulo classes

Hello
I am trying to configure Accumulo v2.1.0 with non-AWS S3 storage (Minio). So, 
the steps are like below

  1.  Run "accumulo init -upload-accumulo-props"
  2.  Add S3 volume to instance.volumes property and run "accumulo init 
-add-volumes"
 *   
instance.volumes=hdfs://accumulo-hdfs-namenode-0.accumulo-hdfs-namenodes:8020/accumulo,s3a://accumulo/

When I run the second command, I see error as below:
2023-01-18T14:00:39,848 [main] [org.apache.accumulo.core.volume.VolumeImpl] 
ERROR: Basepath is empty. Make sure instance.volumes is set to a correct path

I am unable to turn on debug level logging on accumulo classes. I can see debug 
level logging for Hadoop classes just fine. I also turned on async logger in 
accumulo-env.sh - so I get more logs, but debug level logging is turned off 
even if I have the below in log4j2.properties

logger.accumulo.name = org.apache.accumulo
logger.accumulo.level = debug
logger.accumulo.appenderRef.console.ref = STDOUT

and root level logger is set to debug as well!


rootLogger.level = debug
rootLogger.appenderRef.console.ref = STDOUT

So, I am unable to figure out what the values are in BasePath and why.

How do I turn on debug level logging for accumulo classes?

Thanks
Ranga



RE: very-high latency updates through thrift proxy server

2022-10-31 Thread dev1
Could it be data dependent?  For example, if you have a lot of data that has 
passed its TTL you may be scanning across a lot of data to find data that is 
eligible to return.  Similar situations could have to do with visibilities that 
you can’t access…  Or, maybe it’s related to your scan range?  You are scanning 
across a lot of data, but most of the rows do not match your scan criteria?

If you think it could be related to age-off rather than visibilities or scan 
range, can you run a full compaction on the table and see if that improves 
things?  That would eliminate data that has aged off and reduce the amount of 
data that must be scanned.  If you can, use hdfs to determine the directory 
size of the table – it would be under /accumulo/tables/[TABLE-ID] then run the 
compaction (compact -w -t tablename) and when it finishes and the accumulo gc 
runs to remove the “old” files and check the size again.  That should give you 
an idea of how much data was removed by the compaction.

Ed Coleman

From: Christopher 
Sent: Monday, October 31, 2022 5:42 PM
To: accumulo-user 
Subject: Re: very-high latency updates through thrift proxy server

That's odd. They should be available immediately. Are you using replication? 
What kind of scan are you doing? Is it an offline scanner, or isolated scanner?
On Mon, Oct 31, 2022, 15:41 Jeff Turner 
mailto:sjtsp2...@gmail.com>> wrote:
any ideas why mutations via thrift proxy server would take 120 to 150
seconds to show up in a scan?

accumulo 1.9.3

the mutations have all been submitted through thrift (from python), the
writer has been closed, and the writing process has exited.

95% of the time, the latency is between 120.1 and 120.5 seconds.
occasionally the latency is 150 seconds.

there don't appear to be many configuration options for the proxy
server.  and other people i have talked to said that they see their
updates through thrift proxy immediately.

updates via java/jython have millisecond latency.  (for a while i had
been trying to blame tservers or the main server (maybe some delay in
processing compactions, ...).  i don't think that's the issue)





RE: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised error

2022-10-17 Thread dev1
You need to kill any running masters so that the FATE command can get the 
master lock.  Once deleted, restart them.

Ed Coleman

-Original Message-
From: Shailesh Ligade via user  
Sent: Monday, October 17, 2022 9:22 AM
To: user@accumulo.apache.org
Subject: RE: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised 
error

Well,

I tried to fate fail, but it fails with message
ERROT: master lock is held, not running
Could not fail transaction:

For fate delete I also get similar message. 

So how can I just remove this transaction?

-S
-Original Message-
From: Ligade, Shailesh (ITADD) (CON)
Sent: Tuesday, October 11, 2022 11:44 AM
To: user@accumulo.apache.org
Subject: RE: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised 
error

Thanks Ed,

This really helps.

I didn't see any other exception in the tserver log. Will check other things 
too.

Appreciated

-S

-Original Message-
From: Ed Coleman 
Sent: Tuesday, October 11, 2022 11:35 AM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - RE: accumu 1.10.0 master log connection refised 
error

Is the table being compacted?  And then, can compactions be completed on that 
table?

You can scan hdfs under the table id, and look for xx.rf_tmp files - those are 
the temporary files generated by the compaction and would be swapped to xx.rf 
files when the compaction completes. Things to look for:

  - Are there other exceptions in the tserver log?
  - With a compaction running, is there one (or a few) hdfs directories that 
have tmp_rf files much longer than the others - use that info to track the rows 
in the metadata table and see what sticks out.
  - Are there xx.rf_tmp files that are huge?  
  - Is the compaction making progress?  (the tmp_rf file size might change over 
time) - but the compaction is just processing deletes, maybe not.
  - Are there hdfs directories that have a lot of files and the timestamp of 
the last compacted files is really old?
  - In the accumulo.metadata table, there should be a compaction count (once a 
tablet is compacted) - "srv:compact" - you may be able to scan the metadata for 
your table, find a compact id that is lagging the others and then use the row 
info for that id to isolate the tablet server that is hosting the data and then 
look at logs there. (This is assuming that the compaction is not completing)

You likely can delete the FATE, but if there is an underlying issue, then at 
the next compaction it seems like it would just reoccur. You would be better 
off finding the issue and fixing it before deleting the FATE.

Ed Coleman

On 2022/10/11 12:44:01 Shailesh Ligade via user wrote:
> Looking at the fate print/dump
> 
> I do see repo: {
>"org.apache.accumulo.master.tableOps.CompactRange" {
>tableId: xx
>namespace: default
>   }
> }
> 
> Does that mean it is stuck on table compact operation but can't finish it for 
> whatever reason and hence I it drops tserver connection?
> Is it safe to fail/delete this fate? What are the alternatives, if any?
> 
> Appreciate your help
> 
> -S
> 
> From: Shailesh Ligade via user 
> Sent: Tuesday, October 11, 2022 8:09 AM
> To: user@accumulo.apache.org
> Subject: [EXTERNAL EMAIL] - accumu 1.10.0 master log connection 
> refised error
> 
> Hello,
> 
> I have 25 node cluster with two masters. Time to time (every 4/5
> hours) I get on different tserver
> Org.apache.thrift.transport.TTransportException: 
> java.net.ConnectionException: Connection refused Error closing output 
> stream
> Java.ioException: The stream is closed
> SocketOutputStream.write(SocketOutputStream.java:118)
> ...
>
> master.LiveTServerSet$TServerConnection.compact(LiveTServerSet.java:214)
>master.tableOps.CompactionDriver.isReady(CompactionDriver:168)
> master.tableOps.CompactionDriver.isReady(CompactionDriver:54)
> master.tableOps.Tracerepo.isReady(Tracerepo.java:47)
> fate.Fate$TransactionRunner.run(Fate.java:72)
> 
> Everytime its same exception? What may be an issue? Is it stuck in some fate 
> operation?
> After this tserver restarts (I have it system, with auto restart flag)
> 
> How to debug this further.
> Appreciate any response.
> 
> -S
> 


RE: accumulo 1.10 replication table issues

2022-05-12 Thread dev1
Looking at the code, I do not see anything other than the manager (master) 
calling things that would set the table online.

Were the manager(s) started when the replication table was ONLINE?  You could 
try stopping the managers, setting the replication table offline and then 
restarting the manager(s).  If you cannot take the replication table offline 
with the shell without a manager, you can force the table offline using the 
zkCLi set command – the zk path is /accumulo/[instance_id]/tables/+rep/state 
and replace ONLINE with OFFLINE.

You could also try:

disabling TABLE_REPLICATION property for any tables that you were replicating?
Deleting the TABLE_REPLICATION_TARGET if it is set?

And any other replication properties that you may have set.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Thursday, May 12, 2022 9:32 AM
To: user@accumulo.apache.org
Subject: accumulo 1.10 replication table issues

Hello,

I used to have a backup cluster and primary was replicating properly. Now i 
don't need backup cluster, so i deleted replication entries form primary 
cluster and turned replication table offline. However time to time the 
replication table comes online and i am seeing entries in master log stating 
can't find replication targets and replication is utilizing mx 100 threads etc.
since this table comes online on its own, i thinking it is affecting some 
cluster performance 
Question is why replication table is not staying offline? what is the proper 
way of keeping it offline? What is the best way to remove all entries from this 
table that even if it comes online, it will not clutter my logs?

Thanks

-S




RE: minor compaction same as flush

2022-04-14 Thread dev1
Flush and compactions are different actions.

Flush - sorts and writes current, in-memory changes to a file.  This can reduce 
the amount of recovery in case of a failure because the flushed entries do not 
need to be processed from the WAL.

Compactions combine multiple files into a single file.  Major compactions 
combine all files into a single file.  Minor compactions select a subset of 
files can combines them into a file.

See: https://accumulo.apache.org/1.10/accumulo_user_manual.html#_compaction

Flushing will increase the number of files generated, this will potentially 
increase the number of compactions. There are tradeoffs. If you are asking will 
frequent flushes reduce the time required to perform a major compaction?  
Probably not much, if at all.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Thursday, April 14, 2022 9:14 AM
To: user@accumulo.apache.org
Subject: minor compaction same as flush

Hello just wanted to some clarification,

Is the flush same as minor compaction? Is flush better (performance wise) than 
running say range compaction?
Having flush often, will it help major compaction performance or no difference??

Thanks

-S


RE: Accumulo 1.10.0

2022-04-13 Thread dev1
If you still have an issue – check that your user has WRITE permissions on the 
metadata table (even root needs to be added) – if you grant permission(s) you 
likely would want to remove what you added once you are done to prevent 
inadvertently modifying the table in the future if you make a mistake with a 
command intended for another table. (Besides being a good security practice to 
operate with the minimum required permissions)

From: Ligade, Shailesh [USA] 
Sent: Wednesday, April 13, 2022 8:41 AM
To: 'user@accumulo.apache.org' 
Subject: Re: Accumulo 1.10.0

I think i figured out

i have to be on that accumulo.metadata table in order for delete command to 
work.. -t accumulo.metadata did not work ..not sure why 

Thanks

0S

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: Re: Accumulo 1.10.0

Thanks Ed,

a quick question,

Now that i want to delete those duplicates (there are many of those )

the scan output is

a;: loc:a [] tablet1:9997
a;: loc:zz [] tablet2:9997

What is the right delete command, when i issue

delete a; loc a -t accumulo.metadata

i get help so it doesn't think it is right command

i tried

delete a; loca a -t accumulo.metadata  or
delete a;: loc a -t accumulo.metadata

still get the help message..

Thanks in advance,

-S


____
From: dev1 mailto:d...@etcoleman.com>>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: Accumulo 1.10.0


I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.



If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space  all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 



Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?



Thanks



-S


RE: Accumulo 1.10.0

2022-04-12 Thread dev1
I would suspect that the metadata table became corrupted when the system went 
unstable and two tablet servers somehow ended up both thinking that they were 
responsible for the same extents(s)  This should not be because of the balancer 
running.

If you scan the accumulo.metadata table using the shell (scan -t 
accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b 
[TABLE_ID#]:[EXTENT])

There will be duplicated loc entries.

I am uncertain on the best way to fix this and do not have a place to try 
things out, but possible actions.

Shutdown / bounce the tservers that have the duplicated assignments – you could 
start with just one and see what happens. When the tservers go offline – the 
tablets should be reassigned and maybe only one (re)assignment will occur.

Try bouncing the manager (master)

If those don’t work, then a very aggressive / dangerous / only as a last resort:

Delete the specific loc rows from the metadata table (delete [row_id] loc 
[value] -t accumulo.metadata)  This will cause a future entry in the zookeeper 
– to get that to reassign it might be enough to bounce the master, or you may 
need to shutdown / restart the cluster.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0

Hello, Last weekend we ran out of hdfs space  all volumes were 100% yeah it 
was crazy. This accumulo has many tables with good data.

Although accumulo was up it had 3 unsassigned tablets

So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued 
hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets 
went away but tables are show no assigned tablets on the accumulo monitor.

On the active master i am seeing error

ERROR: Error processing table state for store Normal Tablets
java.langRuntimeexception: 
org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
 found two locations for the same extent 

Question is i am getting this because balancer is running and once it finished 
it will recover? What can be done to save this cluster?

Thanks

-S


RE: [External] Re: odd issue with accumulo 1.10.0 starting up

2022-03-17 Thread dev1
When an Accumulo process abnormally terminates, there may be a file create with 
the exception of the problem – the files may be names *.out (or *.err) can’t 
recall which. Normally the files have 0 size, but on termination will have some 
text.

Are you seeing those files and do they point to the issue?

Do you have the jvm configured to terminate on out of memory – and print that 
error condition? Maybe the manager is running out of memory.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, March 16, 2022 3:31 PM
To: user@accumulo.apache.org
Subject: RE: [External] Re: odd issue with accumulo 1.10.0 starting up

Thanks,

I think we are having the same or similar issue with virus scan/security scan. 
However that should not bring down the master, can it??

I am still digging thru the logs.

-S

From: Adam J. Shook mailto:adamjsh...@gmail.com>>
Sent: Wednesday, March 16, 2022 2:46 PM
To: user@accumulo.apache.org
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

This is certainly anecdotal, but we've seen this "ERROR: Read a frame size of 
(large number)" before on our Accumulo cluster that would show up at a regular 
and predictable frequency. The root cause was due to a routine scan done by the 
security team looking for vulnerabilities across the entire enterprise (nothing 
Accumulo-specific). I don't have any additional information about the specifics 
of the scan. From all that we can tell, it has no impact on our Accumulo 
cluster outside of these error messages.

--Adam

On Wed, Mar 16, 2022 at 8:35 AM Christopher 
mailto:ctubb...@apache.org>> wrote:
Since that error message is coming from the libthrift library, and not Accumulo 
code, we would need a lot more context to even begin helping you troubleshoot 
it. For example, the complete stack trace that shows the Accumulo code that 
called into the Thrift library, would be extremely helpful.

It's a bit concerning that you're trying to send a single buffer over thrift 
that's over a gigabyte in size, according to that number. You've said before 
that you use live ingest. Are you trying to send a 1GB mutation to a tablet 
server? Or are you using replication and the stack trace looks like it's 
sending 1GB of replication data?

On Wed, Mar 16, 2022 at 7:14 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Well, I re-initialized accumulo but I still see

ERROR: Read a frame size of 1195725856, which is bigger than the maximum 
allowable buffer size for ALL connections.

Is there a setting that I can increase to get past it?

-S



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, March 15, 2022 12:47 PM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Not daily but  over weekend.

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 10:39 AM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

Why are you bringing the cluster down every night? That is not ideal.

On Tue, Mar 15, 2022 at 9:24 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

We bring the servers down nightly. these are on aws. This worked yesterday 
(Monday) but this (Tuesday) i went on to check on it and it was down, I guess i 
didn't check yesterday. I assume it was up as no one complained., but it was up 
and kicking last week for sure.

So not exactly sure when or what caused it, all services are up (tserver, 
master) so services are not crashing themselves.

I guess worst case, i can re-initialize and recreate tables form hdfs..:-(

-S

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Tuesday, March 15, 2022 9:16 AM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org>>
Subject: Re: [External] Re: odd issue with accumulo 1.10.0 starting up

What was going on in the tserver before you saw that error? Did it finish 
recovering after the restart? If it is still recovering, I don't think you will 
be able to do any scans.

On Tue, Mar 15, 2022 at 8:56 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

That was my first reaction but the instance is backed up by puppet and no 
configuration was updated (i double checked and ran puppet manually as well as 
automatically after restart), Since the system was operational yesterday, So I 
think I can rule that out.

For other error, I did see the exact error 

RE: [External] Re: accumulo 1.10.0 unassigned tablets issue

2022-03-04 Thread dev1
To amplify Christopher's comments that 33k+ tablets is a lot of work for a 
tserver to manage, and may be causing some of your stability issues.  Besides 
consolidating tablets into fewer files / tablets you probably should look into 
running multiple tservers per node.  You will need to adjust the memory 
allocations so that everything fits.

33k+ is really a high number - I would suggest that you take an all of the 
above approach to reduce the number - consolidate tablets to the extent 
possible, use larger split sizes and run multiple tservers on a node.

Ed Coleman

-Original Message-
From: Christopher  
Sent: Friday, March 4, 2022 9:14 AM
To: accumulo-user 
Subject: Re: [External] Re: accumulo 1.10.0 unassigned tablets issue

On Fri, Mar 4, 2022 at 8:36 AM Ligade, Shailesh [USA]  
wrote:
>
EDIT again: Thanks Chris[topher],


>
> Appreciate your support!
>
> Not sure why volumes.replacement was set, especially since we have HA 
> namenode and that’s the only hdfs targeted. The volumes.replacement 
> was set to the same url though e.g. nameservice/accumulo, 
> nameservice:8020/accumulo

That explains the relocation messages.

>
> Regardless, when tserver went down, even though if we set 
> table.suspend.duration=15m, I was seeing volume replacement messages in the 
> master log for every tablet hosted and that is taking looong time (hours for 
> 33k tablets/tserver). So how best to remove this volumes? There is no 
> delete-volumes, I see only add-volumes under accumulo init. Is there anything 
> I need to do after I remove entire instance.volumes.replacement section from 
> accumulo-site.xml?

Just restart any server that had that replacements config, so they don't try to 
unnecessarily update metadata that is already correct.
Updating volume references using the replacements config is just a metadata 
update, though, not a lot of I/O. I'm not sure it would explain things taking a 
long time. It's possible that it's contributing to the slowness, I suppose, 
perhaps the tserver hosting the metadata tablet for the tablet whose metadata 
is being updated is too managing 33k other tablets.

In the past, I think we've recommended around 100 up to 1K tablets per server. 
I'm not sure if that's still a good recommendation or not. In any case, you 
can't reduce the number of tablets you have without doing merges, or deleting 
entire ranges, or compacting and bulk importing into a new table with more 
reasonable split points. And you probably shouldn't try that until you have 
your current situation under control. But, that's sorta why I was previously 
suggesting to examine your whole config. Maybe think about your whole 
architecture, to figure out where you want to go, and compare with where you 
are now, so you can figure out how to get to your target setup from your 
current setup.

>
> I will have to look at each and every property to ensure it makes sense for 
> sure..
>
> Thanks
>
> -S
>
> -Original Message-
> From: Christopher 
> Sent: Wednesday, March 2, 2022 3:09 PM
> To: accumulo-user 
> Subject: Re: [External] Re: accumulo 1.10.0 unassigned tablets issue
>
> On Wed, Mar 2, 2022 at 1:51 PM Ligade, Shailesh [USA] 
>  wrote:
> >
> EDIT > Thanks Chris[topher],
> >
> > I do have instance.volume.replacement overridden
> >
> > Does that mean it will not work with table.suspend.duration property?
>
> No. It's just that's where the RecoveryManager message is coming from.
>
> >
> > uhmm thinking about it i am not sure why we set that as we have only one 
> > hdfs and we have less than 10 beefy nodes...
> >
> > may be I can remove this property after i set table.suspend.duration, and 
> > stop/reboot tserver. After i am done, i can restore the property. Please 
> > advise.
>
> I have no idea why you would set that if you're not replacing one volume with 
> another. I think you would probably benefit from reviewing all of your 
> configuration. Please check the documentation for an explanation of each 
> property. If you have a specific question regarding them, you can ask here, 
> but I would start by reviewing your configs against the docs.
>
> >
> > Thanks
> >
> > -S
> >
> >
> > 
> > From: Christopher 
> > Sent: Wednesday, March 2, 2022 1:32 PM
> > To: accumulo-user 
> > Subject: [External] Re: accumulo 1.10.0 unassigned tablets issue
> >
> > The replacements message should only appear if you have 
> > instance.volumes.replacements set in your configuration.
> >
> > On Wed, Mar 2, 2022 at 11:02 AM Ligade, Shailesh [USA] 
> >  wrote:
> > >
> > > Hello,
> > >
> > > I need reboot a tserver with 34k hosted tablets.
> > >
> > > I set table.supend.duration to 15 min and stop tserver and rebooted the 
> > > machine.
> > >
> > > As soon as tablet server came on line the its hosted tablets counts went 
> > > from 0 to 34k, however, on the master i see 34k unassigned tablets, 
> > > although the count is going down it is taking hours.
> > > not sure why master is 

RE: ENTRIES listed on accumulo 1.10 monitor

2022-02-18 Thread dev1
On bulk import the count is not known. The count is updated on compactions. 
Once a tablet / file is compacted it's count will be updated. The count is only 
approximate anyway - it will not be able to determine of a row has expired 
(past age-off)  after the count has been calculated.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Friday, February 18, 2022 8:16 AM
To: user@accumulo.apache.org
Subject: ENTRIES listed on accumulo 1.10 monitor

Good morning,

Just a quick question, what does the Entries on the accumulo master status page 
indicates for a table?
Yesterday i had to re-initialize accumulo 1.10, before starting i moved my hdfs 
/accumulo to /accumulo-old
then I copied rf files under /accumulo-old/tables/x to /tmp/x and after 
initializing i did directoryimport to create new table x
Everything worked flawlessly and hdfs data size  /accumulo-old/table/x matches 
with /accumulo/table/x however, old Entries were 22B and new Entries are 2B
So what does the entries really represent? I always though it was actually a 
record count but doesn't appear that way..

-S


RE: how to stop entire (or on a single table) replication properly

2022-02-17 Thread dev1
SO, I'm not familiar with replication in an operational setting - so my 
comments are based on my mental model of what I think replication is doing - 
the implement may not match my mental model - maybe someone else with more 
familiarity can chime in.

I'm reading that you want to stop replication and do not care to preserve data 
that may be "in-flight"

Why don't you just stop replication on the source and then create the 
destination table that is expected to exist as the destination.  When that data 
has been "replicated", the source replication table should be empty - then just 
delete the destination table?  You are still getting ride of the data and you 
let replication do the housekeeping for you?

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Thursday, February 17, 2022 8:15 AM
To: 'user@accumulo.apache.org' 
Subject: Re: how to stop entire (or on a single table) replication properly

Thanks Ed,

Let me rephase it. I need to stop replication as my tables on the peer are 
changing. After stopping, I will need to start replication again to the tables.

To stop the replication, on the primary instance tables i am going to set 
config to set replication false. Basically running
config -t my_table -s table.replication=false (currently true).
I believe that setting will stop replicating that table to peer.

However, there is still data in primary replication table and system will still 
try to replicate to peer (on peer corresponding tables no longer exist!), i can 
see it is still replicating to the peer on the replication page on the monitor 
UI. I can set primary replication table offline, but when I bring it online 
again, that data will be still there. So the question is, how can I safely 
remove the data in primary replication table?

One time i tried to do deleteall on primary replication table but when accumulo 
master re-started, it was complaining a lot about replication data, so just 
wanted to figure out proper steps.

thanks

-S
____
From: dev1 mailto:d...@etcoleman.com>>
Sent: Thursday, February 17, 2022 8:04 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: how to stop entire (or on a single table) replication 
properly


I do not understand what you are asking - it would help if you stated what you 
are trying to accomplish and if you clearly identified source vs. destination.



Ed Coleman



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Thursday, February 17, 2022 7:37 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: how to stop entire (or on a single table) replication properly



Hello,



If i must stop entire replication, I set config for an individual table 
replication to false. However this will not affect entries in the replication 
table and the system will keep (or try to keep) replicating.

I can take replication table offline, but eventually when I need to start 
replication it will not clean. How can I delete entries in the replication 
table? Can i just do delete all? will that work?



-S


RE: how to stop entire (or on a single table) replication properly

2022-02-17 Thread dev1
I do not understand what you are asking - it would help if you stated what you 
are trying to accomplish and if you clearly identified source vs. destination.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Thursday, February 17, 2022 7:37 AM
To: user@accumulo.apache.org
Subject: how to stop entire (or on a single table) replication properly

Hello,

If i must stop entire replication, I set config for an individual table 
replication to false. However this will not affect entries in the replication 
table and the system will keep (or try to keep) replicating.
I can take replication table offline, but eventually when I need to start 
replication it will not clean. How can I delete entries in the replication 
table? Can i just do delete all? will that work?

-S


RE: accumulo 1.10.0 masters won't start

2022-02-16 Thread dev1
I would use importdirectory [src_dir] [fail_dir] -t new_table false

I would move the files from under accumulo (either shutdown, or at least have 
the table offline) into hdfs directories for each batch. (10K or so) batch1, 
batch2,… I think importdirectory expects a flat directory of just files.
Then I can import one batch, check for errors, repeat.

The table you create – the splits will be whatever you set.  Again, maybe 1 
split for each tserver (that’s about optimum for the batch import)  Set the 
split size higher.  The import command will then place the commands according 
to the splits on the new table – so in you case, its likey multiple files from 
you current split will be assigned to 1 tserver – effectively being a merge.

My approach to these things is to create scripts based on the info that I have, 
but have them so that I run them so I can see if things are progressing and 
make adjustments if not.  I use individual scripts so that I have positive 
control.  Pipe, grep, awk to build the commands, review the files as a sanity 
check and then run them.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 16, 2022 9:37 AM
To: 'user@accumulo.apache.org' 
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

Since hdfs fsck is fine on /accumulo, I can backup my tables to some location 
within hdfs (not under accumulo) and reinitialize accumulo.
then I can recreate my tables/users on new instances.
What will be command to import/load existing hdfs data into this newly created 
table? importtable command create new table as well, so  i guess i need to test 
it somewhere.
Also while loading old data intothe  new table, what can I do get rid of these 
splits/tablets?

I think this will be faster approach for me to recover..

Thank you so much

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 16, 2022 9:29 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each 
tablet - at what point the master / other things fall over is not something 
I've tried to find out. Even scanning the metadata table and gc process are 
doing a lot of work to track process that many files / tablets and it likely 
unnecessary.

What is the command / arguments that you are using for compactions?  The 
comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:
 - create a new table.
 - point the replication to write to the new table.
 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a 
few tablets per tserver (with some reasonable split size.)  Split sizes of 3G 
or 5G are not uncommon - and larger is reasonable.

Ed Coleman

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large 
table on peer first (eventually i need to do the same on source cluster). i am 
not stopping ingest on source cluster so replication will continue the peer 
table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes 
forever (even over single range.. i didn't try many just first few) and before 
it finishes i get zookeeper error and both master crash. After I bump that jute 
setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it 
back. So left merges alone and just trying 72k compaction, since compactions 
are not backing up, i am doing minimal sleep after every 100 compact commands. 
But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table 
will be compacted. However, for that i will need to stop ingest on primary, and 
then setup replication on the new cluster again..i was avoiding that. but i 
guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the 
worst case scenerio? How it may affect system performance?

Thanks

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not 
understand why you are getting the ZooKeeper errors – is it related to the 
number of tablets, or is it something else?  (an iterator t

Re: accumulo 1.10.0 masters won't start

2022-02-16 Thread dev1
> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each 
tablet - at what point the master / other things fall over is not something 
I've tried to find out. Even scanning the metadata table and gc process are 
doing a lot of work to track process that many files / tablets and it likely 
unnecessary.

What is the command / arguments that you are using for compactions?  The 
comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:
 - create a new table.
 - point the replication to write to the new table.
 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a 
few tablets per tserver (with some reasonable split size.)  Split sizes of 3G 
or 5G are not uncommon - and larger is reasonable.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' 
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large 
table on peer first (eventually i need to do the same on source cluster). i am 
not stopping ingest on source cluster so replication will continue the peer 
table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes 
forever (even over single range.. i didn't try many just first few) and before 
it finishes i get zookeeper error and both master crash. After I bump that jute 
setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it 
back. So left merges alone and just trying 72k compaction, since compactions 
are not backing up, i am doing minimal sleep after every 100 compact commands. 
But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table 
will be compacted. However, for that i will need to stop ingest on primary, and 
then setup replication on the new cluster again..i was avoiding that. but i 
guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the 
worst case scenerio? How it may affect system performance?

Thanks

-S

From: dev1 
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not 
understand why you are getting the ZooKeeper errors – is it related to the 
number of tablets, or is it something else?  (an iterator that was attached 
with a very large set of arguments, a very large list or some sort of binary 
data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might 
dictate a less aggressive approach.  At this point I’m assuming that getting 
things back on line without data loss is the top priority. And if I was sure 
that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a 
compaction would take and if I could wait.  It is generally preferable to work 
on files that have any deleted data removed and can reduce the total number of 
files when files from minor compactions and bulk ingest files are combined into 
a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if 
compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the 
clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging 
directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the 
existence of files if it is offline.) If not, then it would be necessary to use 
an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use 
just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – 
this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate 
the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will 
likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, us

Re: accumulo 1.10.0 masters won't start

2022-02-16 Thread dev1
Tservers will hold locks.

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 16, 2022 9:10 AM
To: 'user@accumulo.apache.org' 
Subject: Re: accumulo 1.10.0 masters won't start

Well now master doesn't come up throwing all sorts of zookeeper errros, only 
chnages i did was jute.maxbuffer set to max of 0x60 (in both zookeeper 
java.env and accumulo-site) and instance.zookeeper.timeout set 90s

Even if both masters are down, i still see table_locks under znode, si that 
normal?

Appreciate your help

-S

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' 
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large 
table on peer first (eventually i need to do the same on source cluster). i am 
not stopping ingest on source cluster so replication will continue the peer 
table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes 
forever (even over single range.. i didn't try many just first few) and before 
it finishes i get zookeeper error and both master crash. After I bump that jute 
setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it 
back. So left merges alone and just trying 72k compaction, since compactions 
are not backing up, i am doing minimal sleep after every 100 compact commands. 
But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table 
will be compacted. However, for that i will need to stop ingest on primary, and 
then setup replication on the new cluster again..i was avoiding that. but i 
guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the 
worst case scenerio? How it may affect system performance?

Thanks

-S

From: dev1 
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' 
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not 
understand why you are getting the ZooKeeper errors – is it related to the 
number of tablets, or is it something else?  (an iterator that was attached 
with a very large set of arguments, a very large list or some sort of binary 
data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might 
dictate a less aggressive approach.  At this point I’m assuming that getting 
things back on line without data loss is the top priority. And if I was sure 
that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a 
compaction would take and if I could wait.  It is generally preferable to work 
on files that have any deleted data removed and can reduce the total number of 
files when files from minor compactions and bulk ingest files are combined into 
a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if 
compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the 
clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging 
directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the 
existence of files if it is offline.) If not, then it would be necessary to use 
an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use 
just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – 
this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate 
the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will 
likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present 
on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the 
original.



If you don’t have the space, you could skip operating on the clone, but then 
you can’t fall back to the original if things go wrong.



Another

RE: accumulo 1.10.0 masters won't start

2022-02-15 Thread dev1
at was changed and revert it after it stabilize.

Thanks a bunch.

-S

____
From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting 
ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that 
file?)

Looking at the Zookeeper documentation it describes what looks like you are 
seeing:

When jute.maxbuffer in the client side is less than the server side, the client 
wants to read the data exceeds jute.maxbuffer in the client side, the client 
side will get java.io.IOException: Unreasonable length or Packet len is out of 
range!

Also, a search showed jira tickets that had a server side limit of 4MB, but 
client limits of 1MB - you may want to see if 4194304 (or larger) as a value 
works,



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper 
configuration.  If this is still correct, then it looks like there are a few 
options 
https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional 
guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - 
The Apache Software 
Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>
The solution to this problem is to set up an external ZooKeeper ensemble, which 
is a number of servers running ZooKeeper that communicate with each other to 
coordinate the activities of the cluster.
solr.apache.org



From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=30



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytesnot sure where to set that on 
accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other 
alternatives.



It is hard to recommend what to do, because it is unclear what put you into the 
current situation and what action / impact you might have had trying to fix 
things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute 
buffers?



It could be that the number of tables and servers pushed you over the limit - 
or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and 
in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If 
that looks okay.

Start just the master - how far does it come up?  It will not be able to load 
the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using 
zkCli.sh using any guidance with errors the master is generating. If that looks 
promising, then:



With the master stopped - start the tservers and check a few logs if there are 
exceptions determine if they are they something that is pointing to an issue - 
or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a 
listing of the tables and the ids.  If

RE: Tablet assignment slow upon restart [SEC=UNOFFICIAL]

2022-02-14 Thread dev1
There is a setting table.suspend.duration (- 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_table_suspend_duration)

That will pause the tablet reassignment while a tserver restarts.  There was a 
discussion on doing rolling restarts on this list around Dec 2, 2021 (one of 
the emails in the chain - 
https://lists.apache.org/thread/m3twvthrfrc79m4ln365wts3p62pl23l )

Ed Coleman

From: Christopher 
Sent: Monday, February 14, 2022 7:30 PM
To: accumulo-user 
Subject: Re: Tablet assignment slow upon restart [SEC=UNOFFICIAL]

Have you considered upgrading to 1.10.2? It includes changes in 1.10.0 that we 
released in September 2020 to specifically address slow startups due to 
rebalance thrashing on restarts: 
https://accumulo.apache.org/release/accumulo-1.10.0/#tserver-startup-and-shutdown-protections

However, I don't know if that is the cause of your specific issues.

On Mon, Feb 14, 2022 at 5:46 PM McClure, Bruce MR 2 
mailto:bruce.mcclu...@defence.gov.au>> wrote:

UNOFFICIAL
Hi,

Working with a reasonably sized Accumulo 1.9 cluster (not small, not enormous) 
it seems that when things go wrong and I need to restart all the t-servers, it 
takes a long time to re-assign all the tablets.  For example, restart, go home 
for the evening, come in in the morning and it is half-way through re-assigning 
the tablets.

Is there a setting or obvious place to look regarding why this takes so long?  
A “go faster” button would be great.  I have tried changing 
“tserver.assignment.concurrent.max” from 2 to 10 to 100, but it doesn’t seem to 
help.

In addition, in the most recent exercise, I have seen some warnings in the log 
about how long reassignments were taking (“has been running for at least 
3152602ms”) – with a specific tablet’s range , and the associated stack trace 
went through “aqcuireRecoveryMemory” .  So maybe this is a clue about something 
not happy with some specific tablets?

Thanks,

Bruce.



Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
You might want to set the accumulo (zookeeper client) side - by setting 
ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that 
file?)

Looking at the Zookeeper documentation it describes what looks like you are 
seeing:

When jute.maxbuffer in the client side is less than the server side, the client 
wants to read the data exceeds jute.maxbuffer in the client side, the client 
side will get java.io.IOException: Unreasonable length or Packet len is out of 
range!

Also, a search showed jira tickets that had a server side limit of 4MB, but 
client limits of 1MB - you may want to see if 4194304 (or larger) as a value 
works,


From: dev1 
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper 
configuration.  If this is still correct, then it looks like there are a few 
options 
https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit

But maybe the ZooKeeper documentation for your version can provide additional 
guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - 
The Apache Software 
Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which 
is a number of servers running ZooKeeper that communicate with each other to 
coordinate the activities of the cluster.
solr.apache.org



From: Shailesh Ligade 
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org 
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=30



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytesnot sure where to set that on 
accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other 
alternatives.



It is hard to recommend what to do, because it is unclear what put you into the 
current situation and what action / impact you might have had trying to fix 
things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute 
buffers?



It could be that the number of tables and servers pushed you over the limit - 
or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and 
in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If 
that looks okay.

Start just the master - how far does it come up?  It will not be able to load 
the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using 
zkCli.sh using any guidance with errors the master is generating. If that looks 
promising, then:



With the master stopped - start the tservers and check a few logs if there are 
exceptions determine if they are they something that is pointing to an issue - 
or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a 
listing of the tables and the ids.  If the worst happens, you can use that to 
map the existing data into a "new" instance. Hopefully it will not come to that 
and you will not need it - but if you don't have it and you need it, well... 
The table names and id are all in ZooKeeper.



Ed Coleman





From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set 
different goal like SAFE_MODE will it come up by ignoring fate and other 
issues? If that comes up, can I switch back to NORMAL, will that work? I 
understand there may be some data loss..



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org&

Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper 
configuration.  If this is still correct, then it looks like there are a few 
options 
https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit

But maybe the ZooKeeper documentation for your version can provide additional 
guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - 
The Apache Software 
Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which 
is a number of servers running ZooKeeper that communicate with each other to 
coordinate the activities of the cluster.
solr.apache.org



From: Shailesh Ligade 
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org 
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=30



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytesnot sure where to set that on 
accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other 
alternatives.



It is hard to recommend what to do, because it is unclear what put you into the 
current situation and what action / impact you might have had trying to fix 
things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute 
buffers?



It could be that the number of tables and servers pushed you over the limit - 
or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and 
in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If 
that looks okay.

Start just the master - how far does it come up?  It will not be able to load 
the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using 
zkCli.sh using any guidance with errors the master is generating. If that looks 
promising, then:



With the master stopped - start the tservers and check a few logs if there are 
exceptions determine if they are they something that is pointing to an issue - 
or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a 
listing of the tables and the ids.  If the worst happens, you can use that to 
map the existing data into a "new" instance. Hopefully it will not come to that 
and you will not need it - but if you don't have it and you need it, well... 
The table names and id are all in ZooKeeper.



Ed Coleman





From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set 
different goal like SAFE_MODE will it come up by ignoring fate and other 
issues? If that comes up, can I switch back to NORMAL, will that work? I 
understand there may be some data loss..



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: 
https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA%3D=0>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - 
the

Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
I would not recommend setting the goal state directly unlit there are no other 
alternatives.

It is hard to recommend what to do, because it is unclear what put you into the 
current situation and what action / impact you might have had trying to fix 
things -

why did the goal state become unset in the first place?
what did you stuff into the fates that increased the need for larger jute 
buffers?

It could be that the number of tables and servers pushed you over the limit - 
or it could be something else.

What I would do.

Shutdown accumulo and make sure all services / tservers are stopped.
Shutdown any other services that might be using ZooKeeper.
Shutdown ZooKeeper.

Set the larger jute.buffer and increase the timeout values across the board and 
in any dependent services.

Start hdfs - if you needed to shut it down.
Start just zookeeper - and use zkCli.sh to examine the state of things.  If 
that looks okay.
Start just the master - how far does it come up?  It will not be able to load 
the root / metadata tables, but it may give some indication of state,

I'd then cycle between stopping the master, trying to clean-up things using 
zkCli.sh using any guidance with errors the master is generating. If that looks 
promising, then:

With the master stopped - start the tservers and check a few logs if there are 
exceptions determine if they are they something that is pointing to an issue - 
or just something that is transient and handled.

Once the tservers are up and looking okay - start the master.

One of the things to grab as soon as you can get the shell to run - get a 
listing of the tables and the ids.  If the worst happens, you can use that to 
map the existing data into a "new" instance. Hopefully it will not come to that 
and you will not need it - but if you don't have it and you need it, well... 
The table names and id are all in ZooKeeper.

Ed Coleman


From: Shailesh Ligade 
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org 
Subject: RE: accumulo 1.10.0 masters won't start


Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set 
different goal like SAFE_MODE will it come up by ignoring fate and other 
issues? If that comes up, can I switch back to NORMAL, will that work? I 
understand there may be some data loss..



-S



From: dev1 
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: 
https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=5QxxwiXmGjYdbt0j3citMG1poPqHrek0qqqRWIM7vfw%3D=0>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - 
the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a 
Zoo<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=5QxxwiXmGjYdbt0j3citMG1poPqHrek0qqqRWIM7vfw%3D=0>

Trace Mask Bit Values ; 0b00 : Unused, reserved for future use. 
0b10 : Logs client requests, excluding ping requests. 0b000100 : 
Unused, reserved ...

zookeeper.apache.org







From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is 
that what are you refering to?



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's 
zookeeper.session.timeout.ms





From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@ba

Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
For values in zoo.cfg see: 
https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration

maxSessionTimeout

In the accumulo config  - #instance.zookeepers.timeout=30s

The zookeeper setting controls the max time that the ZK servers will grant - 
the accumulo setting is how much time accumulo will ask for.

ZooKeeper: Because Coordinating Distributed Systems is a 
Zoo<https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration>
Trace Mask Bit Values ; 0b00 : Unused, reserved for future use. 
0b10 : Logs client requests, excluding ping requests. 0b000100 : 
Unused, reserved ...
zookeeper.apache.org



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is 
that what are you refering to?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's 
zookeeper.session.timeout.ms


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for


/accumulo//config/tserver.hold.time.max

/accumulo//tables

/accumulo//tables/1/name

/accumulo//fate

/accumulo//masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  



-S


From: dev1 
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it 
was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: 
java.lang.IllegalArgumentException: No enum constant 
org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state 
is [], is there a way to add some value there?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.

Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
Have you tried to increase the zoo session timeout value? I think it's 
zookeeper.session.timeout.ms


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for


/accumulo//config/tserver.hold.time.max

/accumulo//tables

/accumulo//tables/1/name

/accumulo//fate

/accumulo//masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  



-S


From: dev1 
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it 
was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: 
java.lang.IllegalArgumentException: No enum constant 
org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state 
is [], is there a way to add some value there?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman


From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it 
was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: 
java.lang.IllegalArgumentException: No enum constant 
org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state 
is [], is there a way to add some value there?

-S

From: dev1 
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
Did you try setting the increased size in the zkCli.sh command (or wherever it 
gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers 
and clients.

You should be able to use zkCli.sh to at least see what's going on - if that 
does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#
  *   there should be a node named debug - doing a get on that should show the 
op name.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org 
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=3000



In conf/java.env and restart all zookeepers but still getting the same error.. 
documentation is kind of fuzzy on setting this property as it states in hex 
(default 0x) so not 100% sure if 3000 is ok, but atleast I could see 
zookeeper was up



-S


From: dev1 
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org 
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


Re: accumulo 1.10.0 masters won't start

2022-02-09 Thread dev1
Does the monitor or any of the logs show errors that relate to exceeding the 
ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer 
limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options)?

Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org 
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo//fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid , likely server has 
closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate 
(fate print returns empty)

Any idea how to bring the master up?

Thanks

S


RE: tablets per tablet server for accumulo 1.10.0

2022-02-04 Thread dev1
Roughly (don’t have the exact command syntax on hand)  I make a file that is 
the executed by passing to the shell command. To build the command file:

Use the getSplits command with the number of batches that I want – that can 
roughly be calculated using # current tablets / (# tservers * # compaction 
slots * comfort factor). IYou can specify and output file or tee the command 
output, something like


  *   getsplits -t tablename -n 20 -o /tmp/my_splits.txt


This would give you the splits for 20 rounds. Using those splits the compact 
command file then looks like:

compact -w -t tablename -e {first split}
compact -w -t tablename -b [first split] -e [second split]
…
compact -w -t tablename -b [last split]

To do a merge, interleave the merge commands:

compact -w -t tablename -e [first split]
merge -w -t tablename -size=[5G] -e [first split]
compact -w -t tablename -b [first split] -e [second split]
merge -w -t tablename -size 5G -b [first split] -e [second split]

Then just issue the shell command with (login info) -e filename. (I don’t 
recall if the switch to pass a file is -e, -f,…?)

The -w switch pauses each round so that it completes before moving to the next.

The comfort factor is some multiple to increase the number of tablets in each 
round.  This will over subscribe the compaction slots – but usually some 
compactions are quick for small tablets and the over-subscription quickly 
drops. It is a balancing act, you want fewer rounds, but limit the over 
subscription period.

You may want to increase the # of compaction slots available – depending on 
your hardware and load – I think the default is 3, 6 is not unreasonable.

Using the compact / merge command with just and end (first row) and a beginning 
(last row) are to insure that all splits are covered – don’t mix them up – or 
you will compact everything.

A few tablets can take much longer if the row ids are not evenly distributed – 
the time that each round takes will be the time of the longest compaction. With 
larger, but fewer rounds you increase the chance that more of the long-poles 
will be in a round and run in parallel. And shorten the total time needed to 
complete – but doing it in rounds does take longer because each round may have 
a long-pole that is essentially being compacted serially in each round.

Ed Coleman


From: Ligade, Shailesh [USA] 
Sent: Friday, February 4, 2022 8:28 AM
To: 'user@accumulo.apache.org' 
Subject: Re: tablets per tablet server for accumulo 1.10.0

Thank you,

Will range compaction (compact -t <> --begin-row<> --end-row<>) be faster than 
just compact -t <>? My worry is, if I somehow issue 72k compact command at 
once, it will kill the system?
On that part what is the best way to issue these compact commands, especially 
because there are so many of them. I saw accumulo shell -u<> -p<>  -e 'compact 
...,compact...,compact,' will work just don't know how many i can tack on 
one shell command..is there a better way of doing all this? I mean i want to be 
as gentle to my production system and yet as fast as possible.. don't want to 
spend days doing compact/merge 

Thanks

-S

____
From: dev1 mailto:d...@etcoleman.com>>
Sent: Tuesday, February 1, 2022 8:53 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


Before.  That has the benefit that file sizes are reduced (if data is eligible 
for age off) and the merge is operating on current file sizes.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Tuesday, February 1, 2022 7:49 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: Re: tablets per tablet server for accumulo 1.10.0



Thank you for explanation!



Once ran getsplits it was clear that splits were the culprit, so I need to do 
merge as well bump the threshold to higher number as you have suggested.



If I have to perform a major compaction, should i do it before merge or after 
merge?



Thanks again,



-S







From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0



You can get the hdfs size using standard hdfs commands – count or ls.  As long 
as you have not cloned the table, the size of the hdfs files and the space 
occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table – 
the column qualifier file: will just give you the referenced files. You can 
look at the directories b-xxx are from a bulk import and t-xxx files 
are assigned to the tablets.  Also bulk import file names start with I-xx, 
files from compactions will be A-xx if from a full compaction, C-xxx 
fro

RE: Zero size tablets in accumulo.metadata

2022-02-03 Thread dev1
What version of Accumulo?

From: Hart, Andrew 
Sent: Thursday, February 3, 2022 2:29 AM
To: user@accumulo.apache.org
Subject: Zero size tablets in accumulo.metadata

Hi,

I have about twenty accumulo.metadata tablets and they all have zero size 
except one with most of the data and one tablet with just a few entries - both 
annoyingly sitting on the same tserver.
I tried the obvious merge command (which I won't repeat to avoid spreading bad 
answers) but it says exception can't merge metadata.

What is the correct way to merge the metadata and clean up the zero sized 
tablets mess?

And


Public


RE: tablets per tablet server for accumulo 1.10.0

2022-02-01 Thread dev1
Before.  That has the benefit that file sizes are reduced (if data is eligible 
for age off) and the merge is operating on current file sizes.

From: Ligade, Shailesh [USA] 
Sent: Tuesday, February 1, 2022 7:49 AM
To: 'user@accumulo.apache.org' 
Subject: Re: tablets per tablet server for accumulo 1.10.0

Thank you for explanation!

Once ran getsplits it was clear that splits were the culprit, so I need to do 
merge as well bump the threshold to higher number as you have suggested.

If I have to perform a major compaction, should i do it before merge or after 
merge?

Thanks again,

-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, January 31, 2022 1:14 PM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: tablets per tablet server for accumulo 1.10.0


You can get the hdfs size using standard hdfs commands - count or ls.  As long 
as you have not cloned the table, the size of the hdfs files and the space 
occupied by the table are equivalent.



You can also get a sense of the referenced files examining the metadata table - 
the column qualifier file: will just give you the referenced files. You can 
look at the directories b-xxx are from a bulk import and t-xxx files 
are assigned to the tablets.  Also bulk import file names start with I-xx, 
files from compactions will be A-xx if from a full compaction, C-xxx 
from a minor compaction and F-xx is the result of a flush. You can look at 
the entries for the files - the numbers for the value are number of entities, 
file size



How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files 
end up in /accumulo/table/x/b-x and then are assigned to tablets - the 
directories for the

Tablets will be created, but will be "empty" until a compaction occurs.  A 
compaction will copy from the files referenced by the tablets into a new file 
that will be placed into the corresponding /accumulo/table/x/t-xx 
directory.  When a bulk imported file is no longer referenced by any tablets, 
it will get garbage collected, until then file will exist and inflate the 
actual space used by the table. The compaction will also remove any data that 
is past the TTL for the records.



Do you ever run a compaction?  With a very large number of tablets, you may 
want to run the compaction in parts so that you don't end up occupying all of 
the compaction slots for a long time.



Are you using keys (row ids) that are always increasing? An typical example 
would be a date.  Say some of your row ids are -mm-dd-hh and there is a 10 
day TTL.  What will happened is that new data will continue to create new 
tablets and on compaction the old tablets will age-off and have 0 size.  You 
can remove the "unused splits" by running a merge.  Anything that creates new 
row ids that are ordered can do this - new splits are necessary and the 
old-splits eventually become unnecessary, if the row ids are distributed across 
the splits it will not do this. It is not necessary a problem if this what you 
data looks like, just something that you may want to manage with merges.



There is usually not much benefit having a large number of tablets for a single 
table on a server.  You can reduce the number of tablets required by setting 
the split threshold to a larger number and then running a merge.  This can be 
done in sections, and you should run a compaction on the section first.



If you have recently compacted, you can figure out the rough number of tables 
necessary  by taking hdfs size / split threshold = number of tablets.   If you 
increase the spilt threshold size you will need fewer tablets.  You may also 
consider setting a split threshold that is larger than your target - say you 
decided that 5G was a good target, if you set the threshold to 8G during the 
merge and then setting it to 5G when completed will cause the table to split - 
and it could give you a better distribution of data in the splits.



This can be done while things are running, but it will be a heavy IO load 
(files and on the hdfs namenode) and can take a very long time. What can be 
useful is you the getSplits command with the number of split options and create 
a script that compacts, then merges a section - using the splits as start / end 
row to the compaction and merge command.



Ed Coleman



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: tablets per tablet server for accumulo 1.10.0



Hello,



table.split.threshold is set to default 1G (except for metadata nd root - which 
is set to 64M)

What can cause tablets per tablet server count to go high? Within a week, that 
count jumped from 5k/tablet server to 23k/tablet server, even though total size 
in hdfs  has not changed.

Is high count, a cause for

RE: tablets per tablet server for accumulo 1.10.0

2022-01-31 Thread dev1
You can get the hdfs size using standard hdfs commands - count or ls.  As long 
as you have not cloned the table, the size of the hdfs files and the space 
occupied by the table are equivalent.

You can also get a sense of the referenced files examining the metadata table - 
the column qualifier file: will just give you the referenced files. You can 
look at the directories b-xxx are from a bulk import and t-xxx files 
are assigned to the tablets.  Also bulk import file names start with I-xx, 
files from compactions will be A-xx if from a full compaction, C-xxx 
from a minor compaction and F-xx is the result of a flush. You can look at 
the entries for the files - the numbers for the value are number of entities, 
file size

How do you ingest? Bulk or continuous?  On a bulk ingest, the imported files 
end up in /accumulo/table/x/b-x and then are assigned to tablets - the 
directories for the
Tablets will be created, but will be "empty" until a compaction occurs.  A 
compaction will copy from the files referenced by the tablets into a new file 
that will be placed into the corresponding /accumulo/table/x/t-xx 
directory.  When a bulk imported file is no longer referenced by any tablets, 
it will get garbage collected, until then file will exist and inflate the 
actual space used by the table. The compaction will also remove any data that 
is past the TTL for the records.

Do you ever run a compaction?  With a very large number of tablets, you may 
want to run the compaction in parts so that you don't end up occupying all of 
the compaction slots for a long time.

Are you using keys (row ids) that are always increasing? An typical example 
would be a date.  Say some of your row ids are -mm-dd-hh and there is a 10 
day TTL.  What will happened is that new data will continue to create new 
tablets and on compaction the old tablets will age-off and have 0 size.  You 
can remove the "unused splits" by running a merge.  Anything that creates new 
row ids that are ordered can do this - new splits are necessary and the 
old-splits eventually become unnecessary, if the row ids are distributed across 
the splits it will not do this. It is not necessary a problem if this what you 
data looks like, just something that you may want to manage with merges.

There is usually not much benefit having a large number of tablets for a single 
table on a server.  You can reduce the number of tablets required by setting 
the split threshold to a larger number and then running a merge.  This can be 
done in sections, and you should run a compaction on the section first.

If you have recently compacted, you can figure out the rough number of tables 
necessary  by taking hdfs size / split threshold = number of tablets.   If you 
increase the spilt threshold size you will need fewer tablets.  You may also 
consider setting a split threshold that is larger than your target - say you 
decided that 5G was a good target, if you set the threshold to 8G during the 
merge and then setting it to 5G when completed will cause the table to split - 
and it could give you a better distribution of data in the splits.

This can be done while things are running, but it will be a heavy IO load 
(files and on the hdfs namenode) and can take a very long time. What can be 
useful is you the getSplits command with the number of split options and create 
a script that compacts, then merges a section - using the splits as start / end 
row to the compaction and merge command.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Monday, January 31, 2022 11:16 AM
To: user@accumulo.apache.org
Subject: tablets per tablet server for accumulo 1.10.0

Hello,

table.split.threshold is set to default 1G (except for metadata nd root - which 
is set to 64M)
What can cause tablets per tablet server count to go high? Within a week, that 
count jumped from 5k/tablet server to 23k/tablet server, even though total size 
in hdfs  has not changed.
Is high count, a cause for concern?
We didn't apply any splits. I did a dumpConfig and checked all my tables and 
didn't see splits either.

Is there a way to find tablet size in hdfs? When I look at hdfs 
/accumulo/table/x/ i see some empty folders, meaning not all folders has rf 
files. is that normal?

Thanks in advance!

-S


Re: uassigned tablets issue

2022-01-26 Thread dev1
Unassigned tablets for a short duration is not unusual.  Tablets can become 
"unassigned" for a number of reasons during normal operations, including, but 
not limited to:

  *   tserver failure - all tablets assigned to the newly deceased tserver will 
be reassigned.
  *   balancer migrations - the balancer will try to even out tablet 
assignments and will select a number of tablets from "busy" servers to servers 
that look like they have spare capacity - the exact behavior will be decided by 
the balancer that you have configured.
  *   new table / add splits.  If you create a table and then add splits, the 
splits first land on the tserver where the table is first assigned - the 
balancer will then migrate these over time.  To for a faster rebalance, on 1.x 
line, you can create the table, add splits, offline the table, online the table 
- this will force the splits to be assigned throughout the cluster without need 
to wait for the balancer - this can help distribute ingest if you are going to 
immediate start ingests after the table creation..  The is an option to create 
and add splits to an offline table, but I'm not sure that is available in 1.10.
  *   tablet(s) exceed splits threshold - if during ingest, either continuous 
(batch writes) or bulk imports, if a tablet size exceeds the tablet split 
threshold, the tablet will split into 1 or more tablets - the balance may then 
see this as unbalanced and begin a migration.

The best way to determine what is occurring should be in the master / main 
debug log - you may want to filter out replication related messages if that 
helps you find the relevant messages.



From: Shailesh Ligade 
Sent: Wednesday, January 26, 2022 11:52 AM
To: 'user@accumulo.apache.org' 
Subject: Re: uassigned tablets issue

Sorry for late reply, from the log i can't tell if it is migrating tablets..we 
have replication on so logs are very chatty, but whatever i saw I didn't see it 
as migrating tablets.

Thanks

S
____
From: dev1 
Sent: Tuesday, January 25, 2022 10:30 AM
To: 'user@accumulo.apache.org' 
Subject: [EXTERNAL EMAIL] - RE: uassigned tablets issue


Can you look at the master log?  Do you see migrations occurring?



Ed Coleman



From: Ligade, Shailesh [USA] 
Sent: Tuesday, January 25, 2022 9:23 AM
To: user@accumulo.apache.org
Subject: uassigned tablets issue



Hello,



For some strange reason, monitor master page (version 1.10.0) shows unassigned 
tablets going up and down, typically goes up to 4 and down to 0

The master log shows unassigned tablet count and waiting for it to go to zero.



This just started happening, is this a cause for concern? I checked for 
tabletserverlocks, checkformetadataproblems, and other usual debugging 
commands, everything looks good.



-S


RE: uassigned tablets issue

2022-01-25 Thread dev1
Can you look at the master log?  Do you see migrations occurring?

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Tuesday, January 25, 2022 9:23 AM
To: user@accumulo.apache.org
Subject: uassigned tablets issue

Hello,

For some strange reason, monitor master page (version 1.10.0) shows unassigned 
tablets going up and down, typically goes up to 4 and down to 0
The master log shows unassigned tablet count and waiting for it to go to zero.

This just started happening, is this a cause for concern? I checked for 
tabletserverlocks, checkformetadataproblems, and other usual debugging 
commands, everything looks good.

-S


RE: issue with OS user using shell

2022-01-21 Thread dev1
Did you try to pass a log4j configuration using -D option on the command line?

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Friday, January 21, 2022 8:19 AM
To: user@accumulo.apache.org
Subject: issue with OS user using shell

Heo,

When a regular OS user use shell command, looks like that OS user needs access 
to accumulo log directory. Othewise it throws FileNotFound exception for the 
log file (tserver/monitor/master/gc/tracer - all the log files)

if i user OS root user or OS accumulo user (user who runs accumulo) there is no 
issue with shell command.

I obviously don't want to open log directory for all OS users or turn off shell 
logging. Is there a way around it? I didn't see shell command option for 
accepting any log4j.conf so that just for shell I can use a different file..


Thanks

-S


RE: replication table offline issue

2022-01-04 Thread dev1
Deleting / recreating the replication table should not be necessary and in any 
case you very likely cannot delete / create the accumulo.replication table - 
the shell will error on the delete because it is in the accumulo namespace.

Is the replication table hosted on a single tserver?  Are there any exceptions 
in the log for that server? (or any of the tservers host it if hosted across 
multiple tservers)

Have you restarted the client? It looks like the exception fragment has client 
in the classname. What log is that exception occurring?

You can try restarting the master(s)

The monitor shows the replication table is online? Can you check in ZooKeeper 
(using the zkCli.sh)

  *   get /accumulo/[instance id]/tables/+rep/state



That should return the text ONLINE

If the replication table is on a single tserver, then you might be able to just 
restart that server rather than needing to do a rolling restart of the cluster. 
If there a no errors in the tserver log this seems unlikely to help.

Ed Coleman

From: Ligade, Shailesh [USA] 
Sent: Tuesday, January 4, 2022 12:24 PM
To: user@accumulo.apache.org
Subject: Re: replication table offline issue

Sorry this is for accumulo 1.10.0

I am wondering is there a way to delete and recreate the accumulo.replication 
table. I know it is bit special table..so

Also, will restarting entire cluster solve this? or may be just restarting 
accumulo master may be?

Since rolling restart of tservers is bit lengthy process for us just wanted to 
make sure it may resolve it or not..

-S

From: Ligade, Shailesh [USA]
Sent: Tuesday, January 4, 2022 11:27 AM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org>>
Subject: replication table offline issue

Hello,

I setup replication and ran 'online accumulo.replication' however i n master i 
keep on getting error stating accumulo.replication is offline. I can scan 
accumulo.replication table it has no data at all
the error is:
-

WARN Failed to write work mutations for replication, will retry
clinet.MutationRejectedException: # constraint violations: 0 security codes: {} 
# server errors 0 # exceptions 6
at xxxclient.impl.TabletServerBatchWriter.checkForFailures

caused by TableOfflineException: Table accumulo.replication (+rep) is offline
---

There are no constraints that I am using on any table.
I added grants for root as well as my replication user for accumulo.replication 
Tbale.WRITE (there was only Table.READ before)
if i run offline accumulo.replication i can see it is offline and then i can 
bring in online again however i still keep getting error

Any suggestion on how to fix this?

Thanks

-S




RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-12-27 Thread dev1
Can you specify the messages?  It may be that replication is working as 
designed.  The current replication is based on the WALs – it would seem normal 
if the WAL is closed when the tserver stops, that it would then trigger 
replication, so it might just be expected activity.  The messages might look 
scarry – unexpected file closed, improperly closed file,… which would be more 
of a concern if they were happening in stable system (and if not associated 
with something like a tserver dying for reasons)

Do I need to turn off replication while I am rolling restart?

First, are you detecting errors / missing data in the replication destination? 
If not, then you might just want to leave it alone.

If you wanted to stop replication, you may need to stop ingest and then take 
steps so that data that is pending for replication is also sent before 
proceeding. I do not know if replication flushes changes when it stopped, or if 
it would pick back up where it left off on the restart.  If it does not, then 
any data that was “pending replication” could be lost.

Ed

From: Ligade, Shailesh [USA] 
Sent: Monday, December 27, 2021 8:45 AM
To: user@accumulo.apache.org
Subject: RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling 
restart

Thanks,

Just a quick question. The steps identified worked..however I noticed that if 
replication is turned on, and I set the table.suspend.duration=5m and stop and 
reboot a tserver, I do get lot of replication messages in the master log. Since 
ingest is turned off, I thought I will not see much replication. Do I need to 
turn off replication while I am rolling restart? Will it have any adverse 
effects?

-S

From: Mike Miller mailto:mmil...@apache.org>>
Sent: Thursday, December 2, 2021 10:39 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling 
restart

Some things to keep in mind... The Master will wait the table.suspend.duration 
before reassigning the SUSPENDED tablets to new tservers. With the 
table.suspend.duration set > 0, a tablet will go from HOSTED to SUSPENDED if 
it's tserver is shutdown. It will then stay SUSPENDED until it's old tserver is 
available or table.suspend.duration has passed. If table.suspend.duration has 
passed before it's tserver has returned, it will then be UNASSIGNED. Once a 
tablet is UNASSIGNED it won't enter the SUSPENDED state.

On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver 
system (it is pretty quick in the order of second), I still get unassigned 
tablets on monitor page. My understand is that with that setting of 5m (or 200s 
etc), master will wait for that mush time before start moving unassigned 
tablets. In my situation, unassigned tablet counts goes back to zero after long 
time, and hence rolling restarts take lot longer (hours in most cases – depends 
on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my 
prod version I have not tested it completely.

Thanks

-S
From: Mike Miller mailto:mmil...@apache.org>>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0" do you mean that the feature is not helping to solve your 
problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0, can I just stop both the masters and then restart tserver 
one at a time (or all at once)? Will that speed up the restart without getting 
into this offline tablet situation and or data loss type situation? I can stop 
the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one 
keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-11-30 Thread dev1
All of this is hard to quantify, because it really depends on your usage, but 
in general.

The larger number of tablets per server the more “work” the tserver and then 
system need to do to keep track of everything.  It seems to be very rare that 
hosting large number of splits for a single table on every tablet improves 
performance – there maybe some cases where it might, but unless you can 
specifically measure that it is really helping in your specific usage, you may 
want to try to bring the count down.

If you adjust the split threshold so that you have larger tablets with more 
entries you would need fewer splits (you may need to merge to combine tablets). 
The indexing used in the RFiles is efficient in being able to quickly skip to 
the relevant data on scans and this really minimizes the impact on scans for 
larger files. The largest drawback is that compactions will take longer to read 
/ write the larger files.

Are you using the default split size of 3G?  Even setting that to 6G would 
reduce the tablet count by 50%, and larger, say 9G should still be feasible.

If you try this, one strategy for merging would be to set a size larger than 
your target, do the merge and then when that is complete, set the threshold to 
your target and allow the data to pick new split points that should be 
relatively balanced across the tablets.  The merge will take a long time and 
can hammer the namenode – so you might want to consider doing it in stages.

From: Shailesh Ligade 
Sent: Tuesday, November 30, 2021 9:37 AM
To: user@accumulo.apache.org
Subject: RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

There are not that many tables but just number of splits. There are 25b entries 
but each  entry is large.
Is there an optimal tserver memory/heap usage to number of tablets 
relationship? I saw some references like 
https://www.oreilly.com/library/view/accumulo/9781491947098/ch10.html that 
states that you should keep 1k tablets per server but I think think that is 
over kill in our situation. Each tserver is quite large 16 core, 128GB.

On the tablet.suspend.duration setting, once I update that setting, do I need 
to restart master? After updating the setting, I saw in master log had old 
value (0s), but if I restart master it shows correct value..in my testing it 
didn’t make any difference, but am just curious.

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Tuesday, November 30, 2021 9:17 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

One thing that you might be able to optimize is the number of tablets per 
server – you stated that you have “roughly 4k+ tablets per tserver”

Is that driven by the number of tables, or do you have lots of splits for a 
much smaller number of tables?

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned  may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S

From: Michael Wall mailto:mjw...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
mailto:d...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 mailto:d...@etcoleman.com>&g

RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-11-30 Thread dev1
If the shell command config shows the correct value you should be okay.

From: Shailesh Ligade 
Sent: Tuesday, November 30, 2021 9:37 AM
To: user@accumulo.apache.org
Subject: RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

There are not that many tables but just number of splits. There are 25b entries 
but each  entry is large.
Is there an optimal tserver memory/heap usage to number of tablets 
relationship? I saw some references like 
https://www.oreilly.com/library/view/accumulo/9781491947098/ch10.html that 
states that you should keep 1k tablets per server but I think think that is 
over kill in our situation. Each tserver is quite large 16 core, 128GB.

On the tablet.suspend.duration setting, once I update that setting, do I need 
to restart master? After updating the setting, I saw in master log had old 
value (0s), but if I restart master it shows correct value..in my testing it 
didn’t make any difference, but am just curious.

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Tuesday, November 30, 2021 9:17 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

One thing that you might be able to optimize is the number of tablets per 
server – you stated that you have “roughly 4k+ tablets per tserver”

Is that driven by the number of tables, or do you have lots of splits for a 
much smaller number of tables?

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned  may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S

From: Michael Wall mailto:mjw...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
mailto:d...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Faccumulo.apache.org%2F1.10%2Faccumulo_user_manual.html*_restarting_process_on_a_node__%3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q%24=04%7C01%7CSLIGADE%40FBI.GOV%7C979350c787894f72cca908d9b40c28db%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637738787912893850%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=mP9HEKWVGtNiNdJEcIevM%2BBUkZn24WORSmY3wjXSn8Q%3D=0>
 – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting – there is a trade-off on the data 
not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated t

RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-11-30 Thread dev1
One thing that you might be able to optimize is the number of tablets per 
server – you stated that you have “roughly 4k+ tablets per tserver”

Is that driven by the number of tables, or do you have lots of splits for a 
much smaller number of tables?

From: Shailesh Ligade 
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned  may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S

From: Michael Wall mailto:mjw...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
mailto:d...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Faccumulo.apache.org%2F1.10%2Faccumulo_user_manual.html*_restarting_process_on_a_node__%3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q%24=04%7C01%7CSLIGADE%40FBI.GOV%7C363899b757914815738508d9b34de39b%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637737969389540183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=p%2FOeqj%2BgzX5PV4H%2Bd3TluGSvACs2CERSRhwEnifXX1c%3D=0>
 – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting – there is a trade-off on the data 
not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated the tserver heap size. Since 
we ar eusing system, I can issue restart command on a tserver. This causes all 
sorts of tablet movements even though accumulo is down for may be a second. If 
I wait for all unassigned tables to become 0, then to restart next tserver, 
then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ 
tablets per tserver)



What may be right way to perform such routine maintenance operation? Is there a 
delay setting we can change so that it will not move tablets around? What may 
be a safe delay value?



-S


RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

2021-11-29 Thread dev1
I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time
 *   Stop the tserver
 *   Pause long enough that ZooKeeper recognizes the lost connection
 *   Restart the tserver
 *   Pause to allow for any recovery
  5.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what 
is going on – there should be messages showing the tserver leaving and then 
rejoining and any other activity related to recovery.  With a rolling restart 
the idea is to keep the cluster up and serving tables – only one (or a few) 
tservers go offline and for a short duration (general less than a minute) and 
between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade 
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned  may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S

From: Shailesh Ligade mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S

From: Michael Wall mailto:mjw...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
mailto:d...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Faccumulo.apache.org%2F1.10%2Faccumulo_user_manual.html*_restarting_process_on_a_node__%3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q%24=04%7C01%7CSLIGADE%40FBI.GOV%7C363899b757914815738508d9b34de39b%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637737969389540183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=p%2FOeqj%2BgzX5PV4H%2Bd3TluGSvACs2CERSRhwEnifXX1c%3D=0>
 – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting – there is a trade-off on the data 
not being 

RE: accumulo tserver rolling restart

2021-11-29 Thread dev1
Yes - and don't forget to reset it back when you are done.

From: Ligade, Shailesh [USA] 
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org
Subject: RE: accumulo tserver rolling restart

Thanks,

I am assuming I can set that property using shell and it will take effect 
immediately?

Thanks

-S

From: dev1 mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org' 
mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart

See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/accumulo.apache.org/1.10/accumulo_user_manual.html*_restarting_process_on_a_node__;Iw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q$>
 - A note on rolling restarts.

There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting - there is a trade-off on the data 
not being available so try to minimize the time the tserver is off-line.

From: Ligade, Shailesh [USA] 
mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo tserver rolling restart

Hello,

I want to restart al the tservers, say I updated the tserver heap size. Since 
we ar eusing system, I can issue restart command on a tserver. This causes all 
sorts of tablet movements even though accumulo is down for may be a second. If 
I wait for all unassigned tables to become 0, then to restart next tserver, 
then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ 
tablets per tserver)

What may be right way to perform such routine maintenance operation? Is there a 
delay setting we can change so that it will not move tablets around? What may 
be a safe delay value?

-S


RE: accumulo tserver rolling restart

2021-11-29 Thread dev1
See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node
 - A note on rolling restarts.

There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting - there is a trade-off on the data 
not being available so try to minimize the time the tserver is off-line.

From: Ligade, Shailesh [USA] 
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org
Subject: accumulo tserver rolling restart

Hello,

I want to restart al the tservers, say I updated the tserver heap size. Since 
we ar eusing system, I can issue restart command on a tserver. This causes all 
sorts of tablet movements even though accumulo is down for may be a second. If 
I wait for all unassigned tables to become 0, then to restart next tserver, 
then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ 
tablets per tserver)

What may be right way to perform such routine maintenance operation? Is there a 
delay setting we can change so that it will not move tablets around? What may 
be a safe delay value?

-S


RE: Triggering Table Balancer in Accumulo [SEC=UNOFFICIAL]

2021-11-22 Thread dev1
Just thinking about other ways that might work - have not tried any of this, so 
safe may be relative...

Sometimes it seems easier to give Accumulo what it wants rather than fighting 
it - an example would be when you have a "missing" file - you can add an 
"empty" file to serve as a placeholder and things will progress. With that as 
an analogy - what if you synthetically added data that corresponded to the 
splits that it is looking for?

If you added rows, with a TTL that was expired - or very short then it should 
not be returned in queries - and once compacted should go away. If you use 
visibilities, you could pick a value that would be inaccessible to users.  If 
you can use visibilities you may want to use a TTL to keep the entries around 
long enough to complete whatever you need to do to get the splits back to what 
you want.  That way the balancer would have the rows even if a compaction ran.

If the incorrectly named splits will sort to a range, then clean-up could be 
easier - or you can scan using the fake visibility and that should only return 
the synthetic rows - or just keep track of what you added.

With the "missing" splits added, then maybe the balancer will complete faster 
and settle down, you could then work to merge those splits away.  Merging is 
usually not a speedy operation - running a compaction before the merge can 
sometimes help.

Ed Coleman

-Original Message-
From: McClure, Bruce MR 2  
Sent: Monday, November 22, 2021 6:15 PM
To: user@accumulo.apache.org
Subject: RE: Triggering Table Balancer in Accumulo [SEC=UNOFFICIAL]

UNOFFICIAL
Hi,

After looking at the master logs, I can see that the custom balancer is running 
every few minutes as you said, but reporting problems with some splits that do 
not conform to the expected naming scheme for the rows (non-existent row-id).  
I also see errors and warnings in the tserver logs "Failed to find midpoint 
using indexes, falling back to data files which is slower.  No entries between 
...",  which reference the same incorrectly named splits that the balancer is 
complaining about.

Attempts have been made to merge these incorrect and empty splits (which were 
created by human error) out of the system by merging a range either-side of the 
bad split.  However, this has taken a very long time (multiple hours) to run 
for a single range and there are quite a number of them.

QUESTION:
Is there a safe, relatively quick way to remove manually created splits that 
were created with the addsplits accumulo shell command?

Thanks,

Bruce.

-Original Message-
From: Christopher  
Sent: Monday, 1 November 2021 10:40 PM
To: accumulo-user 
Subject: Re: Triggering Table Balancer in Accumulo [SEC=UNOFFICIAL]

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust 
the sender and know the content is safe.

Hi Bruce,

We don't have an API for forcing the balancer to rebalance, but I believe it 
automatically runs every couple of minutes. So, it should get frequent 
opportunities to rebalance. It shouldn't be necessary to force a rebalance, if 
your balancer logic takes into account all the factors you care about.

If you need to force it, killing a tserver and allowing it's tablets to be 
reassigned can be relatively unintrusive, provided you don't have a lot of 
ingest going on, and your tables are flushed (to avoid WAL recovery costs). 
Another way might be to take the table offline and back online again, but that 
feels more intrusive to me, because it would affect an entire table. You could 
also manipulate the metadata table for the tablet to remove the saved location 
information while it is offline (don't do this while it is online), in order to 
avoid tablets from just being reassigned back to their previous servers.

Regarding the empty splits, Accumulo generally balances tablets without regard 
to their contents, because we can't know how the application intends to use the 
splits (I say generally, because a custom balancer could be written to do 
anything). It's expected that the application's schema and the user's choice of 
manual splits reflect their preferred distribution of data across tablets, so 
the balancer only has to care about the number of tablets without regard to 
what they contain. You can merge empty tablets away if you don't need them, 
especially for pre-splits that you didn't end up using, but this incurs a cost 
in terms of  chop-compactions on adjacent tablets.
This might be acceptable. There has been some discussion about a feature to 
avoid chop-compactions, which would be nice because it would make merges much 
more instantaneous and cost-free, but it is not implemented yet.

On Sun, Oct 31, 2021 at 8:59 PM McClure, Bruce MR 2 
 wrote:
>
> UNOFFICIAL
>
> Hi,
>
>
>
> I have a custom table balancer set on a particular table, and a cron job that 
> creates splits for the next-days data, each day.  Normally it is all fine, 
> but after some problems happened, I found that 

RE: accumulo 1.10 replication issue

2021-09-23 Thread dev1
How are you inserting the data?

 

From: Ligade, Shailesh [USA]  
Sent: Wednesday, September 22, 2021 10:22 PM
To: user@accumulo.apache.org
Subject: accumulo 1.10 replication issue

 

Hello,

 

I am following 

Apache Accumulo
 R
User Manual Version 1.10

 

I want to setup replication from accumulo instance inst1, table source, TO
inst2, table target

I created a replication user,( same password) on both instances and grant
Table.READ/WRITE for source and target respectively

 

I set replication.name property to be same as inst on both instances

 

On inst1 Set following properties

 

replication.peer.inst1=org.apache.accumulo.tserver.replication.AccumuloRepli
caSystem,inst2,inst2zoo1:2181,inst2zoo2:2181,inst2zoo3:2181

replication.peer.user.inst2=replication

replication.peer.password.inst2=replication

 

set the source table for replication

config -t source -s table.replication=true

config -t source -s table.replication.target.inst2=(number I got for target
table from inst2 tables -l command)

 

and finally I did

online accumulo.replication

 

Now when I insert data in source, I get feiles needing replication 1 on the
monitor replication section. All other values are correct, TABLE - source,
PEER - inst2 REMOTE ID as number I set

 

However my In-Progress Replication always stay empty and I don't see any
data in inst2 target table

 

No errors that I can see in master log or tserver log where tablet exist.

 

Any idea what may be wrong? Is there any way to debug this?

 

-S

 

 

 



RE: [EXTERNAL EMAIL] - Re: [External] RE: how to decommission tablet server

2021-08-18 Thread dev1
If the admin stop fails to stop the service you will need to either kill or 
service stop the linux process.

 

The hosts file should be able to be modified either before or after.  If you do 
it before remember that things like admin start-all, stop-all  will not know 
about those nodes.  Likewise, if after, then those commands may try actions 
like start that you’d rather not happen.

 

From: Shailesh Ligade  
Sent: Wednesday, August 18, 2021 7:53 AM
To: user@accumulo.apache.org
Subject: RE: [EXTERNAL EMAIL] - Re: [External] RE: how to decommission tablet 
server

 

Thanks

 

So in the reality, if I issue admin stop on the  tserver, (with -f if needed) I 
don’t need to stop linux service, right?

 

Also, when it is safe to update slaves file? Can I wait till I decommission all 
my nodes? Or I need to do that after I one node is decommissioned?

 

Appreciated

 

-S

 

 

From: Mike Miller mailto:mmil...@apache.org> > 
Sent: Wednesday, August 18, 2021 7:47 AM
To: user@accumulo.apache.org  
Subject: [EXTERNAL EMAIL] - Re: [External] RE: how to decommission tablet server

 

The admin stop command issues a graceful shutdown to Accumulo for that tserver. 
There is a force option you could try {"-f", "--force"} that will remove the 
lock. But these are more graceful than a linux kill -9 command, which you may 
have to do if the admin command doesn't kill the process entirely.

 

On Wed, Aug 18, 2021 at 7:31 AM Ligade, Shailesh [USA] mailto:ligade_shail...@bah.com> > wrote:

Thank you for good explanation! I really appreciate that.

 

Yes I need to remove the hardware, meaning I need to stop everything on the 
server (tserver and datanode)

 

One quick question:

 

What is the difference between accumulo admin stop :9997 and stopping 
tserver linux service?

 

When I issue admin stop, I can see, from the monitor, hosted tablets count from 
the tserver in the question  goes down to 0, however it doesn't stop the 
tserver process or service.

 

In your steps, you are stopping datanode service first (adding into exclude 
file and then running refreshNodes and then stop the service), I was thinking 
to stop accumulo tserver and let it handle hosted tablets first, before 
touching datanode, will there be any difference? Just trying to understand how 
the relationship between accumulo and hadoop is.

 

Thank you!

 

-S

  _  

From: d...@etcoleman.com   mailto:d...@etcoleman.com> >
Sent: Tuesday, August 17, 2021 2:39 PM
To: user@accumulo.apache.org   
mailto:user@accumulo.apache.org> >
Subject: [External] RE: how to decommission tablet server 

 

Maybe you could clarify.  Decommissioning tablet servers and hdfs replication 
are separate and distinct issues.  Accumulo will generally be unaware of hdfs 
replication and table assignment does not change the hdfs replication.  You can 
set the replication factor for a tablet – but that is used on writes to hdfs – 
Accumulo will assume that on any successful write, on return hdfs  is managing 
the details.

 

When a tablet is assigned / migrated, the underlying files in hdfs are not 
changed – the file references are reassigned in a metadata operation, but the 
files themselves are not modified.  They will maintain whatever replication 
factor that was assigned and whatever the namenode decides.

 

If you are removing servers that have both data nodes and tserver processes 
running: 

 

If you stop / kill the tserver, the tablets assigned to that server will be 
reassigned rather quickly.  It is only an metadata update.  The exact timing 
will depend on your ZooKeeper time-out setting, but the “dead” tserver should 
be detected and reassigned in short order. The reassignment may cause some 
churn of assignments if the cluster becomes un-balanced.   The manager (master) 
will select tablets from tservers that are over-subscribed and then assign them 
to tservers that have fewer tablets – you can monitor the manager (master) 
debug log to see the migration progress.  If you want to be gentile, stop a 
tserver, wait for the number of unassigned tables to hit zero and migration to 
settle and then repeat.

 

If you want to stop the data nodes, you can do that independently of Accumulo – 
just follow the Hadoop data node decommission process.  Hadoop will move the 
data blocks assigned to the data node so that it is “safe” to then stop the 
data node process.  This is independent of Accumulo and Accumulo will not be 
aware that the blocks are moving.  If you are running compactions, Accumulo may 
try to write blocks locally, but if the data node is rejecting new block 
assignments (which I rather assume that it would when in decommission mode) 
then Accumulo still would not care.  If somehow new blocks where written it may 
just delay the Hadoop data node decommissioning.

 

If you are running ingest while killing tservers – things should mostly work – 
there 

RE: [External] RE: how to decommission tablet server

2021-08-18 Thread dev1
The accumulo admin service should attempt to unload the tablets a little
more gracefully than just stopping the service.   I don't know why it's not
stopping the service. I do know that some had issues when the services were
controlled with systemd.  Stopping the service without the admin command
(either kill [pid] or service stop] is not catastrophic - it's basically
just like the node / service failed which Accumulo should handle.  One note,
if you can id the servers that may be hosting the metadata and root tablets,
and if you can, stop them last - that reduces the chance that they would
migrate to a server that is about to be decommissioned.

 

Sorry if I indicated that you should stop the data node first - you can
start the Hadoop decommission first and then when the data node migration
has completed stopping the data node process.  You can stop Accumulo nodes
in parallel with that decommissioning.   An accumulo process running on a
node and a data node process running on that same node are independent.
Accumulo uses Hadoop / namenode to persist files.  The namenode coordinates
with the data nodes where those blocks are stored.

 

You likely can edit the hosts files either before or after - if you elect to
do it after, make sure that you disable the ability to restart the teserver
once stopped on a decommissioning node - say by running accumulo start-all
or through an auto-service restart - that would end up restarting the
services.

 

From: Ligade, Shailesh [USA]  
Sent: Wednesday, August 18, 2021 7:31 AM
To: user@accumulo.apache.org
Subject: Re: [External] RE: how to decommission tablet server

 

Thank you for good explanation! I really appreciate that.

 

Yes I need to remove the hardware, meaning I need to stop everything on the
server (tserver and datanode)

 

One quick question:

 

What is the difference between accumulo admin stop :9997 and
stopping tserver linux service?

 

When I issue admin stop, I can see, from the monitor, hosted tablets count
from the tserver in the question  goes down to 0, however it doesn't stop
the tserver process or service.

 

In your steps, you are stopping datanode service first (adding into exclude
file and then running refreshNodes and then stop the service), I was
thinking to stop accumulo tserver and let it handle hosted tablets first,
before touching datanode, will there be any difference? Just trying to
understand how the relationship between accumulo and hadoop is.

 

Thank you!

 

-S

  _  

From: d...@etcoleman.com   mailto:d...@etcoleman.com> >
Sent: Tuesday, August 17, 2021 2:39 PM
To: user@accumulo.apache.org 
mailto:user@accumulo.apache.org> >
Subject: [External] RE: how to decommission tablet server 

 

Maybe you could clarify.  Decommissioning tablet servers and hdfs
replication are separate and distinct issues.  Accumulo will generally be
unaware of hdfs replication and table assignment does not change the hdfs
replication.  You can set the replication factor for a tablet - but that is
used on writes to hdfs - Accumulo will assume that on any successful write,
on return hdfs  is managing the details.

 

When a tablet is assigned / migrated, the underlying files in hdfs are not
changed - the file references are reassigned in a metadata operation, but
the files themselves are not modified.  They will maintain whatever
replication factor that was assigned and whatever the namenode decides.

 

If you are removing servers that have both data nodes and tserver processes
running: 

 

If you stop / kill the tserver, the tablets assigned to that server will be
reassigned rather quickly.  It is only an metadata update.  The exact timing
will depend on your ZooKeeper time-out setting, but the "dead" tserver
should be detected and reassigned in short order. The reassignment may cause
some churn of assignments if the cluster becomes un-balanced.   The manager
(master) will select tablets from tservers that are over-subscribed and then
assign them to tservers that have fewer tablets - you can monitor the
manager (master) debug log to see the migration progress.  If you want to be
gentile, stop a tserver, wait for the number of unassigned tables to hit
zero and migration to settle and then repeat.

 

If you want to stop the data nodes, you can do that independently of
Accumulo - just follow the Hadoop data node decommission process.  Hadoop
will move the data blocks assigned to the data node so that it is "safe" to
then stop the data node process.  This is independent of Accumulo and
Accumulo will not be aware that the blocks are moving.  If you are running
compactions, Accumulo may try to write blocks locally, but if the data node
is rejecting new block assignments (which I rather assume that it would when
in decommission mode) then Accumulo still would not care.  If somehow new
blocks where written it may just delay the Hadoop data node decommissioning.

 

If you are running 

RE: how to decommission tablet server

2021-08-17 Thread dev1
Maybe you could clarify.  Decommissioning tablet servers and hdfs
replication are separate and distinct issues.  Accumulo will generally be
unaware of hdfs replication and table assignment does not change the hdfs
replication.  You can set the replication factor for a tablet - but that is
used on writes to hdfs - Accumulo will assume that on any successful write,
on return hdfs  is managing the details.

 

When a tablet is assigned / migrated, the underlying files in hdfs are not
changed - the file references are reassigned in a metadata operation, but
the files themselves are not modified.  They will maintain whatever
replication factor that was assigned and whatever the namenode decides.

 

If you are removing servers that have both data nodes and tserver processes
running: 

 

If you stop / kill the tserver, the tablets assigned to that server will be
reassigned rather quickly.  It is only an metadata update.  The exact timing
will depend on your ZooKeeper time-out setting, but the "dead" tserver
should be detected and reassigned in short order. The reassignment may cause
some churn of assignments if the cluster becomes un-balanced.   The manager
(master) will select tablets from tservers that are over-subscribed and then
assign them to tservers that have fewer tablets - you can monitor the
manager (master) debug log to see the migration progress.  If you want to be
gentile, stop a tserver, wait for the number of unassigned tables to hit
zero and migration to settle and then repeat.

 

If you want to stop the data nodes, you can do that independently of
Accumulo - just follow the Hadoop data node decommission process.  Hadoop
will move the data blocks assigned to the data node so that it is "safe" to
then stop the data node process.  This is independent of Accumulo and
Accumulo will not be aware that the blocks are moving.  If you are running
compactions, Accumulo may try to write blocks locally, but if the data node
is rejecting new block assignments (which I rather assume that it would when
in decommission mode) then Accumulo still would not care.  If somehow new
blocks where written it may just delay the Hadoop data node decommissioning.

 

If you are running ingest while killing tservers - things should mostly work
- there may be ingest failures, but normally things would get retried and
the subsequent effort should succeed - the issue may be that if by bad luck
the work keeps getting assigned to tservers that are then killed, you could
end up exceeding the number of retries and the ingest would fail out right.
If you can pause ingest, then this limits that chance.  If you can monitor
your ingest and know when an ingest failed you could just reschedule the
ingest (for bulk import)  If you are doing continuous ingest, it may be
harder to determine if a specific ingest fails, so you may need to select an
appropriate range for replay.  Overall it may mostly work - it will depend
on your processes and your tolerance for any particular data loss on an
ingest.

 

The modest approach (if you can accept transient errors):

 

1 Start the data node decommission process.

2 Pause ingest and cancel any running user compactions.

3 Stop a tserver and wait for unassigned tablets to go back to 0.  Wait for
the tablet migration (if any) to quiet down. 

4 Repeat 3 until all tserver processes have been stopped on the nodes you
are removing.

5 Restart ingest - rerun any user compactions if you stopped any.

6 Wait for the hdfs decommission process to finish moving / replicating
blocks.

7 stop the data node process.

8 do what you want with the node.

 

You do not need to schedule down time - if you can accept transient errors -
say that a client scan is running and that tserver is stopped - the client
may receive an error for the scan.  If the scan is resubmitted and the
tablet has been reassigned it should work - it may pause for the
reassignment and / or timeout if the assignment takes some time.   You are
basically playing a number game here - the number of tablets, the number of
unassigned tablets, the odds that a scan would be using a particular tablet
for the duration that it is unavailable.  It's not guaranteed that it will
fail, its just that there is a greater than 0 chance that it could - if that
is unacceptable then:

 

1 Stop ingest - wait for all to finish or mark which ones will need to be
rescheduled

2 Stop Accumulo

3 Remove the tservers from the servers list

4 Start Accumulo without starting the decommissioned tserver nodes.

 

Do what you want with the data node decommissioning.

 

The later approach removes possible transient issues.  It is up to you to
determine your tolerance for possible transient issues for the duration that
tservers are being stopped vs a complete outage for the duration that
Accumulo is down.  If it is a large cluster and just a few tservers, the
odds of a specific tablet being off line for a short duration may be very
low.  If it is a small cluster or the 

RE: unassigned tablet "is SUSPENDED"

2021-08-05 Thread dev1
Are the wal files available for the table in hdfs? (there should be a metadata 
table entry with the file paths)

 

From: Bulldog20630405  
Sent: Wednesday, August 4, 2021 5:08 PM
To: accumulo-user 
Cc: Bulldog20630405 
Subject: unassigned tablet "is SUSPENDED"

 

 

we have a few tablets that are unassigned and the logs show "is SUSPENDED"

 

e.g.
2;abcd;wxyz@(null, null, myserver:12345[abcdefg123456]) is SUSPENDED #walogs:2

 

how to you fixed this?

 



RE: hdfs rack awareness and accumulo

2021-08-05 Thread dev1
Are there any Accumulo system table that are offline (root, metadata)?  Is 
there a manager (master) process available?  What is the manager log saying?

 

From: Ligade, Shailesh [USA]  
Sent: Thursday, August 5, 2021 7:55 AM
To: user@accumulo.apache.org
Subject: hdfs rack awareness and accumulo

 

Hello,

 

Our hdfs setup is rack aware with replication of 3. The datanode and tserver 
share the same hosts. In the event that one rack goes down, will accumulo be 
still functioning (after hdfs data replication)?

 

What I am finding is accumulo monitor is up and showing half the tablets are 
unreachable, I can get to accumulo shell but I can’t scan any tables. From the 
log I can see there are some locks in zookeeper. But overall accumulo, although 
up, is not usable ☹ Is there any way around it?

 

-S



RE: How to stop accumulo replication

2021-07-20 Thread dev1
See: https://accumulo.apache.org/docs/2.x/administration/replication 

 

Replication should be controlled using the accumulo shell config command.

 

To see the current system replication configuration: (-f provides a filter
and will show only properties that contain replication in the name)

 

> config -f replication

 

It may be as simple as setting the system property table.replication to
false.

 

> config -s table.replication=false

 

The per-table replication configuration can be see using:

 

> config -t [your_table] -f table.replication(or just use -f replication
to properties that contain replication)

 

And disabled (if true) with:

 

> config -t my_table -s table.replication=false.

 

From: Shailesh Ligade  
Sent: Tuesday, July 20, 2021 1:01 PM
To: user@accumulo.apache.org
Subject: How to stop accumulo replication

 

Hello,

 

Is there a way to find out which tables are getting replicated? How can I
stop entire replication process?

 

-S

 



RE: Re: Re: Re: Hadoop ConnectException

2021-07-10 Thread dev1
Use jps -m to check which processes you have running.

 

Check the accumulo logs – are there any with *.err that have a size > 0?  The 
.err files will be created on an unexpected exit.  The other debug logs will 
provide a clearer picture of what is happening.

 

Tail the master debug log / and a tserver debug log – are they showing 
exceptions being thrown?

 

From: Christine Buss  
Sent: Saturday, July 10, 2021 9:57 AM
To: user@accumulo.apache.org
Subject: Aw: Re: Re: Re: Hadoop ConnectException

 

 

Ok, so I in the file 'accumulo.properties' I changed

## Sets location in HDFS where Accumulo will store data
instance.volumes=hdfs://localhost:8020/accumulo

 

to

 

## Sets location in HDFS where Accumulo will store data
instance.volumes=hdfs://localhost:9000/accumulo

 

 

Then I was able to run 'accumulo init' and 'accumulo-cluster start'.

But when I run 'accumulo shell -u root' it hangs:

 

 

christine@centauri:~/accumulo-2.0.1/bin$ 
  ./accumulo shell -u root
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
Loading configuration from 
/home/christine/accumulo-2.0.1/conf/accumulo-client.properties
Password: *

Shell - Apache Accumulo Interactive Shell
-
- version: 2.0.1
- instance name: accumulotest
- instance id: 5d8c404a-c741-48b3-b7a4-adaf19cc1499
-
- type 'help' for a list of available commands
-
2021-07-10 15:39:17,328 [clientImpl.ServerClient] WARN : There are no tablet 
servers: check that zookeeper and accumulo are running.

 

 

 

 

 

Gesendet: Samstag, 10. Juli 2021 um 14:43 Uhr
Von: "Christine Buss" mailto:christine.buss...@gmx.de> >
An: user@accumulo.apache.org  
Betreff: Aw: Re: Re: Re: Hadoop ConnectException

sorry found it:

The ‘accumulo-cluster’ command was created to manage Accumulo on cluster and 
replaces ‘start-all.sh’ and ‘stop-all.sh’

  

  

Gesendet: Samstag, 10. Juli 2021 um 12:14 Uhr
Von: "Christine Buss" mailto:christine.buss...@gmx.de> >
An: user@accumulo.apache.org  
Betreff: Aw: Re: Re: Re: Hadoop ConnectException

I am still trying to run accumulo 2.0.1

Question: instead of ./bin/start-all.sh you use what in 2.0.1 ?

  

  

Gesendet: Freitag, 09. Juli 2021 um 17:15 Uhr
Von: "Christopher" mailto:ctubb...@apache.org> >
An: "accumulo-user" mailto:user@accumulo.apache.org> 
>
Betreff: Re: Re: Re: Hadoop ConnectException

Oh, so you weren't able to get 2.0.1 working? That's unfortunate. If
you try 2.0.1 again and are able to figure out how to get past the
issue you were having, feel free to let us know what you did
differently.

On Fri, Jul 9, 2021 at 10:56 AM Christine Buss mailto:christine.buss...@gmx.de> > wrote:
>
>
> yes of course!
> I deleted accumulo 2.0.1 and installed accumulo 1.10.1.
> Then edited the conf/ files. I think I didn't do that right before.
> And then it worked.
>
> Gesendet: Freitag, 09. Juli 2021 um 16:30 Uhr
> Von: "Christopher" mailto:ctubb...@apache.org> >
> An: "accumulo-user"   >
> Betreff: Re: Re: Hadoop ConnectException
> Glad to hear you got it working! Can you share what your solution was in case 
> it helps others?
>
> On Fri, Jul 9, 2021, 10:20 Christine Buss   > wrote:
>>
>>
>> It works!! Thanks a lot to veryone!
>> I worked through all your hints and suggestions.
>>
>> Gesendet: Donnerstag, 08. Juli 2021 um 18:18 Uhr
>> Von: "Ed Coleman" mailto:edcole...@apache.org> >
>> An: user@accumulo.apache.org  
>> Betreff: Re: Hadoop ConnectException
>>
>> According to the Hadoop getting started guide 
>> (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html)
>>  the resouce manager runs at: http://localhost:8088/
>>
>> Can you run hadoop commands like:
>> > hadoop fs -ls /accumulo (or whatever you've decided on as the destination 
>> > for files)
>>
>> Did you check that accumulo-env.sh and other configuration files have been 
>> set-up for your environemnt?
>>
>>
>> On 2021/07/07 15:20:41, Christine Buss >  > wrote:
>> > Hi,
>> >
>> >
>> >
>> > I am using:
>> >
>> > Java 11
>> >
>> > Ubuntu 20.04.2
>> >
>> > Hadoop 3.3.1
>> >
>> > Zookeeper 3.7.0
>> >
>> > Accumulo 2.0.1
>> >
>> >
>> >
>> >
>> >
>> > I followed the instructions here:
>> >
>> > https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
>> > common/SingleCluster.html
>> >
>> > and edited `etc/hadoop/hadoop-env.sh`, etc/hadoop/core-site.xml,
>> > etc/hadoop/hdfs-site.xml accordingly.
>> >
>> > 'ssh localhost' works without a passphrase.
>> >
>> >
>> >
>> > Then I started Zookeper, start-dfs.sh and start-yarn.sh:
>> >
>> > christine@centauri:~$ ./zookeeper-3.4.9/bin/zkServer.sh start
>> > ZooKeeper JMX enabled by default
>> > Using config: 

RE: Hadoop ConnectException

2021-07-07 Thread dev1
Did you verify that Hadoop is really up and healthy?  Look at the Hadoop 
monitor pages and confirm that you can use the Hadoop cli to navigate around?  
You may also need to update the accumulo configuration files / env to match 
your configuration.

 

You might what to look at using https://github.com/apache/fluo-uno as a quick 
way to stand up an instance for testing – and that might give to additional 
insights. 

 

From: Christine Buss  
Sent: Wednesday, July 7, 2021 11:21 AM
To: user@accumulo.apache.org
Subject: Hadoop ConnectException

 

Hi,

 

I am using:

Java 11

Ubuntu 20.04.2

Hadoop 3.3.1

Zookeeper 3.7.0

Accumulo 2.0.1

 

 

I followed the instructions here:

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

and edited etc/hadoop/hadoop-env.sh,  etc/hadoop/core-site.xml, 
etc/hadoop/hdfs-site.xml accordingly.

'ssh localhost' works without a passphrase.

 

Then I started Zookeper, start-dfs.sh and start-yarn.sh:

christine@centauri:~$ ./zookeeper-3.4.9/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/christine/zookeeper-3.4.9/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
christine@centauri:~$ ./hadoop-3.3.1/sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [centauri]
centauri: Warning: Permanently added 
'centauri,2003:d4:771c:3b00:7223:40a1:4c07:7c7b' (ECDSA) to the list of known 
hosts.
christine@centauri:~$ ./hadoop-3.3.1/sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
christine@centauri:~$ jps
3921 Jps
2387 QuorumPeerMain
3171 SecondaryNameNode
3732 NodeManager
2955 DataNode
3599 ResourceManager

 

BUT

when running 'accumulo init' I get this Error:

hristine@centauri:~$ ./accumulo-2.0.1/bin/accumulo init
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in 
version 9.0 and will likely be removed in a future release.
2021-07-07 15:59:05,590 [conf.SiteConfiguration] INFO : Found Accumulo 
configuration on classpath at 
/home/christine/accumulo-2.0.1/conf/accumulo.properties
2021-07-07 15:59:08,460 [fs.VolumeManagerImpl] WARN : dfs.datanode.synconclose 
set to false in hdfs-site.xml: data loss is possible on hard system reset or 
power loss
2021-07-07 15:59:08,461 [init.Initialize] INFO : Hadoop Filesystem is 
hdfs://localhost:9000
2021-07-07 15:59:08,461 [init.Initialize] INFO : Accumulo data dirs are 
[hdfs://localhost:8020/accumulo]
2021-07-07 15:59:08,461 [init.Initialize] INFO : Zookeeper server is 
localhost:2181
2021-07-07 15:59:08,461 [init.Initialize] INFO : Checking if Zookeeper is 
available. If this hangs, then you need to make sure zookeeper is running
2021-07-07 15:59:08,938 [init.Initialize] ERROR: Fatal exception
java.io.IOException: Failed to check if filesystem already initialized
at org.apache.accumulo.server.init.Initialize.checkInit(Initialize.java:285)
at org.apache.accumulo.server.init.Initialize.doInit(Initialize.java:323)
at org.apache.accumulo.server.init.Initialize.execute(Initialize.java:991)
at org.apache.accumulo.start.Main.lambda$execKeyword$0(Main.java:129)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.ConnectException: Call From centauri/192.168.178.30 to 
localhost:8020 failed on connection exception: java.net.ConnectException: 
Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:828)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1577)
at org.apache.hadoop.ipc.Client.call(Client.java:1519)
at org.apache.hadoop.ipc.Client.call(Client.java:1416)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:965)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 

RE: Unassigned tables on every restart accumulo 1.10.1

2021-05-12 Thread dev1
There was additional discussion at 
https://github.com/apache/accumulo/issues/2085 incase that helps anyone.

 

From: Ivershyn Viktor  
Sent: Wednesday, May 12, 2021 1:09 PM
To: user@accumulo.apache.org
Subject: Unassigned tables on every restart accumulo 1.10.1

 

HI all,

 

have accumulo 1.10.1 and last month every restart broke +r table which is root 
and after restore, it become whole cluster unassigned tables and need to wait 
near hour while all tables will be assigned,

there is 1 master and 2 table servers, can anybody help me with that?

Log:

021/05/06 17:20:50,343   tserver:xx140.internal..net  1   
WARN exception trying to assign tablet +r<< 
hdfs://120.-dev.local:8020/apps/accumulo/data/tables/+r/root_tablet
 java.lang.RuntimeException: Error recovering tablet +r<< from log files at 
org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:499) at 
org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2413)
 at 
org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$3.run(TabletServer.java:1775)
 Caused by: java.io.IOException: Unable to find recovery files for extent +r<< 
logEntry: +r<< 
hdfs://120.-dev.local:8020/apps/accumulo/data/wal/xx.x-dev.local+9997/e31761f2-e600-49f9-9a5c-8972aa37005b
 at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3311) at 
org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:437) ... 2 more

 

2021/05/06 17:20:50,345   tserver:140.internal.cloudapp.net   1 
   WARN Error recovering tablet +r<< from log files

2021/05/06 17:20:50,349   tserver:x140.internal.cloudapp.net   
1WARN failed to open tablet +r<< reporting failure to master

2021/05/06 17:20:50,351   tserver:x140.internal.cloudapp.net   
1WARN rescheduling tablet load in 1.00 seconds

2021/05/06 17:20:50,383   master:x120.internal.cloudapp.net   1 
   ERROR x140.xxx-dev.local:9997 reports assignment failed for 
tablet +r<<

 

 

 



RE: accumulo 1.9 slow startup

2020-09-25 Thread dev1
How are you shutting down and starting up?

 

For start-up, it works much better if most, if not all tservers are up
before the master is started.  That way, the assignments are distributed
across the available tservers.  If you start the master first, what will
happen is that when the first few tservers start, the master immediately
begins to assign all tablets to them - it then will take considerable time
for those few tservers to rebalance and distribute the tablets to the
remaining tservers that did not get additional assignments.

 

You should be able to see reassignment messages in the master debug log if
your cluster is unbalanced.

 

With 1.10.0, the ability to delay assignments until a threshold number of
tserevrs have registered is available to avoid this issue.

 

 

 

From: Ligade, Shailesh (ITADD) (CON)  
Sent: Friday, September 25, 2020 8:07 AM
To: user@accumulo.apache.org
Subject: accumulo 1.9 slow startup

 

Hello,

 

We have 45 tservers hosting few tables, max entries in table are 1B Each
tserver hosts about 2.3k tablets.

 

When we start accumulo it shows offiline tables and slowly works thru it.
After there are no more offline tables it again goes thru 2/3 cycles of
offline tablets and and finally I can scan tables. The overall process takes
about 25+ minutes

 

What is causing this slowness?

 

Tserver heapsize is set at 24G

Memory map is set at 3G

Master heapsize is set at 48G

 

Thanks

 

S