[jira] [Comment Edited] (ACCUMULO-4506) Some in-progress files for replication never replicate

Adam J Shook (JIRA) Thu, 03 Nov 2016 15:19:24 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634471#comment-15634471
 ]


Adam J Shook edited comment on ACCUMULO-4506 at 11/3/16 10:18 PM:
------------------------------------------------------------------

There are two znodes under the {{locks}} node, one for each file.  They belong 
to different tservers, which I located by checking the same {{ephemeralOwner}} 
from the nodes under {{tservers}}.

{noformat}
[zk: host:2181(CONNECTED) 9] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/ae4b03ec-159b-44e8-9a88-ccf7fa849c19|peer_instance|5h|k
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 14] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/9f038f64-4252-44a0-bfd0-99d4a316b397|peer_instance|5g|j
ephemeralOwner = 0x357d1bf618f4f72

[zk: host:2181(CONNECTED) 12] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31658/zlock-0000000000
TSERV_CLIENT=host:31658
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 13] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31368/zlock-0000000000
TSERV_CLIENT=host:31368
ephemeralOwner = 0x357d1bf618f4f72
{noformat}

Unfortunately, we don't keep logs around long enough to see back when these 
files were initially assigned.  We only have data back to October 27th -- a 
Kibana search for the WAL UUID only returns log entries from the Master and the 
GC.

For what it is worth, we've been trying out replication and are seeing some 
behavior we can't really explain without digging into it a lot more (source 
code included).  The time frame between a WAL file being closed and it actually 
being replicated seems to take a lot longer than I would expect it to -- 
anywhere from five minutes to a couple hours.  I see a lot of log entries 
saying saying work is being scheduled, but it takes a while to see the work 
being done.  This particular cluster has four tablet servers and there are 
always 40-60 files pending replication, with files rarely "in-progress" 
(besides these two problematic files).  It seems to replicate in waves, and I 
haven't put my finger on when files move from pending to in-progress.  With 
that said, things *are* replicating, it just seems to be taking a while longer 
than we anticipated.  Not sure if this is the way it is, or something else is 
going on.


was (Author: adamjshook):
There are two znodes under the {locks} node, one for each file.  They belong to 
different tservers, which I located by checking the same {ephemeralOwner} from 
the nodes under {tservers}.

{noformat}
[zk: host:2181(CONNECTED) 9] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/ae4b03ec-159b-44e8-9a88-ccf7fa849c19|peer_instance|5h|k
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 14] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/9f038f64-4252-44a0-bfd0-99d4a316b397|peer_instance|5g|j
ephemeralOwner = 0x357d1bf618f4f72

[zk: host:2181(CONNECTED) 12] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31658/zlock-0000000000
TSERV_CLIENT=host:31658
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 13] get 
/accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31368/zlock-0000000000
TSERV_CLIENT=host:31368
ephemeralOwner = 0x357d1bf618f4f72
{noformat}

Unfortunately, we don't keep logs around long enough to see back when these 
files were initially assigned.  We only have data back to October 27th -- a 
Kibana search for the WAL UUID only returns log entries from the Master and the 
GC.

For what it is worth, we've been trying out replication and are seeing some 
behavior we can't really explain without digging into it a lot more (source 
code included).  The time frame between a WAL file being closed and it actually 
being replicated seems to take a lot longer than I would expect it to -- 
anywhere from five minutes to a couple hours.  I see a lot of log entries 
saying saying work is being scheduled, but it takes a while to see the work 
being done.  This particular cluster has four tablet servers and there are 
always 40-60 files pending replication, with files rarely "in-progress" 
(besides these two problematic files).  It seems to replicate in waves, and I 
haven't put my finger on when files move from pending to in-progress.  With 
that said, things *are* replicating, it just seems to be taking a while longer 
than we anticipated.  Not sure if this is the way it is, or something else is 
going on.

>  Some in-progress files for replication never replicate
> -------------------------------------------------------
>
>                 Key: ACCUMULO-4506
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4506
>             Project: Accumulo
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 1.7.2
>            Reporter: Adam J Shook
>
> We're seeing an issue with replication where two files have been in-progress 
> for a long time and based on the logs are not going to be replicated.  The 
> metadata from the {{accumulo.replication}} table looks a little funky, with a 
> very large {{begin}} value.
> *Logs*
> {noformat}
> 2016-11-02 19:52:50,900 [replication.DistributedWorkQueueWorkAssigner] DEBUG: 
> Not queueing work for 
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> to Remote Name: peer_instance Remote identifier: 5h Source Table ID: k 
> because [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true 
> createdTime: 1477314365827] doesn't need replication
> 2016-11-02 19:53:08,900 [replication.DistributedWorkQueueWorkAssigner] DEBUG: 
> Not queueing work for 
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> to Remote Name: peer_instance Remote identifier: 5i Source Table ID: l 
> because [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true 
> createdTime: 1477052816174] doesn't need replication
> {noformat}
> *Replication table*
> {noformat}
> scan -r 
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> -t accumulo.replication
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> repl:j []    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 
> 1477314369633]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> repl:k []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: 
> true createdTime: 1477314365827]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> repl:l []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: 
> true createdTime: 1477314365707]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025g\x01\x00\x00\x00\x01j
>  []    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 
> 1477314369633]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025h\x01\x00\x00\x00\x01k
>  []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true 
> createdTime: 1477314365827]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025i\x01\x00\x00\x00\x01l
>  []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true 
> createdTime: 1477314365707]
> scan -r 
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> -t accumulo.replication
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> repl:j []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: 
> true createdTime: 1477052819752]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> repl:k []    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 
> 1477052816238]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> repl:l []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: 
> true createdTime: 1477052816174]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025g\x01\x00\x00\x00\x01j
>  []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true 
> createdTime: 1477052819752]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025h\x01\x00\x00\x00\x01k
>  []    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 
> 1477052816238]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 
> work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025i\x01\x00\x00\x00\x01l
>  []    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true 
> createdTime: 1477052816174]
> {noformat}
> *HDFS*
> {noformat}
> hdfs dfs -ls 
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19
> -rwxr-xr-x   3 ubuntu supergroup 1117650900 2016-10-24 13:09 
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397
> -rwxr-xr-x   3 ubuntu supergroup 1171968390 2016-10-21 12:31 
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (ACCUMULO-4506) Some in-progress files for replication never replicate

Reply via email to