[ 
https://issues.apache.org/jira/browse/HDFS-16586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HDFS-16586.
----------------------------------
    Fix Version/s: 3.4.0
                   3.2.4
                   3.3.4
     Hadoop Flags: Reviewed
       Resolution: Fixed

Merged to branch-3, branch-3.3, and to branch-3.2. Thank you for the review 
[~hexiaoqiao] 

> Purge FsDatasetAsyncDiskService threadgroup; it causes 
> BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
> exception and exit' 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16586
>                 URL: https://issues.apache.org/jira/browse/HDFS-16586
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.3.0, 3.2.3
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.2.4, 3.3.4
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when 
> it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] 
> impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
> FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = 
> /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = 
> file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
>  for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
> (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> top.TopAuditLogger(78): ------------------- logged event for top service: 
> allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
> src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
>   dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS 
> downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
> encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:430) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:704) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:525) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1318)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1364)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1291)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1274)
>  ~[hadoop-hdfs-3.2.3.jar:?]
> 2022-05-19T18:21:08,243 DEBUG [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1036441577_16 at /127.0.0.1:54879 [Receiving block 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006]] 
> datanode.BlockReceiver(532): Receiving one packet for block 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006: PacketHeader with 
> packetLen=750 header data: offsetInBlock: 495104
> seqno: 902
> lastPacketInBlock: false
> dataLen: 7382022-05-19T18:21:08,243 WARN  [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1279): Ending command 
> processor service for: Thread[Command processor,5,FailOnTimeoutGroup] {code}
> Above, we are trying to run a FINALIZED 'command' asynchronously using a 
> thread from the volume's dedicated executor. The executor is trying to create 
> a thread because it has none-to-hand as it has not seen action in over 
> {color:#660e7a}THREADS_KEEP_ALIVE_SECONDS{color} (60seconds) so all live 
> threads have been let go (The volume executor as configured allows core 
> threads to die also). The thread creation fails because the 
> FsDatasetAsyncDiskService#threadGroup we are trying to associate the new 
> thread with is 'destroyed'. Threadgroups auto-'destroy' themselves after 
> their last member has 
> died|[https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846]|https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846].]
>  which can happen given the way this volume executor is configured. This 
> FsDatasetAsyncDiskService#threadGroup goes unexploited. If no thread group is 
> supplied, the executor makes a default that internally associates any created 
> threads with a dedicated executor 'threadgroup' but given how our executor is 
> configured to let all threads die in the name of economizing on total unning 
> threads, a threadgroup-for-each-volume-executor is likely to have the same 
> issue as this  FsDatasetAsyncDiskService#threadGroup one. Let me put up a PR 
> to remove it and undo the cause of the above ITSE.
> Of interest perhaps is that this IllegalThreadStateException is not new to 
> 3.2.3/3.0.0+. See the below from 3.2.2 when I run the same downstream test:
> {{2022-05-19T17:06:11,140 ERROR [BP-1924619834-10.0.0.192-1653005021952 
> heartbeating to localhost/127.0.0.1:54360] datanode.BPServiceActor(856): 
> Exception in BPOfferService for Block pool 
> BP-1924619834-10.0.0.192-1653005021952 (Datanode Uuid 
> bb5219c0-a444-4e5a-9455-6176ced7ffe7) service to localhost/127.0.0.1:54360}}
> {{java.lang.IllegalThreadStateException: null}}
> {{  at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:430) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:704) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:525) ~[?:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:105)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623)
>  ~[?:?]}}
> {{  at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]}}
> {{  at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:181)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:230)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2155)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2074)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:888)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:854)
>  ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at java.lang.Thread.run(Thread.java:829) ~[?:?]}}
> It too runs into the IllegalThreadStateException.
> The difference is HDFS-14997 which nicely made command processing async. It 
> changed how commands work (of course). Now on an ITSE, the processor shuts 
> down completely with ERROR. In my test case, the FINALIZED of the block fails 
> and the procedure waiting on its completion times out, in turn timing out the 
> test. In the old days, before HDFS-14997, the then synchronus heartbeats 
> would pick up the ITSE earlier and would just keep retrying in spite of ITSE 
> – the block finalize happened to complete and the test passed (I did not do 
> deep analysis of how command-processnig behavior has changed. If interest, 
> I'd be up for digging in).
> {color:#660e7a} {color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to