[ 
https://issues.apache.org/jira/browse/HDDS-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Jain resolved HDDS-2347.
-------------------------------
    Fix Version/s: 0.5.0
       Resolution: Fixed

> XCeiverClientGrpc's parallel use leads to NPE
> ---------------------------------------------
>
>                 Key: HDDS-2347
>                 URL: https://issues.apache.org/jira/browse/HDDS-2347
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: Ozone Client
>            Reporter: Istvan Fajth
>            Assignee: Istvan Fajth
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.5.0
>
>         Attachments: changes.diff, logs.txt
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> This issue came up when testing Hive with ORC tables on Ozone storage 
> backend, I so far I could not reproduce it locally within a JUnit test but 
> the issue.
> I am attaching a diff file that shows what logging I have added in 
> XCevierClientGrpc and in KeyInputStream to get the results that made me 
> arrive to the following understanding of the scenario:
> - Hive starts a couple of threads to work on the table data during query 
> execution
> - There is one RPCClient that is being used by these threads
> - The threads are opening different stream to read from the same key in ozone
> - The InputStreams internally are using the same XCeiverClientGrpc
> - XCeiverClientGrpc throws the following NPE intermittently:
> {code}
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandAsync(XceiverClientGrpc.java:398)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:295)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:259)
>         at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:242)
>         at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getBlock(ContainerProtocolCalls.java:118)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:169)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:118)
>         at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:224)
>         at 
> org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:173)
>         at 
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52)
>         at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:75)
>         at 
> org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
>         at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
>         at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:555)
>         at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:370)
>         at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:105)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1708)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1596)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2900(OrcInputFormat.java:1383)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1568)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1565)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1565)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1383)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {code}
> I have two proposals to fix this issue, one is the easy answer to put 
> synchronization to the XCeiverClientGrpc code, the other one is a bit more 
> complicated, let me explain below.
> Naively I would assume that when I get a client SPI instance from 
> XCeiverClientManager, that instance is ready to use. In fact it is not, and 
> when the user of the SPI instance sends the first request that is the point 
> when the client gets essentially ready. Now if we put synchronization to this 
> code, that is the easy solution, but my pragmatic half screams for a better 
> solution, that ensures that the Manager essentially manages the clients that 
> is giving to it's users, and the clients themselves are not getting ready by 
> accident.
> I am working on a proposal that moves things around a bit, and looking for 
> possible other solutions that does not feel hacky as I feel with the easy 
> solution.
> I am attaching the followings:
> - a diff that shows the added extended logging in XCeiverClientGrpc and 
> KeyInputStream.
> - a job log snippet from a Hive query that shows the relevant output from the 
> extensive logging added by the diff in a cluster.
> - later a proposal for the fix I need to work on a bit more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to