Re: Release plans for Hadoop 2.10.0

2019-09-27 Thread Viral Bajaria
Hi Wei-Chiu,

Just got a chance to pull the jstack on a datanode that's showing high
ReadBlockOpAvgTime and I found a few different scenarios:

Lot of DataXceiver threads in the following state for a while

   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7f9ad9953f68> (a
java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
at
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at
org.apache.hadoop.util.InstrumentedLock.lock(InstrumentedLock.java:100)
at
org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.acquireDatasetLock(FsDatasetImpl.java:3383)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:255)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:593)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
at java.lang.Thread.run(Thread.java:748)

For some reason I can't find any thread that has locked 0x7f9ad9953f68
and so it's hard to make a conclusive statement on why the thread can't
acquire the lock.

On 2nd attempt, I am able to find a bunch of threads BLOCKED on
MetricsRegistry:

   java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.metrics2.lib.MetricsRegistry.snapshot(MetricsRegistry.java)
- waiting to lock <0x7f9ad80a2e08> (a
org.apache.hadoop.metrics2.lib.MetricsRegistry)
at
org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1.getMetrics(MetricsSourceBuilder.java:80)
at
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
at
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.updateJmxCache(MetricsSourceAdapter.java:183)
at
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getAttribute(MetricsSourceAdapter.java:107)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:647)
at
com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
at
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1445)
at
javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
at
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
at
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
at
javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:639)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$14/515379090.run(Unknown
Source)
at java.security.AccessController.doPrivileged(Native Method)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Thanks,
Viral


On Thu, Sep 26, 2019 at 8:18 PM Viral Bajaria 
wrote:

> Hi Wei-Chiu,
>
> I came across HDFS-14476 when searching if anyone else is seeing the same
> issues as us. I didn't see it merged and 

Re: Release plans for Hadoop 2.10.0

2019-09-26 Thread Viral Bajaria
Hi Wei-Chiu,

I came across HDFS-14476 when searching if anyone else is seeing the same
issues as us. I didn't see it merged and so assumed there are other fixes
done around what you mention in there.

HDFS-11187 is already applied to 2.9.1 and we are on 2.9.2 and so it might
not be impacting us though I need to be 100% sure of it.

I will find the jstack as soon as I am on the server again, but I have the
following WARN statement that I felt could point to the issue. Btw, we have
tested the disk I/O and when these slowdown happen the disk is hardly under
any kind of I/O pressure. And on restart of the datanode process all goes
back to normal i.e. the disks are able to do high I/O.

2019-09-27 00:48:14,101 WARN
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Lock held time above threshold: lock identifier:
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsD
atasetImpl lockHeldTimeMs=12628 ms. Suppressed 1 lock warnings. The
stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1021)
org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:143)
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133)
org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1781)
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.finalizeBlock(BlockReceiver.java:1517)
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1474)
java.lang.Thread.run(Thread.java:748)


Thanks,
Viral






On Thu, Sep 26, 2019 at 7:03 PM Wei-Chiu Chuang 
wrote:

> or maybe https://issues.apache.org/jira/browse/HDFS-14476
>
> I reverted this fix and I've not looked at it further. But take a look.
> It's disappointing to me that none of the active Hadoop contributors seem
> to understand DirectoryScanner well enough.
>
> or HDFS-11187 
>
> If you can post a jstack snippet I might be able to help out.
> On Thu, Sep 26, 2019 at 9:48 PM Viral Bajaria 
> wrote:
>
>> Thanks for the quick response Jonathan.
>>
>> Honestly, I am not sure if 2.10.0 will fix my issue but it looks similar
>> to
>> https://issues.apache.org/jira/browse/HDFS-14536 which is not fixed yet
>> so
>> we will probably not see the benefit.
>>
>> We need to either dig more into the logs and jstack and create a JIRA to
>> see if the developers can comment on what's going on. A datanode restart
>> fixes the latency issue and so it is something that happens over time and
>> need the right instrumentation to figure it out!
>>
>> Thanks,
>> Viral
>>
>>
>> On Thu, Sep 26, 2019 at 6:41 PM Jonathan Hung 
>> wrote:
>>
>> > Hey Viral, yes. We're working on a 2.10.0 release, I'm the release
>> manager
>> > for that. I can't comment on the particular issue you're seeing, but I
>> plan
>> > to start the release process for 2.10.0 early next week, then 2.10.0
>> will
>> > be released shortly after that (assuming all goes well).
>> >
>> > Thanks,
>> > Jonathan Hung
>> >
>> >
>> > On Thu, Sep 26, 2019 at 6:34 PM Viral Bajaria 
>> > wrote:
>> >
>> >> (Cross posting from user list based on feedback by Sean Busbey)
>> >>
>> >> All,
>> >>
>> >> Just saw the announcement for new release for Hadoop 3.2.1,
>> >> congratulations
>> >> team and thanks for all the hard work!
>> >>
>> >> Are we going to see a new release in the 2.x.x ?
>> >>
>> >> I noticed a bunch of tickets that have been resolved in the last year
>> have
>> >> been tagged with 2.10.0 as a fix version and it's been a while since
>> 2.9.2
>> >> was released so I was wondering if we are going to see a 2.10.0 release
>> >> soon OR should we start looking to upgarde to 3.x line ?
>> >>
>> >> The reason I ask is that we are seeing very high ReadBlockOp Latency on
>> >> our
>> >> datanodes and feel that the issue is due to some locking going on
>> between
>> >> VolumeScanner, DirectoryScanner, RW Block Operations and
>> MetricsRegistry.
>> >> Looking at a few JIRA it looks like 2.10.0 might have some fixes that
>> we
>> >> should try, not fully sure yet!
>> >>
>> >> Thanks,
>> >> Viral
>> >>
>> >
>>
>


Re: Release plans for Hadoop 2.10.0

2019-09-26 Thread Wei-Chiu Chuang
or maybe https://issues.apache.org/jira/browse/HDFS-14476

I reverted this fix and I've not looked at it further. But take a look.
It's disappointing to me that none of the active Hadoop contributors seem
to understand DirectoryScanner well enough.

or HDFS-11187 

If you can post a jstack snippet I might be able to help out.
On Thu, Sep 26, 2019 at 9:48 PM Viral Bajaria 
wrote:

> Thanks for the quick response Jonathan.
>
> Honestly, I am not sure if 2.10.0 will fix my issue but it looks similar to
> https://issues.apache.org/jira/browse/HDFS-14536 which is not fixed yet so
> we will probably not see the benefit.
>
> We need to either dig more into the logs and jstack and create a JIRA to
> see if the developers can comment on what's going on. A datanode restart
> fixes the latency issue and so it is something that happens over time and
> need the right instrumentation to figure it out!
>
> Thanks,
> Viral
>
>
> On Thu, Sep 26, 2019 at 6:41 PM Jonathan Hung 
> wrote:
>
> > Hey Viral, yes. We're working on a 2.10.0 release, I'm the release
> manager
> > for that. I can't comment on the particular issue you're seeing, but I
> plan
> > to start the release process for 2.10.0 early next week, then 2.10.0 will
> > be released shortly after that (assuming all goes well).
> >
> > Thanks,
> > Jonathan Hung
> >
> >
> > On Thu, Sep 26, 2019 at 6:34 PM Viral Bajaria 
> > wrote:
> >
> >> (Cross posting from user list based on feedback by Sean Busbey)
> >>
> >> All,
> >>
> >> Just saw the announcement for new release for Hadoop 3.2.1,
> >> congratulations
> >> team and thanks for all the hard work!
> >>
> >> Are we going to see a new release in the 2.x.x ?
> >>
> >> I noticed a bunch of tickets that have been resolved in the last year
> have
> >> been tagged with 2.10.0 as a fix version and it's been a while since
> 2.9.2
> >> was released so I was wondering if we are going to see a 2.10.0 release
> >> soon OR should we start looking to upgarde to 3.x line ?
> >>
> >> The reason I ask is that we are seeing very high ReadBlockOp Latency on
> >> our
> >> datanodes and feel that the issue is due to some locking going on
> between
> >> VolumeScanner, DirectoryScanner, RW Block Operations and
> MetricsRegistry.
> >> Looking at a few JIRA it looks like 2.10.0 might have some fixes that we
> >> should try, not fully sure yet!
> >>
> >> Thanks,
> >> Viral
> >>
> >
>


Re: Release plans for Hadoop 2.10.0

2019-09-26 Thread Viral Bajaria
Thanks for the quick response Jonathan.

Honestly, I am not sure if 2.10.0 will fix my issue but it looks similar to
https://issues.apache.org/jira/browse/HDFS-14536 which is not fixed yet so
we will probably not see the benefit.

We need to either dig more into the logs and jstack and create a JIRA to
see if the developers can comment on what's going on. A datanode restart
fixes the latency issue and so it is something that happens over time and
need the right instrumentation to figure it out!

Thanks,
Viral


On Thu, Sep 26, 2019 at 6:41 PM Jonathan Hung  wrote:

> Hey Viral, yes. We're working on a 2.10.0 release, I'm the release manager
> for that. I can't comment on the particular issue you're seeing, but I plan
> to start the release process for 2.10.0 early next week, then 2.10.0 will
> be released shortly after that (assuming all goes well).
>
> Thanks,
> Jonathan Hung
>
>
> On Thu, Sep 26, 2019 at 6:34 PM Viral Bajaria 
> wrote:
>
>> (Cross posting from user list based on feedback by Sean Busbey)
>>
>> All,
>>
>> Just saw the announcement for new release for Hadoop 3.2.1,
>> congratulations
>> team and thanks for all the hard work!
>>
>> Are we going to see a new release in the 2.x.x ?
>>
>> I noticed a bunch of tickets that have been resolved in the last year have
>> been tagged with 2.10.0 as a fix version and it's been a while since 2.9.2
>> was released so I was wondering if we are going to see a 2.10.0 release
>> soon OR should we start looking to upgarde to 3.x line ?
>>
>> The reason I ask is that we are seeing very high ReadBlockOp Latency on
>> our
>> datanodes and feel that the issue is due to some locking going on between
>> VolumeScanner, DirectoryScanner, RW Block Operations and MetricsRegistry.
>> Looking at a few JIRA it looks like 2.10.0 might have some fixes that we
>> should try, not fully sure yet!
>>
>> Thanks,
>> Viral
>>
>


Re: Release plans for Hadoop 2.10.0

2019-09-26 Thread Jonathan Hung
Hey Viral, yes. We're working on a 2.10.0 release, I'm the release manager
for that. I can't comment on the particular issue you're seeing, but I plan
to start the release process for 2.10.0 early next week, then 2.10.0 will
be released shortly after that (assuming all goes well).

Thanks,
Jonathan Hung


On Thu, Sep 26, 2019 at 6:34 PM Viral Bajaria 
wrote:

> (Cross posting from user list based on feedback by Sean Busbey)
>
> All,
>
> Just saw the announcement for new release for Hadoop 3.2.1, congratulations
> team and thanks for all the hard work!
>
> Are we going to see a new release in the 2.x.x ?
>
> I noticed a bunch of tickets that have been resolved in the last year have
> been tagged with 2.10.0 as a fix version and it's been a while since 2.9.2
> was released so I was wondering if we are going to see a 2.10.0 release
> soon OR should we start looking to upgarde to 3.x line ?
>
> The reason I ask is that we are seeing very high ReadBlockOp Latency on our
> datanodes and feel that the issue is due to some locking going on between
> VolumeScanner, DirectoryScanner, RW Block Operations and MetricsRegistry.
> Looking at a few JIRA it looks like 2.10.0 might have some fixes that we
> should try, not fully sure yet!
>
> Thanks,
> Viral
>