[jira] [Commented] (HBASE-9393) Hbase does not closing a closed socket resulting in many CLOSE_WAIT

2016-01-18 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105939#comment-15105939
 ] 

Colin Patrick McCabe commented on HBASE-9393:
-

Unfortunately, this is kind of a complex topic.

In HDFS, sockets for input streams are managed by the {{Peer}} class.  
{{Peers}} can either be "owned" by {{DFSInputStream}} objects, or stored in the 
{{PeerCache}}.  The {{PeerCache}} already has appropriate timeouts and won't 
keep open too many sockets.  However, there is no limit to how long a 
{{DFSInputStream}} could hold on to a {{Peer}}.

There are a few ways to minimize the number of open peers.
1. If HBase only ever called positional read (pread), the {{DFSInputStream}} 
object would never own a {{Peer}}, so this issue would not arise.
2. If HBase called {{DFSInputStream#unbuffer}}, any open peers would be closed, 
even though the stream would continue to be open.
3. If HDFS had a timeout for how long it would hold onto a {{Peer}}, that could 
limit the number of open sockets.

Configuring HBase to periodically close open streams  is not necessary; it's 
strictly worse than option #2.

I believe there is an option do to #1 even right now.  Can't HBase be 
configured just to use pread and never read?  #2 would require a code change to 
HBase; #3 would require a code change to HDFS.

Are you running out of file descriptors?  What's the user-visible problem here?

> Hbase does not closing a closed socket resulting in many CLOSE_WAIT 
> 
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.2, 0.98.0
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219  0 12:26 pts/000:00:00 grep 21592
> hbase21592 1 17 Aug29 ?03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-9393) Hbase does not closing a closed socket resulting in many CLOSE_WAIT

2016-01-15 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102383#comment-15102383
 ] 

Colin Patrick McCabe commented on HBASE-9393:
-

The timeout that I'm talking about is inside DFSClient.java, not inside HBase.  
HDFS-4911 fixed a problem where the timeout was too long.  Can you be a little 
bit clearer on what you'd like to implement, and what you see as the problem 
here?

> Hbase does not closing a closed socket resulting in many CLOSE_WAIT 
> 
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.2, 0.98.0
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219  0 12:26 pts/000:00:00 grep 21592
> hbase21592 1 17 Aug29 ?03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-9393) Hbase does not closing a closed socket resulting in many CLOSE_WAIT

2016-01-14 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098796#comment-15098796
 ] 

Colin Patrick McCabe commented on HBASE-9393:
-

The client should be configured so that it closes sockets a short time after 
the server does.  In other words, its timeout should be slightly longer than 
the server's.  Suggest checking your timeout configuration (this was too long 
in older versions of Hadoop).

> Hbase does not closing a closed socket resulting in many CLOSE_WAIT 
> 
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.2, 0.98.0
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219  0 12:26 pts/000:00:00 grep 21592
> hbase21592 1 17 Aug29 ?03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14451) Move on to htrace-4.0.1 (from htrace-3.2.0)

2015-09-26 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HBASE-14451:
-
Summary: Move on to htrace-4.0.1 (from htrace-3.2.0)  (was: Move on to 
htrace-4.0.0 (from htrace-3.2.0))

> Move on to htrace-4.0.1 (from htrace-3.2.0)
> ---
>
> Key: HBASE-14451
> URL: https://issues.apache.org/jira/browse/HBASE-14451
> Project: HBase
>  Issue Type: Task
>Reporter: stack
>Assignee: stack
> Attachments: 14451.txt, 14451v2.txt, 14451v3.txt, 14451v4.txt, 
> 14451v5.txt, 14451v6.txt, 14451v7.txt, 14451v8.txt, 14451v9.txt
>
>
> htrace-4.0.0 was just release with a new API. Get up on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14451) Move on to htrace-4.0.0 (from htrace-3.2.0)

2015-09-22 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903324#comment-14903324
 ] 

Colin Patrick McCabe commented on HBASE-14451:
--

Thanks for this, [~stack].

{{ResultBoundedCompletionService}}: it seems like {{Tracer}} should be an 
argument to the constructor here, rather than pulled from 
{{Tracer#curThreadTracer}}.

{code}
requestHeaderBuilder.setTraceInfo(TracingProtos.RPCTInfo.newBuilder().
setParentId(spanId.getHigh()).
setTraceId(spanId.getLow()));
{code}
No sure if it matters, but for consistency, we should probably set {{TraceId}} 
to {{spanId#getHigh}}, since that is the 64 bits that is conserved between 
parent and child (in single-parent scenarios).  Same comment in 
{{RpcClientImpl.java}}.

{code}
protected void tracedWriteRequest(Call call, int priority, TraceScope 
traceScope)
throws IOException {
  try {
writeRequest(call, priority, traceScope);
  } finally {
if (traceScope != null) traceScope.close();
  }
}
{code}
Do we need this method any more?  It seems like the calls to {{writeRequest}} 
are already wrapped in try...catch blocks that we could use a traceScope with.

{{RpcClientImpl.java}}: there is a lot of awkwardness here with trying to get 
the current thread tracer.  Shouldn't the {{RpcClientImpl}} have its own 
{{Tracer}} object internally and just use that for everything?  Same comment 
for {{RecoverableZooKeeper}}.

{{hbase-default.xml}}: should we also document {{hbase.htrace.sampler.classes}}?

In general, {{Tracer#curThreadTracer}} is a hack.  It may be helpful in some 
legacy code, but in general you should pass tracers around "normally"-- i.e. 
when the {{HRegionServer}} creates objects to do what it needs to do, it should 
pass them its own tracer.  Remember that worker threads won't have a current 
tracer when they're first created.  It is always safer and cleaner to pass the 
{{Tracer}} object you want in explicitly than to rely on {{curThreadTracer}}.

I don't see where we're creating the Tracer for the HBase client.  I only see 
us creating a tracer for the RegionServer.
{code}
cmccabe@keter:~/hbase1> git grep Tracer.Builder
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java:
this.tracer = new Tracer.Builder("RegionServer").
hbase-server/src/test/java/org/apache/hadoop/hbase/PerformanceEvaluation.java:  
this.tracer = new Tracer.Builder().name("Client").
hbase-server/src/test/java/org/apache/hadoop/hbase/trace/TestHTraceHooks.java:  
  new Tracer.Builder().name("test").conf(new 
HBaseHTraceConfiguration(conf)).build()) {
{code}
(Note that the second two grep results are unit tests, and so don't count here)

We should trace the HBaseClient as well as the region server.  And probably we 
need another tracer for the HBase Master?

RingBufferTruck: I thought HBase was more of a series of tubes?

> Move on to htrace-4.0.0 (from htrace-3.2.0)
> ---
>
> Key: HBASE-14451
> URL: https://issues.apache.org/jira/browse/HBASE-14451
> Project: HBase
>  Issue Type: Task
>Reporter: stack
>Assignee: stack
> Attachments: 14451.txt, 14451v2.txt, 14451v3.txt, 14451v4.txt, 
> 14451v5.txt, 14451v6.txt, 14451v7.txt
>
>
> htrace-4.0.0 was just release with a new API. Get up on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14451) Move on to htrace-4.0.0 (from htrace-3.2.0)

2015-09-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901651#comment-14901651
 ] 

Colin Patrick McCabe commented on HBASE-14451:
--

[~stack]: right, to get "cross-cutting" tracing with HTrace you need to have 
the same major version in all your components.  You can still get tracing in 
just HBase with any version you choose.  Hadoop 2.8 will have htrace 4.0.

> Move on to htrace-4.0.0 (from htrace-3.2.0)
> ---
>
> Key: HBASE-14451
> URL: https://issues.apache.org/jira/browse/HBASE-14451
> Project: HBase
>  Issue Type: Task
>Reporter: stack
>Assignee: stack
> Attachments: 14451.txt, 14451v2.txt, 14451v3.txt, 14451v4.txt, 
> 14451v5.txt
>
>
> htrace-4.0.0 was just release with a new API. Get up on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-9393) Hbase does not closing a closed socket resulting in many CLOSE_WAIT

2015-05-08 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535512#comment-14535512
 ] 

Colin Patrick McCabe commented on HBASE-9393:
-

CDH4.4 had some configuration defaults that weren't the best, that were 
improved in later versions.  It is getting pretty old now, so I would suggest 
just upgrading.  If that's not possible, then you could check out some of the 
recent HBaseCon talks about tuning HBase and HDFS performance.

I think this jira should be closed since I don't see any bug here.  if we get 
more information about something specific we could improve we could reopen it.

> Hbase does not closing a closed socket resulting in many CLOSE_WAIT 
> 
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.2, 0.98.0
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219  0 12:26 pts/000:00:00 grep 21592
> hbase21592 1 17 Aug29 ?03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13060) Don't use deprecated HTrace API addKVAnnotation(byte[], byte[])

2015-02-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324981#comment-14324981
 ] 

Colin Patrick McCabe commented on HBASE-13060:
--

It looks like the String-based API didn't make it into HTrace 3.1.0.  I guess 
we will have to wait on this one.

> Don't use deprecated HTrace API addKVAnnotation(byte[], byte[])
> ---
>
> Key: HBASE-13060
> URL: https://issues.apache.org/jira/browse/HBASE-13060
> Project: HBase
>  Issue Type: Bug
>Reporter: Colin Patrick McCabe
>Priority: Critical
>
> Let's avoid using the deprecated HTrace API addKVAnnotation(byte[], byte[]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13060) Don't use deprecated HTrace API addKVAnnotation(byte[], byte[])

2015-02-17 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HBASE-13060:


 Summary: Don't use deprecated HTrace API addKVAnnotation(byte[], 
byte[])
 Key: HBASE-13060
 URL: https://issues.apache.org/jira/browse/HBASE-13060
 Project: HBase
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Priority: Critical


Let's avoid using the deprecated HTrace API addKVAnnotation(byte[], byte[]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12899) HBase should prefix htrace configuration keys with "hbase.htrace" rather than just "hbase."

2015-01-21 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HBASE-12899:
-
Attachment: HBASE-12899.002.patch

OK.  We can set the new configuration keys when we see the old ones, and print 
a warning, if that's helpful.  Here is a new patch that does this.  We know 
which config keys existed in the pre-3.1 world, so we can just include 
deprecations for those.

> HBase should prefix htrace configuration keys with "hbase.htrace" rather than 
> just "hbase."
> ---
>
> Key: HBASE-12899
> URL: https://issues.apache.org/jira/browse/HBASE-12899
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Colin Patrick McCabe
> Attachments: HBASE-12899.001.patch, HBASE-12899.002.patch
>
>
> In Hadoop, we pass all configuration keys starting with "hadoop.htrace" to 
> htrace.  So "hadoop.htrace.sampler.fraction" gets passed to HTrace as 
> sampler.fraction, and so forth.
> For consistency, it seems like HBase should prefix htrace configuration keys 
> with "hbase.htrace" rather than just "hbase."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12899) HBase should prefix htrace configuration keys with "hbase.htrace" rather than just "hbase."

2015-01-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286476#comment-14286476
 ] 

Colin Patrick McCabe commented on HBASE-12899:
--

thanks for the review, Nick.

HTrace-for-hbase isn't really deployed in production yet.  I would be surprised 
if anyone but a handful of devs had used this.  I think now is the right time 
to change the prefix so that we don't get conflicts between htrace and hbase 
configuration keys.

If we start with compatibility shims, we'll have to carry them around forever 
(speaking from Hadoop experience) and there's little benefit.  We only just got 
an HTrace release last week! :)

> HBase should prefix htrace configuration keys with "hbase.htrace" rather than 
> just "hbase."
> ---
>
> Key: HBASE-12899
> URL: https://issues.apache.org/jira/browse/HBASE-12899
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Colin Patrick McCabe
> Attachments: HBASE-12899.001.patch
>
>
> In Hadoop, we pass all configuration keys starting with "hadoop.htrace" to 
> htrace.  So "hadoop.htrace.sampler.fraction" gets passed to HTrace as 
> sampler.fraction, and so forth.
> For consistency, it seems like HBase should prefix htrace configuration keys 
> with "hbase.htrace" rather than just "hbase."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12899) HBase should prefix htrace configuration keys with "hbase.htrace" rather than just "hbase."

2015-01-21 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HBASE-12899:
-
Status: Patch Available  (was: Open)

> HBase should prefix htrace configuration keys with "hbase.htrace" rather than 
> just "hbase."
> ---
>
> Key: HBASE-12899
> URL: https://issues.apache.org/jira/browse/HBASE-12899
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Colin Patrick McCabe
> Attachments: HBASE-12899.001.patch
>
>
> In Hadoop, we pass all configuration keys starting with "hadoop.htrace" to 
> htrace.  So "hadoop.htrace.sampler.fraction" gets passed to HTrace as 
> sampler.fraction, and so forth.
> For consistency, it seems like HBase should prefix htrace configuration keys 
> with "hbase.htrace" rather than just "hbase."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12899) HBase should prefix htrace configuration keys with "hbase.htrace" rather than just "hbase."

2015-01-21 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HBASE-12899:
-
Attachment: HBASE-12899.001.patch

> HBase should prefix htrace configuration keys with "hbase.htrace" rather than 
> just "hbase."
> ---
>
> Key: HBASE-12899
> URL: https://issues.apache.org/jira/browse/HBASE-12899
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Colin Patrick McCabe
> Attachments: HBASE-12899.001.patch
>
>
> In Hadoop, we pass all configuration keys starting with "hadoop.htrace" to 
> htrace.  So "hadoop.htrace.sampler.fraction" gets passed to HTrace as 
> sampler.fraction, and so forth.
> For consistency, it seems like HBase should prefix htrace configuration keys 
> with "hbase.htrace" rather than just "hbase."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12899) HBase should prefix htrace configuration keys with "hbase.htrace" rather than just "hbase."

2015-01-21 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HBASE-12899:


 Summary: HBase should prefix htrace configuration keys with 
"hbase.htrace" rather than just "hbase."
 Key: HBASE-12899
 URL: https://issues.apache.org/jira/browse/HBASE-12899
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Colin Patrick McCabe


In Hadoop, we pass all configuration keys starting with "hadoop.htrace" to 
htrace.  So "hadoop.htrace.sampler.fraction" gets passed to HTrace as 
sampler.fraction, and so forth.

For consistency, it seems like HBase should prefix htrace configuration keys 
with "hbase.htrace" rather than just "hbase."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-9393) Hbase does not closing a closed socket resulting in many CLOSE_WAIT

2014-09-02 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118943#comment-14118943
 ] 

Colin Patrick McCabe commented on HBASE-9393:
-

Best guess is that you didn't apply your configuration to HBase, which is the 
DFSClient in this scenario.  Suggest posting to hdfs-u...@apache.org

> Hbase does not closing a closed socket resulting in many CLOSE_WAIT 
> 
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.2, 0.98.0
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219  0 12:26 pts/000:00:00 grep 21592
> hbase21592 1 17 Aug29 ?03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-10689) Explore advisory caching for MR over snapshot scans

2014-03-10 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13926010#comment-13926010
 ] 

Colin Patrick McCabe commented on HBASE-10689:
--

No problem.  I agree it can get a bit confusing.  It would be nice to see some 
numbers from tweaking readahead on HBase when you guys get a chance.  I guess 
the gain will depend partly on how much caching HBase is doing.  If HBase is 
caching that extra 4 MB that it read, then it's not such a loss.  If it's 
throwing that away, then making readahead shorter may be a big gain.

> Explore advisory caching for MR over snapshot scans
> ---
>
> Key: HBASE-10689
> URL: https://issues.apache.org/jira/browse/HBASE-10689
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce, Performance
>Reporter: Nick Dimiduk
>
> Per 
> [comment|https://issues.apache.org/jira/browse/HBASE-10660?focusedCommentId=13921730&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13921730]
>  on HBASE-10660, explore using the new HDFS advisory caching feature 
> introduced in HDFS-4817 for TableSnapshotInputFormat.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10689) Explore advisory caching for MR over snapshot scans

2014-03-09 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925386#comment-13925386
 ] 

Colin Patrick McCabe commented on HBASE-10689:
--

[~stack], there are multiple kinds of caching in HDFS.  The path-based caching 
added in HDFS-4949 caches at the file level, so you are right that it is not 
that useful for HBase.  The advisory caching API is a little different.  It 
allows the application to control how much readahead HDFS does and a little bit 
about how the page cache is used.

When HBase reads a 64kb chunk, currently HDFS will load a 4MB segment off of 
the disk.  The rest of that 4MB is thrown away unless HBase uses it.  HBase 
could avoid this issue by calling DFSInputStream#setReadahead(65536).  Unless 
HBase is doing something smart with the rest of that 4MB, it seems like this 
might be a good idea?

> Explore advisory caching for MR over snapshot scans
> ---
>
> Key: HBASE-10689
> URL: https://issues.apache.org/jira/browse/HBASE-10689
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce, Performance
>Reporter: Nick Dimiduk
>
> Per 
> [comment|https://issues.apache.org/jira/browse/HBASE-10660?focusedCommentId=13921730&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13921730]
>  on HBASE-10660, explore using the new HDFS advisory caching feature 
> introduced in HDFS-4817 for TableSnapshotInputFormat.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10052) use HDFS advisory caching to avoid caching HFiles that are not going to be read again (because they are being compacted)

2013-12-05 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840355#comment-13840355
 ] 

Colin Patrick McCabe commented on HBASE-10052:
--

bq. One thing to be wary of: during the compaction, readers are still accessing 
the old files, so if you're compacting large files, this could really hurt read 
latency during compactions (assuming that people are relying on linux LRU in 
addition to hbase-internal LRU for performance).

That's a fair point.

bq. In most cases, as soon as the compaction is complete, we end up removing 
the input files anyway (thus removing from cache), right?

Unlinking a file doesn't remove that file from the buffer cache.  If the 
unlinked file is no longer referenced (certainly the case here), it will be 
removed over time, as other things evict it.  In the meantime, having those 
pages buffered means that something else isn't.

When doing the fadvise work, I remember us coming up with a crude hack that did 
fadvise from HBase during compactions and seeing some performance gain.  But it 
seems like might be workload-dependent.

It's a shame that there isn't a way to tell Linux to do a read without caching. 
 That's really what we want here.  Instead, we just have a way of nuking the 
cache for a range of the file if it exists, which is not at all the same thing. 
 I took a look at the Linux source tree again today, and {{FADV_NOREUSE}} was 
still a no-op :(

bq. Hmm, ok, moving out until we have something with a quantified benefit.

Yeah, it would be interesting to see some test numbers.  I also wonder if we 
could somehow quantify how often the HBase LRU hits.

> use HDFS advisory caching to avoid caching HFiles that are not going to be 
> read again (because they are being compacted)
> 
>
> Key: HBASE-10052
> URL: https://issues.apache.org/jira/browse/HBASE-10052
> Project: HBase
>  Issue Type: Improvement
>Reporter: Colin Patrick McCabe
>Assignee: Andrew Purtell
>Priority: Minor
> Fix For: 0.98.1, 0.99.0
>
>
> HBase can benefit from doing dropbehind during compaction since compacted 
> files are not read again.  HDFS advisory caching, introduced in HDFS-4817, 
> can help here.  The right API here is {{DataInputStream#setDropBehind}}.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10052) use HDFS advisory caching to avoid caching HFiles that are not going to be read again (because they are being compacted)

2013-12-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838543#comment-13838543
 ] 

Colin Patrick McCabe commented on HBASE-10052:
--

[~enis] It would be interesting to experiment with using drop-behind on HBase's 
block files.  However, in my experiments at least, this wasn't a performance 
win since HBase still relies on the OS page cache in some cases.  It's been a 
while since I did them, though.

> use HDFS advisory caching to avoid caching HFiles that are not going to be 
> read again (because they are being compacted)
> 
>
> Key: HBASE-10052
> URL: https://issues.apache.org/jira/browse/HBASE-10052
> Project: HBase
>  Issue Type: Improvement
>Reporter: Colin Patrick McCabe
> Fix For: 0.98.0
>
>
> HBase can benefit from doing dropbehind during compaction since compacted 
> files are not read again.  HDFS advisory caching, introduced in HDFS-4817, 
> can help here.  The right API here is {{DataInputStream#setDropBehind}}.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10052) use HDFS advisory caching to avoid caching HFiles that are not going to be read again (because they are being compacted)

2013-11-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834477#comment-13834477
 ] 

Colin Patrick McCabe commented on HBASE-10052:
--

[~andrew.purt...@gmail.com] you can take it if you want

> use HDFS advisory caching to avoid caching HFiles that are not going to be 
> read again (because they are being compacted)
> 
>
> Key: HBASE-10052
> URL: https://issues.apache.org/jira/browse/HBASE-10052
> Project: HBase
>  Issue Type: Improvement
>Reporter: Colin Patrick McCabe
> Fix For: 0.98.0
>
>
> HBase can benefit from doing dropbehind during compaction since compacted 
> files are not read again.  HDFS advisory caching, introduced in HDFS-4817, 
> can help here.  The right API here is {{DataInputStream#setDropBehind}}.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HBASE-10052) use HDFS advisory caching to avoid caching HFiles that are not going to be read again (because they are being compacted)

2013-11-27 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HBASE-10052:
-

Description: HBase can benefit from doing dropbehind during compaction 
since compacted files are not read again.  HDFS advisory caching, introduced in 
HDFS-4817, can help here.  The right API here is 
{{DataInputStream#setDropBehind}}.  (was: HBase can benefit from doing 
dropbehind during compaction since compacted files are not read again.  HDFS 
advisory caching, introduced in HDFS-4817, can help here.  The right API here 
is {{DataOutputStream#setDropBehind}}.)

> use HDFS advisory caching to avoid caching HFiles that are not going to be 
> read again (because they are being compacted)
> 
>
> Key: HBASE-10052
> URL: https://issues.apache.org/jira/browse/HBASE-10052
> Project: HBase
>  Issue Type: Improvement
>Reporter: Colin Patrick McCabe
>
> HBase can benefit from doing dropbehind during compaction since compacted 
> files are not read again.  HDFS advisory caching, introduced in HDFS-4817, 
> can help here.  The right API here is {{DataInputStream#setDropBehind}}.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10052) use HDFS advisory caching to avoid caching HFiles that are not going to be read again (because they are being compacted)

2013-11-27 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834203#comment-13834203
 ] 

Colin Patrick McCabe commented on HBASE-10052:
--

[~andrew.purt...@gmail.com] : sorry.  I did mean DFSInputStream, since the 
files not being used again would be the compactees.  And yeah, reflection would 
be a good way to do this and support older Hadoops  (see {{CanSetDropBehind}} 
interface)

> use HDFS advisory caching to avoid caching HFiles that are not going to be 
> read again (because they are being compacted)
> 
>
> Key: HBASE-10052
> URL: https://issues.apache.org/jira/browse/HBASE-10052
> Project: HBase
>  Issue Type: Improvement
>Reporter: Colin Patrick McCabe
>
> HBase can benefit from doing dropbehind during compaction since compacted 
> files are not read again.  HDFS advisory caching, introduced in HDFS-4817, 
> can help here.  The right API here is {{DataOutputStream#setDropBehind}}.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HBASE-10052) use HDFS advisory caching to avoid caching HFiles that are not going to be read again (because they are being compacted)

2013-11-27 Thread Colin Patrick McCabe (JIRA)
Colin Patrick McCabe created HBASE-10052:


 Summary: use HDFS advisory caching to avoid caching HFiles that 
are not going to be read again (because they are being compacted)
 Key: HBASE-10052
 URL: https://issues.apache.org/jira/browse/HBASE-10052
 Project: HBase
  Issue Type: Improvement
Reporter: Colin Patrick McCabe


HBase can benefit from doing dropbehind during compaction since compacted files 
are not read again.  HDFS advisory caching, introduced in HDFS-4817, can help 
here.  The right API here is {{DataOutputStream#setDropBehind}}.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9393) Hbase dose not closing a closed socket resulting in many CLOSE_WAIT

2013-10-11 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792938#comment-13792938
 ] 

Colin Patrick McCabe commented on HBASE-9393:
-

I guess I should also explain why this doesn't happen in branch-1 of Hadoop.  
The reason is because Hadoop-1 had no socket cache and no grace period before 
the sockets were closed.  The client simply opened a new socket each time, 
performed the op, and then closed it.  This would result in (basically) no 
sockets in {{CLOSE_WAIT}}.  Remember {{CLOSE_WAIT}} only happens when the 
server is waiting for the client to execute {{close}}.

Keeping sockets open is an optimization, but one that may require you to raise 
your maximum number of file descriptors.  If you are not happy with this 
tradeoff, you can set {{dfs.client.socketcache.capacity}} to {{0}} and 
{{dfs.datanode.socket.reuse.keepalive}} to {{0}} to get the old branch-1 
behavior.  It will be slower, though.

> Hbase dose not closing a closed socket resulting in many CLOSE_WAIT 
> 
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.2
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219  0 12:26 pts/000:00:00 grep 21592
> hbase21592 1 17 Aug29 ?03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9393) Hbase dose not closing a closed socket resulting in many CLOSE_WAIT

2013-10-11 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792915#comment-13792915
 ] 

Colin Patrick McCabe commented on HBASE-9393:
-

I looked into this issue.  I found a few things:

The HDFS socket cache is too small by default and times out too quickly.  Its 
default size is 16, but HBase seems to be opening many more connections to the 
DN than that.  In this situation, sockets must inevitably be opened and then 
discarded, leading to sockets in {{CLOSE_WAIT}}.

When you use positional read (aka {{pread}}), we grab a socket from the cache, 
read from it, and then immediately put it back.  When you seek and then call 
{{read}}, we don't put the socket back at the end.  The assumption behind the 
normal {{read}} method is that  you are probably going to call {{read}} again, 
so it holds on to the socket until something else comes up (such as closing the 
stream).  In many scenarios, this can lead to {{seek+read}} generating more 
sockets in {{CLOSE_WAIT}} than {{pread}}.

I don't think we want to alter this HDFS behavior, since it's helpful in the 
case that you're reading through the entire file from start to finish-- which 
many HDFS clients do.  It allows us to make certain optimizations such as 
reading a few kilobytes at a time, even if the user only asks for a few bytes 
at a time.  These optimizations are unavailable with {{pread}} because it 
creates a new {{BlockReader}} each time.

So as far as recommendations for HBase go:
* use short-circuit reads whenever possible, since in many cases you can avoid 
needing a socket at all and just reuse the same file descriptor
* set the socket cache to a bigger size and adjust the timeouts to be longer (I 
may explore changing the defaults in HDFS...)
* if you are going to keep files open for a while and random read, use 
{{pread}}, never {{seek+read}}.

> Hbase dose not closing a closed socket resulting in many CLOSE_WAIT 
> 
>
> Key: HBASE-9393
> URL: https://issues.apache.org/jira/browse/HBASE-9393
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.94.2
> Environment: Centos 6.4 - 7 regionservers/datanodes, 8 TB per node, 
> 7279 regions
>Reporter: Avi Zrachya
>
> HBase dose not close a dead connection with the datanode.
> This resulting in over 60K CLOSE_WAIT and at some point HBase can not connect 
> to the datanode because too many mapped sockets from one host to another on 
> the same port.
> The example below is with low CLOSE_WAIT count because we had to restart 
> hbase to solve the porblem, later in time it will incease to 60-100K sockets 
> on CLOSE_WAIT
> [root@hd2-region3 ~]# netstat -nap |grep CLOSE_WAIT |grep 21592 |wc -l
> 13156
> [root@hd2-region3 ~]# ps -ef |grep 21592
> root 17255 17219  0 12:26 pts/000:00:00 grep 21592
> hbase21592 1 17 Aug29 ?03:29:06 
> /usr/java/jdk1.6.0_26/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx8000m 
> -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
> -Dhbase.log.dir=/var/log/hbase 
> -Dhbase.log.file=hbase-hbase-regionserver-hd2-region3.swnet.corp.log ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-8337) Investigate why disabling hadoop short circuit read is required to make recovery tests pass consistently under hadoop2

2013-04-24 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641069#comment-13641069
 ] 

Colin Patrick McCabe commented on HBASE-8337:
-

bq. Maybe I'm missing it in the discussion above, but why is this only a 
problem with hadoop2 and not hadoop1? Is SCR not enabled by default in hadoop1?

Probably because of HDFS-4595.

> Investigate why disabling hadoop short circuit read is required to make 
> recovery tests pass consistently under hadoop2
> --
>
> Key: HBASE-8337
> URL: https://issues.apache.org/jira/browse/HBASE-8337
> Project: HBase
>  Issue Type: Sub-task
>  Components: hadoop2, test
>Affects Versions: 0.98.0, 0.95.1
>Reporter: Jonathan Hsieh
>Priority: Critical
> Fix For: 0.95.1
>
>
> HBASE-7636 makes some TestDistributedLogSplitting pass consistently by 
> disabling hdfs short circuit reads.  
> HBASE-8349 makes datanode node death recovery pass consistently by disabling 
> hdfs short circuit reads.
> This will likely require configuration modifications to fix and may have 
> different fixes for hadoop1, hadoop2 (HDFS-2246), and hadoop3 (HDFS-347)...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8337) Investigate why disabling hadoop short circuit read is required to make recovery tests pass consistently under hadoop2

2013-04-24 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640976#comment-13640976
 ] 

Colin Patrick McCabe commented on HBASE-8337:
-

bq. Looks like this has allowed us to get away with things we shouldn't. Tested 
using the same User for master and all regionservers in the minicluster, with 
0.94 branch and the default Hadoop 1. TestMasterZKSessionRecovery OOMEs after 
surefire tries to parse a 180 MB logfile full of IOExceptions. As soon as one 
regionserver aborts, its filesystem is cached and/or closed by user, the master 
file system's DFS client is closed, and all hell breaks loose.

You can use {{Filesystem#newInstance}} to prevent this problem.

> Investigate why disabling hadoop short circuit read is required to make 
> recovery tests pass consistently under hadoop2
> --
>
> Key: HBASE-8337
> URL: https://issues.apache.org/jira/browse/HBASE-8337
> Project: HBase
>  Issue Type: Sub-task
>  Components: hadoop2, test
>Affects Versions: 0.98.0, 0.95.1
>Reporter: Jonathan Hsieh
>Priority: Critical
> Fix For: 0.95.1
>
>
> HBASE-7636 makes some TestDistributedLogSplitting pass consistently by 
> disabling hdfs short circuit reads.  
> HBASE-8349 makes datanode node death recovery pass consistently by disabling 
> hdfs short circuit reads.
> This will likely require configuration modifications to fix and may have 
> different fixes for hadoop1, hadoop2 (HDFS-2246), and hadoop3 (HDFS-347)...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8337) Investigate why disabling hadoop short circuit read is required to make recovery tests pass consistently under hadoop2

2013-04-23 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13639970#comment-13639970
 ] 

Colin Patrick McCabe commented on HBASE-8337:
-

Actually, the 2.0 series will have SCR; it's just not there now.

> Investigate why disabling hadoop short circuit read is required to make 
> recovery tests pass consistently under hadoop2
> --
>
> Key: HBASE-8337
> URL: https://issues.apache.org/jira/browse/HBASE-8337
> Project: HBase
>  Issue Type: Sub-task
>  Components: hadoop2, test
>Affects Versions: 0.98.0, 0.95.1
>Reporter: Jonathan Hsieh
>Priority: Critical
> Fix For: 0.95.1
>
>
> HBASE-7636 makes some TestDistributedLogSplitting pass consistently by 
> disabling hdfs short circuit reads.  
> HBASE-8349 makes datanode node death recovery pass consistently by disabling 
> hdfs short circuit reads.
> This will likely require configuration modifications to fix and may have 
> different fixes for hadoop1, hadoop2 (HDFS-2246), and hadoop3 (HDFS-347)...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8337) Investigate why disabling hadoop short circuit read is required to make recovery tests pass consistently under hadoop2

2013-04-15 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632467#comment-13632467
 ] 

Colin Patrick McCabe commented on HBASE-8337:
-

It's good to have HDFS-347 merged into trunk.  I hope that the merge into 
branch-2 will not take too long.  Regardless, I think the bottom line is that 
HBase needs to have support for both new-style and old-style short-circuit 
local reads, at least for a while.  Even if we had HDFS-347 in branch-2 now, 
you'd still need this support to test against branch-1, or test against a 
Windows-based HDFS cluster.

> Investigate why disabling hadoop short circuit read is required to make 
> recovery tests pass consistently under hadoop2
> --
>
> Key: HBASE-8337
> URL: https://issues.apache.org/jira/browse/HBASE-8337
> Project: HBase
>  Issue Type: Sub-task
>  Components: hadoop2, test
>Affects Versions: 0.98.0, 0.95.1
>Reporter: Jonathan Hsieh
>Priority: Critical
> Fix For: 0.95.1
>
>
> HBASE-7636 makes some TestDistributedLogSplitting pass consistently by 
> disabling hdfs short circuit reads.  
> HBASE-8349 makes datanode node death recovery pass consistently by disabling 
> hdfs short circuit reads.
> This will likely require configuration modifications to fix and may have 
> different fixes for hadoop1, hadoop2 (HDFS-2246), and hadoop3 (HDFS-347)...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7636) TestDistributedLogSplitting#testThreeRSAbort fails against hadoop 2.0

2013-04-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630477#comment-13630477
 ] 

Colin Patrick McCabe commented on HBASE-7636:
-

bq. Is HDFS-347 the "new" version?

Yes.

bq. Is the current short circuit read (HDFS-2246?) in hadoop1 different from 
hadoop2's?

It's very similar.

bq. What is the fix version/target version for HDFS-347 – what HDFS branches 
are you targeting?

It's going to be merged to trunk at first.

bq. From HBase's point of view, this is purely a unit test fix and specifically 
for the MiniDFSCluster. Do you think disabling the SCR feature for unit tests 
is a prudent idea?

If you are just trying to test functionality rather than performance, it's 
easiest to keep it off.

> TestDistributedLogSplitting#testThreeRSAbort fails against hadoop 2.0
> -
>
> Key: HBASE-7636
> URL: https://issues.apache.org/jira/browse/HBASE-7636
> Project: HBase
>  Issue Type: Sub-task
>  Components: hadoop2, test
>Affects Versions: 0.95.0
>Reporter: Ted Yu
>Assignee: Jonathan Hsieh
> Fix For: 0.98.0, 0.95.1
>
> Attachments: hbase-7636.v2.patch, hbase-7636.v3.patch
>
>
> From 
> https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/364/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
>  :
> {code}
> 2013-01-21 11:49:34,276 DEBUG 
> [MASTER_SERVER_OPERATIONS-juno.apache.org,57966,1358768818594-0] 
> client.HConnectionManager$HConnectionImplementation(956): Looked up root 
> region location, connection=hconnection 0x12f19fe; 
> serverName=juno.apache.org,55531,1358768819479
> 2013-01-21 11:49:34,278 INFO  
> [MASTER_SERVER_OPERATIONS-juno.apache.org,57966,1358768818594-0] 
> catalog.CatalogTracker(576): Failed verification of .META.,,1 at 
> address=juno.apache.org,57582,1358768819456; 
> org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
> in the failed servers list: juno.apache.org/67.195.138.61:57582
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7636) TestDistributedLogSplitting#testThreeRSAbort fails against hadoop 2.0

2013-04-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630453#comment-13630453
 ] 

Colin Patrick McCabe commented on HBASE-7636:
-

I was not able to find any HDFS logs attached (but maybe I didn't look in the 
right place?)  If you can find where the test puts those, your answer is almost 
certainly in there.  And it's almost certainly a configuration problem.

In general, configuring short circuit reads is complex.  For old-style 
short-circuit reads, you need to set up special permissions on your DataNode 
storage directories.  You also need to specify the right user for 
{{dfs.block.local-path-access.user}}.  For new-style short-circuit reads, you 
need to have {{libhadoop.so}} installed, and possibly be running with the 
native profile {{-Pnative}} so that Maven will set up {{LD_LIBRARY_PATH}} 
correctly.  Then you need to set a valid socket path.

It's probably best to wait until we finish merging the HDFS-347 branch (vote 
was successful, now we just need to do the work in svn), and then I'll help you 
set up the conf for this test.  You probably want to have a fallback for if 
native code is not enabled.

> TestDistributedLogSplitting#testThreeRSAbort fails against hadoop 2.0
> -
>
> Key: HBASE-7636
> URL: https://issues.apache.org/jira/browse/HBASE-7636
> Project: HBase
>  Issue Type: Sub-task
>  Components: hadoop2, test
>Affects Versions: 0.95.0
>Reporter: Ted Yu
>Assignee: Jonathan Hsieh
> Fix For: 0.98.0, 0.95.1
>
> Attachments: hbase-7636.v2.patch, hbase-7636.v3.patch
>
>
> From 
> https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/364/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
>  :
> {code}
> 2013-01-21 11:49:34,276 DEBUG 
> [MASTER_SERVER_OPERATIONS-juno.apache.org,57966,1358768818594-0] 
> client.HConnectionManager$HConnectionImplementation(956): Looked up root 
> region location, connection=hconnection 0x12f19fe; 
> serverName=juno.apache.org,55531,1358768819479
> 2013-01-21 11:49:34,278 INFO  
> [MASTER_SERVER_OPERATIONS-juno.apache.org,57966,1358768818594-0] 
> catalog.CatalogTracker(576): Failed verification of .META.,,1 at 
> address=juno.apache.org,57582,1358768819456; 
> org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
> in the failed servers list: juno.apache.org/67.195.138.61:57582
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6686) HFile Quarantine fails with missing dirs in hadoop 2.0

2012-08-29 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1349#comment-1349
 ] 

Colin Patrick McCabe commented on HBASE-6686:
-

As far as I can see, hadoop-1 returns null, not an empty list, when the 
directory does not exist.  Admittedly, I only checked 
DistributedFileSystem.java on line 279 (DistributedFileSystem#listStatus)  
Maybe there's some other override that does it, but... seems questionable.

You're right that this is an exception in hadoop-2 / cdh4.

> HFile Quarantine fails with missing dirs in hadoop 2.0 
> ---
>
> Key: HBASE-6686
> URL: https://issues.apache.org/jira/browse/HBASE-6686
> Project: HBase
>  Issue Type: Bug
>  Components: hbck
>Affects Versions: 0.92.2, 0.96.0, 0.94.2
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
> Fix For: 0.92.2, 0.96.0, 0.94.2
>
> Attachments: hbase-6686-94-92.patch
>
>
> Two unit tests fail because listStatus's semantics change between hadoop 1.0 
> and hadoop 2.0.  (specifically -- hadoop 1.0 returns empty array if used on 
> dir that does not exist, but hadoop 2.0 throws FileNotFoundException).
> here's the exception:
> {code}
> 2012-08-28 16:01:19,789 WARN  [Thread-3155] hbck.HFileCorruptionChecker(230): 
> Failed to quaratine an HFile in regiondir 
> hdfs://localhost:38096/user/jenkins/hbase/testQuarantineMissingFamdir/34b2e072b33052bf4875f85513e9c669
> java.io.FileNotFoundException: File 
> hdfs://localhost:38096/user/jenkins/hbase/testQuarantineMissingFamdir/34b2e072b33052bf4875f85513e9c669/fam
>  does not exist.
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:406)
>   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1341)
>   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1381)
>   at 
> org.apache.hadoop.hbase.util.hbck.HFileCorruptionChecker.checkColFamDir(HFileCorruptionChecker.java:152)
>   at 
> org.apache.hadoop.hbase.util.TestHBaseFsck$2$1.checkColFamDir(TestHBaseFsck.java:1401)
>   at 
> org.apache.hadoop.hbase.util.hbck.HFileCorruptionChecker.checkRegionDir(HFileCorruptionChecker.java:185)
>   at 
> org.apache.hadoop.hbase.util.hbck.HFileCorruptionChecker$RegionDirChecker.call(HFileCorruptionChecker.java:267)
>   at 
> org.apache.hadoop.hbase.util.hbck.HFileCorruptionChecker$RegionDirChecker.call(HFileCorruptionChecker.java:258)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira