Re: Lost regions question
Brennon: Have you run hbck to diagnose the problem ? Since the issue might have involved hdfs, browsing DataNode log(s) may provide some clue as well. What hadoop version are you using ? Cheers On Thu, Apr 11, 2013 at 10:58 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: When you say that the parent regions got reopened does that mean that you did not lose any data(any data could not be read). The reason am asking is if after the parent got split into daughters and the data was written to daughters and if the daughters related files could not be opened you could have ended up in not able to read the data. Some logs could tell us what made the parent to get reopened rather than daughters. Another thing i would like to ask is was the cluster brought down abruptly by killing the RS. Which version of HBase? Regards Ram On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church bren...@getjar.com wrote: Hello, I had an interesting problem come up recently. We have a few thousand regions across 8 datanode/regionservers. I made a change, increasing the heap size for hadoop from 128M to 2048M which ended up bringing the cluster to a complete halt after about 1 hour. I reverted back to 128M and turned things back on again but didn't realize at the time that I came up with 9 fewer regions than I started. Upon further investigation, I found that all 9 missing regions were from splits that occurred while the cluster was running after making the heap change and before it came to a halt. There was a 10th regions (5 splits involved in total) that managed to get recovered. The really odd thing is that in the case of the other 9 regions, the original parent regions, which as far as I can tell in the logs were deleted, were re-opened upon restarting things once again. The daughter regions were gone. Interestingly, I found the orphaned datablocks still intact, and in at least some cases have been able to extract the data from them and will hopefully re-add it to the tables. My question is this. Does anyone know based on the rather muddled description I've given above, what could have possibly happened here? My best guess is that the bad state that hdfs was in caused some critical component of the split process to be missed, which resulted a reference to the parent regions sticking around and losing the references to the daughter regions. Thanks for any insight you can provide. --Brennon
Re: Error while doing multi get from HBase
Hi Ted, The region servers are not loaded. It is showing 5% CPU usage. The datanode is showing around 50% CPU utilization. disk IO is aroung 7Mbps. There is nothing noticeable in GC log. Thanks, Anand On 12 April 2013 02:56, Ted Yu yuzhih...@gmail.com wrote: How loaded were the region servers when the query was running ? Did you check GC log ? Thanks On Thu, Apr 11, 2013 at 8:23 AM, anand nalya anand.na...@gmail.com wrote: Hi, I'm using HBase 0.94.5 with thrift server. I'm trying to get the rows from HBase using org.apache.hadoop.hbase.thrift.generated.Hbase.Client.getRows(ByteBuffer, ListByteBuffer, MapByteBuffer, ByteBuffer) but it is giving results very slowly (around 2 mins for 100 rows). For larger number of records, there is no response. I've two region server and a total of 128 regions. Total data size is around 250GB (250 million records) uniformly distributed across regions. Regionserver only show the following in its log: 2013-04-11 19:53:44,535 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 74994 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 2013-04-11 19:53:46,121 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 76580 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) Any idea what might be wrong here? Thanks, Anand
Re: Error while doing multi get from HBase
Hi Azuryy, I'm using the default cache size of 100 for scanner. For mutigets, I've tried with 1 (13ms), 10(356ms), 100(1135ms), 1000(4330ms), and 1(17744ms) keys. Normal workload will be around 1 keys at a time. Are there any optimization that can be done for multigets. Is HBase a good candidate for usecase? Thanks, Anand On 12 April 2013 19:17, anand nalya a.na...@computer.org wrote: Hi Ted, The region servers are not loaded. It is showing 5% CPU usage. The datanode is showing around 50% CPU utilization. disk IO is aroung 7Mbps. There is nothing noticeable in GC log. Thanks, Anand On 12 April 2013 02:56, Ted Yu yuzhih...@gmail.com wrote: How loaded were the region servers when the query was running ? Did you check GC log ? Thanks On Thu, Apr 11, 2013 at 8:23 AM, anand nalya anand.na...@gmail.com wrote: Hi, I'm using HBase 0.94.5 with thrift server. I'm trying to get the rows from HBase using org.apache.hadoop.hbase.thrift.generated.Hbase.Client.getRows(ByteBuffer, ListByteBuffer, MapByteBuffer, ByteBuffer) but it is giving results very slowly (around 2 mins for 100 rows). For larger number of records, there is no response. I've two region server and a total of 128 regions. Total data size is around 250GB (250 million records) uniformly distributed across regions. Regionserver only show the following in its log: 2013-04-11 19:53:44,535 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 74994 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 2013-04-11 19:53:46,121 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 76580 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) Any idea what might be wrong here? Thanks, Anand
Re: Error while doing multi get from HBase
and whats your block cache size? there are two possible reasons: 1. result is too big 2. GC options are not optimized. can you paste your gc options here? --Send from my Sony mobile. On Apr 12, 2013 9:53 PM, anand nalya a.na...@computer.org wrote: Hi Azuryy, I'm using the default cache size of 100 for scanner. For mutigets, I've tried with 1 (13ms), 10(356ms), 100(1135ms), 1000(4330ms), and 1(17744ms) keys. Normal workload will be around 1 keys at a time. Are there any optimization that can be done for multigets. Is HBase a good candidate for usecase? Thanks, Anand On 12 April 2013 19:17, anand nalya a.na...@computer.org wrote: Hi Ted, The region servers are not loaded. It is showing 5% CPU usage. The datanode is showing around 50% CPU utilization. disk IO is aroung 7Mbps. There is nothing noticeable in GC log. Thanks, Anand On 12 April 2013 02:56, Ted Yu yuzhih...@gmail.com wrote: How loaded were the region servers when the query was running ? Did you check GC log ? Thanks On Thu, Apr 11, 2013 at 8:23 AM, anand nalya anand.na...@gmail.com wrote: Hi, I'm using HBase 0.94.5 with thrift server. I'm trying to get the rows from HBase using org.apache.hadoop.hbase.thrift.generated.Hbase.Client.getRows(ByteBuffer, ListByteBuffer, MapByteBuffer, ByteBuffer) but it is giving results very slowly (around 2 mins for 100 rows). For larger number of records, there is no response. I've two region server and a total of 128 regions. Total data size is around 250GB (250 million records) uniformly distributed across regions. Regionserver only show the following in its log: 2013-04-11 19:53:44,535 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 74994 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 2013-04-11 19:53:46,121 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 76580 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at
Re: Error while doing multi get from HBase
the block cache size is 0.25 Each row holds around 2KB data, so size should not be an issue at least till the number of records is less than 1000. Also, HBASE_HEAPSIZE=8000 HBASE_OPTS=-XX:+UseConcMarkSweepGC HBASE_REGIONSERVER_OPTS=-Xmx4g -Xms4g -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps On 12 April 2013 19:30, Azuryy Yu azury...@gmail.com wrote: and whats your block cache size? there are two possible reasons: 1. result is too big 2. GC options are not optimized. can you paste your gc options here? --Send from my Sony mobile. On Apr 12, 2013 9:53 PM, anand nalya a.na...@computer.org wrote: Hi Azuryy, I'm using the default cache size of 100 for scanner. For mutigets, I've tried with 1 (13ms), 10(356ms), 100(1135ms), 1000(4330ms), and 1(17744ms) keys. Normal workload will be around 1 keys at a time. Are there any optimization that can be done for multigets. Is HBase a good candidate for usecase? Thanks, Anand On 12 April 2013 19:17, anand nalya a.na...@computer.org wrote: Hi Ted, The region servers are not loaded. It is showing 5% CPU usage. The datanode is showing around 50% CPU utilization. disk IO is aroung 7Mbps. There is nothing noticeable in GC log. Thanks, Anand On 12 April 2013 02:56, Ted Yu yuzhih...@gmail.com wrote: How loaded were the region servers when the query was running ? Did you check GC log ? Thanks On Thu, Apr 11, 2013 at 8:23 AM, anand nalya anand.na...@gmail.com wrote: Hi, I'm using HBase 0.94.5 with thrift server. I'm trying to get the rows from HBase using org.apache.hadoop.hbase.thrift.generated.Hbase.Client.getRows(ByteBuffer, ListByteBuffer, MapByteBuffer, ByteBuffer) but it is giving results very slowly (around 2 mins for 100 rows). For larger number of records, there is no response. I've two region server and a total of 128 regions. Total data size is around 250GB (250 million records) uniformly distributed across regions. Regionserver only show the following in its log: 2013-04-11 19:53:44,535 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 74994 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 2013-04-11 19:53:46,121 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 76580 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at
Re: Error while doing multi get from HBase
your CMS is not tuned, please find how tune CMS on java web site. then Xmn is too small, this is not suitable for frequent multi get. please change Xmn=1g --Send from my Sony mobile. On Apr 12, 2013 10:29 PM, anand nalya a.na...@computer.org wrote: the block cache size is 0.25 Each row holds around 2KB data, so size should not be an issue at least till the number of records is less than 1000. Also, HBASE_HEAPSIZE=8000 HBASE_OPTS=-XX:+UseConcMarkSweepGC HBASE_REGIONSERVER_OPTS=-Xmx4g -Xms4g -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps On 12 April 2013 19:30, Azuryy Yu azury...@gmail.com wrote: and whats your block cache size? there are two possible reasons: 1. result is too big 2. GC options are not optimized. can you paste your gc options here? --Send from my Sony mobile. On Apr 12, 2013 9:53 PM, anand nalya a.na...@computer.org wrote: Hi Azuryy, I'm using the default cache size of 100 for scanner. For mutigets, I've tried with 1 (13ms), 10(356ms), 100(1135ms), 1000(4330ms), and 1(17744ms) keys. Normal workload will be around 1 keys at a time. Are there any optimization that can be done for multigets. Is HBase a good candidate for usecase? Thanks, Anand On 12 April 2013 19:17, anand nalya a.na...@computer.org wrote: Hi Ted, The region servers are not loaded. It is showing 5% CPU usage. The datanode is showing around 50% CPU utilization. disk IO is aroung 7Mbps. There is nothing noticeable in GC log. Thanks, Anand On 12 April 2013 02:56, Ted Yu yuzhih...@gmail.com wrote: How loaded were the region servers when the query was running ? Did you check GC log ? Thanks On Thu, Apr 11, 2013 at 8:23 AM, anand nalya anand.na...@gmail.com wrote: Hi, I'm using HBase 0.94.5 with thrift server. I'm trying to get the rows from HBase using org.apache.hadoop.hbase.thrift.generated.Hbase.Client.getRows(ByteBuffer, ListByteBuffer, MapByteBuffer, ByteBuffer) but it is giving results very slowly (around 2 mins for 100 rows). For larger number of records, there is no response. I've two region server and a total of 128 regions. Total data size is around 250GB (250 million records) uniformly distributed across regions. Regionserver only show the following in its log: 2013-04-11 19:53:44,535 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 74994 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3723) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3643) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3626) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3664) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4576) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4549) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2042) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) 2013-04-11 19:53:46,121 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call multi(org.apache.hadoop.hbase.client.MultiAction@49ac272), rpc version=1, client version=29, methodsFingerPrint=-1368823753 from 192.168.145.195:52277after 76580 ms, since caller disconnected at
Re: Region stuck in transition
Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? *Edit : I tried as well to create the service on another node, but the result is still the same : region in transition for a while.* Regards 2013/4/12 Fabien Chung fabien.ch...@ysance.com Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? Regards -- *CHUNG Fabien * -- Chung Fabien EFREI Promo 2013 Tel : 06 48 03 54 92
Re: Region stuck in transition
Hi Fabien, How are you doing today? Have you tried shutting down HBase and going to the zkCli and deleting the znode for unassigned regions and then restarting HBase. It sounds to me like you may have a corrupt state. On Fri, Apr 12, 2013 at 8:50 AM, Fabien Chung chung.fab...@gmail.comwrote: Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? *Edit : I tried as well to create the service on another node, but the result is still the same : region in transition for a while.* Regards 2013/4/12 Fabien Chung fabien.ch...@ysance.com Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? Regards -- *CHUNG Fabien * -- Chung Fabien EFREI Promo 2013 Tel : 06 48 03 54 92 -- Kevin O'Dell Systems Engineer, Cloudera
Re: Region stuck in transition
Can you pastebin master log ? What version of hbase are you using ? Thanks On Apr 12, 2013, at 5:50 AM, Fabien Chung chung.fab...@gmail.com wrote: Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? *Edit : I tried as well to create the service on another node, but the result is still the same : region in transition for a while.* Regards 2013/4/12 Fabien Chung fabien.ch...@ysance.com Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? Regards -- *CHUNG Fabien * -- Chung Fabien EFREI Promo 2013 Tel : 06 48 03 54 92
Re: Region stuck in transition
We have faced similar issue before (We are on 0.94.2), the way I resolved this is by doing this: in hbase shell, assign region_name Hope this works for you. On Fri, Apr 12, 2013 at 8:22 AM, Ted Yu yuzhih...@gmail.com wrote: Can you pastebin master log ? What version of hbase are you using ? Thanks On Apr 12, 2013, at 5:50 AM, Fabien Chung chung.fab...@gmail.com wrote: Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? *Edit : I tried as well to create the service on another node, but the result is still the same : region in transition for a while.* Regards 2013/4/12 Fabien Chung fabien.ch...@ysance.com Hello, I'm facing some troubles and I don't knwo how to figure out. The master hbase crashed couple days ago. I restarted it, but until now, the -ROOT- region is stucked in transition. I tried to restart the service, delete it/ remove all the file in habse folder. But the regions still is in transition. Do you have an idea why ? Regards -- *CHUNG Fabien * -- Chung Fabien EFREI Promo 2013 Tel : 06 48 03 54 92
hbase-0.94.6.1 balancer issue
Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: Lost regions question
Hello, We lost the data when the parent regions got reopened. My guess, and it's only that, is that the regions were essentially empty when they started up again in these cases. We definitely lost data from the tables. I've looked through the hdfs and hbase logs and can't find any obvious difference between a successful split and these failed ones. All steps show up the same in all cases. After the handled split message that listed the parent and daughter regions, the next reference is to the parent regions once again as hbase is started back up after the failure. No further reference to the daughters is made. I couldn't cleanly shut several of the regionservers down, so they were abruptly killed, yes. HBase version is 0.92.0, and hadoop is 1.0.1. Thanks. --Brennon On 4/11/13 10:58 PM, ramkrishna vasudevan wrote: When you say that the parent regions got reopened does that mean that you did not lose any data(any data could not be read). The reason am asking is if after the parent got split into daughters and the data was written to daughters and if the daughters related files could not be opened you could have ended up in not able to read the data. Some logs could tell us what made the parent to get reopened rather than daughters. Another thing i would like to ask is was the cluster brought down abruptly by killing the RS. Which version of HBase? Regards Ram On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church bren...@getjar.com wrote: Hello, I had an interesting problem come up recently. We have a few thousand regions across 8 datanode/regionservers. I made a change, increasing the heap size for hadoop from 128M to 2048M which ended up bringing the cluster to a complete halt after about 1 hour. I reverted back to 128M and turned things back on again but didn't realize at the time that I came up with 9 fewer regions than I started. Upon further investigation, I found that all 9 missing regions were from splits that occurred while the cluster was running after making the heap change and before it came to a halt. There was a 10th regions (5 splits involved in total) that managed to get recovered. The really odd thing is that in the case of the other 9 regions, the original parent regions, which as far as I can tell in the logs were deleted, were re-opened upon restarting things once again. The daughter regions were gone. Interestingly, I found the orphaned datablocks still intact, and in at least some cases have been able to extract the data from them and will hopefully re-add it to the tables. My question is this. Does anyone know based on the rather muddled description I've given above, what could have possibly happened here? My best guess is that the bad state that hdfs was in caused some critical component of the split process to be missed, which resulted a reference to the parent regions sticking around and losing the references to the daughter regions. Thanks for any insight you can provide. --Brennon
Re: hbase-0.94.6.1 balancer issue
Hi Samir, Regions are balancer per table. So if you have 48 regions within the same table, it should be split about 24 on each server. But if you have 48 tables with 1 region each, the for each table, the balancer will see only 1 region and will display the message you saw. Have you looked at the UI? What do you have in it? Can you please confirm if yo uhave 48 tables or 1 table? Thanks, JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: hbase-0.94.6.1 balancer issue
Hi Samir, Since regions are balanced per table, as soon as you will have more than one region in your table, balancer will start to balance the regions over the servers. You can split some of those tables and will you start to see HBase balance them. This is normal behavior for 0.94. I don't know for versions before that. Also, are you sure you need 48 tables? And not less tables with more CFs? JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, JM I have 48 tables and as you said it is 1 region per table since i did not reach splitting limit yet. So this is normal behavior in 0.94.6.1 version ? And at what point balancer will start redistribute regions to second server ? Thanks Samir On Fri, Apr 12, 2013 at 6:06 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Regions are balancer per table. So if you have 48 regions within the same table, it should be split about 24 on each server. But if you have 48 tables with 1 region each, the for each table, the balancer will see only 1 region and will display the message you saw. Have you looked at the UI? What do you have in it? Can you please confirm if yo uhave 48 tables or 1 table? Thanks, JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: Lost regions question
Oh..sorry to hear that . But i think it should be there in the system but not allowing you to access. We should be able to bring it back. One set of logs that would be of interest is that of the RS and master when the split happened. And the main thing would be that when you restarted your cluster and the Master again came back. That is where the system does some self rectification after it sees if there were some partial splits. Regards Ram On Fri, Apr 12, 2013 at 9:34 PM, Brennon Church bren...@getjar.com wrote: Hello, We lost the data when the parent regions got reopened. My guess, and it's only that, is that the regions were essentially empty when they started up again in these cases. We definitely lost data from the tables. I've looked through the hdfs and hbase logs and can't find any obvious difference between a successful split and these failed ones. All steps show up the same in all cases. After the handled split message that listed the parent and daughter regions, the next reference is to the parent regions once again as hbase is started back up after the failure. No further reference to the daughters is made. I couldn't cleanly shut several of the regionservers down, so they were abruptly killed, yes. HBase version is 0.92.0, and hadoop is 1.0.1. Thanks. --Brennon On 4/11/13 10:58 PM, ramkrishna vasudevan wrote: When you say that the parent regions got reopened does that mean that you did not lose any data(any data could not be read). The reason am asking is if after the parent got split into daughters and the data was written to daughters and if the daughters related files could not be opened you could have ended up in not able to read the data. Some logs could tell us what made the parent to get reopened rather than daughters. Another thing i would like to ask is was the cluster brought down abruptly by killing the RS. Which version of HBase? Regards Ram On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church bren...@getjar.com wrote: Hello, I had an interesting problem come up recently. We have a few thousand regions across 8 datanode/regionservers. I made a change, increasing the heap size for hadoop from 128M to 2048M which ended up bringing the cluster to a complete halt after about 1 hour. I reverted back to 128M and turned things back on again but didn't realize at the time that I came up with 9 fewer regions than I started. Upon further investigation, I found that all 9 missing regions were from splits that occurred while the cluster was running after making the heap change and before it came to a halt. There was a 10th regions (5 splits involved in total) that managed to get recovered. The really odd thing is that in the case of the other 9 regions, the original parent regions, which as far as I can tell in the logs were deleted, were re-opened upon restarting things once again. The daughter regions were gone. Interestingly, I found the orphaned datablocks still intact, and in at least some cases have been able to extract the data from them and will hopefully re-add it to the tables. My question is this. Does anyone know based on the rather muddled description I've given above, what could have possibly happened here? My best guess is that the bad state that hdfs was in caused some critical component of the split process to be missed, which resulted a reference to the parent regions sticking around and losing the references to the daughter regions. Thanks for any insight you can provide. --Brennon
Re: Lost regions question
Brennon: Can you try hbck to see if the problem is repaired ? Thanks On Fri, Apr 12, 2013 at 9:27 AM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: Oh..sorry to hear that . But i think it should be there in the system but not allowing you to access. We should be able to bring it back. One set of logs that would be of interest is that of the RS and master when the split happened. And the main thing would be that when you restarted your cluster and the Master again came back. That is where the system does some self rectification after it sees if there were some partial splits. Regards Ram On Fri, Apr 12, 2013 at 9:34 PM, Brennon Church bren...@getjar.com wrote: Hello, We lost the data when the parent regions got reopened. My guess, and it's only that, is that the regions were essentially empty when they started up again in these cases. We definitely lost data from the tables. I've looked through the hdfs and hbase logs and can't find any obvious difference between a successful split and these failed ones. All steps show up the same in all cases. After the handled split message that listed the parent and daughter regions, the next reference is to the parent regions once again as hbase is started back up after the failure. No further reference to the daughters is made. I couldn't cleanly shut several of the regionservers down, so they were abruptly killed, yes. HBase version is 0.92.0, and hadoop is 1.0.1. Thanks. --Brennon On 4/11/13 10:58 PM, ramkrishna vasudevan wrote: When you say that the parent regions got reopened does that mean that you did not lose any data(any data could not be read). The reason am asking is if after the parent got split into daughters and the data was written to daughters and if the daughters related files could not be opened you could have ended up in not able to read the data. Some logs could tell us what made the parent to get reopened rather than daughters. Another thing i would like to ask is was the cluster brought down abruptly by killing the RS. Which version of HBase? Regards Ram On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church bren...@getjar.com wrote: Hello, I had an interesting problem come up recently. We have a few thousand regions across 8 datanode/regionservers. I made a change, increasing the heap size for hadoop from 128M to 2048M which ended up bringing the cluster to a complete halt after about 1 hour. I reverted back to 128M and turned things back on again but didn't realize at the time that I came up with 9 fewer regions than I started. Upon further investigation, I found that all 9 missing regions were from splits that occurred while the cluster was running after making the heap change and before it came to a halt. There was a 10th regions (5 splits involved in total) that managed to get recovered. The really odd thing is that in the case of the other 9 regions, the original parent regions, which as far as I can tell in the logs were deleted, were re-opened upon restarting things once again. The daughter regions were gone. Interestingly, I found the orphaned datablocks still intact, and in at least some cases have been able to extract the data from them and will hopefully re-add it to the tables. My question is this. Does anyone know based on the rather muddled description I've given above, what could have possibly happened here? My best guess is that the bad state that hdfs was in caused some critical component of the split process to be missed, which resulted a reference to the parent regions sticking around and losing the references to the daughter regions. Thanks for any insight you can provide. --Brennon
Re: hbase-0.94.6.1 balancer issue
Thanks for explaining Jean-Marc, We are using 0.90.4 for very long time and balancing was based on total number of regions.That is why i was surprised with balancer log on 0.94. Well i'm more ops guy then dev i handle what other develop :) Regards On Fri, Apr 12, 2013 at 6:24 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Since regions are balanced per table, as soon as you will have more than one region in your table, balancer will start to balance the regions over the servers. You can split some of those tables and will you start to see HBase balance them. This is normal behavior for 0.94. I don't know for versions before that. Also, are you sure you need 48 tables? And not less tables with more CFs? JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, JM I have 48 tables and as you said it is 1 region per table since i did not reach splitting limit yet. So this is normal behavior in 0.94.6.1 version ? And at what point balancer will start redistribute regions to second server ? Thanks Samir On Fri, Apr 12, 2013 at 6:06 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Regions are balancer per table. So if you have 48 regions within the same table, it should be split about 24 on each server. But if you have 48 tables with 1 region each, the for each table, the balancer will see only 1 region and will display the message you saw. Have you looked at the UI? What do you have in it? Can you please confirm if yo uhave 48 tables or 1 table? Thanks, JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: Lost regions question
hbck does show the hdfs files there without associated regions. I probably could have recovered had I noticed just after this happened, but given that we've been running like this for over a week, and that there is the potential for collisions between the missing and new data, I'm probably just going to manually reinsert it all using the hdfs files. Hadoop version is 1.0.1, btw. Thanks. --Brennon On 4/11/13 11:05 PM, Ted Yu wrote: Brennon: Have you run hbck to diagnose the problem ? Since the issue might have involved hdfs, browsing DataNode log(s) may provide some clue as well. What hadoop version are you using ? Cheers On Thu, Apr 11, 2013 at 10:58 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: When you say that the parent regions got reopened does that mean that you did not lose any data(any data could not be read). The reason am asking is if after the parent got split into daughters and the data was written to daughters and if the daughters related files could not be opened you could have ended up in not able to read the data. Some logs could tell us what made the parent to get reopened rather than daughters. Another thing i would like to ask is was the cluster brought down abruptly by killing the RS. Which version of HBase? Regards Ram On Fri, Apr 12, 2013 at 11:20 AM, Brennon Church bren...@getjar.com wrote: Hello, I had an interesting problem come up recently. We have a few thousand regions across 8 datanode/regionservers. I made a change, increasing the heap size for hadoop from 128M to 2048M which ended up bringing the cluster to a complete halt after about 1 hour. I reverted back to 128M and turned things back on again but didn't realize at the time that I came up with 9 fewer regions than I started. Upon further investigation, I found that all 9 missing regions were from splits that occurred while the cluster was running after making the heap change and before it came to a halt. There was a 10th regions (5 splits involved in total) that managed to get recovered. The really odd thing is that in the case of the other 9 regions, the original parent regions, which as far as I can tell in the logs were deleted, were re-opened upon restarting things once again. The daughter regions were gone. Interestingly, I found the orphaned datablocks still intact, and in at least some cases have been able to extract the data from them and will hopefully re-add it to the tables. My question is this. Does anyone know based on the rather muddled description I've given above, what could have possibly happened here? My best guess is that the bad state that hdfs was in caused some critical component of the split process to be missed, which resulted a reference to the parent regions sticking around and losing the references to the daughter regions. Thanks for any insight you can provide. --Brennon
Re: hbase-0.94.6.1 balancer issue
Samir, When you say And at what point balancer will start redistribute regions to second server, do you mean that when you look at the master's web UI you see that one region server has 0 region? That would be a problem. Else, that line you posted in your original message should be repeated for each table, and globally the regions should all be correctly distributed... unless there's an edge case where when you have only tables with 1 region it puts them all on the same server :) Thx, J-D On Fri, Apr 12, 2013 at 12:37 PM, Samir Ahmic ahmic.sa...@gmail.com wrote: Thanks for explaining Jean-Marc, We are using 0.90.4 for very long time and balancing was based on total number of regions.That is why i was surprised with balancer log on 0.94. Well i'm more ops guy then dev i handle what other develop :) Regards On Fri, Apr 12, 2013 at 6:24 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Since regions are balanced per table, as soon as you will have more than one region in your table, balancer will start to balance the regions over the servers. You can split some of those tables and will you start to see HBase balance them. This is normal behavior for 0.94. I don't know for versions before that. Also, are you sure you need 48 tables? And not less tables with more CFs? JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, JM I have 48 tables and as you said it is 1 region per table since i did not reach splitting limit yet. So this is normal behavior in 0.94.6.1 version ? And at what point balancer will start redistribute regions to second server ? Thanks Samir On Fri, Apr 12, 2013 at 6:06 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Regions are balancer per table. So if you have 48 regions within the same table, it should be split about 24 on each server. But if you have 48 tables with 1 region each, the for each table, the balancer will see only 1 region and will display the message you saw. Have you looked at the UI? What do you have in it? Can you please confirm if yo uhave 48 tables or 1 table? Thanks, JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: hbase-0.94.6.1 balancer issue
I have just created 50 tables and they got distributed on different nodes (8) at the create time. I ran the balancer manually and they are still correctly distributed all over the cluster. But Samir tried with only 2 nodes. I don't know if this might change the results or not JM. 2013/4/12 Jean-Daniel Cryans jdcry...@apache.org Samir, When you say And at what point balancer will start redistribute regions to second server, do you mean that when you look at the master's web UI you see that one region server has 0 region? That would be a problem. Else, that line you posted in your original message should be repeated for each table, and globally the regions should all be correctly distributed... unless there's an edge case where when you have only tables with 1 region it puts them all on the same server :) Thx, J-D On Fri, Apr 12, 2013 at 12:37 PM, Samir Ahmic ahmic.sa...@gmail.com wrote: Thanks for explaining Jean-Marc, We are using 0.90.4 for very long time and balancing was based on total number of regions.That is why i was surprised with balancer log on 0.94. Well i'm more ops guy then dev i handle what other develop :) Regards On Fri, Apr 12, 2013 at 6:24 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Since regions are balanced per table, as soon as you will have more than one region in your table, balancer will start to balance the regions over the servers. You can split some of those tables and will you start to see HBase balance them. This is normal behavior for 0.94. I don't know for versions before that. Also, are you sure you need 48 tables? And not less tables with more CFs? JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, JM I have 48 tables and as you said it is 1 region per table since i did not reach splitting limit yet. So this is normal behavior in 0.94.6.1 version ? And at what point balancer will start redistribute regions to second server ? Thanks Samir On Fri, Apr 12, 2013 at 6:06 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Regions are balancer per table. So if you have 48 regions within the same table, it should be split about 24 on each server. But if you have 48 tables with 1 region each, the for each table, the balancer will see only 1 region and will display the message you saw. Have you looked at the UI? What do you have in it? Can you please confirm if yo uhave 48 tables or 1 table? Thanks, JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: hbase-0.94.6.1 balancer issue
Thanks for the replies, Jean-Marc. HBASE-7060 is related to the scenario Samir described: Region load balancing by table does not handle the case where a table's region count is lower than the number of the RS in the cluster It was fixed in 0.94.3 Cheers On Fri, Apr 12, 2013 at 10:30 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: I have just created 50 tables and they got distributed on different nodes (8) at the create time. I ran the balancer manually and they are still correctly distributed all over the cluster. But Samir tried with only 2 nodes. I don't know if this might change the results or not JM. 2013/4/12 Jean-Daniel Cryans jdcry...@apache.org Samir, When you say And at what point balancer will start redistribute regions to second server, do you mean that when you look at the master's web UI you see that one region server has 0 region? That would be a problem. Else, that line you posted in your original message should be repeated for each table, and globally the regions should all be correctly distributed... unless there's an edge case where when you have only tables with 1 region it puts them all on the same server :) Thx, J-D On Fri, Apr 12, 2013 at 12:37 PM, Samir Ahmic ahmic.sa...@gmail.com wrote: Thanks for explaining Jean-Marc, We are using 0.90.4 for very long time and balancing was based on total number of regions.That is why i was surprised with balancer log on 0.94. Well i'm more ops guy then dev i handle what other develop :) Regards On Fri, Apr 12, 2013 at 6:24 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Since regions are balanced per table, as soon as you will have more than one region in your table, balancer will start to balance the regions over the servers. You can split some of those tables and will you start to see HBase balance them. This is normal behavior for 0.94. I don't know for versions before that. Also, are you sure you need 48 tables? And not less tables with more CFs? JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, JM I have 48 tables and as you said it is 1 region per table since i did not reach splitting limit yet. So this is normal behavior in 0.94.6.1 version ? And at what point balancer will start redistribute regions to second server ? Thanks Samir On Fri, Apr 12, 2013 at 6:06 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Regions are balancer per table. So if you have 48 regions within the same table, it should be split about 24 on each server. But if you have 48 tables with 1 region each, the for each table, the balancer will see only 1 region and will display the message you saw. Have you looked at the UI? What do you have in it? Can you please confirm if yo uhave 48 tables or 1 table? Thanks, JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: hbase-0.94.6.1 balancer issue
Hi, J-D Well at this moment i have that edge case with only one region per table:). Like i said i was using 0.90 for long time and regions were distributed evenly on all RSs regardless on region per table ratio. Here is what confused me (like i said i have 2 nodes cluster distributed mode): start-hbase -- tables(regions) are distributed evenlyon two RSs (As expected) stop one RS --- all tables(regions) are moved to remaining RS (as expected) start RS that was down --- run balancer --- LOG:2013-04-12 19:47:20,725 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 regions=1 average=0.5 mostloaded=1 leastloaded=0 all tables(regions) stayed on one server (this is what i did not expect ?) :) Here is is part of status 'detailed' from shell ater i start RS that was down and run balancer: hbase(main):001:0 status 'detailed' version 0.94.6.1 0 regionsInTransition master coprocessors: [] 2 live servers 172.17.33.2:60020 1365787755294 requestsPerSecond=0, numberOfOnlineRegions=0, usedHeapMB=38, maxHeapMB=3487 172.17.33.3:60020 1365777858778 requestsPerSecond=0, numberOfOnlineRegions=49, usedHeapMB=53, maxHeapMB=3487 So because i have 1 regions per table regions were not rebalances after start RS that was down? Thanks Samir On Fri, Apr 12, 2013 at 7:17 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Samir, When you say And at what point balancer will start redistribute regions to second server, do you mean that when you look at the master's web UI you see that one region server has 0 region? That would be a problem. Else, that line you posted in your original message should be repeated for each table, and globally the regions should all be correctly distributed... unless there's an edge case where when you have only tables with 1 region it puts them all on the same server :) Thx, J-D On Fri, Apr 12, 2013 at 12:37 PM, Samir Ahmic ahmic.sa...@gmail.com wrote: Thanks for explaining Jean-Marc, We are using 0.90.4 for very long time and balancing was based on total number of regions.That is why i was surprised with balancer log on 0.94. Well i'm more ops guy then dev i handle what other develop :) Regards On Fri, Apr 12, 2013 at 6:24 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Since regions are balanced per table, as soon as you will have more than one region in your table, balancer will start to balance the regions over the servers. You can split some of those tables and will you start to see HBase balance them. This is normal behavior for 0.94. I don't know for versions before that. Also, are you sure you need 48 tables? And not less tables with more CFs? JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, JM I have 48 tables and as you said it is 1 region per table since i did not reach splitting limit yet. So this is normal behavior in 0.94.6.1 version ? And at what point balancer will start redistribute regions to second server ? Thanks Samir On Fri, Apr 12, 2013 at 6:06 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Samir, Regions are balancer per table. So if you have 48 regions within the same table, it should be split about 24 on each server. But if you have 48 tables with 1 region each, the for each table, the balancer will see only 1 region and will display the message you saw. Have you looked at the UI? What do you have in it? Can you please confirm if yo uhave 48 tables or 1 table? Thanks, JM 2013/4/12 Samir Ahmic ahmic.sa...@gmail.com Hi, all I'm evaluating hbase-0.94.6.1 and i have 48 regions on 2 node cluster. I was restarting on of RSs and after that tried to balance cluster by running balancer from shell. After running command regions were not distributed to second RS and i found this line i master log: 2013-04-12 16:45:15,589 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=2 *regions=1 *average=0.5 mostloaded=1 leastloaded=0 This look like to me that wrong number of regions is reported by balancer and that cause of skipping load balancing . In hbase shell i see all 48 tables that i have and everything else looks fine. Did someone else see this type of behavior ? Did something changed around balancer in hbase-0.94.6.1 ? Regards Samir
Re: hbase-0.94.6.1 balancer issue
HBASE-7060 explains my case, i'm using 0.94.6.1 and looks like issue is still present. Thanks for replaying guys Cheers:)
Re: hbase-0.94.6.1 balancer issue
bq. looks like issue is still present. I think the issue expressed in HBASE-7060 is slightly different: bq. For example, the cluster has 100 RS, the table has 50 regions sitting on one RS, Note: one table had 50 regions On Fri, Apr 12, 2013 at 11:17 AM, Samir Ahmic ahmic.sa...@gmail.com wrote: HBASE-7060 explains my case, i'm using 0.94.6.1 and looks like issue is still present. Thanks for replaying guys Cheers:)