date:20101123

[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HBASE-3263:
---

Attachment: stackoverflow-log.txt

Here's a log showing the beginning of the runaway recursion. It goes like this 
until it gets a stack overflow error.

 Stack overflow in AssignmentManager
 ---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt


 My test cluster experienced a switch outage earlier this week which threw the 
 master into a really bad state. In the catch clause of 
 AssignmentManager.assign, we recurse, and if all of the region servers are 
 inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)

Stack overflow in AssignmentManager
---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt

My test cluster experienced a switch outage earlier this week which threw the 
master into a really bad state. In the catch clause of 
AssignmentManager.assign, we recurse, and if all of the region servers are 
inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Nicolas Spiegelberg (JIRA)

Remove unnecessary Guava Dependency
---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.1
 Attachments: HBASE-3264.patch

Currently, TableMapReduceUtil uses Guava for trivial functionality and 
addDependencyJars() currently adds Guava by default.  However, this jar is only 
necessary for the ImportTsv MR job.  This is annoying when naively bundling 
hbase jar with a MR job because you now need a second dependency jar.  Should 
default bundle with only critical dependencies and have jobs that need fancy 
Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934769#action_12934769
 ] 

Todd Lipcon commented on HBASE-3263:


Shortly after the StackOverflowError it also started spitting this exception:

2010-11-19 12:09:50,366 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Failed assignment of usertable,,1289960558114.03110b4c3c0b24fa1c920ec7669d03a6. 
to serverName=haus03.sf.cloudera.com,60020,1289890926773, load=(requests=0, 
regions=11, usedHeap=5403, maxHeap=8185), trying to assign elsewhere instead
java.lang.NullPointerException
  at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
  at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
  at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
  at $Proxy8.openRegion(Unknown Source)
  at 
org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:537)
  at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:830)


 Stack overflow in AssignmentManager
 ---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt


 My test cluster experienced a switch outage earlier this week which threw the 
 master into a really bad state. In the catch clause of 
 AssignmentManager.assign, we recurse, and if all of the region servers are 
 inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Nicolas Spiegelberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Spiegelberg updated HBASE-3264:
---

Attachment: HBASE-3264.patch

 Remove unnecessary Guava Dependency
 ---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.1

 Attachments: HBASE-3264.patch


 Currently, TableMapReduceUtil uses Guava for trivial functionality and 
 addDependencyJars() currently adds Guava by default.  However, this jar is 
 only necessary for the ImportTsv MR job.  This is annoying when naively 
 bundling hbase jar with a MR job because you now need a second dependency 
 jar.  Should default bundle with only critical dependencies and have jobs 
 that need fancy Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934770#action_12934770
 ] 

Todd Lipcon commented on HBASE-3263:


And also thereafter lots of these:

java.lang.NullPointerException
  at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
  at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
  at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
  at $Proxy8.getRegionInfo(Unknown Source)
  at 
org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:416)
  at 
org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:270)
  at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:322)

So somehow we borked a null into one of our maps, it seems

 Stack overflow in AssignmentManager
 ---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt


 My test cluster experienced a switch outage earlier this week which threw the 
 master into a really bad state. In the catch clause of 
 AssignmentManager.assign, we recurse, and if all of the region servers are 
 inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers

2010-11-23 Thread Todd Lipcon (JIRA)

Regionservers waiting for ROOT while Master waiting for RegionServers
-

 Key: HBASE-3265
 URL: https://issues.apache.org/jira/browse/HBASE-3265
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical


After a cluster disastrophe due to a disconnected switch, I ended up in a state 
where the master was up with no region servers (see HBASE-3263). When I brought 
the RS back up, because of the aforementioned bug, the master didn't get itself 
into a happy state (internal datastructure had some null in it). So I killed 
the master and started it again. Now, the master is in Waiting for region 
servers to check in mode, and the region servers are in the following stack:

- locked 0x2aaab1bda5d0 (a 
org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537)
at java.lang.Thread.run(Thread.java:619)

I imagine what happened is that the RS got through tryReportForDuty with the 
old master, but the old master was unable to assign anything due to bad state. 
So, when it crashed, all the RS were stuck in waitForRoot(), and when I brought 
the new one up, no one was reporting for duty.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Nicolas Spiegelberg (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934776#action_12934776
]

Nicolas Spiegelberg commented on HBASE-3264:

@Todd: you're not precluded from adding Guava or whatever libraries to this,
but I don't think the default action should be to add libraries that you're not
using. Guava is currently the only dependency under addDependencyJars(Job)
that is not essential for basic HBase table operations. Since
addDependencyJars(conf, ...) allows concatenation, you can easily append jars
that are necessary for your specific config. We need to use that ourselves to
add in compression jars for HFileOutputFormat. Note that I used this api to
change the ImportTsv job to append the Guava jar, since it is the job that
requires it right now.

Remove unnecessary Guava Dependency
---

Key: HBASE-3264
URL: https://issues.apache.org/jira/browse/HBASE-3264
Project: HBase
Issue Type: Bug
Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
Fix For: 0.90.1

Attachments: HBASE-3264.patch

Currently, TableMapReduceUtil uses Guava for trivial functionality and
addDependencyJars() currently adds Guava by default. However, this jar is
only necessary for the ImportTsv MR job. This is annoying when naively
bundling hbase jar with a MR job because you now need a second dependency
jar. Should default bundle with only critical dependencies and have jobs
that need fancy Guava functionality explicitly include them.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

2010-11-23 Thread Todd Lipcon (JIRA)

Master does not seem to properly scan ZK for running RS during startup
--

 Key: HBASE-3266
 URL: https://issues.apache.org/jira/browse/HBASE-3266
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical


I was in the situation described by HBASE-3265, where I had a number of RS 
waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on 
checkins. To get past this, I restarted one of the region servers. The 
restarted server checked in, and the master began its startup.
At this point the master started scanning /hbase/.logs for things to split. It 
correctly identified that the RS on haus01 was running (this is the one I 
restarted):

2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder 
hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
 belongs to an existing region server

but then incorrectly decided that the RS on haus02 was down:

2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder 
hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
 doesn't belong to a known region server, splitting

However ZK shows that this RS is up:
[zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs
[haus04.sf.cloudera.com,60020,1290498411533, 
haus05.sf.cloudera.com,60020,1290498411520, 
haus03.sf.cloudera.com,60020,1290498411518, 
haus01.sf.cloudera.com,60020,1290500443143, 
haus02.sf.cloudera.com,60020,1290498411450]

splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I 
can tell is derived from heartbeats and not from ZK (sorry if I got some of 
this wrong, still new to this new codebase)

Of course, the master went into an infinite splitting loop at this point since 
haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HBASE-3267) close_region shell command breaks region

2010-11-23 Thread Todd Lipcon (JIRA)

close_region shell command breaks region


 Key: HBASE-3267
 URL: https://issues.apache.org/jira/browse/HBASE-3267
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver, shell
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical


It used to be that you could use the close_region command from the shell to 
close a region on one server and have the master reassign it elsewhere. Now if 
you close a region, you get the following errors in the master log:

2010-11-23 00:46:34,090 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received CLOSING for region ffaa7999e909dbd6544688cc8ab303bd from server 
haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
and not in expected PENDI
2010-11-23 00:46:34,530 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
master:6-0x12c537d84e10062 Received ZooKeeper Event, type=NodeDataChanged, 
state=SyncConnected, path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd
2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode 
/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; 
region=usertable,user1951957302,1290501969
2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_CLOSED, 
server=haus01.sf.cloudera.com,12020,1290501789693, 
region=ffaa7999e909dbd6544688cc8ab303bd
2010-11-23 00:46:34,531 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received CLOSED for region ffaa7999e909dbd6544688cc8ab303bd from server 
haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
and not in expected PENDIN

and the region just gets stuck closed

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Todd Lipcon (JIRA)

Auto-tune balance frequency based on cluster size
-

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon


Right now we only balance the cluster once every 5 minutes by default. This is 
likely to confuse new users. When you start a new region server, you expect it 
to pick up some load very quickly, but right now you have to wait 5 minutes for 
it to start doing anything in the worst case.

We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934795#action_12934795
 ] 

Andrew Purtell commented on HBASE-3268:
---

+1

Actually I've been considering filing against this as a bug. I have been 
testing recently some heavy write scenarios that on current 0.90 pile regions 
on a single RS and can cause it to OOME before balancing happens. 

Perhaps at least the default should be 1 minute instead of 5?


 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Jonathan Gray (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934915#action_12934915
 ] 

Jonathan Gray commented on HBASE-3268:
--

I think rather than needing to do much more frequent load balances or just in 
addition to them, we should add more intelligence into non-balancing region 
assignment (all random now).  And we could also lazily move splits off their 
original server.

But for now making it more aggressive at one minute or so should be fine.

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Jonathan Gray (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934918#action_12934918
 ] 

Jonathan Gray commented on HBASE-3268:
--

Also, when the master gets a new regionserver on an already running cluster, it 
should automatically trigger a balance.

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers

2010-11-23 Thread Jonathan Gray (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934921#action_12934921
 ] 

Jonathan Gray commented on HBASE-3265:
--

The RSs should also be heartbeating in to the master as well.  Can you post 
full stack dumps from one of the stuck RS and the master?

 Regionservers waiting for ROOT while Master waiting for RegionServers
 -

 Key: HBASE-3265
 URL: https://issues.apache.org/jira/browse/HBASE-3265
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical

 After a cluster disastrophe due to a disconnected switch, I ended up in a 
 state where the master was up with no region servers (see HBASE-3263). When I 
 brought the RS back up, because of the aforementioned bug, the master didn't 
 get itself into a happy state (internal datastructure had some null in it). 
 So I killed the master and started it again. Now, the master is in Waiting 
 for region servers to check in mode, and the region servers are in the 
 following stack:
 - locked 0x2aaab1bda5d0 (a 
 org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
 at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537)
 at java.lang.Thread.run(Thread.java:619)
 I imagine what happened is that the RS got through tryReportForDuty with 
 the old master, but the old master was unable to assign anything due to bad 
 state. So, when it crashed, all the RS were stuck in waitForRoot(), and when 
 I brought the new one up, no one was reporting for duty.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

2010-11-23 Thread Jonathan Gray (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934924#action_12934924
 ] 

Jonathan Gray commented on HBASE-3266:
--

Yeah, I think as it is currently the HMaster is using the startup/heartbeat 
messages to determine which RS are online.  As I commented in the other jira, 
we should see why they were not doing so.

We should do some reconciliation between what we find in ZK and what we think 
is online based on RPCs, but not sure exactly what course we would take in a 
state like this.

 Master does not seem to properly scan ZK for running RS during startup
 --

 Key: HBASE-3266
 URL: https://issues.apache.org/jira/browse/HBASE-3266
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical

 I was in the situation described by HBASE-3265, where I had a number of RS 
 waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting 
 on checkins. To get past this, I restarted one of the region servers. The 
 restarted server checked in, and the master began its startup.
 At this point the master started scanning /hbase/.logs for things to split. 
 It correctly identified that the RS on haus01 was running (this is the one I 
 restarted):
 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 Log folder 
 hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
  belongs to an existing region server
 but then incorrectly decided that the RS on haus02 was down:
 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 Log folder 
 hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
  doesn't belong to a known region server, splitting
 However ZK shows that this RS is up:
 [zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs
 [haus04.sf.cloudera.com,60020,1290498411533, 
 haus05.sf.cloudera.com,60020,1290498411520, 
 haus03.sf.cloudera.com,60020,1290498411518, 
 haus01.sf.cloudera.com,60020,1290500443143, 
 haus02.sf.cloudera.com,60020,1290498411450]
 splitLogsAfterStartup seems to check ServerManager.onlineServers, which best 
 I can tell is derived from heartbeats and not from ZK (sorry if I got some of 
 this wrong, still new to this new codebase)
 Of course, the master went into an infinite splitting loop at this point 
 since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3267) close_region shell command breaks region

2010-11-23 Thread Jonathan Gray (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934925#action_12934925
 ] 

Jonathan Gray commented on HBASE-3267:
--

The master doesn't expect a region to be properly closed out on an RS w/o being 
the one to tell it to do so.

Let me dig in to the code and see what the easiest solution would be.

@Todd... I had plans to start working on new features for 0.92, stop finding 
bugs!  ;)

 close_region shell command breaks region
 

 Key: HBASE-3267
 URL: https://issues.apache.org/jira/browse/HBASE-3267
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver, shell
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical

 It used to be that you could use the close_region command from the shell to 
 close a region on one server and have the master reassign it elsewhere. Now 
 if you close a region, you get the following errors in the master log:
 2010-11-23 00:46:34,090 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSING for region 
 ffaa7999e909dbd6544688cc8ab303bd from server 
 haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
 and not in expected PENDI
 2010-11-23 00:46:34,530 DEBUG 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
 master:6-0x12c537d84e10062 Received ZooKeeper Event, 
 type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd
 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode 
 /hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; 
 region=usertable,user1951957302,1290501969
 2010-11-23 00:46:34,531 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, 
 server=haus01.sf.cloudera.com,12020,1290501789693, 
 region=ffaa7999e909dbd6544688cc8ab303bd
 2010-11-23 00:46:34,531 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 ffaa7999e909dbd6544688cc8ab303bd from server 
 haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
 and not in expected PENDIN
 and the region just gets stuck closed

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934927#action_12934927
 ] 

Todd Lipcon commented on HBASE-3268:


I think the idea of triggering balance when we get a new server is a good one.

One thing we want to be a little careful of is the situation when someone flips 
on 10 new servers at the same time. Rather than triggering a rebalance for 
each (and thus lots of churn), we want a little bit of lag before the rebalance.

Maybe when a new server is added, we trigger the rebalance in 5-10 seconds?

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Jonathan Gray (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934938#action_12934938
 ] 

Jonathan Gray commented on HBASE-3268:
--

Something like that sounds reasonable.  I'm trying to figure some of these 
other issues now so this is up for grabs.

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3262) TestHMasterRPCException uses non-ephemeral port for master

2010-11-23 Thread Jonathan Gray (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Gray updated HBASE-3262:
-

Fix Version/s: 0.92.0
   Status: Patch Available  (was: Open)

 TestHMasterRPCException uses non-ephemeral port for master
 --

 Key: HBASE-3262
 URL: https://issues.apache.org/jira/browse/HBASE-3262
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Jonathan Gray
Assignee: Jonathan Gray
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3262-v1.patch


 TestHMasterRPCException instantiates an HMaster but doesn't use an ephemeral 
 port which can cause the test to fail if port already in use.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Jonathan Gray (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934954#action_12934954
 ] 

Jonathan Gray commented on HBASE-3264:
--

I'm fine with adding dependencies/libraries when we are using them, but in 
general think we should also aim to minimize our dependencies.

So I'm +1 for removing an additional dependency from the job if it's trivial to 
remove it.

I'm also +1 on a complete client-dep jar if we could get maven to do that for 
us.

 Remove unnecessary Guava Dependency
 ---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.1

 Attachments: HBASE-3264.patch


 Currently, TableMapReduceUtil uses Guava for trivial functionality and 
 addDependencyJars() currently adds Guava by default.  However, this jar is 
 only necessary for the ImportTsv MR job.  This is annoying when naively 
 bundling hbase jar with a MR job because you now need a second dependency 
 jar.  Should default bundle with only critical dependencies and have jobs 
 that need fancy Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3256) Coprocessors: Coprocessor host and observer for HMaster

2010-11-23 Thread Gary Helmling (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-3256:
-

Attachment: HBASE-3256_initial.patch

Here's a preview version of the patch adding the MasterObserver interface and 
related changes and refactorings.  The final version of this is waiting on an 
implementation of the lifecycle hooks for HBASE-3260.  Once I complete those 
changes, I will merge here, add unit tests and put the final version of this 
patch up on review board.

 Coprocessors: Coprocessor host and observer for HMaster
 ---

 Key: HBASE-3256
 URL: https://issues.apache.org/jira/browse/HBASE-3256
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
Assignee: Gary Helmling
 Fix For: 0.92.0

 Attachments: HBASE-3256_initial.patch


 Implement a coprocessor host for HMaster. Hook observers into administrative 
 operations performed on tables: create, alter, assignment, load balance, and 
 allow observers to modify base master behavior. Support automatic loading of 
 coprocessor implementation. 
 Consider refactoring the master coprocessor host and regionserver coprocessor 
 host into a common base class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.

2010-11-23 Thread Suraj Varma (JIRA)

HBase table truncate semantics seems broken as disable table is now async by 
default.
---

 Key: HBASE-3269
 URL: https://issues.apache.org/jira/browse/HBASE-3269
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
 Environment: RHEL5 x86_64
Reporter: Suraj Varma
Priority: Critical


The new async design for disable table seems to have caused a side effect on 
the truncate command. (IRC chat with jdcryans)

Apparent Cause: 
Disable is now async by default. When truncate is called, the disable 
operation returns immediately and when the drop is called, the disable 
operation is still not completed. This results in 
HMaster.checkTableModifiable() throwing a TableNotDisabledException.

With earlier versions, disable returned only after Table was disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.

2010-11-23 Thread Jean-Daniel Cryans (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-3269:
--

Fix Version/s: 0.92.0
   0.90.0
 Assignee: stack

Assigning to Stack and marking it against 0.90

 HBase table truncate semantics seems broken as disable table is now async 
 by default.
 ---

 Key: HBASE-3269
 URL: https://issues.apache.org/jira/browse/HBASE-3269
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
 Environment: RHEL5 x86_64
Reporter: Suraj Varma
Assignee: stack
Priority: Critical
 Fix For: 0.90.0, 0.92.0


 The new async design for disable table seems to have caused a side effect on 
 the truncate command. (IRC chat with jdcryans)
 Apparent Cause: 
 Disable is now async by default. When truncate is called, the disable 
 operation returns immediately and when the drop is called, the disable 
 operation is still not completed. This results in 
 HMaster.checkTableModifiable() throwing a TableNotDisabledException.
 With earlier versions, disable returned only after Table was disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3227) Edit of log messages before branching...

2010-11-23 Thread HBase Review Board (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934996#action_12934996
 ] 

HBase Review Board commented on HBASE-3227:
---

Message from: st...@duboce.net


bq.  On 2010-11-22 17:29:45, Nicolas wrote:
bq.   trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java, 
line 739
bq.   http://review.cloudera.org/r/1212/diff/1/?file=17170#file17170line739
bq.  
bq.   I'd suggest keeping the store name in this debug message since we're 
considering thread pools for compactions...

Won't the store name be part of the path on the next line when we do 
sf.toString() where sf is the file we're compacting all into?


- stack


---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1212/#review1971
---





 Edit of log messages before branching...
 

 Key: HBASE-3227
 URL: https://issues.apache.org/jira/browse/HBASE-3227
 Project: HBase
  Issue Type: Improvement
Reporter: stack
 Fix For: 0.90.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

71 matches

Mail list logo