[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HBASE-3263:
---

Attachment: stackoverflow-log.txt

Here's a log showing the beginning of the runaway recursion. It goes like this 
until it gets a stack overflow error.

 Stack overflow in AssignmentManager
 ---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt


 My test cluster experienced a switch outage earlier this week which threw the 
 master into a really bad state. In the catch clause of 
 AssignmentManager.assign, we recurse, and if all of the region servers are 
 inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)
Stack overflow in AssignmentManager
---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt

My test cluster experienced a switch outage earlier this week which threw the 
master into a really bad state. In the catch clause of 
AssignmentManager.assign, we recurse, and if all of the region servers are 
inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Nicolas Spiegelberg (JIRA)
Remove unnecessary Guava Dependency
---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.1
 Attachments: HBASE-3264.patch

Currently, TableMapReduceUtil uses Guava for trivial functionality and 
addDependencyJars() currently adds Guava by default.  However, this jar is only 
necessary for the ImportTsv MR job.  This is annoying when naively bundling 
hbase jar with a MR job because you now need a second dependency jar.  Should 
default bundle with only critical dependencies and have jobs that need fancy 
Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934769#action_12934769
 ] 

Todd Lipcon commented on HBASE-3263:


Shortly after the StackOverflowError it also started spitting this exception:

2010-11-19 12:09:50,366 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Failed assignment of usertable,,1289960558114.03110b4c3c0b24fa1c920ec7669d03a6. 
to serverName=haus03.sf.cloudera.com,60020,1289890926773, load=(requests=0, 
regions=11, usedHeap=5403, maxHeap=8185), trying to assign elsewhere instead
java.lang.NullPointerException
  at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
  at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
  at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
  at $Proxy8.openRegion(Unknown Source)
  at 
org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:537)
  at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:830)


 Stack overflow in AssignmentManager
 ---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt


 My test cluster experienced a switch outage earlier this week which threw the 
 master into a really bad state. In the catch clause of 
 AssignmentManager.assign, we recurse, and if all of the region servers are 
 inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Nicolas Spiegelberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Spiegelberg updated HBASE-3264:
---

Attachment: HBASE-3264.patch

 Remove unnecessary Guava Dependency
 ---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.1

 Attachments: HBASE-3264.patch


 Currently, TableMapReduceUtil uses Guava for trivial functionality and 
 addDependencyJars() currently adds Guava by default.  However, this jar is 
 only necessary for the ImportTsv MR job.  This is annoying when naively 
 bundling hbase jar with a MR job because you now need a second dependency 
 jar.  Should default bundle with only critical dependencies and have jobs 
 that need fancy Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934770#action_12934770
 ] 

Todd Lipcon commented on HBASE-3263:


And also thereafter lots of these:

java.lang.NullPointerException
  at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485)
  at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733)
  at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
  at $Proxy8.getRegionInfo(Unknown Source)
  at 
org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:416)
  at 
org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:270)
  at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:322)

So somehow we borked a null into one of our maps, it seems

 Stack overflow in AssignmentManager
 ---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Attachments: stackoverflow-log.txt


 My test cluster experienced a switch outage earlier this week which threw the 
 master into a really bad state. In the catch clause of 
 AssignmentManager.assign, we recurse, and if all of the region servers are 
 inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers

2010-11-23 Thread Todd Lipcon (JIRA)
Regionservers waiting for ROOT while Master waiting for RegionServers
-

 Key: HBASE-3265
 URL: https://issues.apache.org/jira/browse/HBASE-3265
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical


After a cluster disastrophe due to a disconnected switch, I ended up in a state 
where the master was up with no region servers (see HBASE-3263). When I brought 
the RS back up, because of the aforementioned bug, the master didn't get itself 
into a happy state (internal datastructure had some null in it). So I killed 
the master and started it again. Now, the master is in Waiting for region 
servers to check in mode, and the region servers are in the following stack:

- locked 0x2aaab1bda5d0 (a 
org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537)
at java.lang.Thread.run(Thread.java:619)

I imagine what happened is that the RS got through tryReportForDuty with the 
old master, but the old master was unable to assign anything due to bad state. 
So, when it crashed, all the RS were stuck in waitForRoot(), and when I brought 
the new one up, no one was reporting for duty.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Nicolas Spiegelberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934776#action_12934776
 ] 

Nicolas Spiegelberg commented on HBASE-3264:


@Todd: you're not precluded from adding Guava or whatever libraries to this, 
but I don't think the default action should be to add libraries that you're not 
using.  Guava is currently the only dependency under addDependencyJars(Job) 
that is not essential for basic HBase table operations.  Since 
addDependencyJars(conf, ...) allows concatenation, you can easily append jars 
that are necessary for your specific config.  We need to use that ourselves to 
add in compression jars for HFileOutputFormat.  Note that I used this api to 
change the ImportTsv job to append the Guava jar, since it is the job that 
requires it right now.

 Remove unnecessary Guava Dependency
 ---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.1

 Attachments: HBASE-3264.patch


 Currently, TableMapReduceUtil uses Guava for trivial functionality and 
 addDependencyJars() currently adds Guava by default.  However, this jar is 
 only necessary for the ImportTsv MR job.  This is annoying when naively 
 bundling hbase jar with a MR job because you now need a second dependency 
 jar.  Should default bundle with only critical dependencies and have jobs 
 that need fancy Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

2010-11-23 Thread Todd Lipcon (JIRA)
Master does not seem to properly scan ZK for running RS during startup
--

 Key: HBASE-3266
 URL: https://issues.apache.org/jira/browse/HBASE-3266
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical


I was in the situation described by HBASE-3265, where I had a number of RS 
waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on 
checkins. To get past this, I restarted one of the region servers. The 
restarted server checked in, and the master began its startup.
At this point the master started scanning /hbase/.logs for things to split. It 
correctly identified that the RS on haus01 was running (this is the one I 
restarted):

2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder 
hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
 belongs to an existing region server

but then incorrectly decided that the RS on haus02 was down:

2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder 
hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
 doesn't belong to a known region server, splitting

However ZK shows that this RS is up:
[zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs
[haus04.sf.cloudera.com,60020,1290498411533, 
haus05.sf.cloudera.com,60020,1290498411520, 
haus03.sf.cloudera.com,60020,1290498411518, 
haus01.sf.cloudera.com,60020,1290500443143, 
haus02.sf.cloudera.com,60020,1290498411450]

splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I 
can tell is derived from heartbeats and not from ZK (sorry if I got some of 
this wrong, still new to this new codebase)

Of course, the master went into an infinite splitting loop at this point since 
haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3267) close_region shell command breaks region

2010-11-23 Thread Todd Lipcon (JIRA)
close_region shell command breaks region


 Key: HBASE-3267
 URL: https://issues.apache.org/jira/browse/HBASE-3267
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver, shell
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical


It used to be that you could use the close_region command from the shell to 
close a region on one server and have the master reassign it elsewhere. Now if 
you close a region, you get the following errors in the master log:

2010-11-23 00:46:34,090 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received CLOSING for region ffaa7999e909dbd6544688cc8ab303bd from server 
haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
and not in expected PENDI
2010-11-23 00:46:34,530 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
master:6-0x12c537d84e10062 Received ZooKeeper Event, type=NodeDataChanged, 
state=SyncConnected, path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd
2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode 
/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; 
region=usertable,user1951957302,1290501969
2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_CLOSED, 
server=haus01.sf.cloudera.com,12020,1290501789693, 
region=ffaa7999e909dbd6544688cc8ab303bd
2010-11-23 00:46:34,531 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received CLOSED for region ffaa7999e909dbd6544688cc8ab303bd from server 
haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
and not in expected PENDIN

and the region just gets stuck closed

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Todd Lipcon (JIRA)
Auto-tune balance frequency based on cluster size
-

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon


Right now we only balance the cluster once every 5 minutes by default. This is 
likely to confuse new users. When you start a new region server, you expect it 
to pick up some load very quickly, but right now you have to wait 5 minutes for 
it to start doing anything in the worst case.

We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934795#action_12934795
 ] 

Andrew Purtell commented on HBASE-3268:
---

+1

Actually I've been considering filing against this as a bug. I have been 
testing recently some heavy write scenarios that on current 0.90 pile regions 
on a single RS and can cause it to OOME before balancing happens. 

Perhaps at least the default should be 1 minute instead of 5?


 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934915#action_12934915
 ] 

Jonathan Gray commented on HBASE-3268:
--

I think rather than needing to do much more frequent load balances or just in 
addition to them, we should add more intelligence into non-balancing region 
assignment (all random now).  And we could also lazily move splits off their 
original server.

But for now making it more aggressive at one minute or so should be fine.

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934918#action_12934918
 ] 

Jonathan Gray commented on HBASE-3268:
--

Also, when the master gets a new regionserver on an already running cluster, it 
should automatically trigger a balance.

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934921#action_12934921
 ] 

Jonathan Gray commented on HBASE-3265:
--

The RSs should also be heartbeating in to the master as well.  Can you post 
full stack dumps from one of the stuck RS and the master?

 Regionservers waiting for ROOT while Master waiting for RegionServers
 -

 Key: HBASE-3265
 URL: https://issues.apache.org/jira/browse/HBASE-3265
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical

 After a cluster disastrophe due to a disconnected switch, I ended up in a 
 state where the master was up with no region servers (see HBASE-3263). When I 
 brought the RS back up, because of the aforementioned bug, the master didn't 
 get itself into a happy state (internal datastructure had some null in it). 
 So I killed the master and started it again. Now, the master is in Waiting 
 for region servers to check in mode, and the region servers are in the 
 following stack:
 - locked 0x2aaab1bda5d0 (a 
 org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
 at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537)
 at java.lang.Thread.run(Thread.java:619)
 I imagine what happened is that the RS got through tryReportForDuty with 
 the old master, but the old master was unable to assign anything due to bad 
 state. So, when it crashed, all the RS were stuck in waitForRoot(), and when 
 I brought the new one up, no one was reporting for duty.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934924#action_12934924
 ] 

Jonathan Gray commented on HBASE-3266:
--

Yeah, I think as it is currently the HMaster is using the startup/heartbeat 
messages to determine which RS are online.  As I commented in the other jira, 
we should see why they were not doing so.

We should do some reconciliation between what we find in ZK and what we think 
is online based on RPCs, but not sure exactly what course we would take in a 
state like this.

 Master does not seem to properly scan ZK for running RS during startup
 --

 Key: HBASE-3266
 URL: https://issues.apache.org/jira/browse/HBASE-3266
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical

 I was in the situation described by HBASE-3265, where I had a number of RS 
 waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting 
 on checkins. To get past this, I restarted one of the region servers. The 
 restarted server checked in, and the master began its startup.
 At this point the master started scanning /hbase/.logs for things to split. 
 It correctly identified that the RS on haus01 was running (this is the one I 
 restarted):
 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 Log folder 
 hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
  belongs to an existing region server
 but then incorrectly decided that the RS on haus02 was down:
 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 Log folder 
 hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
  doesn't belong to a known region server, splitting
 However ZK shows that this RS is up:
 [zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs
 [haus04.sf.cloudera.com,60020,1290498411533, 
 haus05.sf.cloudera.com,60020,1290498411520, 
 haus03.sf.cloudera.com,60020,1290498411518, 
 haus01.sf.cloudera.com,60020,1290500443143, 
 haus02.sf.cloudera.com,60020,1290498411450]
 splitLogsAfterStartup seems to check ServerManager.onlineServers, which best 
 I can tell is derived from heartbeats and not from ZK (sorry if I got some of 
 this wrong, still new to this new codebase)
 Of course, the master went into an infinite splitting loop at this point 
 since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3267) close_region shell command breaks region

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934925#action_12934925
 ] 

Jonathan Gray commented on HBASE-3267:
--

The master doesn't expect a region to be properly closed out on an RS w/o being 
the one to tell it to do so.

Let me dig in to the code and see what the easiest solution would be.

@Todd... I had plans to start working on new features for 0.92, stop finding 
bugs!  ;)

 close_region shell command breaks region
 

 Key: HBASE-3267
 URL: https://issues.apache.org/jira/browse/HBASE-3267
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver, shell
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical

 It used to be that you could use the close_region command from the shell to 
 close a region on one server and have the master reassign it elsewhere. Now 
 if you close a region, you get the following errors in the master log:
 2010-11-23 00:46:34,090 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSING for region 
 ffaa7999e909dbd6544688cc8ab303bd from server 
 haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
 and not in expected PENDI
 2010-11-23 00:46:34,530 DEBUG 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
 master:6-0x12c537d84e10062 Received ZooKeeper Event, 
 type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd
 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode 
 /hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; 
 region=usertable,user1951957302,1290501969
 2010-11-23 00:46:34,531 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, 
 server=haus01.sf.cloudera.com,12020,1290501789693, 
 region=ffaa7999e909dbd6544688cc8ab303bd
 2010-11-23 00:46:34,531 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 ffaa7999e909dbd6544688cc8ab303bd from server 
 haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
 and not in expected PENDIN
 and the region just gets stuck closed

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934927#action_12934927
 ] 

Todd Lipcon commented on HBASE-3268:


I think the idea of triggering balance when we get a new server is a good one.

One thing we want to be a little careful of is the situation when someone flips 
on 10 new servers at the same time. Rather than triggering a rebalance for 
each (and thus lots of churn), we want a little bit of lag before the rebalance.

Maybe when a new server is added, we trigger the rebalance in 5-10 seconds?

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934938#action_12934938
 ] 

Jonathan Gray commented on HBASE-3268:
--

Something like that sounds reasonable.  I'm trying to figure some of these 
other issues now so this is up for grabs.

 Auto-tune balance frequency based on cluster size
 -

 Key: HBASE-3268
 URL: https://issues.apache.org/jira/browse/HBASE-3268
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Todd Lipcon

 Right now we only balance the cluster once every 5 minutes by default. This 
 is likely to confuse new users. When you start a new region server, you 
 expect it to pick up some load very quickly, but right now you have to wait 5 
 minutes for it to start doing anything in the worst case.
 We could/should also add a button/shell command to trigger balance now

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3262) TestHMasterRPCException uses non-ephemeral port for master

2010-11-23 Thread Jonathan Gray (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Gray updated HBASE-3262:
-

Fix Version/s: 0.92.0
   Status: Patch Available  (was: Open)

 TestHMasterRPCException uses non-ephemeral port for master
 --

 Key: HBASE-3262
 URL: https://issues.apache.org/jira/browse/HBASE-3262
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Jonathan Gray
Assignee: Jonathan Gray
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3262-v1.patch


 TestHMasterRPCException instantiates an HMaster but doesn't use an ephemeral 
 port which can cause the test to fail if port already in use.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934954#action_12934954
 ] 

Jonathan Gray commented on HBASE-3264:
--

I'm fine with adding dependencies/libraries when we are using them, but in 
general think we should also aim to minimize our dependencies.

So I'm +1 for removing an additional dependency from the job if it's trivial to 
remove it.

I'm also +1 on a complete client-dep jar if we could get maven to do that for 
us.

 Remove unnecessary Guava Dependency
 ---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.1

 Attachments: HBASE-3264.patch


 Currently, TableMapReduceUtil uses Guava for trivial functionality and 
 addDependencyJars() currently adds Guava by default.  However, this jar is 
 only necessary for the ImportTsv MR job.  This is annoying when naively 
 bundling hbase jar with a MR job because you now need a second dependency 
 jar.  Should default bundle with only critical dependencies and have jobs 
 that need fancy Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3256) Coprocessors: Coprocessor host and observer for HMaster

2010-11-23 Thread Gary Helmling (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Helmling updated HBASE-3256:
-

Attachment: HBASE-3256_initial.patch

Here's a preview version of the patch adding the MasterObserver interface and 
related changes and refactorings.  The final version of this is waiting on an 
implementation of the lifecycle hooks for HBASE-3260.  Once I complete those 
changes, I will merge here, add unit tests and put the final version of this 
patch up on review board.

 Coprocessors: Coprocessor host and observer for HMaster
 ---

 Key: HBASE-3256
 URL: https://issues.apache.org/jira/browse/HBASE-3256
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
Assignee: Gary Helmling
 Fix For: 0.92.0

 Attachments: HBASE-3256_initial.patch


 Implement a coprocessor host for HMaster. Hook observers into administrative 
 operations performed on tables: create, alter, assignment, load balance, and 
 allow observers to modify base master behavior. Support automatic loading of 
 coprocessor implementation. 
 Consider refactoring the master coprocessor host and regionserver coprocessor 
 host into a common base class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.

2010-11-23 Thread Suraj Varma (JIRA)
HBase table truncate semantics seems broken as disable table is now async by 
default.
---

 Key: HBASE-3269
 URL: https://issues.apache.org/jira/browse/HBASE-3269
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
 Environment: RHEL5 x86_64
Reporter: Suraj Varma
Priority: Critical


The new async design for disable table seems to have caused a side effect on 
the truncate command. (IRC chat with jdcryans)

Apparent Cause: 
Disable is now async by default. When truncate is called, the disable 
operation returns immediately and when the drop is called, the disable 
operation is still not completed. This results in 
HMaster.checkTableModifiable() throwing a TableNotDisabledException.

With earlier versions, disable returned only after Table was disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-3269:
--

Fix Version/s: 0.92.0
   0.90.0
 Assignee: stack

Assigning to Stack and marking it against 0.90

 HBase table truncate semantics seems broken as disable table is now async 
 by default.
 ---

 Key: HBASE-3269
 URL: https://issues.apache.org/jira/browse/HBASE-3269
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
 Environment: RHEL5 x86_64
Reporter: Suraj Varma
Assignee: stack
Priority: Critical
 Fix For: 0.90.0, 0.92.0


 The new async design for disable table seems to have caused a side effect on 
 the truncate command. (IRC chat with jdcryans)
 Apparent Cause: 
 Disable is now async by default. When truncate is called, the disable 
 operation returns immediately and when the drop is called, the disable 
 operation is still not completed. This results in 
 HMaster.checkTableModifiable() throwing a TableNotDisabledException.
 With earlier versions, disable returned only after Table was disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3227) Edit of log messages before branching...

2010-11-23 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934996#action_12934996
 ] 

HBase Review Board commented on HBASE-3227:
---

Message from: st...@duboce.net


bq.  On 2010-11-22 17:29:45, Nicolas wrote:
bq.   trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java, 
line 739
bq.   http://review.cloudera.org/r/1212/diff/1/?file=17170#file17170line739
bq.  
bq.   I'd suggest keeping the store name in this debug message since we're 
considering thread pools for compactions...

Won't the store name be part of the path on the next line when we do 
sf.toString() where sf is the file we're compacting all into?


- stack


---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1212/#review1971
---





 Edit of log messages before branching...
 

 Key: HBASE-3227
 URL: https://issues.apache.org/jira/browse/HBASE-3227
 Project: HBase
  Issue Type: Improvement
Reporter: stack
 Fix For: 0.90.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-3264) Remove unnecessary Guava Dependency

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3264.
--

   Resolution: Fixed
Fix Version/s: (was: 0.90.1)
   0.90.0
 Hadoop Flags: [Reviewed]

Committed.

I agree we should use libs instead of writing the stuff ourselves but also that 
we move to minimize dependencies.  In this case, I like the bit of Nicolas 
footwork that changes a little bit of code so we can cut our client 
dependencies by 25 (or 33?) percent.

 Remove unnecessary Guava Dependency
 ---

 Key: HBASE-3264
 URL: https://issues.apache.org/jira/browse/HBASE-3264
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Nicolas Spiegelberg
Assignee: Nicolas Spiegelberg
Priority: Minor
 Fix For: 0.90.0

 Attachments: HBASE-3264.patch


 Currently, TableMapReduceUtil uses Guava for trivial functionality and 
 addDependencyJars() currently adds Guava by default.  However, this jar is 
 only necessary for the ImportTsv MR job.  This is annoying when naively 
 bundling hbase jar with a MR job because you now need a second dependency 
 jar.  Should default bundle with only critical dependencies and have jobs 
 that need fancy Guava functionality explicitly include them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3270) When we create the .version file, we should create it in a tmp location and then move it into place

2010-11-23 Thread stack (JIRA)
When we create the .version file, we should create it in a tmp location and 
then move it into place
---

 Key: HBASE-3270
 URL: https://issues.apache.org/jira/browse/HBASE-3270
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: stack
Priority: Minor


Todd suggests over in HBASE-3258 that writing hbase.version, we should write it 
off in a /tmp location and then move it into place after writing it to protect 
against case where file writer crashes between creation and write.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3262) TestHMasterRPCException uses non-ephemeral port for master

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935014#action_12935014
 ] 

stack commented on HBASE-3262:
--

This patch is fine -- especially if you want to apply this to 0.90 for second 
RC -- but yeah, best if HTU does this for all HMaster instantiations.

 TestHMasterRPCException uses non-ephemeral port for master
 --

 Key: HBASE-3262
 URL: https://issues.apache.org/jira/browse/HBASE-3262
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Jonathan Gray
Assignee: Jonathan Gray
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3262-v1.patch


 TestHMasterRPCException instantiates an HMaster but doesn't use an ephemeral 
 port which can cause the test to fail if port already in use.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3267) close_region shell command breaks region

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3267:
-

Fix Version/s: 0.90.0

Bringing into 0.90.0.

 close_region shell command breaks region
 

 Key: HBASE-3267
 URL: https://issues.apache.org/jira/browse/HBASE-3267
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver, shell
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.90.0


 It used to be that you could use the close_region command from the shell to 
 close a region on one server and have the master reassign it elsewhere. Now 
 if you close a region, you get the following errors in the master log:
 2010-11-23 00:46:34,090 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSING for region 
 ffaa7999e909dbd6544688cc8ab303bd from server 
 haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
 and not in expected PENDI
 2010-11-23 00:46:34,530 DEBUG 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
 master:6-0x12c537d84e10062 Received ZooKeeper Event, 
 type=NodeDataChanged, state=SyncConnected, 
 path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd
 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode 
 /hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; 
 region=usertable,user1951957302,1290501969
 2010-11-23 00:46:34,531 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, 
 server=haus01.sf.cloudera.com,12020,1290501789693, 
 region=ffaa7999e909dbd6544688cc8ab303bd
 2010-11-23 00:46:34,531 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 ffaa7999e909dbd6544688cc8ab303bd from server 
 haus01.sf.cloudera.com,12020,1290501789693 but region was in  the state null 
 and not in expected PENDIN
 and the region just gets stuck closed

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3265:
-

Fix Version/s: 0.90.0

Bringing in for triage

 Regionservers waiting for ROOT while Master waiting for RegionServers
 -

 Key: HBASE-3265
 URL: https://issues.apache.org/jira/browse/HBASE-3265
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.90.0


 After a cluster disastrophe due to a disconnected switch, I ended up in a 
 state where the master was up with no region servers (see HBASE-3263). When I 
 brought the RS back up, because of the aforementioned bug, the master didn't 
 get itself into a happy state (internal datastructure had some null in it). 
 So I killed the master and started it again. Now, the master is in Waiting 
 for region servers to check in mode, and the region servers are in the 
 following stack:
 - locked 0x2aaab1bda5d0 (a 
 org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
 at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537)
 at java.lang.Thread.run(Thread.java:619)
 I imagine what happened is that the RS got through tryReportForDuty with 
 the old master, but the old master was unable to assign anything due to bad 
 state. So, when it crashed, all the RS were stuck in waitForRoot(), and when 
 I brought the new one up, no one was reporting for duty.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3266:
-

Fix Version/s: 0.90.0

Bringing into 0.90.0 while we triage.

 Master does not seem to properly scan ZK for running RS during startup
 --

 Key: HBASE-3266
 URL: https://issues.apache.org/jira/browse/HBASE-3266
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical
 Fix For: 0.90.0


 I was in the situation described by HBASE-3265, where I had a number of RS 
 waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting 
 on checkins. To get past this, I restarted one of the region servers. The 
 restarted server checked in, and the master began its startup.
 At this point the master started scanning /hbase/.logs for things to split. 
 It correctly identified that the RS on haus01 was running (this is the one I 
 restarted):
 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 Log folder 
 hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
  belongs to an existing region server
 but then incorrectly decided that the RS on haus02 was down:
 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 Log folder 
 hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
  doesn't belong to a known region server, splitting
 However ZK shows that this RS is up:
 [zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs
 [haus04.sf.cloudera.com,60020,1290498411533, 
 haus05.sf.cloudera.com,60020,1290498411520, 
 haus03.sf.cloudera.com,60020,1290498411518, 
 haus01.sf.cloudera.com,60020,1290500443143, 
 haus02.sf.cloudera.com,60020,1290498411450]
 splitLogsAfterStartup seems to check ServerManager.onlineServers, which best 
 I can tell is derived from heartbeats and not from ZK (sorry if I got some of 
 this wrong, still new to this new codebase)
 Of course, the master went into an infinite splitting loop at this point 
 since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3263:
-

Fix Version/s: 0.90.0

 Stack overflow in AssignmentManager
 ---

 Key: HBASE-3263
 URL: https://issues.apache.org/jira/browse/HBASE-3263
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Blocker
 Fix For: 0.90.0

 Attachments: stackoverflow-log.txt


 My test cluster experienced a switch outage earlier this week which threw the 
 master into a really bad state. In the catch clause of 
 AssignmentManager.assign, we recurse, and if all of the region servers are 
 inaccessible, we do so until we get a stack overflow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935027#action_12935027
 ] 

stack commented on HBASE-3261:
--

+1 on adding nullcheck.

The sequence in which stuff is started was undergoing high velocity change up 
to the end.  Not surprised holes if start sequence aborted midway.

 NPE out of HRS.run at startup when clock is out of sync
 ---

 Key: HBASE-3261
 URL: https://issues.apache.org/jira/browse/HBASE-3261
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0


 This is what I get when I start a region server that's not properly sync'ed:
 {noformat}
 Exception in thread regionserver60020 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603)
   at java.lang.Thread.run(Thread.java:637)
 {noformat}
 I this case the line was:
 {noformat}
 hlogRoller.interruptIfNecessary();
 {noformat}
 I guess we could add a bunch of other null checks.
 The end result is the same, the RS dies, but I think it's misleading.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3260) Coprocessors: Lifecycle management

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935031#action_12935031
 ] 

stack commented on HBASE-3260:
--

I'm good with borrowing the nomenclature.  Can we borrow libraries that will 
manage the lifecycle for us?  Would it make sense implementing CPs atop some 
lifecycle supporting framework?  Would it make sense using, say, any of the DI 
containers wiring up CPs?

The regionserver and master have similar need of a lifecycle as has been 
discussed elsewhere.  Would be grand if same lifecycle nomenclature was used 
throughout -- for hbase daemons and CP.

 Coprocessors: Lifecycle management
 --

 Key: HBASE-3260
 URL: https://issues.apache.org/jira/browse/HBASE-3260
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
 Fix For: 0.92.0

 Attachments: statechart.png


 Considering extending CPs to the master, we have no equivalent to 
 pre/postOpen and pre/postClose as on the regionserver. We also should 
 consider how to resolve dependencies and initialization ordering if loading 
 coprocessors that depend on others. 
 OSGi (http://en.wikipedia.org/wiki/OSGi) has a lifecycle API and is familiar 
 to many Java programmers, so we propose to borrow its terminology and state 
 machine.
 A lifecycle layer manages coprocessors as they are dynamically installed, 
 started, stopped, updated and uninstalled. Coprocessors rely on the framework 
 for dependency resolution and class loading. In turn, the framework calls up 
 to lifecycle management methods in the coprocessor as needed.
 A coprocessor transitions between the below states over its lifetime:
 ||State||Description||
 |UNINSTALLED|The coprocessor implementation is not installed. This is the 
 default implicit state.|
 |INSTALLED|The coprocessor implementation has been successfully installed|
 |STARTING|A coprocessor instance is being started.|
 |ACTIVE|The coprocessor instance has been successfully activated and is 
 running.|
 |STOPPING|A coprocessor instance is being stopped.|
 See attached state diagram. Transitions to STOPPING will only happen as the 
 region is being closed. If a coprocessor throws an unhandled exception, this 
 will cause the RegionServer to close the region, stopping all coprocessor 
 instances on it. 
 Transitions from INSTALLED-STARTING and ACTIVE-STOPPING would go through 
 upcall methods into the coprocessor via the CoprocessorLifecycle interface:
 {code:java}
 public interface CoprocessorLifecycle {
   void start(CoprocessorEnvironment env) throws IOException; 
   void stop(CoprocessorEnvironment env) throws IOException;
 }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3254) Ability to specify the host published in zookeeper

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935040#action_12935040
 ] 

stack commented on HBASE-3254:
--

Hey Cheddar.  You got a patch that would illustrate how you'd fix this (We 
should be using hostnames rather than IPs I'd say)?

 Ability to specify the host published in zookeeper
 

 Key: HBASE-3254
 URL: https://issues.apache.org/jira/browse/HBASE-3254
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.89.20100924
Reporter: Eric Tschetter

 We are running HBase on EC2 and I'm trying to get a client external from EC2 
 to connect to the cluster.  But, each of the nodes appears to be publishing 
 its IP address into zookeeper.  The problem is that the nodes on EC2 see a 
 10. IP address that is only resolvable inside of EC2.
 Specifically for EC2, there is a DNS name that will resolve properly both 
 externally and internally, so it would be nice if I could tell each of the 
 processes what host to publish into zookeeper via a property.  As it stands, 
 I have to do ssh tunnelling/muck with the hosts file in order to get my 
 client to connect.
  
 This problem could occur anywhere that you have a different DNS entry for 
 public vs. private access.  That might only ever happen on EC2, but it might 
 happen elsewhere.  I don't really know :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3249:
-

Attachment: shutdown.txt

I took a quick look.  'help shutdown' is being interpreted as two commands, an 
'help' followed by a 'shutdown'.  The interpreter is running the 'shutdown' 
command first and then 'help'.  'help' is a native IRB that we did some hackery 
to override.  The 'shutdown' is part of our command-set injection.  I'd need to 
dig in and spend some time figuring how I hacked this up.

For 0.90.0, I propose the following patch where we just remove the shutdown 
command.  Shutdown from inside the shell seems 'odd' to me.  Meantime I'll open 
an issue to look into how this help stuff is being interpreted by IRB.  Its 
kinda ugly.  If you do 'help get' you get first complaint that get has not been 
passed enough parameters and then the help output followed by the total help 
output.  Ugly.  We need to fix.  But don't think it blocker on 0.90 RC.

 Typing 'help shutdown' in the shell shouldn't shutdown the cluster
 --

 Key: HBASE-3249
 URL: https://issues.apache.org/jira/browse/HBASE-3249
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
 Fix For: 0.90.0

 Attachments: shutdown.txt


 _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives 
 you the full help... and shuts down the cluster. I don't really understand 
 why we process both commands, putting against 0.90.0 if anyone has an idea.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack reassigned HBASE-3249:


Assignee: stack

 Typing 'help shutdown' in the shell shouldn't shutdown the cluster
 --

 Key: HBASE-3249
 URL: https://issues.apache.org/jira/browse/HBASE-3249
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
 Fix For: 0.90.0

 Attachments: shutdown.txt


 _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives 
 you the full help... and shuts down the cluster. I don't really understand 
 why we process both commands, putting against 0.90.0 if anyone has an idea.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935047#action_12935047
 ] 

stack commented on HBASE-3249:
--

Oh, I'm looking for a +1

 Typing 'help shutdown' in the shell shouldn't shutdown the cluster
 --

 Key: HBASE-3249
 URL: https://issues.apache.org/jira/browse/HBASE-3249
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
 Fix For: 0.90.0

 Attachments: shutdown.txt


 _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives 
 you the full help... and shuts down the cluster. I don't really understand 
 why we process both commands, putting against 0.90.0 if anyone has an idea.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935049#action_12935049
 ] 

Jean-Daniel Cryans commented on HBASE-3249:
---

+1

 Typing 'help shutdown' in the shell shouldn't shutdown the cluster
 --

 Key: HBASE-3249
 URL: https://issues.apache.org/jira/browse/HBASE-3249
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
 Fix For: 0.90.0

 Attachments: shutdown.txt


 _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives 
 you the full help... and shuts down the cluster. I don't really understand 
 why we process both commands, putting against 0.90.0 if anyone has an idea.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935055#action_12935055
 ] 

stack commented on HBASE-3247:
--

@Steven Yes, we should start with RowLog 
(http://www.lilyproject.org/maven-site/0.1/apidocs/org/lilycms/rowlog/api/RowLog.html).

bq. I'm wondering, as I'm seeing a proliferation of alternative yet overlapping 
approaches to a certain number of issues (secondary indexes, change listening) 
which in the end could confuse new users.

-1 to proliferation of alternate yet overlapping... things

What you fellas suggest for bootstrapping system -- doing a fat bulk load into 
the search index -- and then cutting over to rowlog for incremental updates?  
Doesn't there have to exact transition so followers do not miss edits?  You 
fellas have ideas for how to do that?

 Changes API: API for pulling edits from HBase
 -

 Key: HBASE-3247
 URL: https://issues.apache.org/jira/browse/HBASE-3247
 Project: HBase
  Issue Type: Task
Reporter: stack

 Talking to Shay from Elastic Search, he was asking where the Changes API is 
 in HBase.  Talking more -- there was a bit of beer involved so apologize up 
 front -- he wants to be able to bootstrap an index and thereafter ask HBase 
 for changes since time t.  We thought he could tie into the replication 
 stream, but rather he wants to be able to pull rather than have it pushed to 
 him (in case he crashes, etc. so on recovery he can start pulling again from 
 last good edit received).  He could do the bootstrap with a Scan.  
 Thereafter, requests to pull from hbase would pass a marker of some  sort.  
 HBase would then give out edits that came in after this marker, in batches, 
 along with an updated marker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3249.
--

  Resolution: Fixed
Hadoop Flags: [Reviewed]

Committed.  Thanks for review j-d.

 Typing 'help shutdown' in the shell shouldn't shutdown the cluster
 --

 Key: HBASE-3249
 URL: https://issues.apache.org/jira/browse/HBASE-3249
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
 Fix For: 0.90.0

 Attachments: shutdown.txt


 _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives 
 you the full help... and shuts down the cluster. I don't really understand 
 why we process both commands, putting against 0.90.0 if anyone has an idea.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-3258) EOF when version file is empty

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved HBASE-3258.
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]

Committed to trunk and 0.90, thanks Stack.

 EOF when version file is empty
 --

 Key: HBASE-3258
 URL: https://issues.apache.org/jira/browse/HBASE-3258
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Blocker
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3258.patch


 I somehow was able to get an empty hbase.version file on a test machine and 
 when I start HBase I see:
 {noformat}
 starting master, logging to 
 /data/jdcryans/git/hbase/bin/../logs/hbase-jdcryans-master-hbasedev.out
 Exception in thread master-hbasedev:6 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.master.HMaster.stopServiceThreads(HMaster.java:559)
   at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:286)
 {noformat}
 And in the master's log:
 {noformat}
 2010-11-22 10:08:43,003 FATAL org.apache.hadoop.hbase.master.HMaster: 
 Unhandled exception. Starting shutdown.
 java.io.EOFException
 at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
 at java.io.DataInputStream.readUTF(DataInputStream.java:572)
 at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:151)
 at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:170)
 at 
 org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:226)
 at 
 org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:104)
 at 
 org.apache.hadoop.hbase.master.MasterFileSystem.init(MasterFileSystem.java:89)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:337)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:273)
 2010-11-22 10:08:43,006 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
 {noformat}
 I thought that that kind of issue was solved a long time ago, but somehow 
 it's there again. I'll fix by handling the EOF and also will look at that 
 ugly NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3259) Can't kill the region servers when they wait on the master or the cluster state znode

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935064#action_12935064
 ] 

Jean-Daniel Cryans commented on HBASE-3259:
---

Committed to trunk and 0.90, thanks Stack.

 Can't kill the region servers when they wait on the master or the cluster 
 state znode
 -

 Key: HBASE-3259
 URL: https://issues.apache.org/jira/browse/HBASE-3259
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Blocker
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3259.patch


 With a situation like HBASE-3258, it's easy to have the region servers stuck 
 on waiting for either the master or the cluster state znode since it has no 
 timeout. You have to kill -9 them to have them shutting down. This is very 
 bad for usability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3262) TestHMasterRPCException uses non-ephemeral port for master

2010-11-23 Thread Jonathan Gray (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Gray updated HBASE-3262:
-

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Committed to branch and trunk.

 TestHMasterRPCException uses non-ephemeral port for master
 --

 Key: HBASE-3262
 URL: https://issues.apache.org/jira/browse/HBASE-3262
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
Reporter: Jonathan Gray
Assignee: Jonathan Gray
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3262-v1.patch


 TestHMasterRPCException instantiates an HMaster but doesn't use an ephemeral 
 port which can cause the test to fail if port already in use.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3271) Allow .META. table to be exported

2010-11-23 Thread Ted Yu (JIRA)
Allow .META. table to be exported
-

 Key: HBASE-3271
 URL: https://issues.apache.org/jira/browse/HBASE-3271
 Project: HBase
  Issue Type: Improvement
  Components: util
Affects Versions: 0.20.6
Reporter: Ted Yu


I tried to export .META. table in 0.20.6 and got:

[had...@us01-ciqps1-name01 hbase]$ bin/hbase 
org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0
10/11/23 20:59:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
2010-11-23 20:59:05.255::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
2010-11-23 20:59:05.255::INFO:  verisons=1, starttime=0, 
endtime=9223372036854775807
10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT
10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
environment:host.name=us01-ciqps1-name01.carrieriq.com
10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
environment:java.version=1.6.0_21
10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun 
Microsystems Inc.
...
10/11/23 20:59:05 INFO zookeeper.ClientCnxn: Server connection successful
10/11/23 20:59:05 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode 
/hbase/root-region-server got 10.202.50.112:60020
10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Found ROOT at 
10.202.50.112:60020
10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Cached location 
for .META.,,1 is us01-ciqps1-grid02.carrieriq.com:60020
Exception in thread main java.io.IOException: Expecting at least one region.
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:281)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.apache.hadoop.hbase.mapreduce.Export.main(Export.java:146)

Related code is:
if (keys == null || keys.getFirst() == null ||
keys.getFirst().length == 0) {
  throw new IOException(Expecting at least one region.);
}

My intention was to save the dangling rows in .META. (for future investigation) 
which prevented a table from being created.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935074#action_12935074
 ] 

Jean-Daniel Cryans commented on HBASE-3261:
---

The patch attached just adds a bunch of null checks.

 NPE out of HRS.run at startup when clock is out of sync
 ---

 Key: HBASE-3261
 URL: https://issues.apache.org/jira/browse/HBASE-3261
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3261.patch


 This is what I get when I start a region server that's not properly sync'ed:
 {noformat}
 Exception in thread regionserver60020 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603)
   at java.lang.Thread.run(Thread.java:637)
 {noformat}
 I this case the line was:
 {noformat}
 hlogRoller.interruptIfNecessary();
 {noformat}
 I guess we could add a bunch of other null checks.
 The end result is the same, the RS dies, but I think it's misleading.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-3261:
--

Attachment: HBASE-3261.patch

 NPE out of HRS.run at startup when clock is out of sync
 ---

 Key: HBASE-3261
 URL: https://issues.apache.org/jira/browse/HBASE-3261
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3261.patch


 This is what I get when I start a region server that's not properly sync'ed:
 {noformat}
 Exception in thread regionserver60020 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603)
   at java.lang.Thread.run(Thread.java:637)
 {noformat}
 I this case the line was:
 {noformat}
 hlogRoller.interruptIfNecessary();
 {noformat}
 I guess we could add a bunch of other null checks.
 The end result is the same, the RS dies, but I think it's misleading.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3272) Remove no longer used options

2010-11-23 Thread stack (JIRA)
Remove no longer used options
-

 Key: HBASE-3272
 URL: https://issues.apache.org/jira/browse/HBASE-3272
 Project: HBase
  Issue Type: Bug
Reporter: stack


From Lars George list up on hbase-dev:

{code}
Hi,

I went through the config values as per the defaults XML file (still
going through it again now based on what is actually in the code, i.e.
those not in defaults). Here is what I found:

hbase.master.balancer.period - Only used in hbase-default.xml?

hbase.regions.percheckin, hbase.regions.slop - Some tests still have
it but not used anywhere else

zookeeper.pause, zookeeper.retries - Never used? Only in hbase-defaults.xml


And then there are differences between hardcoded and XML based defaults:

hbase.client.pause - XML: 1000, hardcoded: 2000 (HBaseClient) and 30 *
1000 (HBaseAdmin)

hbase.client.retries.number - XML: 10, hardcoded 5 (HBaseAdmin) and 2 (HMaster)

hbase.hstore.blockingStoreFiles - XML: 7, hardcoded: -1

hbase.hstore.compactionThreshold - XML: 3, hardcoded: 2

hbase.regionserver.global.memstore.lowerLimit - XML: 0.35, hardcoded: 0.25

hbase.regionserver.handler.count - XML: 25, hardcoded: 10

hbase.regionserver.msginterval - XML: 3000, hardcoded: 1000

hbase.rest.port - XML: 8080, hardcoded: 9090

hfile.block.cache.size - XML: 0.2, hardcoded: 0.0


Finally, some keys are already in HConstants, some are in local
classes and others used as literals. There is an issue open to fix
this though. Just saying.

Thoughts?
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3272) Remove no longer used options

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935098#action_12935098
 ] 

Jean-Daniel Cryans commented on HBASE-3272:
---

+1

Looking at the patch, it made me remember that we wanted to raise the default 
for blocking store file to something like 12. Do we still want to do this?

 Remove no longer used options
 -

 Key: HBASE-3272
 URL: https://issues.apache.org/jira/browse/HBASE-3272
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: stack
 Fix For: 0.90.0

 Attachments: 3272.txt


 From Lars George list up on hbase-dev:
 {code}
 Hi,
 I went through the config values as per the defaults XML file (still
 going through it again now based on what is actually in the code, i.e.
 those not in defaults). Here is what I found:
 hbase.master.balancer.period - Only used in hbase-default.xml?
 hbase.regions.percheckin, hbase.regions.slop - Some tests still have
 it but not used anywhere else
 zookeeper.pause, zookeeper.retries - Never used? Only in hbase-defaults.xml
 And then there are differences between hardcoded and XML based defaults:
 hbase.client.pause - XML: 1000, hardcoded: 2000 (HBaseClient) and 30 *
 1000 (HBaseAdmin)
 hbase.client.retries.number - XML: 10, hardcoded 5 (HBaseAdmin) and 2 
 (HMaster)
 hbase.hstore.blockingStoreFiles - XML: 7, hardcoded: -1
 hbase.hstore.compactionThreshold - XML: 3, hardcoded: 2
 hbase.regionserver.global.memstore.lowerLimit - XML: 0.35, hardcoded: 0.25
 hbase.regionserver.handler.count - XML: 25, hardcoded: 10
 hbase.regionserver.msginterval - XML: 3000, hardcoded: 1000
 hbase.rest.port - XML: 8080, hardcoded: 9090
 hfile.block.cache.size - XML: 0.2, hardcoded: 0.0
 Finally, some keys are already in HConstants, some are in local
 classes and others used as literals. There is an issue open to fix
 this though. Just saying.
 Thoughts?
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3254) Ability to specify the host published in zookeeper

2010-11-23 Thread Eric Tschetter (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935102#action_12935102
 ] 

Eric Tschetter commented on HBASE-3254:
---

Not yet.  I worked around it by mucking with my hosts file and setting up an
SSH-based SOCKS proxy for now.  If I get some time, I'll take a stab at a
patch.

I think that it should be reasonable to just have a System/hbase conf
property that the HRegionServer and HMaster look at to get the address that
they publish.  If that is not set, then just do what it does now.
https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935040#action_12935040]
should be using hostnames rather than IPs I'd say)?
EC2 to connect to the cluster. But, each of the nodes appears to be
publishing its IP address into zookeeper. The problem is that the nodes on
EC2 see a 10. IP address that is only resolvable inside of EC2.
externally and internally, so it would be nice if I could tell each of the
processes what host to publish into zookeeper via a property. As it stands,
I have to do ssh tunnelling/muck with the hosts file in order to get my
client to connect.
public vs. private access. That might only ever happen on EC2, but it might
happen elsewhere. I don't really know :).


 Ability to specify the host published in zookeeper
 

 Key: HBASE-3254
 URL: https://issues.apache.org/jira/browse/HBASE-3254
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.89.20100924
Reporter: Eric Tschetter

 We are running HBase on EC2 and I'm trying to get a client external from EC2 
 to connect to the cluster.  But, each of the nodes appears to be publishing 
 its IP address into zookeeper.  The problem is that the nodes on EC2 see a 
 10. IP address that is only resolvable inside of EC2.
 Specifically for EC2, there is a DNS name that will resolve properly both 
 externally and internally, so it would be nice if I could tell each of the 
 processes what host to publish into zookeeper via a property.  As it stands, 
 I have to do ssh tunnelling/muck with the hosts file in order to get my 
 client to connect.
  
 This problem could occur anywhere that you have a different DNS entry for 
 public vs. private access.  That might only ever happen on EC2, but it might 
 happen elsewhere.  I don't really know :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3260) Coprocessors: Lifecycle management

2010-11-23 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935111#action_12935111
 ] 

Andrew Purtell commented on HBASE-3260:
---

We could use something like Guice as a lightweight DI framework within HBase 
code in general, but I think this is orthogonal to what coprocessors tries to 
achieve. 

bq. But each coprocessor is currently a separate island the relationship 
between them seems more akin to chained servlet filters on a request than 
independent components.

Chained servlet filters is a good analogy. Need to add client-transparent 
compression support to your webapp? Register a compression filter on the chain. 
Need to add client-transparent value compression to your table? Register a 
value compression coprocessor on the region.


 Coprocessors: Lifecycle management
 --

 Key: HBASE-3260
 URL: https://issues.apache.org/jira/browse/HBASE-3260
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
 Fix For: 0.92.0

 Attachments: statechart.png


 Considering extending CPs to the master, we have no equivalent to 
 pre/postOpen and pre/postClose as on the regionserver. We also should 
 consider how to resolve dependencies and initialization ordering if loading 
 coprocessors that depend on others. 
 OSGi (http://en.wikipedia.org/wiki/OSGi) has a lifecycle API and is familiar 
 to many Java programmers, so we propose to borrow its terminology and state 
 machine.
 A lifecycle layer manages coprocessors as they are dynamically installed, 
 started, stopped, updated and uninstalled. Coprocessors rely on the framework 
 for dependency resolution and class loading. In turn, the framework calls up 
 to lifecycle management methods in the coprocessor as needed.
 A coprocessor transitions between the below states over its lifetime:
 ||State||Description||
 |UNINSTALLED|The coprocessor implementation is not installed. This is the 
 default implicit state.|
 |INSTALLED|The coprocessor implementation has been successfully installed|
 |STARTING|A coprocessor instance is being started.|
 |ACTIVE|The coprocessor instance has been successfully activated and is 
 running.|
 |STOPPING|A coprocessor instance is being stopped.|
 See attached state diagram. Transitions to STOPPING will only happen as the 
 region is being closed. If a coprocessor throws an unhandled exception, this 
 will cause the RegionServer to close the region, stopping all coprocessor 
 instances on it. 
 Transitions from INSTALLED-STARTING and ACTIVE-STOPPING would go through 
 upcall methods into the coprocessor via the CoprocessorLifecycle interface:
 {code:java}
 public interface CoprocessorLifecycle {
   void start(CoprocessorEnvironment env) throws IOException; 
   void stop(CoprocessorEnvironment env) throws IOException;
 }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-3272) Remove no longer used options

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3272.
--

  Resolution: Fixed
Hadoop Flags: [Reviewed]

Committed to branch and trunk.

 Remove no longer used options
 -

 Key: HBASE-3272
 URL: https://issues.apache.org/jira/browse/HBASE-3272
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: stack
 Fix For: 0.90.0

 Attachments: 3272.txt


 From Lars George list up on hbase-dev:
 {code}
 Hi,
 I went through the config values as per the defaults XML file (still
 going through it again now based on what is actually in the code, i.e.
 those not in defaults). Here is what I found:
 hbase.master.balancer.period - Only used in hbase-default.xml?
 hbase.regions.percheckin, hbase.regions.slop - Some tests still have
 it but not used anywhere else
 zookeeper.pause, zookeeper.retries - Never used? Only in hbase-defaults.xml
 And then there are differences between hardcoded and XML based defaults:
 hbase.client.pause - XML: 1000, hardcoded: 2000 (HBaseClient) and 30 *
 1000 (HBaseAdmin)
 hbase.client.retries.number - XML: 10, hardcoded 5 (HBaseAdmin) and 2 
 (HMaster)
 hbase.hstore.blockingStoreFiles - XML: 7, hardcoded: -1
 hbase.hstore.compactionThreshold - XML: 3, hardcoded: 2
 hbase.regionserver.global.memstore.lowerLimit - XML: 0.35, hardcoded: 0.25
 hbase.regionserver.handler.count - XML: 25, hardcoded: 10
 hbase.regionserver.msginterval - XML: 3000, hardcoded: 1000
 hbase.rest.port - XML: 8080, hardcoded: 9090
 hfile.block.cache.size - XML: 0.2, hardcoded: 0.0
 Finally, some keys are already in HConstants, some are in local
 classes and others used as literals. There is an issue open to fix
 this though. Just saying.
 Thoughts?
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-3273) Set the ZK default timeout to 3 minutes

2010-11-23 Thread Jean-Daniel Cryans (JIRA)
Set the ZK default timeout to 3 minutes
---

 Key: HBASE-3273
 URL: https://issues.apache.org/jira/browse/HBASE-3273
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0


Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed 
that we set it to 3 minutes (he said that last part in person). This should 
cover most of the big GC pauses out there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-3259) Can't kill the region servers when they wait on the master or the cluster state znode

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3259.
--

  Resolution: Fixed
Hadoop Flags: [Reviewed]

Committed.  Closing.

 Can't kill the region servers when they wait on the master or the cluster 
 state znode
 -

 Key: HBASE-3259
 URL: https://issues.apache.org/jira/browse/HBASE-3259
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Blocker
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3259.patch


 With a situation like HBASE-3258, it's easy to have the region servers stuck 
 on waiting for either the master or the cluster state znode since it has no 
 timeout. You have to kill -9 them to have them shutting down. This is very 
 bad for usability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3246) Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935136#action_12935136
 ] 

Jonathan Gray commented on HBASE-3246:
--

I would be okay with just removing addColumn and having a single call, 
incrementColumn().  That method would be additive w/ existing calls for a given 
column.  If you wanted to somehow undo existing increments of columns, you 
would have to start over with a new Increment.

I could also add a separate call, resetColumn() but then we start to get more 
confusing again.

Everyone okay with removing addColumn() and just leaving incrementColumn()?

 Add API to Increment client class that increments rather than replaces the 
 amount for a column when done multiple times
 ---

 Key: HBASE-3246
 URL: https://issues.apache.org/jira/browse/HBASE-3246
 Project: HBase
  Issue Type: Improvement
  Components: client
Reporter: Jonathan Gray
Assignee: Jonathan Gray
 Attachments: HBASE-3246-v1.patch


 In the new Increment class, the API to add columns is {{addColumn()}}.  If 
 you do this multiple times for an individual column, the amount to increment 
 by is replaced.  I think this is the right way for this method to work and it 
 is javadoc'd with the behavior.
 We should add a new method, {{incrementColumn()}} which will increment any 
 existing amount for the specified column rather than replacing it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3246) Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times

2010-11-23 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935137#action_12935137
 ] 

Jonathan Gray commented on HBASE-3246:
--

@Stack, and yeah, all of our client classes are not thread-safe.  If you wanted 
to use an Increment across threads you'd have to do your own synchronization, 
but I think that would be kind of odd.  Can add a note that class is not 
thread-safe in the javadoc.

 Add API to Increment client class that increments rather than replaces the 
 amount for a column when done multiple times
 ---

 Key: HBASE-3246
 URL: https://issues.apache.org/jira/browse/HBASE-3246
 Project: HBase
  Issue Type: Improvement
  Components: client
Reporter: Jonathan Gray
Assignee: Jonathan Gray
 Attachments: HBASE-3246-v1.patch


 In the new Increment class, the API to add columns is {{addColumn()}}.  If 
 you do this multiple times for an individual column, the amount to increment 
 by is replaced.  I think this is the right way for this method to work and it 
 is javadoc'd with the behavior.
 We should add a new method, {{incrementColumn()}} which will increment any 
 existing amount for the specified column rather than replacing it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3273) Set the ZK default timeout to 3 minutes

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3273:
-

Attachment: doc_of_three_minute.txt

Here is a patch for the manual adding zookeeper.session.timeout to the list of 
configurations we suggest you change with explaination for why its set to a 
long three minute default timeout.

 Set the ZK default timeout to 3 minutes
 ---

 Key: HBASE-3273
 URL: https://issues.apache.org/jira/browse/HBASE-3273
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0

 Attachments: doc_of_three_minute.txt


 Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed 
 that we set it to 3 minutes (he said that last part in person). This should 
 cover most of the big GC pauses out there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-3273) Set the ZK default timeout to 3 minutes

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-3273:
--

Attachment: HBASE-3273.patch

Patch that changes the timeout to 3min and that fixes HQuorumPeer to use the 
new configuration introduced in ZK 3.3.0

 Set the ZK default timeout to 3 minutes
 ---

 Key: HBASE-3273
 URL: https://issues.apache.org/jira/browse/HBASE-3273
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0

 Attachments: doc_of_three_minute.txt, HBASE-3273.patch


 Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed 
 that we set it to 3 minutes (he said that last part in person). This should 
 cover most of the big GC pauses out there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3273) Set the ZK default timeout to 3 minutes

2010-11-23 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935156#action_12935156
 ] 

Jean-Daniel Cryans commented on HBASE-3273:
---

Regarding the documentation:

bq. The default timeout is three minutes

I would add: The default timeout is three minutes (specified in milliseconds)

bq. This means that if a server crash

Shouldn't it be crashes?

Also there's a typo later in intriciacies.

 Set the ZK default timeout to 3 minutes
 ---

 Key: HBASE-3273
 URL: https://issues.apache.org/jira/browse/HBASE-3273
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0

 Attachments: doc_of_three_minute.txt, HBASE-3273.patch


 Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed 
 that we set it to 3 minutes (he said that last part in person). This should 
 cover most of the big GC pauses out there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3273) Set the ZK default timeout to 3 minutes

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935180#action_12935180
 ] 

stack commented on HBASE-3273:
--

+1 on your patch and on the doc fixes.

Want to make the doc fixes you suggest above when you commit the doc alongside 
your commit of your patch?

 Set the ZK default timeout to 3 minutes
 ---

 Key: HBASE-3273
 URL: https://issues.apache.org/jira/browse/HBASE-3273
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0

 Attachments: doc_of_three_minute.txt, HBASE-3273.patch


 Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed 
 that we set it to 3 minutes (he said that last part in person). This should 
 cover most of the big GC pauses out there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3246) Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935182#action_12935182
 ] 

stack commented on HBASE-3246:
--

I'm fine w/ removing addColumn.  Lets get the change in before we cut new RC.

 Add API to Increment client class that increments rather than replaces the 
 amount for a column when done multiple times
 ---

 Key: HBASE-3246
 URL: https://issues.apache.org/jira/browse/HBASE-3246
 Project: HBase
  Issue Type: Improvement
  Components: client
Reporter: Jonathan Gray
Assignee: Jonathan Gray
 Attachments: HBASE-3246-v1.patch


 In the new Increment class, the API to add columns is {{addColumn()}}.  If 
 you do this multiple times for an individual column, the amount to increment 
 by is replaced.  I think this is the right way for this method to work and it 
 is javadoc'd with the behavior.
 We should add a new method, {{incrementColumn()}} which will increment any 
 existing amount for the specified column rather than replacing it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3260) Coprocessors: Lifecycle management

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935183#action_12935183
 ] 

stack commented on HBASE-3260:
--

Thanks for entertaining my random ramble Gary and Andrew.  Your reasoning seems 
good to me.

 Coprocessors: Lifecycle management
 --

 Key: HBASE-3260
 URL: https://issues.apache.org/jira/browse/HBASE-3260
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Purtell
 Fix For: 0.92.0

 Attachments: statechart.png


 Considering extending CPs to the master, we have no equivalent to 
 pre/postOpen and pre/postClose as on the regionserver. We also should 
 consider how to resolve dependencies and initialization ordering if loading 
 coprocessors that depend on others. 
 OSGi (http://en.wikipedia.org/wiki/OSGi) has a lifecycle API and is familiar 
 to many Java programmers, so we propose to borrow its terminology and state 
 machine.
 A lifecycle layer manages coprocessors as they are dynamically installed, 
 started, stopped, updated and uninstalled. Coprocessors rely on the framework 
 for dependency resolution and class loading. In turn, the framework calls up 
 to lifecycle management methods in the coprocessor as needed.
 A coprocessor transitions between the below states over its lifetime:
 ||State||Description||
 |UNINSTALLED|The coprocessor implementation is not installed. This is the 
 default implicit state.|
 |INSTALLED|The coprocessor implementation has been successfully installed|
 |STARTING|A coprocessor instance is being started.|
 |ACTIVE|The coprocessor instance has been successfully activated and is 
 running.|
 |STOPPING|A coprocessor instance is being stopped.|
 See attached state diagram. Transitions to STOPPING will only happen as the 
 region is being closed. If a coprocessor throws an unhandled exception, this 
 will cause the RegionServer to close the region, stopping all coprocessor 
 instances on it. 
 Transitions from INSTALLED-STARTING and ACTIVE-STOPPING would go through 
 upcall methods into the coprocessor via the CoprocessorLifecycle interface:
 {code:java}
 public interface CoprocessorLifecycle {
   void start(CoprocessorEnvironment env) throws IOException; 
   void stop(CoprocessorEnvironment env) throws IOException;
 }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3271) Allow .META. table to be exported

2010-11-23 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935184#action_12935184
 ] 

Ted Yu commented on HBASE-3271:
---

I used this code:
if (keys == null || keys.getFirst() == null || 
keys.getFirst().length == 0) {
HRegionLocation regLoc = 
table.getRegionLocation(HConstants.EMPTY_BYTE_ARRAY);
if (null == regLoc)
throw new IOException(Expecting at least one region.);
ListInputSplit splits = new ArrayListInputSplit(1); 
InputSplit split = new TableSplit(table.getTableName(),
HConstants.EMPTY_BYTE_ARRAY, 
HConstants.EMPTY_BYTE_ARRAY,
regLoc.getServerAddress().getHostname());
splits.add(split);
return splits;
}

The following command only exports rows in .META. which have 'packageindex' 
(refer to HBASE-3255):
bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 
packageindex

-rwxrwxrwx 1 hadoop users 90700 Nov 24 03:31 h-meta/part-m-0

 Allow .META. table to be exported
 -

 Key: HBASE-3271
 URL: https://issues.apache.org/jira/browse/HBASE-3271
 Project: HBase
  Issue Type: Improvement
  Components: util
Affects Versions: 0.20.6
Reporter: Ted Yu

 I tried to export .META. table in 0.20.6 and got:
 [had...@us01-ciqps1-name01 hbase]$ bin/hbase 
 org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0
 10/11/23 20:59:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
 processName=JobTracker, sessionId=
 2010-11-23 20:59:05.255::INFO:  Logging to STDERR via 
 org.mortbay.log.StdErrLog
 2010-11-23 20:59:05.255::INFO:  verisons=1, starttime=0, 
 endtime=9223372036854775807
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:host.name=us01-ciqps1-name01.carrieriq.com
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:java.version=1.6.0_21
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:java.vendor=Sun Microsystems Inc.
 ...
 10/11/23 20:59:05 INFO zookeeper.ClientCnxn: Server connection successful
 10/11/23 20:59:05 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode 
 /hbase/root-region-server got 10.202.50.112:60020
 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Found ROOT at 
 10.202.50.112:60020
 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Cached 
 location for .META.,,1 is us01-ciqps1-grid02.carrieriq.com:60020
 Exception in thread main java.io.IOException: Expecting at least one region.
 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:281)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
 at org.apache.hadoop.hbase.mapreduce.Export.main(Export.java:146)
 Related code is:
 if (keys == null || keys.getFirst() == null ||
 keys.getFirst().length == 0) {
   throw new IOException(Expecting at least one region.);
 }
 My intention was to save the dangling rows in .META. (for future 
 investigation) which prevented a table from being created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3254) Ability to specify the host published in zookeeper

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935187#action_12935187
 ] 

stack commented on HBASE-3254:
--

A patch whereby optionally servers would register themselves in zk using a 
suggested hostname seems reasonable (Only tricky part is that a RegionServer 
will use the 'name' the master tells it use -- see the reportForDuty code in 
HRegionServer.  RegionServer on startup reads its 'address' then volunteers 
this to the Master but the Master could change it on the RegionServer.  
Subsequently after first reportForDuty, the regionserver will always checkin 
using the name the Master told it.  In fact, you might be able to exploit this 
behavior by patching the Master only?)

 Ability to specify the host published in zookeeper
 

 Key: HBASE-3254
 URL: https://issues.apache.org/jira/browse/HBASE-3254
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.89.20100924
Reporter: Eric Tschetter

 We are running HBase on EC2 and I'm trying to get a client external from EC2 
 to connect to the cluster.  But, each of the nodes appears to be publishing 
 its IP address into zookeeper.  The problem is that the nodes on EC2 see a 
 10. IP address that is only resolvable inside of EC2.
 Specifically for EC2, there is a DNS name that will resolve properly both 
 externally and internally, so it would be nice if I could tell each of the 
 processes what host to publish into zookeeper via a property.  As it stands, 
 I have to do ssh tunnelling/muck with the hosts file in order to get my 
 client to connect.
  
 This problem could occur anywhere that you have a different DNS entry for 
 public vs. private access.  That might only ever happen on EC2, but it might 
 happen elsewhere.  I don't really know :).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935188#action_12935188
 ] 

stack commented on HBASE-3261:
--

+1

 NPE out of HRS.run at startup when clock is out of sync
 ---

 Key: HBASE-3261
 URL: https://issues.apache.org/jira/browse/HBASE-3261
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.90.0, 0.92.0

 Attachments: HBASE-3261.patch


 This is what I get when I start a region server that's not properly sync'ed:
 {noformat}
 Exception in thread regionserver60020 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603)
   at java.lang.Thread.run(Thread.java:637)
 {noformat}
 I this case the line was:
 {noformat}
 hlogRoller.interruptIfNecessary();
 {noformat}
 I guess we could add a bunch of other null checks.
 The end result is the same, the RS dies, but I think it's misleading.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3271) Allow .META. table to be exported

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935190#action_12935190
 ] 

stack commented on HBASE-3271:
--

Any chance of your making the code above into a patch and attaching it to this 
issue so we can review it Ted?  (This might be of help: 
http://www.apache.org/dev/contributors.html#patches).  Thanks.

 Allow .META. table to be exported
 -

 Key: HBASE-3271
 URL: https://issues.apache.org/jira/browse/HBASE-3271
 Project: HBase
  Issue Type: Improvement
  Components: util
Affects Versions: 0.20.6
Reporter: Ted Yu

 I tried to export .META. table in 0.20.6 and got:
 [had...@us01-ciqps1-name01 hbase]$ bin/hbase 
 org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0
 10/11/23 20:59:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
 processName=JobTracker, sessionId=
 2010-11-23 20:59:05.255::INFO:  Logging to STDERR via 
 org.mortbay.log.StdErrLog
 2010-11-23 20:59:05.255::INFO:  verisons=1, starttime=0, 
 endtime=9223372036854775807
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:host.name=us01-ciqps1-name01.carrieriq.com
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:java.version=1.6.0_21
 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client 
 environment:java.vendor=Sun Microsystems Inc.
 ...
 10/11/23 20:59:05 INFO zookeeper.ClientCnxn: Server connection successful
 10/11/23 20:59:05 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode 
 /hbase/root-region-server got 10.202.50.112:60020
 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Found ROOT at 
 10.202.50.112:60020
 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Cached 
 location for .META.,,1 is us01-ciqps1-grid02.carrieriq.com:60020
 Exception in thread main java.io.IOException: Expecting at least one region.
 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:281)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
 at org.apache.hadoop.hbase.mapreduce.Export.main(Export.java:146)
 Related code is:
 if (keys == null || keys.getFirst() == null ||
 keys.getFirst().length == 0) {
   throw new IOException(Expecting at least one region.);
 }
 My intention was to save the dangling rows in .META. (for future 
 investigation) which prevented a table from being created.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2888) Review all our metrics

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935193#action_12935193
 ] 

stack commented on HBASE-2888:
--

@Alex 1. and 2. sound great.  How would 3. differ from current log?  All it has 
in it is 'events', no?  Regards your question about process, I'd say no need of 
an issue per metric., at least not for now.  The way it usually runs is we have 
a fat umbrella issue like this one, a bunch of work gets done under this 
umbrella -- in this case, a bunch of the above will be addressed by the patch 
-- but then subsequent amendments or additions get done in separate issues.  
Hope this helps.

 Review all our metrics
 --

 Key: HBASE-2888
 URL: https://issues.apache.org/jira/browse/HBASE-2888
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Jean-Daniel Cryans
 Fix For: 0.92.0


 HBase publishes a bunch of metrics, some useful some wasteful, that should be 
 improved to deliver a better ops experience. Examples:
  - Block cache hit ratio converges at some point and stops moving
  - fsReadLatency goes down when compactions are running
  - storefileIndexSizeMB is the exact same number once a system is serving 
 production load
 We could use new metrics too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935196#action_12935196
 ] 

stack commented on HBASE-3269:
--

This is the fix:

{code}
pynchon-432:clean_trunk stack$ svn diff 
Index: src/main/ruby/hbase/admin.rb
===
--- src/main/ruby/hbase/admin.rb(revision 1038464)
+++ src/main/ruby/hbase/admin.rb(working copy)
@@ -83,7 +83,7 @@
 # Disables a table
 def disable(table_name)
   return unless enabled?(table_name)
-  @admin.disableTableAsync(table_name)
+  @admin.disableTable(table_name)
 end
 
 
#--
{code}

I'm just going to commit.

 HBase table truncate semantics seems broken as disable table is now async 
 by default.
 ---

 Key: HBASE-3269
 URL: https://issues.apache.org/jira/browse/HBASE-3269
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
 Environment: RHEL5 x86_64
Reporter: Suraj Varma
Assignee: stack
Priority: Critical
 Fix For: 0.90.0, 0.92.0


 The new async design for disable table seems to have caused a side effect on 
 the truncate command. (IRC chat with jdcryans)
 Apparent Cause: 
 Disable is now async by default. When truncate is called, the disable 
 operation returns immediately and when the drop is called, the disable 
 operation is still not completed. This results in 
 HMaster.checkTableModifiable() throwing a TableNotDisabledException.
 With earlier versions, disable returned only after Table was disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.

2010-11-23 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3269.
--

Resolution: Fixed

Committed branch and trunk.

 HBase table truncate semantics seems broken as disable table is now async 
 by default.
 ---

 Key: HBASE-3269
 URL: https://issues.apache.org/jira/browse/HBASE-3269
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
 Environment: RHEL5 x86_64
Reporter: Suraj Varma
Assignee: stack
Priority: Critical
 Fix For: 0.90.0, 0.92.0


 The new async design for disable table seems to have caused a side effect on 
 the truncate command. (IRC chat with jdcryans)
 Apparent Cause: 
 Disable is now async by default. When truncate is called, the disable 
 operation returns immediately and when the drop is called, the disable 
 operation is still not completed. This results in 
 HMaster.checkTableModifiable() throwing a TableNotDisabledException.
 With earlier versions, disable returned only after Table was disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.

2010-11-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935199#action_12935199
 ] 

stack commented on HBASE-3269:
--

Oh, I checked for other mentions of async I found enable is also async in shell 
so I changed that too.  I committed the change under this issue.

 HBase table truncate semantics seems broken as disable table is now async 
 by default.
 ---

 Key: HBASE-3269
 URL: https://issues.apache.org/jira/browse/HBASE-3269
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.0
 Environment: RHEL5 x86_64
Reporter: Suraj Varma
Assignee: stack
Priority: Critical
 Fix For: 0.90.0, 0.92.0


 The new async design for disable table seems to have caused a side effect on 
 the truncate command. (IRC chat with jdcryans)
 Apparent Cause: 
 Disable is now async by default. When truncate is called, the disable 
 operation returns immediately and when the drop is called, the disable 
 operation is still not completed. This results in 
 HMaster.checkTableModifiable() throwing a TableNotDisabledException.
 With earlier versions, disable returned only after Table was disabled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2888) Review all our metrics

2010-11-23 Thread Alex Baranau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935217#action_12935217
 ] 

Alex Baranau commented on HBASE-2888:
-

qt. How would 3. differ from current log?

The difference is to have this shown somewhere in more user friendly-place e.g. 
*on web interface*. The point I wanted to stress out is that currently 
users/ops have to go for this information to the log files which isn't 
straightforward for them (and leads to questions on ML as I linked to), I 
think, but might be wrong.

Thanks.

 Review all our metrics
 --

 Key: HBASE-2888
 URL: https://issues.apache.org/jira/browse/HBASE-2888
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Jean-Daniel Cryans
 Fix For: 0.92.0


 HBase publishes a bunch of metrics, some useful some wasteful, that should be 
 improved to deliver a better ops experience. Examples:
  - Block cache hit ratio converges at some point and stops moving
  - fsReadLatency goes down when compactions are running
  - storefileIndexSizeMB is the exact same number once a system is serving 
 production load
 We could use new metrics too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.