Re: Errors after major compaction

2011-07-07 Thread Eran Kutner
Well, the master doesn't know that s05 has the region open -- thats why it gives it to s02 -- and then, there is no channel available to s05 to figure who has what The way I see it, that's the root of the problem. It would probably make sense if the RS could figure this out independently from

double assignment WAS: Errors after major compaction

2011-07-07 Thread Ted Yu
Mind pastebin'ing this part of master log? 2011-06-29 16:39:54,326 DEBUG org.apache.hadoop.hbase. master.handler.OpenedRegionHandler: Opened region gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076

Re: Errors after major compaction

2011-07-07 Thread Stack
On Thu, Jul 7, 2011 at 2:56 AM, Eran Kutner e...@gigya.com wrote: Well, the master doesn't know that s05 has the region open -- thats why it gives it to s02 -- and then, there is no channel available to s05 to figure who has what The way I see it, that's the root of the problem. Well backing

Re: Errors after major compaction

2011-07-06 Thread Eran Kutner
no. but I did run major compaction. As I explained initially, I disabled the table so I could change its TTL, then re-enabled it then ran major compaction so it would clean up the expired data due to the TTL change. -eran On Wed, Jul 6, 2011 at 02:43, Ted Yu yuzhih...@gmail.com wrote: Eran:

Re: Errors after major compaction

2011-07-06 Thread Stack
On Sun, Jul 3, 2011 at 12:00 AM, Eran Kutner e...@gigya.com wrote: It does seem that both servers opened the same region around the same time. The region was offline because I disabled the table so I can change its TTL. . 2011-06-29 16:37:12,964 DEBUG

Re: Errors after major compaction

2011-07-06 Thread Stack
On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner e...@gigya.com wrote: 4. Then at 16:40:00 the master log says: master:6-0x13004a31d7804c4 Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3 09 with OFFLINE state - why did it decide to take the region offline after

Re: Errors after major compaction

2011-07-05 Thread Ted Yu
Eran: I logged https://issues.apache.org/jira/browse/HBASE-4060 for you. On Mon, Jul 4, 2011 at 2:30 AM, Ted Yu yuzhih...@gmail.com wrote: Thanks for the understanding. Can you log a JIRA and put your ideas below in it ? On Jul 4, 2011, at 12:42 AM, Eran Kutner e...@gigya.com wrote:

Re: Errors after major compaction

2011-07-05 Thread Eran Kutner
Appreciate it, sorry I didn't get to it sooner. Had some crazy days :) -eran On Tue, Jul 5, 2011 at 17:19, Ted Yu yuzhih...@gmail.com wrote: Eran: I logged https://issues.apache.org/jira/browse/HBASE-4060 for you. On Mon, Jul 4, 2011 at 2:30 AM, Ted Yu yuzhih...@gmail.com wrote: Thanks

Re: Errors after major compaction

2011-07-05 Thread Ted Yu
Eran: You didn't run hbck during the enabling of gs_raw_events table, right ? I saw: 2011-06-29 16:43:50,395 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction (major) requested for gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.

Re: Errors after major compaction

2011-07-04 Thread Ted Yu
Thanks for the understanding. Can you log a JIRA and put your ideas below in it ? On Jul 4, 2011, at 12:42 AM, Eran Kutner e...@gigya.com wrote: Thanks for the explanation Ted, I will try to apply HBASE-3789 and hope for the best but my understanding is that it doesn't really solve the

Re: Errors after major compaction

2011-07-04 Thread Eran Kutner
Sure, I'll do that. -eran On Mon, Jul 4, 2011 at 12:30, Ted Yu yuzhih...@gmail.com wrote: Thanks for the understanding. Can you log a JIRA and put your ideas below in it ? On Jul 4, 2011, at 12:42 AM, Eran Kutner e...@gigya.com wrote: Thanks for the explanation Ted, I will try

Re: Errors after major compaction

2011-07-03 Thread Eran Kutner
It does seem that both servers opened the same region around the same time. The region was offline because I disabled the table so I can change its TTL. Here is the log from haddop1-s05 : 2011-06-29 16:37:12,576 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open

Re: Errors after major compaction

2011-07-03 Thread Ted Yu
Eran: I was thinking of this: HBASE-3789 Cleanup the locking contention in the master though it doesn't directly handle 'PENDING_OPEN for too long' case. https://issues.apache.org/jira/browse/HBASE-3741 is in 0.90.3 and actually close to the symptom you described. On Sun, Jul 3, 2011 at 12:00

Re: Errors after major compaction

2011-07-03 Thread Eran Kutner
Thanks Ted, but, as stated before, I'm already using 0.90.3, so either it's not fixed or it's not the same thing. -eran On Sun, Jul 3, 2011 at 17:27, Ted Yu yuzhih...@gmail.com wrote: Eran: I was thinking of this: HBASE-3789 Cleanup the locking contention in the master though it doesn't

Re: Errors after major compaction

2011-07-03 Thread Ted Yu
HBASE-3789 should have sped up region assignment. The patch for 0.90 is attached to that JIRA. You may prudently apply that patch. Regards On Sun, Jul 3, 2011 at 10:01 AM, Eran Kutner e...@gigya.com wrote: Thanks Ted, but, as stated before, I'm already using 0.90.3, so either it's not fixed

Re: Errors after major compaction

2011-07-03 Thread Eran Kutner
Ted, So if I understand correctly the the theory is that because of the issue fixed in HBASE-3789 the master took too long to detect that the region was successfully opened by the first server so it forced closed it and transitioned to a second server, but there are a few things about this

Re: Errors after major compaction

2011-07-03 Thread Ted Yu
Let me try to answer some of your questions. The two paragraphs below were written along my reasoning which is in reverse order of the actual call sequence. For #4 below, the log indicates that the following was executed: private void assign(final RegionState state, final boolean

Re: Errors after major compaction

2011-07-01 Thread Stack
So, Eran, it seems as though two RegionServers were carrying the region? One deleted a file (compaction on its side)? Can you figure if indeed two servers had same region? (Check master logs for this regions assignments). What version of hbase? St.Ack On Thu, Jun 30, 2011 at 3:58 AM, Eran

Re: Errors after major compaction

2011-07-01 Thread Eran Kutner
Hi Stack, I'm not sure what the log means. I do see references to two different servers, but that would probably happen if there was normal transition I assume. I'm using version 0.90.3 Here are the relevant lines from the master logs: 2011-06-19 21:39:37,164 INFO

Re: Errors after major compaction

2011-07-01 Thread Stack
Is gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. the region that was having the issue? If so, if you looked in hadoop1-s05's logs, was this region opened around 2011-06-29 16:43:57? Was it also opened hadoop1-s02 not long after? Did you say what

Re: Errors after major compaction

2011-07-01 Thread Ted Yu
2011-06-29 16:43:57,880 INFO org.apache.hadoop.hbase. master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. The double assignment should have been fixed by J-D's recent

Errors after major compaction

2011-06-30 Thread Eran Kutner
Hi, I have a cluster of 5 nodes with one large table that currently has around 12000 regions. Everything was working fine for relatively long time, until now. Yesterday I significantly reduced the TTL on the table and initiated major compaction. This should have reduced the table size to about 20%