[jira] [Updated] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5165:


Attachment: hbase-5165v2.patch

Modify it as Ted's comment in 5165v2.patch.(DeleteTableHandler doesn't set 
table enabled until no ServerShutdownHanler)

Since fixupDaughters() will not be executed if table is disabled, I think 
patchv2 fix this issue including HBASE-5155

Waiting for better approach.


> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5165:
--

Status: Patch Available  (was: Open)

> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: (was: 5139.txt)

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: 5139.txt

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: 5139.txt

Patch now supports weighted median

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: (was: 5139.txt)

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183167#comment-13183167
 ] 

Zhihong Yu commented on HBASE-5165:
---

As Stack said:
bq. now when a table is disabled, we now set a flag for the table in zk rather 
than do it individually on each region

@Chunhui:
Can you address the above in ServerShutdownHandler ?

> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread Zhihong Yu (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183167#comment-13183167
 ] 

Zhihong Yu edited comment on HBASE-5165 at 1/10/12 9:39 AM:


As Stack said in HBASE-5155:
bq. now when a table is disabled, we now set a flag for the table in zk rather 
than do it individually on each region

@Chunhui:
Can you address the above in ServerShutdownHandler ?

  was (Author: zhi...@ebaysf.com):
As Stack said:
bq. now when a table is disabled, we now set a flag for the table in zk rather 
than do it individually on each region

@Chunhui:
Can you address the above in ServerShutdownHandler ?
  
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183170#comment-13183170
 ] 

Hadoop QA commented on HBASE-5165:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510019/hbase-5165v2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated -151 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 79 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/714//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/714//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/714//console

This message is automatically generated.

> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5166) MultiThreaded Table Mapper analogous to MultiThreaded Mapper in hadoop

2012-01-10 Thread Jai Kumar Singh (Created) (JIRA)
MultiThreaded Table Mapper analogous to MultiThreaded Mapper in hadoop
--

 Key: HBASE-5166
 URL: https://issues.apache.org/jira/browse/HBASE-5166
 Project: HBase
  Issue Type: Improvement
Reporter: Jai Kumar Singh
Priority: Minor


There is no MultiThreadedTableMapper in hbase currently just like we have a 
MultiThreadedMapper in Hadoop for IO Bound Jobs. 
UseCase, webcrawler: take input (urls) from a hbase table and put the content 
(urls, content) back into hbase. 
Running these kind of hbase mapreduce job with normal table mapper is quite 
slow as we are not utilizing CPU fully (N/W IO Bound).

Moreover, I want to know whether It would be a good/bad idea to use HBase for 
these kind of usecases ?. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread chunhui shen (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183181#comment-13183181
 ] 

chunhui shen commented on HBASE-5165:
-

I'm not clear why we need to set disabled flag for each region in zk.
It seems no help for the issue.(DeleteTableHandler will delete these nodes, and 
ServerShutdownHandler will assign region if nodes not exist)

We only need to synchronize DeleteTableHandler and ServerShutdownHandler, to 
ensure deleted region not assigned.
Since regions of disabled table will not be assigned and fixed up in current 
strategy, we only need to ensure that no regions in assigning queue before set 
table enabled

> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.

2012-01-10 Thread Harsh J (Created) (JIRA)
We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing 
that.
--

 Key: HBASE-5167
 URL: https://issues.apache.org/jira/browse/HBASE-5167
 Project: HBase
  Issue Type: Improvement
  Components: scripts
Affects Versions: 0.92.0
Reporter: Harsh J
Priority: Trivial


HBASE-4209 changed the behavior of the scripts such that we do not kill the 
daemons away anymore. We should have also changed the message shown in the logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.

2012-01-10 Thread Harsh J (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HBASE-5167:
---

Fix Version/s: 0.94.0
   Status: Patch Available  (was: Open)

> We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing 
> that.
> --
>
> Key: HBASE-5167
> URL: https://issues.apache.org/jira/browse/HBASE-5167
> Project: HBase
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 0.92.0
>Reporter: Harsh J
>Priority: Trivial
> Fix For: 0.94.0
>
> Attachments: HBASE-5167.patch
>
>
> HBASE-4209 changed the behavior of the scripts such that we do not kill the 
> daemons away anymore. We should have also changed the message shown in the 
> logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.

2012-01-10 Thread Harsh J (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HBASE-5167:
---

Attachment: HBASE-5167.patch

Lets make it {{s/Killing/Terminating/}}

> We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing 
> that.
> --
>
> Key: HBASE-5167
> URL: https://issues.apache.org/jira/browse/HBASE-5167
> Project: HBase
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 0.92.0
>Reporter: Harsh J
>Priority: Trivial
> Fix For: 0.94.0
>
> Attachments: HBASE-5167.patch
>
>
> HBASE-4209 changed the behavior of the scripts such that we do not kill the 
> daemons away anymore. We should have also changed the message shown in the 
> logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5168) Backport HBASE-5100 - Rollback of split could cause closed region to be opened again

2012-01-10 Thread ramkrishna.s.vasudevan (Created) (JIRA)
Backport HBASE-5100 - Rollback of split could cause closed region to be opened 
again


 Key: HBASE-5168
 URL: https://issues.apache.org/jira/browse/HBASE-5168
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan


Considering the importance of the defect merging it to 0.90.6

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183199#comment-13183199
 ] 

Hadoop QA commented on HBASE-5167:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510031/HBASE-5167.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated -151 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 79 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/715//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/715//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/715//console

This message is automatically generated.

> We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing 
> that.
> --
>
> Key: HBASE-5167
> URL: https://issues.apache.org/jira/browse/HBASE-5167
> Project: HBase
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 0.92.0
>Reporter: Harsh J
>Priority: Trivial
> Fix For: 0.94.0
>
> Attachments: HBASE-5167.patch
>
>
> HBASE-4209 changed the behavior of the scripts such that we do not kill the 
> daemons away anymore. We should have also changed the message shown in the 
> logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5166) MultiThreaded Table Mapper analogous to MultiThreaded Mapper in hadoop

2012-01-10 Thread Jai Kumar Singh (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jai Kumar Singh updated HBASE-5166:
---

Attachment: 0001-Added-MultithreadedTableMapper-HBASE-5166.patch

This is the implementation I am using currently for Multithreadedtablemapper 
which is a modification of MultithreadedMapper from 
org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper.java 

> MultiThreaded Table Mapper analogous to MultiThreaded Mapper in hadoop
> --
>
> Key: HBASE-5166
> URL: https://issues.apache.org/jira/browse/HBASE-5166
> Project: HBase
>  Issue Type: Improvement
>Reporter: Jai Kumar Singh
>Priority: Minor
>  Labels: multithreaded, tablemapper
> Attachments: 0001-Added-MultithreadedTableMapper-HBASE-5166.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> There is no MultiThreadedTableMapper in hbase currently just like we have a 
> MultiThreadedMapper in Hadoop for IO Bound Jobs. 
> UseCase, webcrawler: take input (urls) from a hbase table and put the content 
> (urls, content) back into hbase. 
> Running these kind of hbase mapreduce job with normal table mapper is quite 
> slow as we are not utilizing CPU fully (N/W IO Bound).
> Moreover, I want to know whether It would be a good/bad idea to use HBase for 
> these kind of usecases ?. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5120:
--

Attachment: HBASE-5120_2.patch

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59

[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted

2012-01-10 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183205#comment-13183205
 ] 

ramkrishna.s.vasudevan commented on HBASE-5155:
---

@Stack
After analysing the code found one thing. May be avoiding SSH and 
DisableTableHandler and DeleteTableHandler parallely is a bigger discussion. 
But the above problem can be solved. 
In SSH 
{code}
  public static boolean processDeadRegion(HRegionInfo hri, Result result,
  AssignmentManager assignmentManager, CatalogTracker catalogTracker)
  throws IOException {
// If table is not disabled but the region is offlined,
boolean disabled = assignmentManager.getZKTable().isDisabledTable(
hri.getTableDesc().getNameAsString());
{code}
we check if the table is disabled.  But if you look at the above logs it is the 
DeleteTableHandler that has already deleted the region and also removed the 
cache from ZkTable.
{code}
am.getZKTable().setEnabledTable(Bytes.toString(tableName));
{code}
Currently setEnabledTable means removing the entry from the map.  So we do not 
have a differentiation between enabled table and delete the table because both 
places we remove from the cache map.

So can we  use the unused TableState.ENABLED in case of enable table handler 
and only delete table handler will remove it.
This will ensure that in SSH.processDeadRegion() we can first check if the 
table is not present in the map and then proceed. If not present we can ensure 
that the table is already deleted.  
Pls give your opinion.

> ServerShutDownHandler And Disable/Delete should not happen parallely leading 
> to recreation of regions that were deleted
> ---
>
> Key: HBASE-5155
> URL: https://issues.apache.org/jira/browse/HBASE-5155
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Priority: Blocker
>
> ServerShutDownHandler and disable/delete table handler races.  This is not an 
> issue due to TM.
> -> A regionserver goes down.  In our cluster the regionserver holds lot of 
> regions.
> -> A region R1 has two daughters D1 and D2.
> -> The ServerShutdownHandler gets called and scans the META and gets all the 
> user regions
> -> Parallely a table is disabled. (No problem in this step).
> -> Delete table is done.
> -> The tables and its regions are deleted including R1, D1 and D2.. (So META 
> is cleaned)
> -> Now ServerShutdownhandler starts to processTheDeadRegion
> {code}
>  if (hri.isOffline() && hri.isSplit()) {
>   LOG.debug("Offlined and split region " + hri.getRegionNameAsString() +
> "; checking daughter presence");
>   fixupDaughters(result, assignmentManager, catalogTracker);
> {code}
> As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 
> {code}
> if (isDaughterMissing(catalogTracker, daughter)) {
>   LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString());
>   MetaEditor.addDaughter(catalogTracker, daughter, null);
>   // TODO: Log WARN if the regiondir does not exist in the fs.  If its not
>   // there then something wonky about the split -- things will keep going
>   // but could be missing references to parent region.
>   // And assign it.
>   assignmentManager.assign(daughter, true);
> {code}
> we call assign of the daughers.  
> Now after this we again start with the below code.
> {code}
> if (processDeadRegion(e.getKey(), e.getValue(),
> this.services.getAssignmentManager(),
> this.server.getCatalogTracker())) {
>   this.services.getAssignmentManager().assign(e.getKey(), true);
> {code}
> Now when the SSH scanned the META it had R1, D1 and D2.
> So as part of the above code D1 and D2 which where assigned by fixUpDaughters
> is again assigned by 
> {code}
> this.services.getAssignmentManager().assign(e.getKey(), true);
> {code}
> Thus leading to a zookeeper issue due to bad version and killing the master.
> The important part here is the regions that were deleted are recreated which 
> i think is more critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183206#comment-13183206
 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
---

@Ted
I feel abort is not needed. I cannot see a place where the znode can be in 
another state.  Also if we do that if the znode is changed in the assign flow 
we still go with the deletion and will abort the master if deletion fails.

Just my thought. 

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111b

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183214#comment-13183214
 ] 

Hadoop QA commented on HBASE-5120:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510034/HBASE-5120_2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated -151 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 80 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/716//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/716//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/716//console

This message is automatically generated.

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that li

[jira] [Created] (HBASE-5169) This is a subtask of issue 4120,this patch provides the region server group feature of HBase.

2012-01-10 Thread Liu Jia (Created) (JIRA)
 This is a subtask of issue 4120,this patch provides the region server group 
feature of HBase.
--

 Key: HBASE-5169
 URL: https://issues.apache.org/jira/browse/HBASE-5169
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Liu Jia
Assignee: Liu Jia


This is a subtask of issue 4120,this patch provides the region server group 
feature of HBase.
With this patch, region servers can be divided into groups,one table could 
belong to one or more groups while the region server can only belong to 

one group. Work load in defferent groups will not affect each other. This patch 
provides table level and 

group level load balance,the default load balance and region assignments will 
consider the group 

configuration and assign regions to their corresponding groups.

More information, please check out the documents of issue 4120.

There is a web tool of this patch providing operations of group managements 
like add/delete group, move 

in/out servers,change table's group attribute ,balance groups, balance tables.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5169) This is a subtask of issue 4120,this patch provides the region server group feature of HBase.

2012-01-10 Thread Liu Jia (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Jia updated HBASE-5169:
---

Attachment: GroupOfRegionServer_v1.patch

>  This is a subtask of issue 4120,this patch provides the region server group 
> feature of HBase.
> --
>
> Key: HBASE-5169
> URL: https://issues.apache.org/jira/browse/HBASE-5169
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Reporter: Liu Jia
>Assignee: Liu Jia
> Fix For: 0.94.0
>
> Attachments: GroupOfRegionServer_v1.patch
>
>
> This is a subtask of issue 4120,this patch provides the region server group 
> feature of HBase.
> With this patch, region servers can be divided into groups,one table could 
> belong to one or more groups while the region server can only belong to 
> one group. Work load in defferent groups will not affect each other. This 
> patch provides table level and 
> group level load balance,the default load balance and region assignments will 
> consider the group 
> configuration and assign regions to their corresponding groups.
> More information, please check out the documents of issue 4120.
> There is a web tool of this patch providing operations of group managements 
> like add/delete group, move 
> in/out servers,change table's group attribute ,balance groups, balance tables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5153) HConnection re-creation in HTable after HConnection abort

2012-01-10 Thread Jieshan Bean (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183290#comment-13183290
 ] 

Jieshan Bean commented on HBASE-5153:
-

Only do that in flushCommits maybe not enough. I'll go though the code and give 
a more considerate approach. and also will give a patch for TRUNK.

> HConnection re-creation in HTable after HConnection abort
> -
>
> Key: HBASE-5153
> URL: https://issues.apache.org/jira/browse/HBASE-5153
> Project: HBase
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.90.4
>Reporter: Jieshan Bean
>Assignee: Jieshan Bean
> Fix For: 0.90.6
>
> Attachments: HBASE-5153-V2.patch, HBASE-5153.patch
>
>
> HBASE-4893 is related to this issue. In that issue, we know, if multi-threads 
> share a same connection, once this connection got abort in one thread, the 
> other threads will got a 
> "HConnectionManager$HConnectionImplementation@18fb1f7 closed" exception.
> It solve the problem of "stale connection can't removed". But the orignal 
> HTable instance cann't be continue to use. The connection in HTable should be 
> recreated.
> Actually, there's two aproach to solve this:
> 1. In user code, once catch an IOE, close connection and re-create HTable 
> instance. We can use this as a workaround.
> 2. In HBase Client side, catch this exception, and re-create connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5169) This is a subtask of issue 4120,this patch provides the region server group feature of HBase.

2012-01-10 Thread Liu Jia (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Jia updated HBASE-5169:
---

Status: Patch Available  (was: Open)

The patch of region server group,with test case but don't contain the web pages.

>  This is a subtask of issue 4120,this patch provides the region server group 
> feature of HBase.
> --
>
> Key: HBASE-5169
> URL: https://issues.apache.org/jira/browse/HBASE-5169
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Reporter: Liu Jia
>Assignee: Liu Jia
> Fix For: 0.94.0
>
> Attachments: GroupOfRegionServer_v1.patch
>
>
> This is a subtask of issue 4120,this patch provides the region server group 
> feature of HBase.
> With this patch, region servers can be divided into groups,one table could 
> belong to one or more groups while the region server can only belong to 
> one group. Work load in defferent groups will not affect each other. This 
> patch provides table level and 
> group level load balance,the default load balance and region assignments will 
> consider the group 
> configuration and assign regions to their corresponding groups.
> More information, please check out the documents of issue 4120.
> There is a web tool of this patch providing operations of group managements 
> like add/delete group, move 
> in/out servers,change table's group attribute ,balance groups, balance tables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5170) The web tools of region server groups

2012-01-10 Thread Liu Jia (Created) (JIRA)
The web tools of region server groups
-

 Key: HBASE-5170
 URL: https://issues.apache.org/jira/browse/HBASE-5170
 Project: HBase
  Issue Type: Sub-task
Reporter: Liu Jia


The web pages which allow users to perform some group management operations 
including add/delete group, move 

in/out servers,change table's group attribute ,balance groups, balance tables.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5170) The web tools of region server groups

2012-01-10 Thread Liu Jia (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Jia updated HBASE-5170:
---

Attachment: GroupOfRegionServerWebTool.patch

> The web tools of region server groups
> -
>
> Key: HBASE-5170
> URL: https://issues.apache.org/jira/browse/HBASE-5170
> Project: HBase
>  Issue Type: Sub-task
>  Components: master, regionserver
>Reporter: Liu Jia
> Fix For: 0.94.0
>
> Attachments: GroupOfRegionServerWebTool.patch
>
>
> The web pages which allow users to perform some group management operations 
> including add/delete group, move 
> in/out servers,change table's group attribute ,balance groups, balance tables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183308#comment-13183308
 ] 

ramkrishna.s.vasudevan commented on HBASE-5165:
---

@Chunhui
I just now saw your issue.  It is same as HBASE-5155.  Can you see my comment 
in HBSE-5155.  I propose to use the TableState.ENABLED.  Currently i am working 
on a patch for it.  Will upload it by tomorrow. 

> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5169) Group of Region Server, a subtask of issue 4120

2012-01-10 Thread Liu Jia (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Jia updated HBASE-5169:
---

Summary:  Group of Region Server,  a subtask of issue 4120  (was:  This is 
a subtask of issue 4120,this patch provides the region server group feature of 
HBase.)

>  Group of Region Server,  a subtask of issue 4120
> -
>
> Key: HBASE-5169
> URL: https://issues.apache.org/jira/browse/HBASE-5169
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Reporter: Liu Jia
>Assignee: Liu Jia
> Fix For: 0.94.0
>
> Attachments: GroupOfRegionServer_v1.patch
>
>
> This is a subtask of issue 4120,this patch provides the region server group 
> feature of HBase.
> With this patch, region servers can be divided into groups,one table could 
> belong to one or more groups while the region server can only belong to 
> one group. Work load in defferent groups will not affect each other. This 
> patch provides table level and 
> group level load balance,the default load balance and region assignments will 
> consider the group 
> configuration and assign regions to their corresponding groups.
> More information, please check out the documents of issue 4120.
> There is a web tool of this patch providing operations of group managements 
> like add/delete group, move 
> in/out servers,change table's group attribute ,balance groups, balance tables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread ramkrishna.s.vasudevan (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183308#comment-13183308
 ] 

ramkrishna.s.vasudevan edited comment on HBASE-5165 at 1/10/12 3:04 PM:


@Chunhui
I just now saw your issue.  It is same as HBASE-5155.  Can you see my comment 
in HBASE-5155.  I propose to use the TableState.ENABLED.  Currently i am 
working on a patch for it.  Will upload it by tomorrow. 

  was (Author: ram_krish):
@Chunhui
I just now saw your issue.  It is same as HBASE-5155.  Can you see my comment 
in HBSE-5155.  I propose to use the TableState.ENABLED.  Currently i am working 
on a patch for it.  Will upload it by tomorrow. 
  
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause deleted region to assign again
> --
>
> Key: HBASE-5165
> URL: https://issues.apache.org/jira/browse/HBASE-5165
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: chunhui shen
> Attachments: hbase-5165.patch, hbase-5165v2.patch
>
>
> Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
> cause following situation
> 1.Table has already be disabled.
> 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
> 3.When step2 is processing or is completed just now, DeleteTableHandler 
> starts to delete region(Remove region from META and Delete region from FS)
> 4.DeleteTableHandler set table enabled.
> 4.ServerShutdownHandler is starting to assign region which is alread deleted 
> by DeleteTableHandler.
> The result of above operations is producing an invalid record in .META.  and 
> can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5169) Group of Region Server, a subtask of issue 4120

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183313#comment-13183313
 ] 

Hadoop QA commented on HBASE-5169:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12510045/GroupOfRegionServer_v1.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -131 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 92 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.client.TestFromClientSide
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/717//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/717//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/717//console

This message is automatically generated.

>  Group of Region Server,  a subtask of issue 4120
> -
>
> Key: HBASE-5169
> URL: https://issues.apache.org/jira/browse/HBASE-5169
> Project: HBase
>  Issue Type: Sub-task
>  Components: master
>Reporter: Liu Jia
>Assignee: Liu Jia
> Fix For: 0.94.0
>
> Attachments: GroupOfRegionServer_v1.patch
>
>
> This is a subtask of issue 4120,this patch provides the region server group 
> feature of HBase.
> With this patch, region servers can be divided into groups,one table could 
> belong to one or more groups while the region server can only belong to 
> one group. Work load in defferent groups will not affect each other. This 
> patch provides table level and 
> group level load balance,the default load balance and region assignments will 
> consider the group 
> configuration and assign regions to their corresponding groups.
> More information, please check out the documents of issue 4120.
> There is a web tool of this patch providing operations of group managements 
> like add/delete group, move 
> in/out servers,change table's group attribute ,balance groups, balance tables.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5157) Backport HBASE-4880- Region is on service before openRegionHandler completes, may cause data loss

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5157:
--

Attachment: HBASE-4880_branch90_1.patch

> Backport HBASE-4880- Region is on service before openRegionHandler completes, 
> may cause data loss
> -
>
> Key: HBASE-5157
> URL: https://issues.apache.org/jira/browse/HBASE-5157
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4880_branch90_1.patch
>
>
> Backporting to 0.90.6 considering the importance of the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5158) Backport HBASE-4878 - Master crash when splitting hlog may cause data loss

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5158:
--

Attachment: HBASE-4878_branch90_1.patch

> Backport HBASE-4878 - Master crash when splitting hlog may cause data loss
> --
>
> Key: HBASE-5158
> URL: https://issues.apache.org/jira/browse/HBASE-5158
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4878_branch90_1.patch
>
>
> Backporting to 0.90.6 considering the importance of the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183319#comment-13183319
 ] 

Zhihong Yu commented on HBASE-5120:
---

I think deleteClosingOrClosedNode() should catch KeeperException and abort, 
considering that code appears twice in the patch.

There is no checking in deleteClosingOrClosedNode() on the return value for the 
second call to ZKAssign.deleteNode(). At least we should log the return value.
There might be more corner cases that we discover in the future.

Good job.

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting 

[jira] [Updated] (HBASE-5159) Backport HBASE-4079 - HTableUtil - helper class for loading data

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5159:
--

Attachment: HBASE-4079.patch

> Backport HBASE-4079 - HTableUtil - helper class for loading data 
> -
>
> Key: HBASE-5159
> URL: https://issues.apache.org/jira/browse/HBASE-5159
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4079.patch
>
>
> Backporting to 0.90.6 considering the usefulness of the feature.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5156) Backport HBASE-4899 - Region would be assigned twice easily with continually killing server and moving region in testing environment

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5156:
--

Attachment: HBASE-4899_Branch90_1.patch

> Backport HBASE-4899 -  Region would be assigned twice easily with continually 
> killing server and moving region in testing environment
> -
>
> Key: HBASE-5156
> URL: https://issues.apache.org/jira/browse/HBASE-5156
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4899_Branch90_1.patch
>
>
> Need to backport to 0.90.6 considering the criticality of the issue

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5120:
--

Attachment: HBASE-5120_3.patch

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183337#comment-13183337
 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
---

@Ted
Thanks for your review. I have updated the patch.

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandl

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5120:
--

Status: Open  (was: Patch Available)

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5120:
--

Status: Patch Available  (was: Open)

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183355#comment-13183355
 ] 

Hadoop QA commented on HBASE-5120:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510054/HBASE-5120_3.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated -151 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 80 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/718//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/718//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/718//console

This message is automatically generated.

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At

[jira] [Created] (HBASE-5171) Potential NullPointerException while obtaining row lock

2012-01-10 Thread Yves Langisch (Created) (JIRA)
Potential NullPointerException while obtaining row lock 


 Key: HBASE-5171
 URL: https://issues.apache.org/jira/browse/HBASE-5171
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.5
Reporter: Yves Langisch


We have a table which is concurrently accessed (read/write) from many threads 
and we make use of row locks. Under heavy load we regularly get NPE while 
obtaining row locks. An example stack trace looks as follows:

java.lang.NullPointerException
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:986)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2008)
  at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
  at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
Caused by: java.lang.NullPointerException
  at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:881)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.addRowLock(HRegionServer.java:2018)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2004)
  ... 5 more 

After checking the source code I've noticed that the value which is going to be 
put into the HashMap can be null in the case where the waitForLock flag is true 
or the rowLockWaitDuration is expired (HRegion#internalObtainRowLock, line 
2111ff). The latter I think happens in our case as we have heavy load hitting 
the server.

IMHO this case should be handled somehow and must not lead to a NPE.

-
Yves

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183380#comment-13183380
 ] 

Zhihong Yu commented on HBASE-5120:
---

{code}
+LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please make the sentence 
syntactically correct.

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=13256363103

[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread Zhihong Yu (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183380#comment-13183380
 ] 

Zhihong Yu edited comment on HBASE-5120 at 1/10/12 5:07 PM:


{code}
+LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please log the region name.

  was (Author: zhi...@ebaysf.com):
{code}
+LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please make the sentence 
syntactically correct.
  
> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to 

[jira] [Updated] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

2012-01-10 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5137:
--

Fix Version/s: 0.90.6
   0.92.1

Committed to 0.90 and trunk.  Do we need to commit in 0.92 also?

> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws 
> IOException
> 
>
> Key: HBASE-5137
> URL: https://issues.apache.org/jira/browse/HBASE-5137
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.1, 0.90.6
>
> Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and 
> ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
> // If FS is in safe mode, just wait till out of it.
> FSUtils.waitOnSafeMode(conf,
>   conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
> splitter.splitLog();
>   } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>   checkFileSystem();
>   LOG.error("Failed splitting " + logDir.toString(), e);
> }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that 
> was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183413#comment-13183413
 ] 

stack commented on HBASE-5137:
--

@Ram Please commit to 0.92 branch also.

> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws 
> IOException
> 
>
> Key: HBASE-5137
> URL: https://issues.apache.org/jira/browse/HBASE-5137
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.1, 0.90.6
>
> Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and 
> ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
> // If FS is in safe mode, just wait till out of it.
> FSUtils.waitOnSafeMode(conf,
>   conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
> splitter.splitLog();
>   } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>   checkFileSystem();
>   LOG.error("Failed splitting " + logDir.toString(), e);
> }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that 
> was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5156) Backport HBASE-4899 - Region would be assigned twice easily with continually killing server and moving region in testing environment

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183416#comment-13183416
 ] 

stack commented on HBASE-5156:
--

+1

> Backport HBASE-4899 -  Region would be assigned twice easily with continually 
> killing server and moving region in testing environment
> -
>
> Key: HBASE-5156
> URL: https://issues.apache.org/jira/browse/HBASE-5156
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4899_Branch90_1.patch
>
>
> Need to backport to 0.90.6 considering the criticality of the issue

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5159) Backport HBASE-4079 - HTableUtil - helper class for loading data

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183417#comment-13183417
 ] 

stack commented on HBASE-5159:
--

+1

> Backport HBASE-4079 - HTableUtil - helper class for loading data 
> -
>
> Key: HBASE-5159
> URL: https://issues.apache.org/jira/browse/HBASE-5159
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4079.patch
>
>
> Backporting to 0.90.6 considering the usefulness of the feature.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5158) Backport HBASE-4878 - Master crash when splitting hlog may cause data loss

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183420#comment-13183420
 ] 

stack commented on HBASE-5158:
--

+1

> Backport HBASE-4878 - Master crash when splitting hlog may cause data loss
> --
>
> Key: HBASE-5158
> URL: https://issues.apache.org/jira/browse/HBASE-5158
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4878_branch90_1.patch
>
>
> Backporting to 0.90.6 considering the importance of the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5157) Backport HBASE-4880- Region is on service before openRegionHandler completes, may cause data loss

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183421#comment-13183421
 ] 

stack commented on HBASE-5157:
--

+1

> Backport HBASE-4880- Region is on service before openRegionHandler completes, 
> may cause data loss
> -
>
> Key: HBASE-5157
> URL: https://issues.apache.org/jira/browse/HBASE-5157
> Project: HBase
>  Issue Type: Bug
>Reporter: ramkrishna.s.vasudevan
> Attachments: HBASE-4880_branch90_1.patch
>
>
> Backporting to 0.90.6 considering the importance of the issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface

2012-01-10 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5134:
-

Status: Patch Available  (was: Open)

> Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
> HConnection Interface
> --
>
> Key: HBASE-5134
> URL: https://issues.apache.org/jira/browse/HBASE-5134
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
> 5134-v6.txt, 5134-v6.txt
>
>
> Its broke having these meta methods in HConnection.  They take 
> ServerCallables which themselves have HConnections inevitably.   It makes for 
> a tangle in the model and frustrates being able to do mocked implemenations 
> of HConnection.  These methods better belong in something like 
> HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

2012-01-10 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183424#comment-13183424
 ] 

ramkrishna.s.vasudevan commented on HBASE-5137:
---

Committed to 0.92 also.

> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws 
> IOException
> 
>
> Key: HBASE-5137
> URL: https://issues.apache.org/jira/browse/HBASE-5137
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.0, 0.90.6
>
> Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and 
> ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
> // If FS is in safe mode, just wait till out of it.
> FSUtils.waitOnSafeMode(conf,
>   conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
> splitter.splitLog();
>   } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>   checkFileSystem();
>   LOG.error("Failed splitting " + logDir.toString(), e);
> }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that 
> was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

2012-01-10 Thread ramkrishna.s.vasudevan (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan resolved HBASE-5137.
---

   Resolution: Fixed
Fix Version/s: (was: 0.92.1)
   0.92.0

> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws 
> IOException
> 
>
> Key: HBASE-5137
> URL: https://issues.apache.org/jira/browse/HBASE-5137
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.0, 0.90.6
>
> Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and 
> ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
> // If FS is in safe mode, just wait till out of it.
> FSUtils.waitOnSafeMode(conf,
>   conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
> splitter.splitLog();
>   } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>   checkFileSystem();
>   LOG.error("Failed splitting " + logDir.toString(), e);
> }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that 
> was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

2012-01-10 Thread ramkrishna.s.vasudevan (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183424#comment-13183424
 ] 

ramkrishna.s.vasudevan edited comment on HBASE-5137 at 1/10/12 5:54 PM:


Committed to 0.92 also.
Thanks for the review Stack.
Thanks to Ted for the patch and review.

  was (Author: ram_krish):
Committed to 0.92 also.
  
> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws 
> IOException
> 
>
> Key: HBASE-5137
> URL: https://issues.apache.org/jira/browse/HBASE-5137
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.0, 0.90.6
>
> Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and 
> ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
> // If FS is in safe mode, just wait till out of it.
> FSUtils.waitOnSafeMode(conf,
>   conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
> splitter.splitLog();
>   } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>   checkFileSystem();
>   LOG.error("Failed splitting " + logDir.toString(), e);
> }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that 
> was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5171) Potential NullPointerException while obtaining row lock

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183425#comment-13183425
 ] 

Zhihong Yu commented on HBASE-5171:
---

@Yves:
Do you want to provide a patch ?

I think one approach is to create an IOException subclass, similar to 
UnknownRowLockException, for the scenario where wait duration expires.

> Potential NullPointerException while obtaining row lock 
> 
>
> Key: HBASE-5171
> URL: https://issues.apache.org/jira/browse/HBASE-5171
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.5
>Reporter: Yves Langisch
>
> We have a table which is concurrently accessed (read/write) from many threads 
> and we make use of row locks. Under heavy load we regularly get NPE while 
> obtaining row locks. An example stack trace looks as follows:
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:986)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2008)
>   at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> Caused by: java.lang.NullPointerException
>   at 
> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:881)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.addRowLock(HRegionServer.java:2018)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2004)
>   ... 5 more 
> After checking the source code I've noticed that the value which is going to 
> be put into the HashMap can be null in the case where the waitForLock flag is 
> true or the rowLockWaitDuration is expired (HRegion#internalObtainRowLock, 
> line 2111ff). The latter I think happens in our case as we have heavy load 
> hitting the server.
> IMHO this case should be handled somehow and must not lead to a NPE.
> -
> Yves

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5171) Potential NullPointerException while obtaining row lock

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183429#comment-13183429
 ] 

Zhihong Yu commented on HBASE-5171:
---

One workaround is to increase the value for "hbase.rowlock.wait.duration"
But a new exception should be introduced anyways.

> Potential NullPointerException while obtaining row lock 
> 
>
> Key: HBASE-5171
> URL: https://issues.apache.org/jira/browse/HBASE-5171
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.5
>Reporter: Yves Langisch
>
> We have a table which is concurrently accessed (read/write) from many threads 
> and we make use of row locks. Under heavy load we regularly get NPE while 
> obtaining row locks. An example stack trace looks as follows:
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:986)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2008)
>   at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>   at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> Caused by: java.lang.NullPointerException
>   at 
> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:881)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.addRowLock(HRegionServer.java:2018)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2004)
>   ... 5 more 
> After checking the source code I've noticed that the value which is going to 
> be put into the HashMap can be null in the case where the waitForLock flag is 
> true or the rowLockWaitDuration is expired (HRegion#internalObtainRowLock, 
> line 2111ff). The latter I think happens in our case as we have heavy load 
> hitting the server.
> IMHO this case should be handled somehow and must not lead to a NPE.
> -
> Yves

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183434#comment-13183434
 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
---

@Ted
Ok.  I will make a patch.  Before making the next patch will wait for others 
comments.
@Stack
Any Comments ?

> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBU

[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: 5139-v2.txt

Patch version 2 modifies some javadoc and utilizes scanner caching in pass 2.

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139-v2.txt, 5139.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Status: Patch Available  (was: Open)

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139-v2.txt, 5139.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5172) HTableInterface should extend java.io.Closeable

2012-01-10 Thread Zhihong Yu (Created) (JIRA)
HTableInterface should extend java.io.Closeable
---

 Key: HBASE-5172
 URL: https://issues.apache.org/jira/browse/HBASE-5172
 Project: HBase
  Issue Type: Bug
Reporter: Zhihong Yu


Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5033) Opening/Closing store in parallel to reduce region open/close time

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183442#comment-13183442
 ] 

Zhihong Yu commented on HBASE-5033:
---

@Liyin:
Do you happen to have some performance data when this feature is used ?

> Opening/Closing store in parallel to reduce region open/close time
> --
>
> Key: HBASE-5033
> URL: https://issues.apache.org/jira/browse/HBASE-5033
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: 5033.txt, D933.1.patch, D933.2.patch, D933.3.patch, 
> D933.4.patch, D933.5.patch, HBASE-5033-apach-trunk.patch
>
>
> Region servers are opening/closing each store and each store file for every 
> store in sequential fashion, which may cause inefficiency to open/close 
> regions. 
> So this diff is to open/close each store in parallel in order to reduce 
> region open/close time. Also it would help to reduce the cluster restart time.
> 1) Opening each store in parallel
> 2) Loading each store file for every store in parallel
> 3) Closing each store in parallel
> 4) Closing each store file for every store in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183443#comment-13183443
 ] 

Hadoop QA commented on HBASE-5139:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510074/5139-v2.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The patch appears to cause mvn compile goal to fail.

-1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/720//testReport/
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/720//console

This message is automatically generated.

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139-v2.txt, 5139.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: (was: 5139-v2.txt)

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: (was: 5139.txt)

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Attachment: 5139-v2.txt

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139-v2.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5139:
--

Comment: was deleted

(was: -1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510074/5139-v2.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The patch appears to cause mvn compile goal to fail.

-1 findbugs.  The patch appears to cause Findbugs (version 1.3.9) to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/720//testReport/
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/720//console

This message is automatically generated.)

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139-v2.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183452#comment-13183452
 ] 

Hadoop QA commented on HBASE-5134:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509981/5134-v6.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 18 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -147 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 78 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/719//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/719//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/719//console

This message is automatically generated.

> Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
> HConnection Interface
> --
>
> Key: HBASE-5134
> URL: https://issues.apache.org/jira/browse/HBASE-5134
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
> 5134-v6.txt, 5134-v6.txt
>
>
> Its broke having these meta methods in HConnection.  They take 
> ServerCallables which themselves have HConnections inevitably.   It makes for 
> a tangle in the model and frustrates being able to do mocked implemenations 
> of HConnection.  These methods better belong in something like 
> HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183458#comment-13183458
 ] 

stack commented on HBASE-5134:
--

I ran your nice script Ted -- shall we check it in under ./dev-support? -- and 
it reports a new test hanging:

Hanging test: Running org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

... which also seems unrelated.  Mind if I check in v6?


> Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
> HConnection Interface
> --
>
> Key: HBASE-5134
> URL: https://issues.apache.org/jira/browse/HBASE-5134
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
> 5134-v6.txt, 5134-v6.txt
>
>
> Its broke having these meta methods in HConnection.  They take 
> ServerCallables which themselves have HConnections inevitably.   It makes for 
> a tangle in the model and frustrates being able to do mocked implemenations 
> of HConnection.  These methods better belong in something like 
> HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183462#comment-13183462
 ] 

stack commented on HBASE-5134:
--

I ran your script back to build 700 and every other test has a random hang, 
usually different each time.

> Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
> HConnection Interface
> --
>
> Key: HBASE-5134
> URL: https://issues.apache.org/jira/browse/HBASE-5134
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
> 5134-v6.txt, 5134-v6.txt
>
>
> Its broke having these meta methods in HConnection.  They take 
> ServerCallables which themselves have HConnections inevitably.   It makes for 
> a tangle in the model and frustrates being able to do mocked implemenations 
> of HConnection.  These methods better belong in something like 
> HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183463#comment-13183463
 ] 

Zhihong Yu commented on HBASE-5134:
---

Feel free to check the script under ./dev-support.

I think v6 is good to go.

Thanks

> Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
> HConnection Interface
> --
>
> Key: HBASE-5134
> URL: https://issues.apache.org/jira/browse/HBASE-5134
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
> 5134-v6.txt, 5134-v6.txt
>
>
> Its broke having these meta methods in HConnection.  They take 
> ServerCallables which themselves have HConnections inevitably.   It makes for 
> a tangle in the model and frustrates being able to do mocked implemenations 
> of HConnection.  These methods better belong in something like 
> HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface

2012-01-10 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5134:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed trunk.  Thanks for reviews lads.

> Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
> HConnection Interface
> --
>
> Key: HBASE-5134
> URL: https://issues.apache.org/jira/browse/HBASE-5134
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
> 5134-v6.txt, 5134-v6.txt
>
>
> Its broke having these meta methods in HConnection.  They take 
> ServerCallables which themselves have HConnections inevitably.   It makes for 
> a tangle in the model and frustrates being able to do mocked implemenations 
> of HConnection.  These methods better belong in something like 
> HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5173) Commit hbase-4480 findHangingTest.sh script under dev-support

2012-01-10 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5173:
-

Attachment: 5173.txt

Here is patch that adds script.

> Commit hbase-4480 findHangingTest.sh script under dev-support
> -
>
> Key: HBASE-5173
> URL: https://issues.apache.org/jira/browse/HBASE-5173
> Project: HBase
>  Issue Type: Task
>Reporter: stack
> Fix For: 0.94.0
>
> Attachments: 5173.txt
>
>
> See hbase-4480 for the script from Ted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-10 Thread Josh Wymer (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Wymer updated HBASE-5140:
--

Description: 
In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
am working on a patch for the TableInputFormat class that overrides getSplits 
in order to generate N number of splits per regions and/or N number of splits 
per job. The idea is to convert the startKey and endKey for each region from 
byte[] to BigDecimal, take the difference, divide by N, convert back to byte[] 
and generate splits on the resulting values. Assuming your keys are fully 
distributed this should generate splits at nearly the same number of rows per 
split. Any suggestions on this issue are welcome.


  was:In regards to 
[HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I am working on a 
subclass for the TableInputFormat class that overrides getSplits in order to 
generate N number of splits per regions and/or N number of splits per job. The 
idea is to convert the startKey and endKey for each region from byte[] to 
BigDecimal, take the difference, divide by N, convert back to byte[] and 
generate splits on the resulting values. Assuming your keys are fully 
distributed this should generate splits at nearly the same number of rows per 
split. Any suggestions on this issue are welcome.


> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Affects Versions: 0.90.4
>Reporter: Josh Wymer
>Priority: Trivial
>  Labels: mapreduce, split
> Fix For: 0.90.4
>
> Attachments: 
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch,
>  
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch.1,
>  Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a patch for the TableInputFormat class that overrides getSplits 
> in order to generate N number of splits per regions and/or N number of splits 
> per job. The idea is to convert the startKey and endKey for each region from 
> byte[] to BigDecimal, take the difference, divide by N, convert back to 
> byte[] and generate splits on the resulting values. Assuming your keys are 
> fully distributed this should generate splits at nearly the same number of 
> rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HBASE-5173) Commit hbase-4480 findHangingTest.sh script under dev-support

2012-01-10 Thread stack (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-5173.
--

   Resolution: Fixed
Fix Version/s: 0.94.0

Committed to trunk.

> Commit hbase-4480 findHangingTest.sh script under dev-support
> -
>
> Key: HBASE-5173
> URL: https://issues.apache.org/jira/browse/HBASE-5173
> Project: HBase
>  Issue Type: Task
>Reporter: stack
> Fix For: 0.94.0
>
> Attachments: 5173.txt
>
>
> See hbase-4480 for the script from Ted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5173) Commit hbase-4480 findHangingTest.sh script under dev-support

2012-01-10 Thread stack (Created) (JIRA)
Commit hbase-4480 findHangingTest.sh script under dev-support
-

 Key: HBASE-5173
 URL: https://issues.apache.org/jira/browse/HBASE-5173
 Project: HBase
  Issue Type: Task
Reporter: stack
 Attachments: 5173.txt

See hbase-4480 for the script from Ted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183480#comment-13183480
 ] 

Zhihong Yu commented on HBASE-5139:
---

I want to point out that weighted median computation in patch v2 is only for 
reference.
Consider the case where cf:cq1, the value column, has different data type from 
cf:cq2, the weight column. Two ColumnInterpreters should be provided, one for 
value column and one for weight column.

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139-v2.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-5172) HTableInterface should extend java.io.Closeable

2012-01-10 Thread stack (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack reassigned HBASE-5172:


Assignee: stack

> HTableInterface should extend java.io.Closeable
> ---
>
> Key: HBASE-5172
> URL: https://issues.apache.org/jira/browse/HBASE-5172
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Yu
>Assignee: stack
> Attachments: 5172.txt
>
>
> Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5172) HTableInterface should extend java.io.Closeable

2012-01-10 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5172:
-

Attachment: 5172.txt

> HTableInterface should extend java.io.Closeable
> ---
>
> Key: HBASE-5172
> URL: https://issues.apache.org/jira/browse/HBASE-5172
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Yu
> Attachments: 5172.txt
>
>
> Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5172) HTableInterface should extend java.io.Closeable

2012-01-10 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5172:
-

Status: Patch Available  (was: Open)

> HTableInterface should extend java.io.Closeable
> ---
>
> Key: HBASE-5172
> URL: https://issues.apache.org/jira/browse/HBASE-5172
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Yu
> Attachments: 5172.txt
>
>
> Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

2012-01-10 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183487#comment-13183487
 ] 

stack commented on HBASE-5163:
--

The lock issue happens in a bunch of tests IIRC.  Nice digging N.  Can #2 fail 
ever?  If so, I like #1.

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
> hadoop QA on trunk ("The directory is already locked.")
> -
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rul

[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183488#comment-13183488
 ] 

Hadoop QA commented on HBASE-5139:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510076/5139-v2.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -151 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 81 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/721//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/721//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/721//console

This message is automatically generated.

> Compute (weighted) median using AggregateProtocol
> -
>
> Key: HBASE-5139
> URL: https://issues.apache.org/jira/browse/HBASE-5139
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zhihong Yu
>Assignee: Zhihong Yu
> Attachments: 5139-v2.txt
>
>
> Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
> This task finds out the median value among the values of cf:cq1 (See 
> http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
> This can be done in two passes.
> The first pass utilizes AggregateProtocol where the following tuple is 
> returned from each region:
> (partial-sum-of-values, partial-sum-of-weights)
> The start rowkey (supplied by coprocessor framework) would be used to sort 
> the tuples. This way we can determine which region (called R) contains the 
> (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
> sought
> The second pass involves scanning the table, beginning with startrow of 
> region R and computing partial (weighted) sum until the threshold of S/2 is 
> crossed. The (weighted) median is returned.
> However, this approach wouldn't work if there is mutation in the underlying 
> table between pass one and pass two.
> In that case, sequential scanning seems to be the solution which is slower 
> than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5172) HTableInterface should extend java.io.Closeable

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183494#comment-13183494
 ] 

Zhihong Yu commented on HBASE-5172:
---

+1 on patch.

Thanks Stack.

> HTableInterface should extend java.io.Closeable
> ---
>
> Key: HBASE-5172
> URL: https://issues.apache.org/jira/browse/HBASE-5172
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Yu
>Assignee: stack
> Attachments: 5172.txt
>
>
> Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5172) HTableInterface should extend java.io.Closeable

2012-01-10 Thread stack (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5172:
-

   Resolution: Fixed
Fix Version/s: 0.94.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Committed to trunk

> HTableInterface should extend java.io.Closeable
> ---
>
> Key: HBASE-5172
> URL: https://issues.apache.org/jira/browse/HBASE-5172
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Yu
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5172.txt
>
>
> Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4608) HLog Compression

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183495#comment-13183495
 ] 

Zhihong Yu commented on HBASE-4608:
---

I got the following on my MacBook for 4608v9.txt:
{code}
testReplayEditsWrittenViaHRegion(org.apache.hadoop.hbase.regionserver.wal.TestWALReplayCompressed)
  Time elapsed: 2.009 sec  <<< FAILURE!
java.lang.AssertionError
  at org.junit.Assert.fail(Assert.java:92)
  at org.junit.Assert.assertTrue(Assert.java:43)
  at org.junit.Assert.assertTrue(Assert.java:54)
  at 
org.apache.hadoop.hbase.regionserver.wal.TestWALReplay.testReplayEditsWrittenViaHRegion(TestWALReplay.java:289)
{code}

> HLog Compression
> 
>
> Key: HBASE-4608
> URL: https://issues.apache.org/jira/browse/HBASE-4608
> Project: HBase
>  Issue Type: New Feature
>Reporter: Li Pi
>Assignee: Li Pi
> Attachments: 4608v1.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 
> 4608v8fixed.txt
>
>
> The current bottleneck to HBase write speed is replicating the WAL appends 
> across different datanodes. We can speed up this process by compressing the 
> HLog. Current plan involves using a dictionary to compress table name, region 
> id, cf name, and possibly other bits of repeated data. Also, HLog format may 
> be changed in other ways to produce a smaller HLog.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5174) Coalesce aborted tasks in the TaskMonitor

2012-01-10 Thread Jean-Daniel Cryans (Created) (JIRA)
Coalesce aborted tasks in the TaskMonitor
-

 Key: HBASE-5174
 URL: https://issues.apache.org/jira/browse/HBASE-5174
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
 Fix For: 0.94.0, 0.92.1


Some tasks can get repeatedly canceled like flushing when splitting is going 
on, in the logs it looks like this:

{noformat}
2012-01-10 19:28:29,164 INFO 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region 
test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap 
pressure
2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT 
flushing memstore for region 
test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, 
writesEnabled=false
2012-01-10 19:28:29,164 DEBUG 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up 
because memory above low water=1.6g
2012-01-10 19:28:29,164 INFO 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region 
test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap 
pressure
2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT 
flushing memstore for region 
test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, 
writesEnabled=false
2012-01-10 19:28:29,164 DEBUG 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up 
because memory above low water=1.6g
2012-01-10 19:28:29,164 INFO 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region 
test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap 
pressure
2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT 
flushing memstore for region 
test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, 
writesEnabled=false
{noformat}

But in the TaskMonitor UI you'll get MAX_TASKS (1000) displayed on top of the 
regions. Basically 1000x:

{noformat}
Tue Jan 10 19:28:29 UTC 2012Flushing 
test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. ABORTED (since 31sec 
ago)   Not flushing since writes not enabled (since 31sec ago)
{noformat}

It's ugly and I'm sure some users will freak out seeing this, plus you have to 
scroll down all the way to see your regions. Coalescing consecutive aborted 
tasks seems like a good solution.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-4440) add an option to presplit table to PerformanceEvaluation

2012-01-10 Thread Jean-Daniel Cryans (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans reassigned HBASE-4440:
-

Assignee: Jean-Daniel Cryans

> add an option to presplit table to PerformanceEvaluation
> 
>
> Key: HBASE-4440
> URL: https://issues.apache.org/jira/browse/HBASE-4440
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Sujee Maniyam
>Assignee: Jean-Daniel Cryans
>Priority: Minor
>  Labels: benchmark
> Fix For: 0.94.0
>
> Attachments: PerformanceEvaluation.java, 
> PerformanceEvaluation_HBASE_4440.patch, 
> PerformanceEvaluation_HBASE_4440_2.patch
>
>
> PerformanceEvaluation a quick way to 'benchmark' a HBase cluster.  The 
> current 'write*' operations do not pre-split the table.  Pre splitting the 
> table will really boost the insert performance.
> It would be nice to have an option to enable pre-splitting table before the 
> inserts begin.
> it would look something like:
> (a) hbase ...PerformanceEvaluation   --presplit=10 
> (b) hbase ...PerformanceEvaluation   --presplit 
> (b) will try to presplit the table on some default value (say number of 
> region servers)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (HBASE-4440) add an option to presplit table to PerformanceEvaluation

2012-01-10 Thread Jean-Daniel Cryans (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans reassigned HBASE-4440:
-

Assignee: Sujee Maniyam  (was: Jean-Daniel Cryans)

ugh pressed the wrong button and assigned it to myself, I added Sujee as a 
contributor reassigned this jira.

> add an option to presplit table to PerformanceEvaluation
> 
>
> Key: HBASE-4440
> URL: https://issues.apache.org/jira/browse/HBASE-4440
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Sujee Maniyam
>Assignee: Sujee Maniyam
>Priority: Minor
>  Labels: benchmark
> Fix For: 0.94.0
>
> Attachments: PerformanceEvaluation.java, 
> PerformanceEvaluation_HBASE_4440.patch, 
> PerformanceEvaluation_HBASE_4440_2.patch
>
>
> PerformanceEvaluation a quick way to 'benchmark' a HBase cluster.  The 
> current 'write*' operations do not pre-split the table.  Pre splitting the 
> table will really boost the insert performance.
> It would be nice to have an option to enable pre-splitting table before the 
> inserts begin.
> it would look something like:
> (a) hbase ...PerformanceEvaluation   --presplit=10 
> (b) hbase ...PerformanceEvaluation   --presplit 
> (b) will try to presplit the table on some default value (say number of 
> region servers)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5172) HTableInterface should extend java.io.Closeable

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183521#comment-13183521
 ] 

Hadoop QA commented on HBASE-5172:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510082/5172.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 javadoc.  The javadoc tool appears to have generated -147 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 78 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/722//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/722//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/722//console

This message is automatically generated.

> HTableInterface should extend java.io.Closeable
> ---
>
> Key: HBASE-5172
> URL: https://issues.apache.org/jira/browse/HBASE-5172
> Project: HBase
>  Issue Type: Bug
>Reporter: Zhihong Yu
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5172.txt
>
>
> Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException

2012-01-10 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183531#comment-13183531
 ] 

Hudson commented on HBASE-5137:
---

Integrated in HBase-0.92 #238 (See 
[https://builds.apache.org/job/HBase-0.92/238/])
HBASE-5137HBASE-5137  MasterFileSystem.splitLog() should abort even if 
waitOnSafeMode() throws IOException (Ram & Ted)

ramkrishna : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* 
/hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java


> MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws 
> IOException
> 
>
> Key: HBASE-5137
> URL: https://issues.apache.org/jira/browse/HBASE-5137
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 0.92.0, 0.90.6
>
> Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch
>
>
> I am not sure if this bug was already raised in JIRA.
> In our test cluster we had a scenario where the RS had gone down and 
> ServerShutDownHandler started with splitLog.
> But as the HDFS was down the check waitOnSafeMode throws IOException.
> {code}
> try {
> // If FS is in safe mode, just wait till out of it.
> FSUtils.waitOnSafeMode(conf,
>   conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
> splitter.splitLog();
>   } catch (OrphanHLogAfterSplitException e) {
> {code}
> We catch the exception
> {code}
> } catch (IOException e) {
>   checkFileSystem();
>   LOG.error("Failed splitting " + logDir.toString(), e);
> }
> {code}
> So the HLog split itself did not happen. We encontered like 4 regions that 
> was recently splitted in the crashed RS was lost.
> Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5175) Add DoubleColumnInterpreter

2012-01-10 Thread Zhihong Yu (Created) (JIRA)
Add DoubleColumnInterpreter
---

 Key: HBASE-5175
 URL: https://issues.apache.org/jira/browse/HBASE-5175
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhihong Yu


DoubleColumnInterpreter was requested by Royston Sellman.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183592#comment-13183592
 ] 

Zhihong Yu commented on HBASE-5155:
---

+1 on utilizing TableState.ENABLED
Nice finding, Ram.

> ServerShutDownHandler And Disable/Delete should not happen parallely leading 
> to recreation of regions that were deleted
> ---
>
> Key: HBASE-5155
> URL: https://issues.apache.org/jira/browse/HBASE-5155
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.90.4
>Reporter: ramkrishna.s.vasudevan
>Priority: Blocker
>
> ServerShutDownHandler and disable/delete table handler races.  This is not an 
> issue due to TM.
> -> A regionserver goes down.  In our cluster the regionserver holds lot of 
> regions.
> -> A region R1 has two daughters D1 and D2.
> -> The ServerShutdownHandler gets called and scans the META and gets all the 
> user regions
> -> Parallely a table is disabled. (No problem in this step).
> -> Delete table is done.
> -> The tables and its regions are deleted including R1, D1 and D2.. (So META 
> is cleaned)
> -> Now ServerShutdownhandler starts to processTheDeadRegion
> {code}
>  if (hri.isOffline() && hri.isSplit()) {
>   LOG.debug("Offlined and split region " + hri.getRegionNameAsString() +
> "; checking daughter presence");
>   fixupDaughters(result, assignmentManager, catalogTracker);
> {code}
> As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 
> {code}
> if (isDaughterMissing(catalogTracker, daughter)) {
>   LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString());
>   MetaEditor.addDaughter(catalogTracker, daughter, null);
>   // TODO: Log WARN if the regiondir does not exist in the fs.  If its not
>   // there then something wonky about the split -- things will keep going
>   // but could be missing references to parent region.
>   // And assign it.
>   assignmentManager.assign(daughter, true);
> {code}
> we call assign of the daughers.  
> Now after this we again start with the below code.
> {code}
> if (processDeadRegion(e.getKey(), e.getValue(),
> this.services.getAssignmentManager(),
> this.server.getCatalogTracker())) {
>   this.services.getAssignmentManager().assign(e.getKey(), true);
> {code}
> Now when the SSH scanned the META it had R1, D1 and D2.
> So as part of the above code D1 and D2 which where assigned by fixUpDaughters
> is again assigned by 
> {code}
> this.services.getAssignmentManager().assign(e.getKey(), true);
> {code}
> Thus leading to a zookeeper issue due to bad version and killing the master.
> The important part here is the regions that were deleted are recreated which 
> i think is more critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5174) Coalesce aborted tasks in the TaskMonitor

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183653#comment-13183653
 ] 

Zhihong Yu commented on HBASE-5174:
---

I think this issue is similar to HBASE-5136 in that 
TaskMonitor.get().createStatus() is called imprudently.
We can store MonitoredTask for flushcache() as a field in HRegion and reuse it.

> Coalesce aborted tasks in the TaskMonitor
> -
>
> Key: HBASE-5174
> URL: https://issues.apache.org/jira/browse/HBASE-5174
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.92.0
>Reporter: Jean-Daniel Cryans
> Fix For: 0.94.0, 0.92.1
>
>
> Some tasks can get repeatedly canceled like flushing when splitting is going 
> on, in the logs it looks like this:
> {noformat}
> 2012-01-10 19:28:29,164 INFO 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region 
> test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap 
> pressure
> 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> NOT flushing memstore for region 
> test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, 
> writesEnabled=false
> 2012-01-10 19:28:29,164 DEBUG 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up 
> because memory above low water=1.6g
> 2012-01-10 19:28:29,164 INFO 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region 
> test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap 
> pressure
> 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> NOT flushing memstore for region 
> test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, 
> writesEnabled=false
> 2012-01-10 19:28:29,164 DEBUG 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up 
> because memory above low water=1.6g
> 2012-01-10 19:28:29,164 INFO 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region 
> test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap 
> pressure
> 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
> NOT flushing memstore for region 
> test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, 
> writesEnabled=false
> {noformat}
> But in the TaskMonitor UI you'll get MAX_TASKS (1000) displayed on top of the 
> regions. Basically 1000x:
> {noformat}
> Tue Jan 10 19:28:29 UTC 2012  Flushing 
> test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. ABORTED (since 31sec 
> ago)   Not flushing since writes not enabled (since 31sec ago)
> {noformat}
> It's ugly and I'm sure some users will freak out seeing this, plus you have 
> to scroll down all the way to see your regions. Coalescing consecutive 
> aborted tasks seems like a good solution.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

2012-01-10 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183655#comment-13183655
 ] 

nkeywal commented on HBASE-5163:


I've got a fix using a variation of #2. Tested 100 times without any failure.
The advantage of #1 for me is that it eliminates a quite tricky behavior, but 
the fix would then be outside HBase...


> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
> hadoop QA on trunk ("The directory is already locked.")
> -
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same direc

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

2012-01-10 Thread nkeywal (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5163:
---

Attachment: 5163.patch

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
> hadoop QA on trunk ("The directory is already locked.")
> -
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure 
> that the directory names will not be reused. Bu

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

2012-01-10 Thread nkeywal (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5163:
---

Status: Patch Available  (was: Open)

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
> hadoop QA on trunk ("The directory is already locked.")
> -
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure 
> that the directory names will not

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

2012-01-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183682#comment-13183682
 ] 

Zhihong Yu commented on HBASE-5163:
---

+1 on patch.
Looped over the test and didn't encounter test failure.

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
> hadoop QA on trunk ("The directory is already locked.")
> -
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> There are two ways of fixing the test:
> 1) Fix the 

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")

2012-01-10 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183691#comment-13183691
 ] 

Hadoop QA commented on HBASE-5163:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510114/5163.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -147 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 78 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.replication.TestReplication
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/723//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/723//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/723//console

This message is automatically generated.

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
> hadoop QA on trunk ("The directory is already locked.")
> -
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> t

[jira] [Commented] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface

2012-01-10 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183704#comment-13183704
 ] 

Lars Hofhansl commented on HBASE-5134:
--

+1 on v6

> Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
> HConnection Interface
> --
>
> Key: HBASE-5134
> URL: https://issues.apache.org/jira/browse/HBASE-5134
> Project: HBase
>  Issue Type: Improvement
>Reporter: stack
>Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
> 5134-v6.txt, 5134-v6.txt
>
>
> Its broke having these meta methods in HConnection.  They take 
> ServerCallables which themselves have HConnections inevitably.   It makes for 
> a tangle in the model and frustrates being able to do mocked implemenations 
> of HConnection.  These methods better belong in something like 
> HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5121) MajorCompaction may affect scan's correctness

2012-01-10 Thread Lars Hofhansl (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-5121:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to 0.92 and trunk (5121-suggest.txt)

> MajorCompaction may affect scan's correctness
> -
>
> Key: HBASE-5121
> URL: https://issues.apache.org/jira/browse/HBASE-5121
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
>Reporter: chunhui shen
>Assignee: chunhui shen
>Priority: Critical
> Fix For: 0.94.0, 0.92.1
>
> Attachments: 5121-0.92.txt, 5121-suggest.txt, 
> 5121-trunk-combined.txt, 5121.90, hbase-5121-testcase.patch, 
> hbase-5121.patch, hbase-5121v2.patch
>
>
> In our test, there are two families' keyvalue for one row.
> But we could find a infrequent problem when doing scan's next if 
> majorCompaction happens concurrently.
> In the client's two continuous doing scan.next():
> 1.First time, scan's next returns the result where family A is null.
> 2.Second time, scan's next returns the result where family B is null.
> The two next()'s result have the same row.
> If there are more families, I think the scenario will be more strange...
> We find the reason is that storescanner.peek() is changed after 
> majorCompaction if there are delete type KeyValue.
> This change causes the PriorityQueue of RegionScanner's heap 
> is not sure to be sorted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5121) MajorCompaction may affect scan's correctness

2012-01-10 Thread Lars Hofhansl (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-5121:
-

Attachment: 5121-0.92.txt

Patch against 0.92

> MajorCompaction may affect scan's correctness
> -
>
> Key: HBASE-5121
> URL: https://issues.apache.org/jira/browse/HBASE-5121
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.90.4
>Reporter: chunhui shen
>Assignee: chunhui shen
>Priority: Critical
> Fix For: 0.94.0, 0.92.1
>
> Attachments: 5121-0.92.txt, 5121-suggest.txt, 
> 5121-trunk-combined.txt, 5121.90, hbase-5121-testcase.patch, 
> hbase-5121.patch, hbase-5121v2.patch
>
>
> In our test, there are two families' keyvalue for one row.
> But we could find a infrequent problem when doing scan's next if 
> majorCompaction happens concurrently.
> In the client's two continuous doing scan.next():
> 1.First time, scan's next returns the result where family A is null.
> 2.Second time, scan's next returns the result where family B is null.
> The two next()'s result have the same row.
> If there are more families, I think the scenario will be more strange...
> We find the reason is that storescanner.peek() is changed after 
> majorCompaction if there are delete type KeyValue.
> This change causes the PriorityQueue of RegionScanner's heap 
> is not sure to be sorted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >