[jira] [Updated] (HBASE-14932) bulkload fails because file not found

2016-02-29 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14932:
---
Status: Patch Available  (was: Open)

> bulkload fails because file not found
> -
>
> Key: HBASE-14932
> URL: https://issues.apache.org/jira/browse/HBASE-14932
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.98.10
>Reporter: Shuaifeng Zhou
>Assignee: Alicia Ying Shu
> Fix For: 0.98.18
>
> Attachments: HBASE-14932-0.98.patch
>
>
> When make a dobulkload call, one call may contain sevel hfiles to load, but 
> the call may timeout during regionserver load files, and client will retry to 
> load.
> But when client doing retry call, regionserver may continue doing load 
> operation, if somefiles success, the retry call will throw filenotfound 
> exception, and this will cause client retry again and again until retry 
> exhausted, and bulkload fails.
> When this happening, actually, some files are loaded successfully, that's a 
> inconsistent status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2016-02-22 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158469#comment-15158469
 ] 

Shuaifeng Zhou commented on HBASE-14735:


In our clusters fixed like this, and this problem never happened again during 
the passed few monthes.

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-branch-1.2.patch, 14735-master (2).patch, 
> 14735-master.patch, 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14932) bulkload fails because file not found

2016-02-22 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14932:
---
Attachment: HBASE-14932-0.98.patch

A solution is when HRegion bulkLoadHFiles, ignore FileNotFoundException to 
continue.
Patch on 0.98 is attached, please review it

> bulkload fails because file not found
> -
>
> Key: HBASE-14932
> URL: https://issues.apache.org/jira/browse/HBASE-14932
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.98.10
>Reporter: Shuaifeng Zhou
>Assignee: Alicia Ying Shu
> Fix For: 0.98.18
>
> Attachments: HBASE-14932-0.98.patch
>
>
> When make a dobulkload call, one call may contain sevel hfiles to load, but 
> the call may timeout during regionserver load files, and client will retry to 
> load.
> But when client doing retry call, regionserver may continue doing load 
> operation, if somefiles success, the retry call will throw filenotfound 
> exception, and this will cause client retry again and again until retry 
> exhausted, and bulkload fails.
> When this happening, actually, some files are loaded successfully, that's a 
> inconsistent status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14932) bulkload fails because file not found

2015-12-05 Thread Shuaifeng Zhou (JIRA)
Shuaifeng Zhou created HBASE-14932:
--

 Summary: bulkload fails because file not found
 Key: HBASE-14932
 URL: https://issues.apache.org/jira/browse/HBASE-14932
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.10
Reporter: Shuaifeng Zhou
 Fix For: 0.98.17


When make a dobulkload call, one call may contain sevel hfiles to load, but the 
call may timeout during regionserver load files, and client will retry to load.
But when client doing retry call, regionserver may continue doing load 
operation, if somefiles success, the retry call will throw filenotfound 
exception, and this will cause client retry again and again until retry 
exhausted, and bulkload fails.
When this happening, actually, some files are loaded successfully, that's a 
inconsistent status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14931) Active master switches may cause region close forever

2015-12-04 Thread Shuaifeng Zhou (JIRA)
Shuaifeng Zhou created HBASE-14931:
--

 Summary: Active master switches may cause region close forever
 Key: HBASE-14931
 URL: https://issues.apache.org/jira/browse/HBASE-14931
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.98.10
Reporter: Shuaifeng Zhou
Priority: Critical
 Fix For: 0.98.17


60010 webpage shows that a region is online on one RS, but when access data in 
the region throw notServingRegion. After lookup the source code and logs, found 
that it's because active master switches during the region openning:
1, master1 open region 'region1', sent open region request to rs and create 
node in zk
2, master1 stoped
3, master2 became active master
4, master2 obtain all region status,  'region1' status is offline
5, rs opened 'region1' node changed to opened in zk, and sent message to master2
6, master2 received RS_ZK_REGION_OPENED, but the status is not pending open or 
openning, sent unassign to rs, 'region1' closed
{code:title=AssignmentManager.java|borderStyle=solid}
case RS_ZK_REGION_OPENED:
  // Should see OPENED after OPENING but possible after PENDING_OPEN.
  if (regionState == null
  || !regionState.isPendingOpenOrOpeningOnServer(sn)) {
LOG.warn("Received OPENED for " + prettyPrintedRegionName
  + " from " + sn + " but the region isn't PENDING_OPEN/OPENING 
here: "
  + regionStates.getRegionState(encodedName));

if (regionState != null) {
  // Close it without updating the internal region states,
  // so as not to create double assignments in unlucky scenarios
  // mentioned in OpenRegionHandler#process
  unassign(regionState.getRegion(), null, -1, null, false, sn);
}
return;
  }
{code}
7, master2 continue handle regioninfo when master1 stoped, found that 'region1' 
status in zk is opened, update status in memory to opened.
8, up to now, 'region1' status is opened on webpage of master status, but not 
opened on any regionserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-30 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15031504#comment-15031504
 ] 

Shuaifeng Zhou commented on HBASE-14735:


Hi, [~stack]
We running with this patch applied on our clusters. We have many clusters, some 
0.94 and some 0.98 version. Recently we are upgrading, not finished. This patch 
really works. 
In 0.98 version, there are no reference problem, but 0.94 have. Because in 
0.98, if there is any reference, compact will force to major, but in 0.94, it's 
not.  Both version have huge region problem. Because 0.94 is too old and to be 
upgraded, I haven't provide the patch on 0.94.
Below are some of the du result and lsr result of one example in 0.94( after 
split onece, alse have a 200G+ huge region, a file more than 100G, but aways 
being selected during compaction. And also hive 2 reference after several 
compactions), the regionsize configured is 40GB
du:
{noformat}
32796614610   
hdfs://hm101:9000/hbase/TAB_INTERESTING/effa8658177d023f4001b5d169bca149
24719467342   
hdfs://hm101:9000/hbase/TAB_INTERESTING/f0819cb446cbdf785fb85638553605c5
210031594622  
hdfs://hm101:9000/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019
40210595441   
hdfs://hm101:9000/hbase/TAB_INTERESTING/f0e08b1a4b1169a7b1f537c068a577bb
50824015435   
hdfs://hm101:9000/hbase/TAB_INTERESTING/f0e710bb05dbc394d11524fa6dc34016
21566277612   
hdfs://hm101:9000/hbase/TAB_INTERESTING/f11affc0f157e8f4cacce13c6faefe52
{noformat}
lsr:
{noformat}
-rw-r--r--   2 root supergroup   4181396311 2015-11-23 09:48 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/1f5b326bbbe64b178ce98783fe8223af
-rw-r--r--   2 root supergroup   4128995550 2015-11-23 10:03 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/29289c0ea8a746a284d585a928611d65
-rw-r--r--   2 root supergroup   4137771163 2015-11-22 08:05 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/3458f12dd5d842fa8629ade59fbc5443
-rw-r--r--   2 root supergroup   4122308215 2015-11-23 10:08 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/4ecdd970f2d845d680d5273b13a4d463
-rw-r--r--   2 root supergroup   74 2015-11-22 01:34 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/5b4fa0edcd37427cadc50602b0a0758a.78b89f6a03d5e5f61e7e49b2cb1bb0a8
-rw-r--r--   2 root supergroup 122997494766 2015-11-22 22:22 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/6517dbf39bd449c1ae97cdcc0f341100
-rw-r--r--   2 root supergroup   4121185787 2015-11-22 07:57 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/72c864a6b36148b98a26f6e9fd52e89c
-rw-r--r--   2 root supergroup   4131467137 2015-11-23 09:58 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/74ff9ced889f43839fa520dcaba1744a
-rw-r--r--   2 root supergroup   1963236714 2015-11-23 10:34 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/75e5ab25679e4bc6bc7490577e90b166
-rw-r--r--   2 root supergroup   4141563183 2015-11-23 09:54 
/hbase/TAB_INTERESTING/f0c180d817bd74e1743c56f6478ac019/F/7c5d0db31f92424fa06e6070dc4d0817
{noformat}

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-branch-1.2.patch, 14735-master (2).patch, 
> 14735-master.patch, 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region 

[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-26 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029342#comment-15029342
 ] 

Shuaifeng Zhou commented on HBASE-14735:


Thanks a lot for the explain, [~stack]
We met the problem. The huge region can not be compacted to a few files because 
high input load, and if cannot be split, the input load aways on the region, 
this situation become worse and worse.
If split the region to 2, the input load will be split and balanced on the 2 
children.
What you wary about the patch is reasonable, we also met the the reference file 
problem. After we apply the patch on our cluster, the huge region also cannot 
be split, because there is a reference file, for some reason, the file aways 
cannot be selected to compact, and we sent a major compact request to solve the 
problem. The patch may not solve the huge region problem, but can prevent it.
In the patch, we respect the rule that compact comes first, but give a chance 
to split if region is too big. 
If region split before it grows too big, compact on the children may be easily, 
and can clean the reference intime before the children grow too big. 

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-branch-1.2.patch, 14735-master (2).patch, 
> 14735-master.patch, 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-21 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020332#comment-15020332
 ] 

Shuaifeng Zhou commented on HBASE-14735:


priority <= 0  means storefiles in the store is more than 
blockingFileCount(default = 7), and memory store flush on this region will be 
blocked for a while.
So, I think the check doing priority <= 0 is all right, isn't it?

But this issue is that if we don't split it, may cause region growing too big. 
So, Recursive enqueue is all right, but if region size is too big, it should 
splited.

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-branch-1.2.patch, 14735-master (2).patch, 
> 14735-master.patch, 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-15 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006181#comment-15006181
 ] 

Shuaifeng Zhou commented on HBASE-14735:


lgtm,  no zombie tests this time

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-branch-1.2.patch, 14735-master (2).patch, 
> 14735-master.patch, 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14735) Region may grow too big and can not be split

2015-11-11 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14735:
---
Attachment: 14735-branch-1.2.patch

reattached patch on 1.2

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-branch-1.2.patch, 14735-master (2).patch, 
> 14735-master.patch, 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-10 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999831#comment-14999831
 ] 

Shuaifeng Zhou commented on HBASE-14735:


Should the problem block this jira issue?
I haven't seen this before, what can I do to continue?  
Thanks [~stack]]  [~tedyu]]

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-master (2).patch, 14735-master.patch, 
> 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-10 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999832#comment-14999832
 ] 

Shuaifeng Zhou commented on HBASE-14735:


Should the problem block this jira issue?
I haven't seen this before, what can I do to continue?  
Thanks [~stack]]  [~tedyu]]

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-master (2).patch, 14735-master.patch, 
> 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-10 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999833#comment-14999833
 ] 

Shuaifeng Zhou commented on HBASE-14735:


Should the problem block this jira issue?
I haven't seen this before, what can I do to continue?  
Thanks [~stack]]  [~tedyu]]

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-master (2).patch, 14735-master.patch, 
> 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-04 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990880#comment-14990880
 ] 

Shuaifeng Zhou commented on HBASE-14735:


Yes, my mistake.

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-master (2).patch, 14735-master.patch, 
> 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14735) Region may grow too big and can not be split

2015-11-04 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14735:
---
Attachment: 14735-branch-1.2.patch
14735-branch-1.1.patch
14735-0.98.patch

patches on branch 0.98, 1.1 and 1.2 is attached

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-master (2).patch, 14735-master.patch, 
> 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-11-04 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989259#comment-14989259
 ] 

Shuaifeng Zhou commented on HBASE-14735:


I think it's correct.

The lower value have higher priority, because there are more storefiles in the 
store.

If the number of storefiles <= blockingFileCount, flush will be blocked, so it 
should have higher priority.

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-0.98.patch, 14735-branch-1.1.patch, 
> 14735-branch-1.2.patch, 14735-master (2).patch, 14735-master.patch, 
> 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14735) Region may grow too big and can not be split

2015-11-02 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14735:
---
Attachment: 14735-master.patch

attached patch on master, all regionserver testcase passed, please review it. 
patch on other branch will be attached late after I run the test case.

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: 14735-master.patch
>
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-10-31 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983921#comment-14983921
 ] 

Shuaifeng Zhou commented on HBASE-14735:


edit the requestSplit function, remove 
-deleted-&& r.getCompactPriority() >= Store.PRIORITY_USER
{code:title=CompactSplitThread.java|borderStyle=solid}
  public synchronized boolean requestSplit(final HRegion r) {
// don't split regions that are blocking
if (shouldSplitRegion()) {
  byte[] midKey = r.checkSplit();
  if (midKey != null) {
requestSplit(r, midKey);
return true;
  }
}
return false;
  }
{code}

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14735) Region may grow too big and can not be split

2015-10-31 Thread Shuaifeng Zhou (JIRA)
Shuaifeng Zhou created HBASE-14735:
--

 Summary: Region may grow too big and can not be split
 Key: HBASE-14735
 URL: https://issues.apache.org/jira/browse/HBASE-14735
 Project: HBase
  Issue Type: Bug
  Components: Compaction, regionserver
Affects Versions: 0.98.15, 1.1.2
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou


When a compaction completed, may there are also many storefiles in the store, 
and CompactPriority < 0, then compactSplitThread will do a "Recursive enqueue" 
compaction request instead of request a split:
{code:title=CompactSplitThread.java|borderStyle=solid}
if (completed) {
  // degenerate case: blocked regions require recursive enqueues
  if (store.getCompactPriority() <= 0) {
requestSystemCompaction(region, store, "Recursive enqueue");
  } else {
// see if the compaction has caused us to exceed max region size
requestSplit(region);
  }
{code}
But in some situation, the "recursive enqueue" request may return null, and not 
build up a new compaction runner. For example, an other compaction of the same 
region is running, and compaction selection will exclude all files older than 
the newest files currently compacting, this may cause no enough files can be 
selected by the "recursive enqueue" request. When this happen, split will not 
be trigged. If the input load is high enough, compactions aways running on the 
region, and split will never be triggered.
In our cluster, this situation happened, and a huge region more than 400GB and 
100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14735) Region may grow too big and can not be split

2015-10-31 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983918#comment-14983918
 ] 

Shuaifeng Zhou commented on HBASE-14735:


A solution is remove "else" switch, give a chance to split after each completed 
compaction.
{code:title=CompactSplitThread.java|borderStyle=solid}
if (completed) {
  // degenerate case: blocked regions require recursive enqueues
  if (store.getCompactPriority() <= 0) {
requestSystemCompaction(region, store, "Recursive enqueue");
  } 
  // see if the compaction has caused us to exceed max region size
  requestSplit(region);
{code}

> Region may grow too big and can not be split
> 
>
> Key: HBASE-14735
> URL: https://issues.apache.org/jira/browse/HBASE-14735
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction, regionserver
>Affects Versions: 1.1.2, 0.98.15
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>
> When a compaction completed, may there are also many storefiles in the store, 
> and CompactPriority < 0, then compactSplitThread will do a "Recursive 
> enqueue" compaction request instead of request a split:
> {code:title=CompactSplitThread.java|borderStyle=solid}
> if (completed) {
>   // degenerate case: blocked regions require recursive enqueues
>   if (store.getCompactPriority() <= 0) {
> requestSystemCompaction(region, store, "Recursive enqueue");
>   } else {
> // see if the compaction has caused us to exceed max region size
> requestSplit(region);
>   }
> {code}
> But in some situation, the "recursive enqueue" request may return null, and 
> not build up a new compaction runner. For example, an other compaction of the 
> same region is running, and compaction selection will exclude all files older 
> than the newest files currently compacting, this may cause no enough files 
> can be selected by the "recursive enqueue" request. When this happen, split 
> will not be trigged. If the input load is high enough, compactions aways 
> running on the region, and split will never be triggered.
> In our cluster, this situation happened, and a huge region more than 400GB 
> and 100+ storefiles appeared. Version is 0.98.10, and the trank also have the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-24 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14407:
---
Attachment: 14407-branch-1.1.patch

Reattached patch on branch-1.1
Is it ok?
[~apurtell]

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: 14407-0.98.patch, 14407-branch-1.1.patch, 
> 14407-branch-1.2.patch, hbase-14407-0.98.patch, hbase-14407-1.1.patch, 
> hbase-14407-1.2.patch, hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-24 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906064#comment-14906064
 ] 

Shuaifeng Zhou commented on HBASE-14407:


lgtm

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: 14407-0.98.patch, 14407-branch-1.1.patch, 
> 14407-branch-1.2.patch, hbase-14407-0.98.patch, hbase-14407-1.1.patch, 
> hbase-14407-1.2.patch, hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-24 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907328#comment-14907328
 ] 

Shuaifeng Zhou commented on HBASE-14407:


lgtm

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: 14407-0.98.patch, 14407-branch-1.1.patch, 
> 14407-branch-1.2.patch, hbase-14407-0.98.patch, hbase-14407-1.1.patch, 
> hbase-14407-1.2.patch, hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-21 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901759#comment-14901759
 ] 

Shuaifeng Zhou commented on HBASE-14407:


[~apurtell]

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: 14407-0.98.patch, 14407-branch-1.2.patch, 
> hbase-14407-0.98.patch, hbase-14407-1.1.patch, hbase-14407-1.2.patch, 
> hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-19 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877385#comment-14877385
 ] 

Shuaifeng Zhou commented on HBASE-14407:


lgtm

should patch goes to 0.98 and branch-1.1 ?

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: 14407-branch-1.2.patch, hbase-14407-0.98.patch, 
> hbase-14407-1.1.patch, hbase-14407-1.2.patch, hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-15 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14407:
---
Attachment: hbase-14407-1.2.patch
hbase-14407-1.1.patch
hbase-14407-0.98.patch

A possible solution is when processAlreadyOpenedRegion, check zk state before 
modify master memory.
Patch on branch 0.98, 1.1 and 1.2 is attached. And I tested 0.98.10 modified as 
this with more than 10,000 regions, that's ok(before, the problem happens every 
time restarting hbase).
In master branch, assign not using zk, so there is no problem.
Please review it, welcome more smart solution.


> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: hbase-14407-0.98.patch, hbase-14407-1.1.patch, 
> hbase-14407-1.2.patch, hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-15 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745283#comment-14745283
 ] 

Shuaifeng Zhou commented on HBASE-14407:


Thanks, stack
I have extract the master log analysis, and attached possible patch.

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: hbase-14407-0.98.patch, hbase-14407-1.1.patch, 
> hbase-14407-1.2.patch, hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-14 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744866#comment-14744866
 ] 

Shuaifeng Zhou commented on HBASE-14407:


In master log, the error is happened like this(0.98.10):

1, master open region timeout:
{noformat}
2015-09-06 01:35:59,521 DEBUG [hm,6,1438368907764-GeneralBulkAssigner-19] 
master.AssignmentManager(1768): Bulk assigner openRegion() to 
hs4,60020,1441213185092 has timed out, but the regions might already be opened 
on it.
java.net.SocketTimeoutException: Call to hs4/15.173.0.115:60020 failed because 
java.net.SocketTimeoutException: 6 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/15.173.0.110:33771 remote=hs4/15.173.0.115:60020]
{noformat}
2, master retried open region, found that region already opened, delete node
{noformat}
2015-09-06 01:36:07,063 DEBUG [hm,6,1438368907764-GeneralBulkAssigner-19] 
master.AssignmentManager(2293): ALREADY_OPENED 
NB_APP_BEHAVIOR_LABEL_00,F\x180100\x18,1441015264846.2070d83bfd7c3fa6950c859ce842039e.
 to hs4,60020,1441213185092
{noformat}

3, delete zk node in processAlreadyOpenedRegion, but region state not match 
(because this is a retry, region opened previously )
{noformat}
2015-09-06 01:36:07,073 WARN  [hm,6,1438368907764-GeneralBulkAssigner-19] 
zookeeper.ZKAssign(458): master:6-0x24ee432542d01eb, 
quorum=hs5:2181,hs4:2181,hm:2181, baseZNode=/hbase Attempting to delete 
unassigned node 2070d83bfd7c3fa6950c859ce842039e in M_ZK_REGION_OFFLINE state 
but node is in RS_ZK_REGION_OPENED state
2015-09-06 01:36:07,073 INFO  [hm,6,1438368907764-GeneralBulkAssigner-19] 
master.AssignmentManager(3614): Failed to delete the offline node for 
2070d83bfd7c3fa6950c859ce842039e. The node type may not match
{noformat}
at the same time, will modiry regionStates in master memory:
{code:title=AssignmentManager.java|borderStyle=solid}
  private void processAlreadyOpenedRegion(HRegionInfo region, ServerName sn) {
// Remove region from in-memory transition and unassigned node from ZK
// While trying to enable the table the regions of the table were
// already enabled.
LOG.debug("ALREADY_OPENED " + region.getRegionNameAsString()
  + " to " + sn);
String encodedName = region.getEncodedName();
deleteNodeInStates(encodedName, "offline", sn, 
EventType.M_ZK_REGION_OFFLINE);
regionStates.regionOnline(region, sn);
  }
{code}

4, handling previous success open region zk event (delayed)
{noformat}
2015-09-06 01:36:07,073 INFO  [hm,6,1438368907764-GeneralBulkAssigner-19] 
master.RegionStates(826): Transition {2070d83bfd7c3fa6950c859ce842039e 
state=PENDING_OPEN, ts=1441474499424, server=hs4,60020,1441213185092} to 
{2070d83bfd7c3fa6950c859ce842039e state=OPEN, ts=1441474567073, 
server=hs4,60020,1441213185092}
2015-09-06 01:36:07,073 INFO  [hm,6,1438368907764-GeneralBulkAssigner-19] 
master.RegionStates(371): Onlined 2070d83bfd7c3fa6950c859ce842039e on 
hs4,60020,1441213185092
2015-09-06 01:36:33,960 DEBUG [AM.ZK.Worker-pool2-t5251] 
master.AssignmentManager(926): Handling RS_ZK_REGION_OPENED, 
server=hs4,60020,1441213185092, region=2070d83bfd7c3fa6950c859ce842039e, which 
is more than 15 seconds late, current_state={2070d83bfd7c3fa6950c859ce842039e 
state=OPEN, ts=1441474567073, server=hs4,60020,1441213185092}
{noformat}

5, modify regionStates again, but found that region already opened, Error, 
close region
{noformat}
2015-09-06 01:36:33,961 WARN  [AM.ZK.Worker-pool2-t5251] 
master.AssignmentManager(1061): Received OPENED for 
2070d83bfd7c3fa6950c859ce842039e from hs4,60020,1441213185092 but the region 
isn't PENDING_OPEN/OPENING here: {2070d83bfd7c3fa6950c859ce842039e state=OPEN, 
ts=1441474567073, server=hs4,60020,1441213185092}
2015-09-06 01:36:33,965 DEBUG [AM.ZK.Worker-pool2-t5251] 
master.AssignmentManager(1849): Sent CLOSE to hs4,60020,1441213185092 for 
region 
NB_APP_BEHAVIOR_LABEL_00,F\x180100\x18,1441015264846.2070d83bfd7c3fa6950c859ce842039e.
{noformat}

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>  Components: Region Assignment
>Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>Priority: Critical
> Attachments: hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send 

[jira] [Updated] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-11 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14407:
---
Attachment: hs4.log
master.log

attached is logs on master and regionserver

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.98.10, 1.1.2
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
> Attachments: hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-10 Thread Shuaifeng Zhou (JIRA)
Shuaifeng Zhou created HBASE-14407:
--

 Summary: NotServingRegion: hbase region closed forever
 Key: HBASE-14407
 URL: https://issues.apache.org/jira/browse/HBASE-14407
 Project: HBase
  Issue Type: Bug
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
 Fix For: 1.1.2, 0.98.10


I found a situation may cause region closed forever, and this situation 
happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
problem:
1, master send region open to regionserver
2, rs open a handler do openregion
3, rs return resopnse to master
3, master not received the response, or timeout, send open region again
4, rs already opened the region
5, master processAlreadyOpenedRegion, update regionstate open in master 
memory
6, master received zk message region opened(for some reason late, eg: net 
work), and triger update regionstate open, but find that region already opened, 
ERROR!
7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-10 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14407:
---
Affects Version/s: 0.98.10
   1.1.2

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.98.10, 1.1.2
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14407) NotServingRegion: hbase region closed forever

2015-09-10 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-14407:
---
Fix Version/s: (was: 1.1.2)
   (was: 0.98.10)

> NotServingRegion: hbase region closed forever
> -
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.98.10, 1.1.2
>Reporter: Shuaifeng Zhou
>Assignee: Shuaifeng Zhou
>
> I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
> 6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
> 7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13528:
---
Attachment: HBASE-13528-master.patch
HBASE-13528-1.0.patch
HBASE-13528-0.98.patch

 A bug on selecting compaction pool
 --

 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13

 Attachments: HBASE-13528-0.98.patch, HBASE-13528-1.0.patch, 
 HBASE-13528-master.patch


 When the selectNow == true, in requestCompactionInternal, the compaction pool 
 section is incorrect.
 as discussed in:
 http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506298#comment-14506298
 ] 

Shuaifeng Zhou commented on HBASE-13528:


Yes, it's redundant, just like this is OK?
{noformat}
long size = compaction.getRequest().getSize();
ThreadPoolExecutor pool = (selectNow  s.throttleCompaction(size))
  ? largeCompactions : smallCompactions;
{noformat}

 A bug on selecting compaction pool
 --

 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13

 Attachments: HBASE-13528-0.98.patch, HBASE-13528-1.0.patch, 
 HBASE-13528-master.patch


 When the selectNow == true, in requestCompactionInternal, the compaction pool 
 section is incorrect.
 as discussed in:
 http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14506341#comment-14506341
 ] 

Shuaifeng Zhou commented on HBASE-13528:


OK, will atach patch soon.

 A bug on selecting compaction pool
 --

 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13

 Attachments: HBASE-13528-0.98.patch, HBASE-13528-1.0.patch, 
 HBASE-13528-master.patch


 When the selectNow == true, in requestCompactionInternal, the compaction pool 
 section is incorrect.
 as discussed in:
 http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13528:
---
Status: Patch Available  (was: Open)

 A bug on selecting compaction pool
 --

 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13

 Attachments: HBASE-13528-0.98.patch, HBASE-13528-1.0.patch, 
 HBASE-13528-master.patch


 When the selectNow == true, in requestCompactionInternal, the compaction pool 
 section is incorrect.
 as discussed in:
 http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)
Shuaifeng Zhou created HBASE-13528:
--

 Summary: A bug on selecting compaction pool
 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13


When the selectNow == true, in requestCompactionInternal, the compaction pool 
section is incorrect.
as discussed in:

http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13528:
---
Status: Open  (was: Patch Available)

 A bug on selecting compaction pool
 --

 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13

 Attachments: HBASE-13528-0.98.patch, HBASE-13528-1.0.patch, 
 HBASE-13528-master.patch


 When the selectNow == true, in requestCompactionInternal, the compaction pool 
 section is incorrect.
 as discussed in:
 http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13528:
---
Attachment: HBASE-13528-master-1.patch
HBASE-13528-1.0-1.patch
HBASE-13528-0.98-1.patch

refine the patch as comments from zhangduo

 A bug on selecting compaction pool
 --

 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13

 Attachments: HBASE-13528-0.98-1.patch, HBASE-13528-0.98.patch, 
 HBASE-13528-1.0-1.patch, HBASE-13528-1.0.patch, HBASE-13528-master-1.patch, 
 HBASE-13528-master.patch


 When the selectNow == true, in requestCompactionInternal, the compaction pool 
 section is incorrect.
 as discussed in:
 http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13528) A bug on selecting compaction pool

2015-04-21 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13528:
---
Status: Patch Available  (was: Open)

 A bug on selecting compaction pool
 --

 Key: HBASE-13528
 URL: https://issues.apache.org/jira/browse/HBASE-13528
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.98.12
Reporter: Shuaifeng Zhou
Assignee: Shuaifeng Zhou
Priority: Minor
 Fix For: 1.0.1, 0.98.13

 Attachments: HBASE-13528-0.98-1.patch, HBASE-13528-0.98.patch, 
 HBASE-13528-1.0-1.patch, HBASE-13528-1.0.patch, HBASE-13528-master-1.patch, 
 HBASE-13528-master.patch


 When the selectNow == true, in requestCompactionInternal, the compaction pool 
 section is incorrect.
 as discussed in:
 http://mail-archives.apache.org/mod_mbox/hbase-dev/201504.mbox/%3CCAAAYAnNC06E-pUG_Fhu9-7x5z--tm_apnFqpuqfn%3DLdNESE3mA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-04 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13122:
---
Status: Patch Available  (was: Reopened)

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.98.10.1, 0.94.24, 1.0.1
Reporter: Shuaifeng Zhou
 Fix For: 2.0.0, 1.1.0

 Attachments: 13122-master.patch, 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-04 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13122:
---
Attachment: 13122-master.patch

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Fix For: 2.0.0, 1.1.0

 Attachments: 13122-master.patch, 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-04 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347847#comment-14347847
 ] 

Shuaifeng Zhou commented on HBASE-13122:


Failure error is below:

Failed to read test report file 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-shell/target/surefire-reports/TEST-org.apache.hadoop.hbase.client.TestShell.xml
org.dom4j.DocumentException: Error on line 706 of document 
file:///home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-shell/target/surefire-reports/TEST-org.apache.hadoop.hbase.client.TestShell.xml
 : XML document structures must start and end within the same entity. Nested 
exception: XML document structures must start and end within the same entity.

whole test case only cost 0ms, not run any testcase. Error info shows that 
TestShell.xml error, nothing related with the patch, I think.

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Fix For: 2.0.0, 1.1.0

 Attachments: 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-03 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346352#comment-14346352
 ] 

Shuaifeng Zhou commented on HBASE-13122:


Thanks for the review, Ram.
That's the same thing if got data from the first family. Both cells from the 
two families will be filtered by the filter.
Got data from secondFamily, it will skip the first family after check one cell. 
Similar, get data from the firstFamily, it will skip the second family after 
check one cell in it. (each row check once)

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Fix For: 2.0.0, 1.1.0

 Attachments: 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-03 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346313#comment-14346313
 ] 

Shuaifeng Zhou commented on HBASE-13122:


That's a good point.
Accurately, there should be a return code NEXT_FAMILY, so the return code 
should be next_family. But currently, we have no this code. And if there is 
multi-family, next_row does the work, scan jump to next family if return code 
is NEXT_ROW.
Maybe lately, we should add a return code NEXT_FAMILY, but it's a big change 
...

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Fix For: 2.0.0, 1.1.0

 Attachments: 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-03 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346312#comment-14346312
 ] 

Shuaifeng Zhou commented on HBASE-13122:


That's a good point.
Accurately, there should be a return code NEXT_FAMILY, so the return code 
should be next_family. But currently, we have no this code. And if there is 
multi-family, next_row does the work, scan jump to next family if return code 
is NEXT_ROW.
Maybe lately, we should add a return code NEXT_FAMILY, but it's a big change 
...

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Fix For: 2.0.0, 1.1.0

 Attachments: 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-03 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346367#comment-14346367
 ] 

Shuaifeng Zhou commented on HBASE-13122:


NEXT_ROW can work is because there is regionscanner and store scanner, next_row 
affect store scanner. 
When one storescanner switch to next row, the region scanner switch to the next 
storescanner, the second store scanner will continue check the current row. 
That why the change can be more efficiency.

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Fix For: 2.0.0, 1.1.0

 Attachments: 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13122) Improve efficiency for return codes of some filters

2015-03-02 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343269#comment-14343269
 ] 

Shuaifeng Zhou commented on HBASE-13122:


We have done a performance test, here are the result:
FamilyFilter:
test table have two familys , each have 3 qualifier, and put 1 rows into 
the table, each rowqualifier have 1000 versions.
scan use familyFilter get values from the second family, scaned 2000 rows and 
100 versions of each row qualifier.
Use the oraginal FamilyFilter, cost average 309 seconds, but with the improved 
familyFilter, the cost is average 38 seconds, improved about 700%
ColumnRangeFilter:
The same data but only one family, scan 1 rows and 1000 versions, orangial 
cost average 68s, the improved cost 64s, improved a little. 
Because in the FamilyFilter, the improve will reduce read files, so improved 
significantly, but the columnRangeFilter can not reduce read files, so imporve 
little.

 Improve efficiency for return codes of some filters
 ---

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Attachments: 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13122) return codes of some filters not efficent

2015-02-26 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13122:
---
Attachment: 13122-master.patch

patch for master branch attached

 return codes of some filters not efficent
 -

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Attachments: 13122-master.patch, 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-13122) return codes of some filters not efficent

2015-02-26 Thread Shuaifeng Zhou (JIRA)
Shuaifeng Zhou created HBASE-13122:
--

 Summary: return codes of some filters not efficent
 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.98.10.1, 0.94.24, 1.0.1
Reporter: Shuaifeng Zhou


ColumnRangeFilter:
 when minColumnInclusive is false, it means all the cells at the current 
rowcolumn not fit the condition, so it should skip to next column, return code 
should be NEXT_COL, not SKIP.
FamilyFilter is the similar sitution.

Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13122) return codes of some filters not efficent

2015-02-26 Thread Shuaifeng Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaifeng Zhou updated HBASE-13122:
---
Attachment: 13122.patch

just change the return code to next_col in ColumnRangeFilter,
change to Next_Row in FamilyFilter.

 return codes of some filters not efficent
 -

 Key: HBASE-13122
 URL: https://issues.apache.org/jira/browse/HBASE-13122
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.24, 1.0.1, 0.98.10.1
Reporter: Shuaifeng Zhou
 Attachments: 13122.patch


 ColumnRangeFilter:
  when minColumnInclusive is false, it means all the cells at the current 
 rowcolumn not fit the condition, so it should skip to next column, return 
 code should be NEXT_COL, not SKIP.
 FamilyFilter is the similar sitution.
 Currently, SKIP will not causing error, but not efficent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12976) Set default value for hbase.client.scanner.max.result.size

2015-02-05 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308474#comment-14308474
 ] 

Shuaifeng Zhou commented on HBASE-12976:


Lars, sorry for confussing.
I mean the kvs returned to client by one rpc is defined by catching and 
batching, but the byte size is not controled, may be this parameter can help?
Just an idea.

 Set default value for hbase.client.scanner.max.result.size
 --

 Key: HBASE-12976
 URL: https://issues.apache.org/jira/browse/HBASE-12976
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 1.0.1, 1.1.0, 0.98.11

 Attachments: 12976-v2.txt, 12976.txt


 Setting scanner caching is somewhat of a black art. It's hard to estimate 
 ahead of time how large the result set will be.
 I propose we hbase.client.scanner.max.result.size to 2mb. That is good 
 compromise between performance and buffer usage on typical networks (avoiding 
 OOMs when the caching was chosen too high).
 To an HTable client this is completely transparent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12976) Default hbase.client.scanner.max.result.size

2015-02-04 Thread Shuaifeng Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306787#comment-14306787
 ] 

Shuaifeng Zhou commented on HBASE-12976:


this setting should work together with batching and catching to control the 
result size

 Default hbase.client.scanner.max.result.size
 

 Key: HBASE-12976
 URL: https://issues.apache.org/jira/browse/HBASE-12976
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
 Fix For: 2.0.0, 1.0.1, 0.94.27, 0.98.11

 Attachments: 12976.txt


 Setting scanner caching is somewhat of a black art. It's hard to estimate 
 ahead of time how large the result set will be.
 I propose we hbase.client.scanner.max.result.size to 2mb. That is good 
 compromise between performance and buffer usage on typical networks (avoiding 
 OOMs when the caching was chosen too high).
 To an HTable client this is completely transparent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)