[jira] [Commented] (HBASE-5846) HBase rpm packing is broken at multiple places

2012-04-20 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258467#comment-13258467
 ] 

Shrijeet Paliwal commented on HBASE-5846:
-

Here is what happens if one runs update : 

{noformat}
D:   install: %post(hbase-0.92.1-2.x86_64) synchronous scriptlet start
D:   install: %post(hbase-0.92.1-2.x86_64)  execv(/bin/sh) pid 26772
+ /usr/share/hbase/sbin/update-hbase-env.sh --prefix=/usr --bin-dir=/usr/bin 
--conf-dir=/etc/hbase --log-dir=/var/log/hbase --pid-dir=/var/run/hbase
D:   install: waitpid(26772) rc 26772 status 0 secs 0.038
D: == --- hbase-0.92.1-1 x86_64-linux 0x0
D: erase: hbase-0.92.1-1 has 224 files, test = 0
D: erase: %preun(hbase-0.92.1-1.x86_64) asynchronous scriptlet start
D: erase: %preun(hbase-0.92.1-1.x86_64) execv(/bin/sh) pid 26819
+ /usr/share/hbase/sbin/update-hbase-env.sh --prefix=/usr --bin-dir=/usr/bin 
--conf-dir=/etc/hbase --log-dir=/var/log/hbase --pid-dir=/var/run/hbase 
--uninstall
{noformat}

This is out put of rpm -Uvv . Note how install post runs followed by preun . 
preun erases all the work that was done by install post.

 HBase rpm packing is broken at multiple places
 --

 Key: HBASE-5846
 URL: https://issues.apache.org/jira/browse/HBASE-5846
 Project: HBase
  Issue Type: Bug
  Components: build
Affects Versions: 0.92.1
 Environment: CentOS release 5.7 (Final)
Reporter: Shrijeet Paliwal

 Here is how I executed rpm build: 
 {noformat}
 MAVEN_OPTS=-Xmx2g mvn clean package assembly:single -Prpm -DskipTests
 {noformat}
 The issues with the rpm build are: 
 * There is no clean (%clean) section in the hbase.spec file . Last run can 
 leave stuff in RPM_BUILD_ROOT which in turn will fail build. As a fix I added 
 'rm -rf $RPM_BUILD_ROOT' to %clean section
 * The Buildroot is set to _build_dir . The build fails with this error. 
 {noformat}
 cp: cannot copy a directory, 
 `/data/9adda425-1f1e-4fe5-8a53-83bd2ce5ad45/app/jenkins/workspace/hbase.92/target/rpm/hbase/BUILD',
  into itself, 
 `/data/9adda425-1f1e-4fe5-8a53-83bd2ce5ad45/app/jenkins/workspace/hbase.92/target/rpm/hbase/BUILD/BUILD'
 {noformat}
 If we set it to ' %{_tmppath}/%{name}-%{version}-root' build passes
 * The src/packages/update-hbase-env.sh script will leave inconsistent state 
 if 'yum update hbase' is executed. It deletes data from /etc/init.d/hbase* 
 and does not put scripts back during update. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

2012-01-10 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183875#comment-13183875
 ] 

Shrijeet Paliwal commented on HBASE-3638:
-

We just hit this issue today in production. We did not do an FS bootstrap (I 
assume you mean cleaning /hbase directory from hdfs by FS bootstrap). It was a 
regular day a RS was throwing not serving exceptions and I went ahead and 
restarted it. It was not a META or ROOT serving RS. Following this RS restart 
hbck started reporting holes in regions. 

Later, for some unexplainable, crazy and panicky reason I restarted Master and 
all other region servers. This is the point where master started complaining 
META is in OPENED state in ZK, for a server which no longer exists. And like 
Todd explained in the other Jira, master went to an unending loop. 

The work around was to clear up all files from ZK data directory. 

What do you think Stack, can master pick a *stale* ZK state which is not a 
leftover from previous HBase install, in other words a stale state created by 
itself?

 If a FS bootstrap, need to also ensure ZK is cleaned
 

 Key: HBASE-3638
 URL: https://issues.apache.org/jira/browse/HBASE-3638
 Project: HBase
  Issue Type: Bug
Reporter: stack
Priority: Minor

 In a test environment where a cycle of start, operation, kill hbase (repeat), 
 noticed that we were doing a bootstrap on startup but then we were picking up 
 the previous cycles zk state.  It made for a mess in the test.
 Last thing seen on previous cycle was:
 {code}
 2011-03-11 06:33:36,708 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, 
 region=1028785192/.META.
 {code}
 Then, in the messed up cycle I saw:
 {code}
 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 BOOTSTRAP: creating ROOT and first META regions
 .
 {code}
 Then after setting watcher on .META., we get a 
 {code}
 2011-03-11 06:42:58,301 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Processing region 
 .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
 2011-03-11 06:42:58,302 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 
 1028785192 references a server no longer up X.X.X; letting RIT timeout so 
 will be assigned elsewhere
 {code}
 We're all confused.
 Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned

2012-01-10 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183925#comment-13183925
 ] 

Shrijeet Paliwal commented on HBASE-3638:
-

Here is the relevant portion of log. 

The master (even if you restart all the Hbase services across the cluster) will 
always
get stuck at this state. 
{noformat}
2012-01-10 21:28:03,382 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Region in transition 1028785192 references a server no longer up 
txa-18.rfiserve.net,60020,1326125886539; letting RIT timeout so will be 
assigned elsewhere
2012-01-10 21:28:06,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, 
ts=1326241230066
2012-01-10 21:28:06,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:16,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, 
ts=1326241230066
2012-01-10 21:28:16,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:26,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, 
ts=1326241230066
2012-01-10 21:28:26,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:36,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, 
ts=1326241230066
2012-01-10 21:28:36,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:46,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, 
ts=1326241230066
2012-01-10 21:28:46,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2012-01-10 21:28:56,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, 
ts=1326241230066
{noformat}


bq. What do you think Stack, can master pick a stale ZK state which is not a 
leftover from previous HBase install, in other words a stale state created by 
itself?

By this I was referring to comment made by Todd in the related jira when he 
said:

bq. Notably, it wasn't clearing ZK between runs. So some leftover RIT data from 
a previous HBase incarnation may be confusing this one's master.

He floated one possibility, left over RIT from previous incarnation. I am 
thinking what other possibilities are there? 

 If a FS bootstrap, need to also ensure ZK is cleaned
 

 Key: HBASE-3638
 URL: https://issues.apache.org/jira/browse/HBASE-3638
 Project: HBase
  Issue Type: Bug
Reporter: stack
Priority: Minor

 In a test environment where a cycle of start, operation, kill hbase (repeat), 
 noticed that we were doing a bootstrap on startup but then we were picking up 
 the previous cycles zk state.  It made for a mess in the test.
 Last thing seen on previous cycle was:
 {code}
 2011-03-11 06:33:36,708 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, 
 region=1028785192/.META.
 {code}
 Then, in the messed up cycle I saw:
 {code}
 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
 BOOTSTRAP: creating ROOT and first META regions
 .
 {code}
 Then after setting watcher on .META., we get a 
 {code}
 2011-03-11 06:42:58,301 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Processing region 
 .META.,,1.1028785192 in state RS_ZK_REGION_OPENED
 2011-03-11 06:42:58,302 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 
 1028785192 references a server no longer up X.X.X; letting RIT timeout so 
 will be assigned elsewhere
 {code}
 We're all confused.
 Should at least clear our zk if a bootstrap happened.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error

2012-01-04 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179907#comment-13179907
 ] 

Shrijeet Paliwal commented on HBASE-5041:
-

{code}
 mvn clean compile test -Dtest=TestReplication
{code}

Above passes without error for branch 0.90 in my dev machine. 

-Shrijeet

 Major compaction on non existing table does not throw error 
 

 Key: HBASE-5041
 URL: https://issues.apache.org/jira/browse/HBASE-5041
 Project: HBase
  Issue Type: Bug
  Components: regionserver, shell
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal
Assignee: Shrijeet Paliwal
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 
 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 
 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 
 0003-HBASE-5041-Throw-error-if-table-does-not-exist.0.90.patch


 Following will not complain even if fubar does not exist
 {code}
 echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell
 {code}
 The downside for this defect is that major compaction may be skipped due to
 a typo by Ops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error

2011-12-30 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177849#comment-13177849
 ] 

Shrijeet Paliwal commented on HBASE-5041:
-

I will update this Jira with new Patch post holidays.

 Major compaction on non existing table does not throw error 
 

 Key: HBASE-5041
 URL: https://issues.apache.org/jira/browse/HBASE-5041
 Project: HBase
  Issue Type: Bug
  Components: regionserver, shell
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal
Assignee: Shrijeet Paliwal
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 0001-HBASE-5041-Throw-error-if-table-does-not-exist.patch


 Following will not complain even if fubar does not exist
 {code}
 echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell
 {code}
 The downside for this defect is that major compaction may be skipped due to
 a typo by Ops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error

2011-12-22 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175132#comment-13175132
 ] 

Shrijeet Paliwal commented on HBASE-5041:
-

Our logic to check if the name is a regionname or tablename is designed to be 
as follows: 
tl;dr: If it is not an existing table, its should be a region. 

{noformat}
 /**
   * @param tableNameOrRegionName Name of a table or name of a region.
   * @return True if codetableNameOrRegionName/code is *possibly* a region
   * name else false if a verified tablename (we call {@link 
#tableExists(byte[])};
   * else we throw an exception.
   * @throws IOException 
   */
  private boolean isRegionName(final byte [] tableNameOrRegionName)
  throws IOException {
if (tableNameOrRegionName == null) {
  throw new IllegalArgumentException(Pass a table name or region name);
}
return !tableExists(tableNameOrRegionName);
  }
{noformat}

My plan was to modify majorCompact function's else block to check if the table 
exist and throw TableNotFoundException if it does not. 
But because of name logic one will never reach 'else' part and a compaction 
request will be registered assuming it must be a region. 

 Major compaction on non existing table does not throw error 
 

 Key: HBASE-5041
 URL: https://issues.apache.org/jira/browse/HBASE-5041
 Project: HBase
  Issue Type: Bug
  Components: regionserver, shell
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal

 Following will not complain even if fubar does not exist
 {code}
 echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell
 {code}
 The downside for this defect is that major compaction may be skipped due to
 a typo by Ops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error

2011-12-22 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175264#comment-13175264
 ] 

Shrijeet Paliwal commented on HBASE-5041:
-

@Ted, will add a unit test and upload a new one on top of trunk. 

@Ram, thanks for commenting. Do you mean to say isRegionName should throw an 
exception? I wanted to keep the semantic same as before - it tells weather the 
name argument 'appears' to be a region name or not. When MetaReader.getRegion 
returns null we know one thing for sure, it is not a region. Determining if its 
a valid table is left to caller, depending on need.

Did you mean something else?

 Major compaction on non existing table does not throw error 
 

 Key: HBASE-5041
 URL: https://issues.apache.org/jira/browse/HBASE-5041
 Project: HBase
  Issue Type: Bug
  Components: regionserver, shell
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal
Assignee: Shrijeet Paliwal
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 0001-HBASE-5041-Throw-error-if-table-does-not-exist.patch


 Following will not complain even if fubar does not exist
 {code}
 echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell
 {code}
 The downside for this defect is that major compaction may be skipped due to
 a typo by Ops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error

2011-12-22 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175287#comment-13175287
 ] 

Shrijeet Paliwal commented on HBASE-5041:
-

@Stack
{quote}
I think patch is doing right thing. Its changing the contract for isRegionName 
but this is a private method and you are tightening what was a sloppy contract 
previous; it looks too like all instances of isRegionName can benefit from this 
tightening (is this your though Shrijeet?).
{quote}
Yes that is the idea. 

{quote}
You might make a method that returns a String tablename for a table you know 
exists (else it throws the TNFE).
{quote}
Makes sense, will do.

{quote}
We are creating a new CatalogTracker instance. No one seems to be shutting it 
down? Is that a prob?
{quote}
Did not understand this one Stack. cleanupCatalogTracker called in finally will 
stop the CatalogTracker, no? 


 Major compaction on non existing table does not throw error 
 

 Key: HBASE-5041
 URL: https://issues.apache.org/jira/browse/HBASE-5041
 Project: HBase
  Issue Type: Bug
  Components: regionserver, shell
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal
Assignee: Shrijeet Paliwal
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 0001-HBASE-5041-Throw-error-if-table-does-not-exist.patch


 Following will not complain even if fubar does not exist
 {code}
 echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell
 {code}
 The downside for this defect is that major compaction may be skipped due to
 a typo by Ops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan

2011-12-20 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173697#comment-13173697
 ] 

Shrijeet Paliwal commented on HBASE-5035:
-

Ted, you had mentioned following in the email thread: 

Null check for regionInfo should be added 

I could not gather why regionInfo could possibly be null. The call 
'Writables.getHRegionInfo(value);' does not seem to return null ever. Could you 
please tell me your reasoning. 

Meanwhile I am still reading code and trying to find the place where NPE might 
occur.  

 Runtime exceptions during meta scan
 ---

 Key: HBASE-5035
 URL: https://issues.apache.org/jira/browse/HBASE-5035
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal

 Version: 0.90.3 + patches back ported 
 The other day our client started spitting these two runtime exceptions. Not 
 all clients connected to the cluster were under impact. Only 4 of them. While 
 3 of them were throwing NPE, one of them was throwing 
 ArrayIndexOutOfBoundsException. The errors are : 
 1. http://pastie.org/2987926
 2. http://pastie.org/2987927
 Clients did not recover from this and I had to restart them. 
 Motive of this jira is to identify and put null checks at appropriate places. 
 Also with the given stack trace I can not tell which line caused NPE of 
 AIOBE, hence additional motive is to make the trace more helpful. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan

2011-12-20 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173780#comment-13173780
 ] 

Shrijeet Paliwal commented on HBASE-5035:
-

Amm you might be right. 

{noformat}
final String serverAddress = Bytes.toString(value);

// instantiate the location
HRegionLocation loc = new HRegionLocation(regionInfo,
new HServerAddress(serverAddress));
{noformat}

The Bytes.toString call, in theory, may return both an empty string or a null 
string.
In the case when it returns a null (see below), it tries to log an error which 
I didn't see in my log file. 
So I am not still 100% sure this is out guy. 
{noformat}
 try {
  return new String(b, off, len, HConstants.UTF8_ENCODING);
} catch (UnsupportedEncodingException e) {
  LOG.error(UTF-8 not supported?, e);
  return null;
}
{noformat}

Nonetheless it will be good to put a check against serverAddress variable for 
emptiness as well nullness since HServerAddress construtor may throw runtime 
error otherwise. Interesting point is - it can throw both 
ArrayIndexOutOfBoundsException and NPE and I saw both cases.

{noformat}
/**
   * @param hostAndPort Hostname and port formatted as codelt;hostname ':' 
lt;port/code
   */
  public HServerAddress(String hostAndPort) {
int colonIndex = hostAndPort.lastIndexOf(':');
{noformat}


I will open a subtask to make the trace more helpful. 

 Runtime exceptions during meta scan
 ---

 Key: HBASE-5035
 URL: https://issues.apache.org/jira/browse/HBASE-5035
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal

 Version: 0.90.3 + patches back ported 
 The other day our client started spitting these two runtime exceptions. Not 
 all clients connected to the cluster were under impact. Only 4 of them. While 
 3 of them were throwing NPE, one of them was throwing 
 ArrayIndexOutOfBoundsException. The errors are : 
 1. http://pastie.org/2987926
 2. http://pastie.org/2987927
 Clients did not recover from this and I had to restart them. 
 Motive of this jira is to identify and put null checks at appropriate places. 
 Also with the given stack trace I can not tell which line caused NPE of 
 AIOBE, hence additional motive is to make the trace more helpful. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan

2011-12-14 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169955#comment-13169955
 ] 

Shrijeet Paliwal commented on HBASE-5035:
-

Here is the patched HCM https://gist.github.com/1478070 , can be used to match 
line numbers.  

 Runtime exceptions during meta scan
 ---

 Key: HBASE-5035
 URL: https://issues.apache.org/jira/browse/HBASE-5035
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal

 Version: 0.90.3 + patches back ported 
 The other day our client started spitting these two runtime exceptions. Not 
 all clients connected to the cluster were under impact. Only 4 of them. While 
 3 of them were throwing NPE, one of them was throwing 
 ArrayIndexOutOfBoundsException. The errors are : 
 1. http://pastie.org/2987926
 2. http://pastie.org/2987927
 Clients did not recover from this and I had to restart them. 
 Motive of this jira is to identify and put null checks at appropriate places. 
 Also with the given stack trace I can not tell which line caused NPE of 
 AIOBE, hence additional motive is to make the trace more helpful. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4980) Null pointer exception in HBaseClient receiveResponse

2011-12-08 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165406#comment-13165406
 ] 

Shrijeet Paliwal commented on HBASE-4980:
-

Done attaching, should I click cancel patch and then click submit patch again?

 Null pointer exception in HBaseClient receiveResponse
 -

 Key: HBASE-4980
 URL: https://issues.apache.org/jira/browse/HBASE-4980
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.92.0
Reporter: Shrijeet Paliwal
  Labels: newbie
 Attachments: 
 0001-HBASE-4980-Fix-NPE-in-HBaseClient-receiveResponse.patch, 
 0002-HBASE-4980-Fix-NPE-in-HBaseClient-receiveResponse.patch, 
 0003-HBASE-4980-Fix-NPE-in-HBaseClient-receiveResponse.patch


 Relevant Stack trace: 
 2011-11-30 13:10:26,557 [IPC Client (47) connection to 
 xx.xx.xx/172.22.4.68:60020 from an unknown user] WARN  
 org.apache.hadoop.ipc.HBaseClient - Unexpected exception receiving call 
 responses
 java.lang.NullPointerException
 -at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:583)
 -at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:511)
 {code}
   if (LOG.isDebugEnabled())
   LOG.debug(getName() +  got value # + id);
 Call call = calls.remove(id);
 // Read the flag byte
 byte flag = in.readByte();
 boolean isError = ResponseFlag.isError(flag);
 if (ResponseFlag.isLength(flag)) {
   // Currently length if present is unused.
   in.readInt();
 }
 int state = in.readInt(); // Read the state.  Currently unused.
 if (isError) {
   //noinspection ThrowableInstanceNeverThrown
   call.setException(new RemoteException( WritableUtils.readString(in),
   WritableUtils.readString(in)));
 } else {
 {code}
 This line {code}Call call = calls.remove(id);{code}  may return a null 
 'call'. It is so because if you have rpc timeout enable, we proactively clean 
 up other calls which have expired their lifetime along with the call for 
 which socket timeout exception happend.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism

2011-12-04 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162475#comment-13162475
 ] 

Shrijeet Paliwal commented on HBASE-4633:
-

Recent updates: 
* In my case the leak/memory-hold is not in HBase client. I could not find 
enough evidence to conclude that. What I did find is, our application holds one 
heavy object in memory. This object is shared between threads. Every N minutes 
the application creates a new instance of this class. Unless any thread is 
still holding on to an old instance, all old instances are GCed in time. Hence 
in theory at any time there should be only one active instance of heavy object. 

* Under heavy load and client operation RPC timeout enabled, some threads get 
stuck. This causes multiple instances of heavy object. In turn heap grows. 

After reading client code multiple times I can not gather why there will be a 
case when application thread will get stuck for several minutes. We have safe 
guards to clean up calls 'forcefully' if they have been alive for more than rpc 
timeout interval. 

I had planned to update the title of Jira to reflect above finding but 
Gaojinchao observed something interesting at his end and so keeping title same 
for now. Gaojinchao's thread is here: http://search-hadoop.com/m/teczL8KvcH


 Potential memory leak in client RPC timeout mechanism
 -

 Key: HBASE-4633
 URL: https://issues.apache.org/jira/browse/HBASE-4633
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.3
 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0
Reporter: Shrijeet Paliwal

 Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937,
 https://issues.apache.org/jira/browse/HBASE-4003
 We have been using the 'hbase.client.operation.timeout' knob
 introduced in 2937 for quite some time now. It helps us enforce SLA.
 We have two HBase clusters and two HBase client clusters. One of them
 is much busier than the other.
 We have seen a deterministic behavior of clients running in busy
 cluster. Their (client's) memory footprint increases consistently
 after they have been up for roughly 24 hours.
 This memory footprint almost doubles from its usual value (usual case
 == RPC timeout disabled). After much investigation nothing concrete
 came out and we had to put a hack
 which keep heap size in control even when RPC timeout is enabled. Also
 note , the same behavior is not observed in 'not so busy
 cluster.
 The patch is here : https://gist.github.com/1288023

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism

2011-10-20 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13131677#comment-13131677
 ] 

Shrijeet Paliwal commented on HBASE-4633:
-

@Liyin, 
Are you using RPC timeouts for client operations? 

bq. But Not sure the leak comes from HBase Client jar itself or just our client 
code. 
In the absence of a concrete evidence that leak is indeed in HBase client jar, 
I have similar feeling. It could be in our application layer. 

bq. Our symptom is that the memory footprint will increase as time. But the 
actual heap size of the client is not increasing.
We observe the used memory using a collectd plugin 
http://collectd.org/wiki/index.php/Plugin:Memory

bq. So I am very interested to know when you have keep the heap size in 
control, is the memory leaking solved ?
We run with max and min memory set as -Xmx2{X}G -Xms{X}G . And when 'leak' 
happens the plugin shows the used memory touching 2X value, so it does seem 
heap size is increasing. Correct me here if I am mistaken. 

Let me know if you need more inputs. 

 Potential memory leak in client RPC timeout mechanism
 -

 Key: HBASE-4633
 URL: https://issues.apache.org/jira/browse/HBASE-4633
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.3
 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0
Reporter: Shrijeet Paliwal

 Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937,
 https://issues.apache.org/jira/browse/HBASE-4003
 We have been using the 'hbase.client.operation.timeout' knob
 introduced in 2937 for quite some time now. It helps us enforce SLA.
 We have two HBase clusters and two HBase client clusters. One of them
 is much busier than the other.
 We have seen a deterministic behavior of clients running in busy
 cluster. Their (client's) memory footprint increases consistently
 after they have been up for roughly 24 hours.
 This memory footprint almost doubles from its usual value (usual case
 == RPC timeout disabled). After much investigation nothing concrete
 came out and we had to put a hack
 which keep heap size in control even when RPC timeout is enabled. Also
 note , the same behavior is not observed in 'not so busy
 cluster.
 The patch is here : https://gist.github.com/1288023

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism

2011-10-20 Thread Shrijeet Paliwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13131786#comment-13131786
 ] 

Shrijeet Paliwal commented on HBASE-4633:
-

@Stack
No we did not run with that flag. Also we never got to a point when application 
had to die cause of OOM. The reasons (I guess) are :
# We have GC flags to do garbage collection as fast as possible. 
# The monitoring in place starts sending our alerts and we usually shoot the 
server in the head before it OOMs
# The load balancer will kick in and start sending no work to application 
server realizing it is in bad state. 

As mentioned earlier I have found it hard to reproduce in dev environment, 
failing to simulate production like load. But I must try again when.

 Potential memory leak in client RPC timeout mechanism
 -

 Key: HBASE-4633
 URL: https://issues.apache.org/jira/browse/HBASE-4633
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.3
 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0
Reporter: Shrijeet Paliwal

 Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937,
 https://issues.apache.org/jira/browse/HBASE-4003
 We have been using the 'hbase.client.operation.timeout' knob
 introduced in 2937 for quite some time now. It helps us enforce SLA.
 We have two HBase clusters and two HBase client clusters. One of them
 is much busier than the other.
 We have seen a deterministic behavior of clients running in busy
 cluster. Their (client's) memory footprint increases consistently
 after they have been up for roughly 24 hours.
 This memory footprint almost doubles from its usual value (usual case
 == RPC timeout disabled). After much investigation nothing concrete
 came out and we had to put a hack
 which keep heap size in control even when RPC timeout is enabled. Also
 note , the same behavior is not observed in 'not so busy
 cluster.
 The patch is here : https://gist.github.com/1288023

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira