[jira] [Commented] (HBASE-5846) HBase rpm packing is broken at multiple places
[ https://issues.apache.org/jira/browse/HBASE-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258467#comment-13258467 ] Shrijeet Paliwal commented on HBASE-5846: - Here is what happens if one runs update : {noformat} D: install: %post(hbase-0.92.1-2.x86_64) synchronous scriptlet start D: install: %post(hbase-0.92.1-2.x86_64) execv(/bin/sh) pid 26772 + /usr/share/hbase/sbin/update-hbase-env.sh --prefix=/usr --bin-dir=/usr/bin --conf-dir=/etc/hbase --log-dir=/var/log/hbase --pid-dir=/var/run/hbase D: install: waitpid(26772) rc 26772 status 0 secs 0.038 D: == --- hbase-0.92.1-1 x86_64-linux 0x0 D: erase: hbase-0.92.1-1 has 224 files, test = 0 D: erase: %preun(hbase-0.92.1-1.x86_64) asynchronous scriptlet start D: erase: %preun(hbase-0.92.1-1.x86_64) execv(/bin/sh) pid 26819 + /usr/share/hbase/sbin/update-hbase-env.sh --prefix=/usr --bin-dir=/usr/bin --conf-dir=/etc/hbase --log-dir=/var/log/hbase --pid-dir=/var/run/hbase --uninstall {noformat} This is out put of rpm -Uvv . Note how install post runs followed by preun . preun erases all the work that was done by install post. HBase rpm packing is broken at multiple places -- Key: HBASE-5846 URL: https://issues.apache.org/jira/browse/HBASE-5846 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.92.1 Environment: CentOS release 5.7 (Final) Reporter: Shrijeet Paliwal Here is how I executed rpm build: {noformat} MAVEN_OPTS=-Xmx2g mvn clean package assembly:single -Prpm -DskipTests {noformat} The issues with the rpm build are: * There is no clean (%clean) section in the hbase.spec file . Last run can leave stuff in RPM_BUILD_ROOT which in turn will fail build. As a fix I added 'rm -rf $RPM_BUILD_ROOT' to %clean section * The Buildroot is set to _build_dir . The build fails with this error. {noformat} cp: cannot copy a directory, `/data/9adda425-1f1e-4fe5-8a53-83bd2ce5ad45/app/jenkins/workspace/hbase.92/target/rpm/hbase/BUILD', into itself, `/data/9adda425-1f1e-4fe5-8a53-83bd2ce5ad45/app/jenkins/workspace/hbase.92/target/rpm/hbase/BUILD/BUILD' {noformat} If we set it to ' %{_tmppath}/%{name}-%{version}-root' build passes * The src/packages/update-hbase-env.sh script will leave inconsistent state if 'yum update hbase' is executed. It deletes data from /etc/init.d/hbase* and does not put scripts back during update. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned
[ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183875#comment-13183875 ] Shrijeet Paliwal commented on HBASE-3638: - We just hit this issue today in production. We did not do an FS bootstrap (I assume you mean cleaning /hbase directory from hdfs by FS bootstrap). It was a regular day a RS was throwing not serving exceptions and I went ahead and restarted it. It was not a META or ROOT serving RS. Following this RS restart hbck started reporting holes in regions. Later, for some unexplainable, crazy and panicky reason I restarted Master and all other region servers. This is the point where master started complaining META is in OPENED state in ZK, for a server which no longer exists. And like Todd explained in the other Jira, master went to an unending loop. The work around was to clear up all files from ZK data directory. What do you think Stack, can master pick a *stale* ZK state which is not a leftover from previous HBase install, in other words a stale state created by itself? If a FS bootstrap, need to also ensure ZK is cleaned Key: HBASE-3638 URL: https://issues.apache.org/jira/browse/HBASE-3638 Project: HBase Issue Type: Bug Reporter: stack Priority: Minor In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state. It made for a mess in the test. Last thing seen on previous cycle was: {code} 2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META. {code} Then, in the messed up cycle I saw: {code} 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions . {code} Then after setting watcher on .META., we get a {code} 2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED 2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere {code} We're all confused. Should at least clear our zk if a bootstrap happened. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned
[ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183925#comment-13183925 ] Shrijeet Paliwal commented on HBASE-3638: - Here is the relevant portion of log. The master (even if you restart all the Hbase services across the cluster) will always get stuck at this state. {noformat} 2012-01-10 21:28:03,382 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up txa-18.rfiserve.net,60020,1326125886539; letting RIT timeout so will be assigned elsewhere 2012-01-10 21:28:06,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: .META.,,1.1028785192 state=OPENING, ts=1326241230066 2012-01-10 21:28:06,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192 2012-01-10 21:28:16,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: .META.,,1.1028785192 state=OPENING, ts=1326241230066 2012-01-10 21:28:16,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192 2012-01-10 21:28:26,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: .META.,,1.1028785192 state=OPENING, ts=1326241230066 2012-01-10 21:28:26,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192 2012-01-10 21:28:36,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: .META.,,1.1028785192 state=OPENING, ts=1326241230066 2012-01-10 21:28:36,787 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192 2012-01-10 21:28:46,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: .META.,,1.1028785192 state=OPENING, ts=1326241230066 2012-01-10 21:28:46,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192 2012-01-10 21:28:56,788 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: .META.,,1.1028785192 state=OPENING, ts=1326241230066 {noformat} bq. What do you think Stack, can master pick a stale ZK state which is not a leftover from previous HBase install, in other words a stale state created by itself? By this I was referring to comment made by Todd in the related jira when he said: bq. Notably, it wasn't clearing ZK between runs. So some leftover RIT data from a previous HBase incarnation may be confusing this one's master. He floated one possibility, left over RIT from previous incarnation. I am thinking what other possibilities are there? If a FS bootstrap, need to also ensure ZK is cleaned Key: HBASE-3638 URL: https://issues.apache.org/jira/browse/HBASE-3638 Project: HBase Issue Type: Bug Reporter: stack Priority: Minor In a test environment where a cycle of start, operation, kill hbase (repeat), noticed that we were doing a bootstrap on startup but then we were picking up the previous cycles zk state. It made for a mess in the test. Last thing seen on previous cycle was: {code} 2011-03-11 06:33:36,708 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, region=1028785192/.META. {code} Then, in the messed up cycle I saw: {code} 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: BOOTSTRAP: creating ROOT and first META regions . {code} Then after setting watcher on .META., we get a {code} 2011-03-11 06:42:58,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Processing region .META.,,1.1028785192 in state RS_ZK_REGION_OPENED 2011-03-11 06:42:58,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region in transition 1028785192 references a server no longer up X.X.X; letting RIT timeout so will be assigned elsewhere {code} We're all confused. Should at least clear our zk if a bootstrap happened. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error
[ https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179907#comment-13179907 ] Shrijeet Paliwal commented on HBASE-5041: - {code} mvn clean compile test -Dtest=TestReplication {code} Above passes without error for branch 0.90 in my dev machine. -Shrijeet Major compaction on non existing table does not throw error Key: HBASE-5041 URL: https://issues.apache.org/jira/browse/HBASE-5041 Project: HBase Issue Type: Bug Components: regionserver, shell Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Assignee: Shrijeet Paliwal Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 0003-HBASE-5041-Throw-error-if-table-does-not-exist.0.90.patch Following will not complain even if fubar does not exist {code} echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell {code} The downside for this defect is that major compaction may be skipped due to a typo by Ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error
[ https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177849#comment-13177849 ] Shrijeet Paliwal commented on HBASE-5041: - I will update this Jira with new Patch post holidays. Major compaction on non existing table does not throw error Key: HBASE-5041 URL: https://issues.apache.org/jira/browse/HBASE-5041 Project: HBase Issue Type: Bug Components: regionserver, shell Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Assignee: Shrijeet Paliwal Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 0001-HBASE-5041-Throw-error-if-table-does-not-exist.patch Following will not complain even if fubar does not exist {code} echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell {code} The downside for this defect is that major compaction may be skipped due to a typo by Ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error
[ https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175132#comment-13175132 ] Shrijeet Paliwal commented on HBASE-5041: - Our logic to check if the name is a regionname or tablename is designed to be as follows: tl;dr: If it is not an existing table, its should be a region. {noformat} /** * @param tableNameOrRegionName Name of a table or name of a region. * @return True if codetableNameOrRegionName/code is *possibly* a region * name else false if a verified tablename (we call {@link #tableExists(byte[])}; * else we throw an exception. * @throws IOException */ private boolean isRegionName(final byte [] tableNameOrRegionName) throws IOException { if (tableNameOrRegionName == null) { throw new IllegalArgumentException(Pass a table name or region name); } return !tableExists(tableNameOrRegionName); } {noformat} My plan was to modify majorCompact function's else block to check if the table exist and throw TableNotFoundException if it does not. But because of name logic one will never reach 'else' part and a compaction request will be registered assuming it must be a region. Major compaction on non existing table does not throw error Key: HBASE-5041 URL: https://issues.apache.org/jira/browse/HBASE-5041 Project: HBase Issue Type: Bug Components: regionserver, shell Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Following will not complain even if fubar does not exist {code} echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell {code} The downside for this defect is that major compaction may be skipped due to a typo by Ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error
[ https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175264#comment-13175264 ] Shrijeet Paliwal commented on HBASE-5041: - @Ted, will add a unit test and upload a new one on top of trunk. @Ram, thanks for commenting. Do you mean to say isRegionName should throw an exception? I wanted to keep the semantic same as before - it tells weather the name argument 'appears' to be a region name or not. When MetaReader.getRegion returns null we know one thing for sure, it is not a region. Determining if its a valid table is left to caller, depending on need. Did you mean something else? Major compaction on non existing table does not throw error Key: HBASE-5041 URL: https://issues.apache.org/jira/browse/HBASE-5041 Project: HBase Issue Type: Bug Components: regionserver, shell Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Assignee: Shrijeet Paliwal Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 0001-HBASE-5041-Throw-error-if-table-does-not-exist.patch Following will not complain even if fubar does not exist {code} echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell {code} The downside for this defect is that major compaction may be skipped due to a typo by Ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error
[ https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175287#comment-13175287 ] Shrijeet Paliwal commented on HBASE-5041: - @Stack {quote} I think patch is doing right thing. Its changing the contract for isRegionName but this is a private method and you are tightening what was a sloppy contract previous; it looks too like all instances of isRegionName can benefit from this tightening (is this your though Shrijeet?). {quote} Yes that is the idea. {quote} You might make a method that returns a String tablename for a table you know exists (else it throws the TNFE). {quote} Makes sense, will do. {quote} We are creating a new CatalogTracker instance. No one seems to be shutting it down? Is that a prob? {quote} Did not understand this one Stack. cleanupCatalogTracker called in finally will stop the CatalogTracker, no? Major compaction on non existing table does not throw error Key: HBASE-5041 URL: https://issues.apache.org/jira/browse/HBASE-5041 Project: HBase Issue Type: Bug Components: regionserver, shell Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Assignee: Shrijeet Paliwal Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 0001-HBASE-5041-Throw-error-if-table-does-not-exist.patch Following will not complain even if fubar does not exist {code} echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell {code} The downside for this defect is that major compaction may be skipped due to a typo by Ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan
[ https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173697#comment-13173697 ] Shrijeet Paliwal commented on HBASE-5035: - Ted, you had mentioned following in the email thread: Null check for regionInfo should be added I could not gather why regionInfo could possibly be null. The call 'Writables.getHRegionInfo(value);' does not seem to return null ever. Could you please tell me your reasoning. Meanwhile I am still reading code and trying to find the place where NPE might occur. Runtime exceptions during meta scan --- Key: HBASE-5035 URL: https://issues.apache.org/jira/browse/HBASE-5035 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Version: 0.90.3 + patches back ported The other day our client started spitting these two runtime exceptions. Not all clients connected to the cluster were under impact. Only 4 of them. While 3 of them were throwing NPE, one of them was throwing ArrayIndexOutOfBoundsException. The errors are : 1. http://pastie.org/2987926 2. http://pastie.org/2987927 Clients did not recover from this and I had to restart them. Motive of this jira is to identify and put null checks at appropriate places. Also with the given stack trace I can not tell which line caused NPE of AIOBE, hence additional motive is to make the trace more helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan
[ https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173780#comment-13173780 ] Shrijeet Paliwal commented on HBASE-5035: - Amm you might be right. {noformat} final String serverAddress = Bytes.toString(value); // instantiate the location HRegionLocation loc = new HRegionLocation(regionInfo, new HServerAddress(serverAddress)); {noformat} The Bytes.toString call, in theory, may return both an empty string or a null string. In the case when it returns a null (see below), it tries to log an error which I didn't see in my log file. So I am not still 100% sure this is out guy. {noformat} try { return new String(b, off, len, HConstants.UTF8_ENCODING); } catch (UnsupportedEncodingException e) { LOG.error(UTF-8 not supported?, e); return null; } {noformat} Nonetheless it will be good to put a check against serverAddress variable for emptiness as well nullness since HServerAddress construtor may throw runtime error otherwise. Interesting point is - it can throw both ArrayIndexOutOfBoundsException and NPE and I saw both cases. {noformat} /** * @param hostAndPort Hostname and port formatted as codelt;hostname ':' lt;port/code */ public HServerAddress(String hostAndPort) { int colonIndex = hostAndPort.lastIndexOf(':'); {noformat} I will open a subtask to make the trace more helpful. Runtime exceptions during meta scan --- Key: HBASE-5035 URL: https://issues.apache.org/jira/browse/HBASE-5035 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Version: 0.90.3 + patches back ported The other day our client started spitting these two runtime exceptions. Not all clients connected to the cluster were under impact. Only 4 of them. While 3 of them were throwing NPE, one of them was throwing ArrayIndexOutOfBoundsException. The errors are : 1. http://pastie.org/2987926 2. http://pastie.org/2987927 Clients did not recover from this and I had to restart them. Motive of this jira is to identify and put null checks at appropriate places. Also with the given stack trace I can not tell which line caused NPE of AIOBE, hence additional motive is to make the trace more helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5035) Runtime exceptions during meta scan
[ https://issues.apache.org/jira/browse/HBASE-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169955#comment-13169955 ] Shrijeet Paliwal commented on HBASE-5035: - Here is the patched HCM https://gist.github.com/1478070 , can be used to match line numbers. Runtime exceptions during meta scan --- Key: HBASE-5035 URL: https://issues.apache.org/jira/browse/HBASE-5035 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Version: 0.90.3 + patches back ported The other day our client started spitting these two runtime exceptions. Not all clients connected to the cluster were under impact. Only 4 of them. While 3 of them were throwing NPE, one of them was throwing ArrayIndexOutOfBoundsException. The errors are : 1. http://pastie.org/2987926 2. http://pastie.org/2987927 Clients did not recover from this and I had to restart them. Motive of this jira is to identify and put null checks at appropriate places. Also with the given stack trace I can not tell which line caused NPE of AIOBE, hence additional motive is to make the trace more helpful. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4980) Null pointer exception in HBaseClient receiveResponse
[ https://issues.apache.org/jira/browse/HBASE-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165406#comment-13165406 ] Shrijeet Paliwal commented on HBASE-4980: - Done attaching, should I click cancel patch and then click submit patch again? Null pointer exception in HBaseClient receiveResponse - Key: HBASE-4980 URL: https://issues.apache.org/jira/browse/HBASE-4980 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.92.0 Reporter: Shrijeet Paliwal Labels: newbie Attachments: 0001-HBASE-4980-Fix-NPE-in-HBaseClient-receiveResponse.patch, 0002-HBASE-4980-Fix-NPE-in-HBaseClient-receiveResponse.patch, 0003-HBASE-4980-Fix-NPE-in-HBaseClient-receiveResponse.patch Relevant Stack trace: 2011-11-30 13:10:26,557 [IPC Client (47) connection to xx.xx.xx/172.22.4.68:60020 from an unknown user] WARN org.apache.hadoop.ipc.HBaseClient - Unexpected exception receiving call responses java.lang.NullPointerException -at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:583) -at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:511) {code} if (LOG.isDebugEnabled()) LOG.debug(getName() + got value # + id); Call call = calls.remove(id); // Read the flag byte byte flag = in.readByte(); boolean isError = ResponseFlag.isError(flag); if (ResponseFlag.isLength(flag)) { // Currently length if present is unused. in.readInt(); } int state = in.readInt(); // Read the state. Currently unused. if (isError) { //noinspection ThrowableInstanceNeverThrown call.setException(new RemoteException( WritableUtils.readString(in), WritableUtils.readString(in))); } else { {code} This line {code}Call call = calls.remove(id);{code} may return a null 'call'. It is so because if you have rpc timeout enable, we proactively clean up other calls which have expired their lifetime along with the call for which socket timeout exception happend. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism
[ https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162475#comment-13162475 ] Shrijeet Paliwal commented on HBASE-4633: - Recent updates: * In my case the leak/memory-hold is not in HBase client. I could not find enough evidence to conclude that. What I did find is, our application holds one heavy object in memory. This object is shared between threads. Every N minutes the application creates a new instance of this class. Unless any thread is still holding on to an old instance, all old instances are GCed in time. Hence in theory at any time there should be only one active instance of heavy object. * Under heavy load and client operation RPC timeout enabled, some threads get stuck. This causes multiple instances of heavy object. In turn heap grows. After reading client code multiple times I can not gather why there will be a case when application thread will get stuck for several minutes. We have safe guards to clean up calls 'forcefully' if they have been alive for more than rpc timeout interval. I had planned to update the title of Jira to reflect above finding but Gaojinchao observed something interesting at his end and so keeping title same for now. Gaojinchao's thread is here: http://search-hadoop.com/m/teczL8KvcH Potential memory leak in client RPC timeout mechanism - Key: HBASE-4633 URL: https://issues.apache.org/jira/browse/HBASE-4633 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0 Reporter: Shrijeet Paliwal Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937, https://issues.apache.org/jira/browse/HBASE-4003 We have been using the 'hbase.client.operation.timeout' knob introduced in 2937 for quite some time now. It helps us enforce SLA. We have two HBase clusters and two HBase client clusters. One of them is much busier than the other. We have seen a deterministic behavior of clients running in busy cluster. Their (client's) memory footprint increases consistently after they have been up for roughly 24 hours. This memory footprint almost doubles from its usual value (usual case == RPC timeout disabled). After much investigation nothing concrete came out and we had to put a hack which keep heap size in control even when RPC timeout is enabled. Also note , the same behavior is not observed in 'not so busy cluster. The patch is here : https://gist.github.com/1288023 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism
[ https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13131677#comment-13131677 ] Shrijeet Paliwal commented on HBASE-4633: - @Liyin, Are you using RPC timeouts for client operations? bq. But Not sure the leak comes from HBase Client jar itself or just our client code. In the absence of a concrete evidence that leak is indeed in HBase client jar, I have similar feeling. It could be in our application layer. bq. Our symptom is that the memory footprint will increase as time. But the actual heap size of the client is not increasing. We observe the used memory using a collectd plugin http://collectd.org/wiki/index.php/Plugin:Memory bq. So I am very interested to know when you have keep the heap size in control, is the memory leaking solved ? We run with max and min memory set as -Xmx2{X}G -Xms{X}G . And when 'leak' happens the plugin shows the used memory touching 2X value, so it does seem heap size is increasing. Correct me here if I am mistaken. Let me know if you need more inputs. Potential memory leak in client RPC timeout mechanism - Key: HBASE-4633 URL: https://issues.apache.org/jira/browse/HBASE-4633 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0 Reporter: Shrijeet Paliwal Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937, https://issues.apache.org/jira/browse/HBASE-4003 We have been using the 'hbase.client.operation.timeout' knob introduced in 2937 for quite some time now. It helps us enforce SLA. We have two HBase clusters and two HBase client clusters. One of them is much busier than the other. We have seen a deterministic behavior of clients running in busy cluster. Their (client's) memory footprint increases consistently after they have been up for roughly 24 hours. This memory footprint almost doubles from its usual value (usual case == RPC timeout disabled). After much investigation nothing concrete came out and we had to put a hack which keep heap size in control even when RPC timeout is enabled. Also note , the same behavior is not observed in 'not so busy cluster. The patch is here : https://gist.github.com/1288023 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism
[ https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13131786#comment-13131786 ] Shrijeet Paliwal commented on HBASE-4633: - @Stack No we did not run with that flag. Also we never got to a point when application had to die cause of OOM. The reasons (I guess) are : # We have GC flags to do garbage collection as fast as possible. # The monitoring in place starts sending our alerts and we usually shoot the server in the head before it OOMs # The load balancer will kick in and start sending no work to application server realizing it is in bad state. As mentioned earlier I have found it hard to reproduce in dev environment, failing to simulate production like load. But I must try again when. Potential memory leak in client RPC timeout mechanism - Key: HBASE-4633 URL: https://issues.apache.org/jira/browse/HBASE-4633 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0 Reporter: Shrijeet Paliwal Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937, https://issues.apache.org/jira/browse/HBASE-4003 We have been using the 'hbase.client.operation.timeout' knob introduced in 2937 for quite some time now. It helps us enforce SLA. We have two HBase clusters and two HBase client clusters. One of them is much busier than the other. We have seen a deterministic behavior of clients running in busy cluster. Their (client's) memory footprint increases consistently after they have been up for roughly 24 hours. This memory footprint almost doubles from its usual value (usual case == RPC timeout disabled). After much investigation nothing concrete came out and we had to put a hack which keep heap size in control even when RPC timeout is enabled. Also note , the same behavior is not observed in 'not so busy cluster. The patch is here : https://gist.github.com/1288023 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira