[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840978#comment-15840978
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-trafodion/pull/929


> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=185, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 02:07:22 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=310, waitTime=61, operationTimeout=60 expired.
> A subsequent update statistics command succeeds, but these failures take a 
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update 
> stats assumes the table is small enough to do a select count (*). The plan 
> for this select count (*) (perhaps suffering from the same issue that causes 
> getrowcount to return a non-estimate) chooses the HBase aggregate 
> coprocessor. The table in question has 22 billion rows, so the the 
> coprocessor isn't a good choice, and the query times out. But the real issue 
> is, why can't the table get a rowcount estimate.
> Rerunning UPDATE STATS on this table a few hours later succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840444#comment-15840444
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

Github user DaveBirdsall commented on a diff in the pull request:

https://github.com/apache/incubator-trafodion/pull/929#discussion_r98089783
  
--- Diff: core/sql/executor/HBaseClient_JNI.cpp ---
@@ -1303,12 +1304,26 @@ HBC_RetCode HBaseClient_JNI::grant(const Text& 
user, const Text& tblName, const
 HBC_RetCode HBaseClient_JNI::estimateRowCount(const char* tblName,
   Int32 partialRowSize,
   Int32 numCols,
-  Int64& rowCount)
+  Int32 retryLimitMilliSeconds,
+  Int64& rowCount,
+  Int32& breadCrumb)
 {
+  // Note: Please use HBC_ERROR_ROWCOUNT_EST_EXCEPTION only for
+  // those error returns that call getExceptionDetails(). This
+  // tells the caller that Java exception information is available.
+
   QRLogger::log(CAT_SQL_HBASE, LL_DEBUG, 
"HBaseClient_JNI::estimateRowCount(%s) called.", tblName);
-  if (initJNIEnv() != JOI_OK)
- return HBC_ERROR_INIT_PARAM;
+  breadCrumb = 1;
+  if (jenv_ == NULL)
+ if (initJVM() != JOI_OK)
+ return HBC_ERROR_INIT_PARAM;
 
+  breadCrumb = 2;
+  if (jenv_->PushLocalFrame(jniHandleCapacity_) != 0) {
+ getExceptionDetails();
+ return HBC_ERROR_ROWCOUNT_EST_EXCEPTION;
+  }
+  breadCrumb = 3;
--- End diff --

My mistake... I seem to have accidentally resurrected some old code. Will 
fix.


> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=185, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 02:07:22 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=310, waitTime=61, operationTimeout=60 expired.
> A subsequent update statistics command succeeds, but these failures take a 
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update 
> stats assumes the table is small enough to do a select count (*). The plan 
> for this select count (*) (perhaps suffering from the same issue that causes 
> getrowcount to return a non-estimate) chooses the HBase aggregate 
> coprocessor. The table in question has 22 billion rows, so the the 
> coprocessor isn't a good choice, and the query times out. But the real issue 
> is, why can't the table get a rowcount estimate.
> Rerunning 

[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840445#comment-15840445
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

Github user DaveBirdsall commented on a diff in the pull request:

https://github.com/apache/incubator-trafodion/pull/929#discussion_r98089873
  
--- Diff: core/sql/executor/HBaseClient_JNI.cpp ---
@@ -1319,17 +1334,21 @@ HBC_RetCode HBaseClient_JNI::estimateRowCount(const 
char* tblName,
 
   jint jPartialRowSize = partialRowSize;
   jint jNumCols = numCols;
+  jint jRetryLimitMilliSeconds = retryLimitMilliSeconds;
   jlongArray jRowCount = jenv_->NewLongArray(1);
   tsRecentJMFromJNI = JavaMethods_[JM_EST_RC].jm_full_name;
   jboolean jresult = jenv_->CallBooleanMethod(javaObj_, 
JavaMethods_[JM_EST_RC].methodID,
   js_tblName, jPartialRowSize,
-  jNumCols, jRowCount);
+  jNumCols, 
jRetryLimitMilliSeconds, jRowCount);
   jboolean isCopy;
   jlong* arrayElems = jenv_->GetLongArrayElements(jRowCount, );
   rowCount = *arrayElems;
   if (isCopy == JNI_TRUE)
 jenv_->ReleaseLongArrayElements(jRowCount, arrayElems, JNI_ABORT);
 
+  jenv_->DeleteLocalRef(js_tblName);
--- End diff --

Right. I think I accidentally resurrected some old code. Will fix.


> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=185, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 02:07:22 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=310, waitTime=61, operationTimeout=60 expired.
> A subsequent update statistics command succeeds, but these failures take a 
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update 
> stats assumes the table is small enough to do a select count (*). The plan 
> for this select count (*) (perhaps suffering from the same issue that causes 
> getrowcount to return a non-estimate) chooses the HBase aggregate 
> coprocessor. The table in question has 22 billion rows, so the the 
> coprocessor isn't a good choice, and the query times out. But the real issue 
> is, why can't the table get a rowcount estimate.
> Rerunning UPDATE STATS on this table a few hours later succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840112#comment-15840112
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

Github user zcorrea commented on a diff in the pull request:

https://github.com/apache/incubator-trafodion/pull/929#discussion_r98049125
  
--- Diff: core/sqf/monitor/linux/process.cxx ---
@@ -1317,7 +1318,11 @@ bool CProcess::Create (CProcess *parent, int & 
result)
 env = getenv ("TERM");
 STRCPY (term, (env?env:"ansi"));
 env = getenv ("TZ");
-STRCPY (tz, (env?env:""));
+tz_exists = (env != NULL);
+if (tz_exists)
+{
+  STRCPY (tz, env); // see note regarding TZ below
+}
 env = getenv ("USER");
 STRCPY (user, (env?env:""));
 env = getenv ("HOME");
--- End diff --

Change looks good. @selvaganesang has a good point on making the calls to 
getenv() on each process create. Will have to analyze the impact of caching 
these environment variables.


> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=185, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 02:07:22 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=310, waitTime=61, operationTimeout=60 expired.
> A subsequent update statistics command succeeds, but these failures take a 
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update 
> stats assumes the table is small enough to do a select count (*). The plan 
> for this select count (*) (perhaps suffering from the same issue that causes 
> getrowcount to return a non-estimate) chooses the HBase aggregate 
> coprocessor. The table in question has 22 billion rows, so the the 
> coprocessor isn't a good choice, and the query times out. But the real issue 
> is, why can't the table get a rowcount estimate.
> Rerunning UPDATE STATS on this table a few hours later succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839019#comment-15839019
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

Github user selvaganesang commented on a diff in the pull request:

https://github.com/apache/incubator-trafodion/pull/929#discussion_r97918415
  
--- Diff: core/sqf/monitor/linux/process.cxx ---
@@ -1317,7 +1318,11 @@ bool CProcess::Create (CProcess *parent, int & 
result)
 env = getenv ("TERM");
 STRCPY (term, (env?env:"ansi"));
 env = getenv ("TZ");
-STRCPY (tz, (env?env:""));
+tz_exists = (env != NULL);
+if (tz_exists)
+{
+  STRCPY (tz, env); // see note regarding TZ below
+}
 env = getenv ("USER");
 STRCPY (user, (env?env:""));
 env = getenv ("HOME");
--- End diff --

Though it is in surround code, I thought that I would mention this.  I 
think getenv is an expensive call that needs to be avoided in this part of code.


> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=185, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 02:07:22 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=310, waitTime=61, operationTimeout=60 expired.
> A subsequent update statistics command succeeds, but these failures take a 
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update 
> stats assumes the table is small enough to do a select count (*). The plan 
> for this select count (*) (perhaps suffering from the same issue that causes 
> getrowcount to return a non-estimate) chooses the HBase aggregate 
> coprocessor. The table in question has 22 billion rows, so the the 
> coprocessor isn't a good choice, and the query times out. But the real issue 
> is, why can't the table get a rowcount estimate.
> Rerunning UPDATE STATS on this table a few hours later succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839018#comment-15839018
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

Github user selvaganesang commented on a diff in the pull request:

https://github.com/apache/incubator-trafodion/pull/929#discussion_r97918864
  
--- Diff: core/sql/executor/HBaseClient_JNI.cpp ---
@@ -1303,12 +1304,26 @@ HBC_RetCode HBaseClient_JNI::grant(const Text& 
user, const Text& tblName, const
 HBC_RetCode HBaseClient_JNI::estimateRowCount(const char* tblName,
   Int32 partialRowSize,
   Int32 numCols,
-  Int64& rowCount)
+  Int32 retryLimitMilliSeconds,
+  Int64& rowCount,
+  Int32& breadCrumb)
 {
+  // Note: Please use HBC_ERROR_ROWCOUNT_EST_EXCEPTION only for
+  // those error returns that call getExceptionDetails(). This
+  // tells the caller that Java exception information is available.
+
   QRLogger::log(CAT_SQL_HBASE, LL_DEBUG, 
"HBaseClient_JNI::estimateRowCount(%s) called.", tblName);
-  if (initJNIEnv() != JOI_OK)
- return HBC_ERROR_INIT_PARAM;
+  breadCrumb = 1;
+  if (jenv_ == NULL)
+ if (initJVM() != JOI_OK)
+ return HBC_ERROR_INIT_PARAM;
 
+  breadCrumb = 2;
+  if (jenv_->PushLocalFrame(jniHandleCapacity_) != 0) {
+ getExceptionDetails();
+ return HBC_ERROR_ROWCOUNT_EST_EXCEPTION;
+  }
+  breadCrumb = 3;
--- End diff --

I know you wanted to provide breadcrumb for the errors reported. However, I 
would think you would want to consider leave this initialization code as it is. 
I have cleaned up the code earlier just to ensure that the initialization to 
the JNI layer is encapsulated in this method initJNIEnv() so that any changes 
can be made in one routine.


> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=185, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 02:07:22 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=310, waitTime=61, operationTimeout=60 expired.
> A subsequent update statistics command succeeds, but these failures take a 
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update 
> stats assumes the table is small enough to do a select count (*). The plan 
> for this select count (*) (perhaps suffering from the same issue that causes 
> getrowcount to 

[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839020#comment-15839020
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

Github user selvaganesang commented on a diff in the pull request:

https://github.com/apache/incubator-trafodion/pull/929#discussion_r97919059
  
--- Diff: core/sql/executor/HBaseClient_JNI.cpp ---
@@ -1319,17 +1334,21 @@ HBC_RetCode HBaseClient_JNI::estimateRowCount(const 
char* tblName,
 
   jint jPartialRowSize = partialRowSize;
   jint jNumCols = numCols;
+  jint jRetryLimitMilliSeconds = retryLimitMilliSeconds;
   jlongArray jRowCount = jenv_->NewLongArray(1);
   tsRecentJMFromJNI = JavaMethods_[JM_EST_RC].jm_full_name;
   jboolean jresult = jenv_->CallBooleanMethod(javaObj_, 
JavaMethods_[JM_EST_RC].methodID,
   js_tblName, jPartialRowSize,
-  jNumCols, jRowCount);
+  jNumCols, 
jRetryLimitMilliSeconds, jRowCount);
   jboolean isCopy;
   jlong* arrayElems = jenv_->GetLongArrayElements(jRowCount, );
   rowCount = *arrayElems;
   if (isCopy == JNI_TRUE)
 jenv_->ReleaseLongArrayElements(jRowCount, arrayElems, JNI_ABORT);
 
+  jenv_->DeleteLocalRef(js_tblName);
--- End diff --

popLocalFrame would do this for you. Again I cleaned this code earlier to 
remove the unnecessary call to DeleteLocalRef if push/pop local frame is used


> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=185, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 02:07:22 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=310, waitTime=61, operationTimeout=60 expired.
> A subsequent update statistics command succeeds, but these failures take a 
> half hour or more.
> Enabling logging for update stats shows that getrowcount returns 0, so update 
> stats assumes the table is small enough to do a select count (*). The plan 
> for this select count (*) (perhaps suffering from the same issue that causes 
> getrowcount to return a non-estimate) chooses the HBase aggregate 
> coprocessor. The table in question has 22 billion rows, so the the 
> coprocessor isn't a good choice, and the query times out. But the real issue 
> is, why can't the table get a rowcount estimate.
> Rerunning UPDATE STATS on this table a few hours later succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TRAFODION-2455) Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from estimator, fails with timeouts by doing select count (*)

2017-01-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TRAFODION-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838748#comment-15838748
 ] 

ASF GitHub Bot commented on TRAFODION-2455:
---

GitHub user DaveBirdsall opened a pull request:

https://github.com/apache/incubator-trafodion/pull/929

[TRAFODION-2455] Add retry to row count estimation logic

This set of changes does the following:

1. Changes the stack from NATable::estimateHBaseRowCount on down to return 
detailed error information about any failure.
2. Changes NATable::estimateHBaseRowCount to return a rowcount of 100 
million instead of zero when an error occurs. It is safer to overestimate the 
size of an object than to underestimate it.
3. Changes UPDATE STATISTICS to give an error 9252 "Unable to get row count 
estimate: Error code $0int0, detail $1int1. Exception info (if any): 
$0~string0" when an error occurs in or underneath 
NATable::estimateHBaseRowCount instead of using a rowcount estimate of zero. 
The information in this message gives details such as what error path was 
taken, and any Java exception information that may be pertinent.
4. Adds a retry loop to HBaseClient.java method estimateRowCount so that we 
retry if we encounter a FileNotFoundException. UPDATE STATISTICS will do up to 
4 minutes worth of accumulated retries; normal compilation will do up to 5 
seconds worth of accumulated retries. Wait times for retries start out at 2 
seconds, doubling until topping out at 30 seconds.
5. Adds timestamps to messages in update statistics logging in local time. 
To get local time, I needed to fix a bug in the monitor (process.cxx) that was 
incorrectly setting an environment variable TZ to the empty string when TZ was 
not defined in the monitor itself.

Note: The fix in process.cxx might (or might not!) fix a similar bug in DTM 
logging where local timestamps are desired but UTC timestamps are produced.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DaveBirdsall/incubator-trafodion 
Trafodion2455x

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-trafodion/pull/929.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #929


commit 932c219f0fd2a49e1c53a666a1dcbcb3578c4f91
Author: Dave Birdsall 
Date:   2017-01-19T20:10:33Z

[TRAFODION-2440] Add retry to row count estimation logic

commit f1653636dd6d411fafe79697ee75c8763ff8edf2
Author: Dave Birdsall 
Date:   2017-01-24T00:09:08Z

Comment change

commit dbc6c0876ca5af07b63d68f388bad07195d106a5
Author: Dave Birdsall 
Date:   2017-01-25T22:40:56Z

[TRAFODION-2455] More refinements to row count estimation retry logic




> Initial Update Stats on 22B row 2.5TB OE table gets 0 rowcount from 
> estimator, fails with timeouts by doing select count (*)
> 
>
> Key: TRAFODION-2455
> URL: https://issues.apache.org/jira/browse/TRAFODION-2455
> Project: Apache Trafodion
>  Issue Type: Bug
>  Components: sql-cmp
>Affects Versions: 2.1-incubating
> Environment: A cluster large enough to host a 22 billion row table
>Reporter: David Wayne Birdsall
>Assignee: David Wayne Birdsall
>
> When loading a scale factor 73728 Order Entry database, if UPDATE STATISTICS 
> is done soon after the load on one particular table (the largest table, 
> having 22 billion rows), we get the following failure:
> SQLEXCEPTION on Statement, Error Code = -9200
>update statistics for table trafodion.javabench.oe_orderline_73728 on 
> every column, (OL_W_ID, OL_I_ID), (OL_D_ID, OL_W_ID), (OL_D_ID, OL_I_ID) 
> sample
> *** ERROR[9200] UPDATE STATISTICS for table 
> TRAFODION.JAVABENCH.OE_ORDERLINE_73728 encountered an error (8448) from 
> statement getRow(). [2017-01-09 02:07:22]
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::coProcAggr returned error HBASE_ACCESS_ERROR(-706). Cause: 
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
> attempts=3, exceptions:
> Mon Jan 09 01:47:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020 failed on 
> local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call 
> id=73, waitTime=61, operationTimeout=60 expired.
> Mon Jan 09 01:57:21 PST 2017, 
> RpcRetryingCaller{globalStartTime=1483954641419, pause=100, retries=3}, 
> java.io.IOException: Call to nap015.esgyn.local/10.1.10.20:60020