You should have written to the mailing list earlier :-) hbase community is very responsive.
On Fri, Oct 28, 2016 at 2:53 PM, Pat Ferrel <[email protected]> wrote: > After passing in hbase-site.xml with the increased timeout it completes > pretty fast with no errors. > > Thanks Ted, we’ve been going crazy trying to figure what was going on. We > moved from having Hbase installed on the Spark driver machine (though not > used) to containerized installation, where the config was left default on > the driver and only existed in the containers. We were passing in the empty > config to the spark-submit but it didn’t match the containers and fixing > that has made the system much happier. > > Anyway good call, we will be more aware of this with other services now. > Thanks for ending our weeks long fight! :-) > > > On Oct 28, 2016, at 11:29 AM, Ted Yu <[email protected]> wrote: > > bq. with 400 threads hitting HBase at the same time > > How many regions are serving the 400 threads ? > How many region servers do you have ? > > If the regions are spread relatively evenly across the cluster, the above > may not be big issue. > > On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <[email protected]> > wrote: > > > Ok, will do. > > > > So the scanner does not indicate of itself that I’ve missed something in > > handling the data. If not index, then made a fast lookup “key”? I ask > > because the timeout change may work but not be the optimal solution. The > > stage that fails is very long compared to other stages. And with 400 > > threads hitting HBase at the same time, this seems like something I may > > need to restructure and any advice about that would be welcome. > > > > HBase is 1.2.3 > > > > > > On Oct 28, 2016, at 10:36 AM, Ted Yu <[email protected]> wrote: > > > > For your first question, you need to pass hbase-site.xml which has config > > parameters affecting client operations to Spark executors. > > > > bq. missed indexing some column > > > > hbase doesn't have indexing (as in the sense of traditional RDBMS). > > > > Let's see what happens after hbase-site.xml is passed to executors. > > > > BTW Can you tell us the release of hbase you're using ? > > > > > > > > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <[email protected]> > > wrote: > > > >> So to clarify there are some values in hbase/conf/hbase-site.xml that > are > >> needed by the calling code in the Spark driver and executors and so must > > be > >> passed using --files to spark-submit? If so I can do this. > >> > >> But do I have a deeper issue? Is it typical to need a scan like this? > > Have > >> I missed indexing some column maybe? > >> > >> > >> On Oct 28, 2016, at 9:59 AM, Ted Yu <[email protected]> wrote: > >> > >> Mich: > >> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740 > >> > >> What you observed was different issue. > >> The above looks like trouble with locating region(s) during scan. > >> > >> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh < > >> [email protected]> > >> wrote: > >> > >>> This is an example I got > >>> > >>> warning: there were two deprecation warnings; re-run with -deprecation > >> for > >>> details > >>> rdd1: org.apache.spark.rdd.RDD[(String, String)] = > MapPartitionsRDD[77] > >> at > >>> map at <console>:151 > >>> defined class columns > >>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string, > TICKER: > >>> string] > >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > >>> attempts=36, exceptions: > >>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException: > >>> callTimeout=60000, callDuration=68411: row > >>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at > >>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044, > >>> seqNum=0 > >>> at > >>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli > >>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276) > >>> at > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call( > >>> ScannerCallableWithReplicas.java:210) > >>> at > >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call( > >>> ScannerCallableWithReplicas.java:60) > >>> at > >>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries( > >>> RpcRetryingCaller.java:210) > >>> > >>> > >>> > >>> Dr Mich Talebzadeh > >>> > >>> > >>> > >>> LinkedIn * https://www.linkedin.com/profile/view?id= > >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >>> <https://www.linkedin.com/profile/view?id= > > AAEAAAAWh2gBxianrbJd6zP6AcPCCd > >>> OABUrV8Pw>* > >>> > >>> > >>> > >>> http://talebzadehmich.wordpress.com > >>> > >>> > >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for > > any > >>> loss, damage or destruction of data or any other property which may > > arise > >>> from relying on this email's technical content is explicitly > disclaimed. > >>> The author will in no case be liable for any monetary damages arising > >> from > >>> such loss, damage or destruction. > >>> > >>> > >>> > >>> On 28 October 2016 at 17:52, Pat Ferrel <[email protected]> wrote: > >>> > >>>> I will check that, but if that is a server startup thing I was not > > aware > >>> I > >>>> had to send it to the executors. So it’s like a connection timeout > from > >>>> executor code? > >>>> > >>>> > >>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <[email protected]> wrote: > >>>> > >>>> How did you change the timeout(s) ? > >>>> > >>>> bq. timeout is currently set to 60000 > >>>> > >>>> Did you pass hbase-site.xml using --files to Spark job ? > >>>> > >>>> Cheers > >>>> > >>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <[email protected]> > >>> wrote: > >>>> > >>>>> Using standalone Spark. I don’t recall seeing connection lost errors, > >>> but > >>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to > large > >>>>> numbers on the servers. > >>>>> > >>>>> But I also saw in the logs: > >>>>> > >>>>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms > >>>>> passed since the last invocation, timeout is currently set to 60000 > >>>>> > >>>>> Not sure where that is coming from. Does the driver machine making > >>>> queries > >>>>> need to have the timeout config also? > >>>>> > >>>>> And why so large, am I doing something wrong? > >>>>> > >>>>> > >>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <[email protected]> wrote: > >>>>> > >>>>> Mich: > >>>>> The OutOfOrderScannerNextException indicated problem with read from > >>>> hbase. > >>>>> > >>>>> How did you know connection to Spark cluster was lost ? > >>>>> > >>>>> Cheers > >>>>> > >>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh < > >>>>> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> Looks like it lost the connection to Spark cluster. > >>>>>> > >>>>>> What mode you are using with Spark, Standalone, Yarn or others. The > >>>> issue > >>>>>> looks like a resource manager issue. > >>>>>> > >>>>>> I have seen this when running Zeppelin with Spark on Hbase. > >>>>>> > >>>>>> HTH > >>>>>> > >>>>>> Dr Mich Talebzadeh > >>>>>> > >>>>>> > >>>>>> > >>>>>> LinkedIn * https://www.linkedin.com/profile/view?id= > >>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >>>>>> <https://www.linkedin.com/profile/view?id= > >>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd > >>>>>> OABUrV8Pw>* > >>>>>> > >>>>>> > >>>>>> > >>>>>> http://talebzadehmich.wordpress.com > >>>>>> > >>>>>> > >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility > for > >>>> any > >>>>>> loss, damage or destruction of data or any other property which may > >>>> arise > >>>>>> from relying on this email's technical content is explicitly > >>> disclaimed. > >>>>>> The author will in no case be liable for any monetary damages > arising > >>>>> from > >>>>>> such loss, damage or destruction. > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 28 October 2016 at 16:38, Pat Ferrel <[email protected]> > >>> wrote: > >>>>>> > >>>>>>> I’m getting data from HBase using a large Spark cluster with > >>>> parallelism > >>>>>>> of near 400. The query fails quire often with the message below. > >>>>>> Sometimes > >>>>>>> a retry will work and sometimes the ultimate failure results > > (below). > >>>>>>> > >>>>>>> If I reduce parallelism in Spark it slows other parts of the > >>> algorithm > >>>>>>> unacceptably. I have also experimented with very large RPC/Scanner > >>>>>> timeouts > >>>>>>> of many minutes—to no avail. > >>>>>>> > >>>>>>> Any clues about what to look for or what may be setup wrong in my > >>>>> tables? > >>>>>>> > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 > >>>> times, > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833, > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal): > >>> org.apache.hadoop.hbase. > >>>>>> DoNotRetryIOException: > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a > >>> rpc > >>>>>>> timeout?+details > >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4 > >>>> times, > >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833, > >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal): > >>> org.apache.hadoop.hbase. > >>>>>> DoNotRetryIOException: > >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a > >>> rpc > >>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next( > >>>>>> ClientScanner.java:403) > >>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl. > >>>>> nextKeyValue( > >>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase. > >>>>>>> mapreduce.TableRecordReader.nextKeyValue( > > TableRecordReader.java:138) > >>>> at > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> > > > > > >
