You should have written to the mailing list earlier :-)

hbase community is very responsive.

On Fri, Oct 28, 2016 at 2:53 PM, Pat Ferrel <[email protected]> wrote:

> After passing in hbase-site.xml with the increased timeout it completes
> pretty fast with no errors.
>
> Thanks Ted, we’ve been going crazy trying to figure what was going on. We
> moved from having Hbase installed on the Spark driver machine (though not
> used) to containerized installation, where the config was left default on
> the driver and only existed in the containers. We were passing in the empty
> config to the spark-submit but it didn’t match the containers and fixing
> that has made the system much happier.
>
> Anyway good call, we will be more aware of this with other services now.
> Thanks for ending our weeks long fight!  :-)
>
>
> On Oct 28, 2016, at 11:29 AM, Ted Yu <[email protected]> wrote:
>
> bq. with 400 threads hitting HBase at the same time
>
> How many regions are serving the 400 threads ?
> How many region servers do you have ?
>
> If the regions are spread relatively evenly across the cluster, the above
> may not be big issue.
>
> On Fri, Oct 28, 2016 at 11:21 AM, Pat Ferrel <[email protected]>
> wrote:
>
> > Ok, will do.
> >
> > So the scanner does not indicate of itself that I’ve missed something in
> > handling the data. If not index, then made a fast lookup “key”? I ask
> > because the timeout change may work but not be the optimal solution. The
> > stage that fails is very long compared to other stages. And with 400
> > threads hitting HBase at the same time, this seems like something I may
> > need to restructure and any advice about that would be welcome.
> >
> > HBase is 1.2.3
> >
> >
> > On Oct 28, 2016, at 10:36 AM, Ted Yu <[email protected]> wrote:
> >
> > For your first question, you need to pass hbase-site.xml which has config
> > parameters affecting client operations to Spark  executors.
> >
> > bq. missed indexing some column
> >
> > hbase doesn't have indexing (as in the sense of traditional RDBMS).
> >
> > Let's see what happens after hbase-site.xml is passed to executors.
> >
> > BTW Can you tell us the release of hbase you're using ?
> >
> >
> >
> > On Fri, Oct 28, 2016 at 10:22 AM, Pat Ferrel <[email protected]>
> > wrote:
> >
> >> So to clarify there are some values in hbase/conf/hbase-site.xml that
> are
> >> needed by the calling code in the Spark driver and executors and so must
> > be
> >> passed using --files to spark-submit? If so I can do this.
> >>
> >> But do I have a deeper issue? Is it typical to need a scan like this?
> > Have
> >> I missed indexing some column maybe?
> >>
> >>
> >> On Oct 28, 2016, at 9:59 AM, Ted Yu <[email protected]> wrote:
> >>
> >> Mich:
> >> bq. on table 'hbase:meta' *at region=hbase:meta,,1.1588230740
> >>
> >> What you observed was different issue.
> >> The above looks like trouble with locating region(s) during scan.
> >>
> >> On Fri, Oct 28, 2016 at 9:54 AM, Mich Talebzadeh <
> >> [email protected]>
> >> wrote:
> >>
> >>> This is an example I got
> >>>
> >>> warning: there were two deprecation warnings; re-run with -deprecation
> >> for
> >>> details
> >>> rdd1: org.apache.spark.rdd.RDD[(String, String)] =
> MapPartitionsRDD[77]
> >> at
> >>> map at <console>:151
> >>> defined class columns
> >>> dfTICKER: org.apache.spark.sql.Dataset[columns] = [KEY: string,
> TICKER:
> >>> string]
> >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> >>> attempts=36, exceptions:
> >>> *Fri Oct 28 13:13:46 BST 2016, null, java.net.SocketTimeoutException:
> >>> callTimeout=60000, callDuration=68411: row
> >>> 'MARKETDATAHBASE,,00000000000000' on table 'hbase:meta' *at
> >>> region=hbase:meta,,1.1588230740, hostname=rhes564,16201,1477246132044,
> >>> seqNum=0
> >>> at
> >>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadRepli
> >>> cas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
> >>> at
> >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >>> ScannerCallableWithReplicas.java:210)
> >>> at
> >>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(
> >>> ScannerCallableWithReplicas.java:60)
> >>> at
> >>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(
> >>> RpcRetryingCaller.java:210)
> >>>
> >>>
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>> <https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >>> OABUrV8Pw>*
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> >>> loss, damage or destruction of data or any other property which may
> > arise
> >>> from relying on this email's technical content is explicitly
> disclaimed.
> >>> The author will in no case be liable for any monetary damages arising
> >> from
> >>> such loss, damage or destruction.
> >>>
> >>>
> >>>
> >>> On 28 October 2016 at 17:52, Pat Ferrel <[email protected]> wrote:
> >>>
> >>>> I will check that, but if that is a server startup thing I was not
> > aware
> >>> I
> >>>> had to send it to the executors. So it’s like a connection timeout
> from
> >>>> executor code?
> >>>>
> >>>>
> >>>> On Oct 28, 2016, at 9:48 AM, Ted Yu <[email protected]> wrote:
> >>>>
> >>>> How did you change the timeout(s) ?
> >>>>
> >>>> bq. timeout is currently set to 60000
> >>>>
> >>>> Did you pass hbase-site.xml using --files to Spark job ?
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Fri, Oct 28, 2016 at 9:27 AM, Pat Ferrel <[email protected]>
> >>> wrote:
> >>>>
> >>>>> Using standalone Spark. I don’t recall seeing connection lost errors,
> >>> but
> >>>>> there are lots of logs. I’ve set the scanner and RPC timeouts to
> large
> >>>>> numbers on the servers.
> >>>>>
> >>>>> But I also saw in the logs:
> >>>>>
> >>>>> org.apache.hadoop.hbase.client.ScannerTimeoutException: 381788ms
> >>>>> passed since the last invocation, timeout is currently set to 60000
> >>>>>
> >>>>> Not sure where that is coming from. Does the driver machine making
> >>>> queries
> >>>>> need to have the timeout config also?
> >>>>>
> >>>>> And why so large, am I doing something wrong?
> >>>>>
> >>>>>
> >>>>> On Oct 28, 2016, at 8:50 AM, Ted Yu <[email protected]> wrote:
> >>>>>
> >>>>> Mich:
> >>>>> The OutOfOrderScannerNextException indicated problem with read from
> >>>> hbase.
> >>>>>
> >>>>> How did you know connection to Spark cluster was lost ?
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Fri, Oct 28, 2016 at 8:47 AM, Mich Talebzadeh <
> >>>>> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Looks like it lost the connection to Spark cluster.
> >>>>>>
> >>>>>> What mode you are using with Spark, Standalone, Yarn or others. The
> >>>> issue
> >>>>>> looks like a resource manager issue.
> >>>>>>
> >>>>>> I have seen this when running Zeppelin with Spark on Hbase.
> >>>>>>
> >>>>>> HTH
> >>>>>>
> >>>>>> Dr Mich Talebzadeh
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >>>>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>>> <https://www.linkedin.com/profile/view?id=
> >>>> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> >>>>>> OABUrV8Pw>*
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> http://talebzadehmich.wordpress.com
> >>>>>>
> >>>>>>
> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
> for
> >>>> any
> >>>>>> loss, damage or destruction of data or any other property which may
> >>>> arise
> >>>>>> from relying on this email's technical content is explicitly
> >>> disclaimed.
> >>>>>> The author will in no case be liable for any monetary damages
> arising
> >>>>> from
> >>>>>> such loss, damage or destruction.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 28 October 2016 at 16:38, Pat Ferrel <[email protected]>
> >>> wrote:
> >>>>>>
> >>>>>>> I’m getting data from HBase using a large Spark cluster with
> >>>> parallelism
> >>>>>>> of near 400. The query fails quire often with the message below.
> >>>>>> Sometimes
> >>>>>>> a retry will work and sometimes the ultimate failure results
> > (below).
> >>>>>>>
> >>>>>>> If I reduce parallelism in Spark it slows other parts of the
> >>> algorithm
> >>>>>>> unacceptably. I have also experimented with very large RPC/Scanner
> >>>>>> timeouts
> >>>>>>> of many minutes—to no avail.
> >>>>>>>
> >>>>>>> Any clues about what to look for or what may be setup wrong in my
> >>>>> tables?
> >>>>>>>
> >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >>>> times,
> >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >>> org.apache.hadoop.hbase.
> >>>>>> DoNotRetryIOException:
> >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> >>> rpc
> >>>>>>> timeout?+details
> >>>>>>> Job aborted due to stage failure: Task 44 in stage 147.0 failed 4
> >>>> times,
> >>>>>>> most recent failure: Lost task 44.3 in stage 147.0 (TID 24833,
> >>>>>>> ip-172-16-3-9.eu-central-1.compute.internal):
> >>> org.apache.hadoop.hbase.
> >>>>>> DoNotRetryIOException:
> >>>>>>> Failed after retry of OutOfOrderScannerNextException: was there a
> >>> rpc
> >>>>>>> timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(
> >>>>>> ClientScanner.java:403)
> >>>>>>> at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.
> >>>>> nextKeyValue(
> >>>>>>> TableRecordReaderImpl.java:232) at org.apache.hadoop.hbase.
> >>>>>>> mapreduce.TableRecordReader.nextKeyValue(
> > TableRecordReader.java:138)
> >>>> at
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>
>

Reply via email to