Re: Parallel reading advice

Jean-Daniel Cryans Thu, 29 Nov 2012 14:08:15 -0800

Thanks for posting your findings back to the list.

Is 2-3 seconds including the startup time for the JVM?


J-D

On Wed, Nov 28, 2012 at 2:25 PM, Sean McNamara
<[email protected]>wrote:

> Turns out there is a way to reuse the connection in Spark.  I was also
> forgetting to call setCaching (that was the primary reason). So it's very
> fast now and I have the data where I need it.
>
> The first request still takes 2-3 seconds to setup and see data
> (regardless of how much), but after that it's super fast.
>
> Sean
>
>
> On 11/28/12 10:37 AM, "Sean McNamara" <[email protected]> wrote:
>
> >Hi J-D
> >
> >Really good questions.  I will check for a misconfiguration.
> >
> >
> >> I'm not sure what you're talking about here. Which master
> >
> >I am using http://spark-project.org/ , so the master I am referring to is
> >really the spark driver.  Spark can read from a hadoop InputFormat and
> >populate itself that way, but you don't have control over which
> >slave/worker data will land on using it.  My goal is to use spark to reach
> >in for slices of data that are in HBase, and be able to perform set
> >operations on the data in parallel using spark.  Being able to load a
> >partition onto the right node is important. This is so that I don't have
> >to reshuffle the data, just to get it onto the right node that handles a
> >particular data partition.
> >
> >
> >> BTW why can't you keep the connections around?
> >
> >The spark api is totally functional, AFAIK it's not possible to setup a
> >connection and keep it around (I am asking on that mailing list to be
> >sure).
> >
> >
> >> Since this is something done within the HBase client, doing it
> >>externally sounds terribly tacky
> >
> >Yup.  The reason I am entertaining this route is that using an InputFormat
> >with spark I was able to load in way more data, and it was all sub second.
> > Since moving to having the spark slaves handle pulling in their data (not
> >using the InputFormat) it seems slower for some reason.  I figured it
> >might be because using an InputFormat the slaves were told what to load,
> >vs. each of the 40 slaves having to do more work to find what to load.
> >Perhaps my assumption is wrong?  Thoughts?
> >
> >
> >I really appreciate your insights.  Thanks!
> >
> >
> >
> >
> >
> >On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <[email protected]> wrote:
> >
> >>Inline.
> >>
> >>J-D
> >>
> >>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
> >><[email protected]>wrote:
> >>
> >>> I have a table who's keys are prefixed with a byte to help distribute
> >>>the
> >>> keys so scans don't hotspot.
> >>>
> >>> I also have a bunch of slave processes that work to scan the prefix
> >>> partitions in parallel.  Currently each slave sets up their own hbase
> >>> connection, scanner, etc..  Most of the slave processes finish their
> >>>scan
> >>> and return within 2-3 seconds.  It tends to take the same amount of
> >>>time
> >>> regardless of if there's lots of data, or very little.  So I think that
> >>>2
> >>> sec overhead is there because each slave will setup a new connection on
> >>> each request (I am unable to reuse connections in the slaves).
> >>>
> >>
> >>2 secs sounds way too high. I recommend you check into this and see where
> >>the time is spent as you may find underlying issues lis misconfiguration.
> >>
> >>
> >>>
> >>> I'm wondering if I could remove some of that overhead by using the
> >>>master
> >>> (which can reuse it's hbase connection) to determine the splits, and
> >>>then
> >>> delegating that information out to each slave. I think I could possible
> >>>use
> >>> TableInputFormat/TableRecordReader to accomplish this?  Would this
> >>>route
> >>> make sense?
> >>>
> >>
> >>I'm not sure what you're talking about here. Which master? HBase's or
> >>there's something in your infrastructure that's also called "master"?
> >>Then
> >>I'm not sure what your are trying to achieve by "determine the splits",
> >>you
> >>mean finding the regions you need to contact from your slaves? Since this
> >>is something done within the HBase client, doing it externally sounds
> >>terribly hacky. BTW why can't you keep the connections around? Is it a
> >>problem of JVMs being re-spawned? If so, there are techniques you can use
> >>to keep them around for reuse and then you would also benefit from
> >>reusing
> >>connections.
> >>
> >>Hope this helps,
> >>
> >>J-D
> >
>
>

Re: Parallel reading advice

Reply via email to