Re: Parallel reading advice

Sean McNamara Thu, 29 Nov 2012 15:16:47 -0800

> Is 2-3 seconds including the startup time for the JVM?

Not in this case.  I put a timer wrapper around my call to
HTablePool.getTable and the scan.


Thanks J-D

Sean


On 11/29/12 3:07 PM, "Jean-Daniel Cryans" <[email protected]> wrote:

>Thanks for posting your findings back to the list.
>
>Is 2-3 seconds including the startup time for the JVM?
>
>J-D
>
>On Wed, Nov 28, 2012 at 2:25 PM, Sean McNamara
><[email protected]>wrote:
>
>> Turns out there is a way to reuse the connection in Spark.  I was also
>> forgetting to call setCaching (that was the primary reason). So it's
>>very
>> fast now and I have the data where I need it.
>>
>> The first request still takes 2-3 seconds to setup and see data
>> (regardless of how much), but after that it's super fast.
>>
>> Sean
>>
>>
>> On 11/28/12 10:37 AM, "Sean McNamara" <[email protected]>
>>wrote:
>>
>> >Hi J-D
>> >
>> >Really good questions.  I will check for a misconfiguration.
>> >
>> >
>> >> I'm not sure what you're talking about here. Which master
>> >
>> >I am using http://spark-project.org/ , so the master I am referring to
>>is
>> >really the spark driver.  Spark can read from a hadoop InputFormat and
>> >populate itself that way, but you don't have control over which
>> >slave/worker data will land on using it.  My goal is to use spark to
>>reach
>> >in for slices of data that are in HBase, and be able to perform set
>> >operations on the data in parallel using spark.  Being able to load a
>> >partition onto the right node is important. This is so that I don't
>>have
>> >to reshuffle the data, just to get it onto the right node that handles
>>a
>> >particular data partition.
>> >
>> >
>> >> BTW why can't you keep the connections around?
>> >
>> >The spark api is totally functional, AFAIK it's not possible to setup a
>> >connection and keep it around (I am asking on that mailing list to be
>> >sure).
>> >
>> >
>> >> Since this is something done within the HBase client, doing it
>> >>externally sounds terribly tacky
>> >
>> >Yup.  The reason I am entertaining this route is that using an
>>InputFormat
>> >with spark I was able to load in way more data, and it was all sub
>>second.
>> > Since moving to having the spark slaves handle pulling in their data
>>(not
>> >using the InputFormat) it seems slower for some reason.  I figured it
>> >might be because using an InputFormat the slaves were told what to
>>load,
>> >vs. each of the 40 slaves having to do more work to find what to load.
>> >Perhaps my assumption is wrong?  Thoughts?
>> >
>> >
>> >I really appreciate your insights.  Thanks!
>> >
>> >
>> >
>> >
>> >
>> >On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <[email protected]> wrote:
>> >
>> >>Inline.
>> >>
>> >>J-D
>> >>
>> >>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara
>> >><[email protected]>wrote:
>> >>
>> >>> I have a table who's keys are prefixed with a byte to help
>>distribute
>> >>>the
>> >>> keys so scans don't hotspot.
>> >>>
>> >>> I also have a bunch of slave processes that work to scan the prefix
>> >>> partitions in parallel.  Currently each slave sets up their own
>>hbase
>> >>> connection, scanner, etc..  Most of the slave processes finish their
>> >>>scan
>> >>> and return within 2-3 seconds.  It tends to take the same amount of
>> >>>time
>> >>> regardless of if there's lots of data, or very little.  So I think
>>that
>> >>>2
>> >>> sec overhead is there because each slave will setup a new
>>connection on
>> >>> each request (I am unable to reuse connections in the slaves).
>> >>>
>> >>
>> >>2 secs sounds way too high. I recommend you check into this and see
>>where
>> >>the time is spent as you may find underlying issues lis
>>misconfiguration.
>> >>
>> >>
>> >>>
>> >>> I'm wondering if I could remove some of that overhead by using the
>> >>>master
>> >>> (which can reuse it's hbase connection) to determine the splits, and
>> >>>then
>> >>> delegating that information out to each slave. I think I could
>>possible
>> >>>use
>> >>> TableInputFormat/TableRecordReader to accomplish this?  Would this
>> >>>route
>> >>> make sense?
>> >>>
>> >>
>> >>I'm not sure what you're talking about here. Which master? HBase's or
>> >>there's something in your infrastructure that's also called "master"?
>> >>Then
>> >>I'm not sure what your are trying to achieve by "determine the
>>splits",
>> >>you
>> >>mean finding the regions you need to contact from your slaves? Since
>>this
>> >>is something done within the HBase client, doing it externally sounds
>> >>terribly hacky. BTW why can't you keep the connections around? Is it a
>> >>problem of JVMs being re-spawned? If so, there are techniques you can
>>use
>> >>to keep them around for reuse and then you would also benefit from
>> >>reusing
>> >>connections.
>> >>
>> >>Hope this helps,
>> >>
>> >>J-D
>> >
>>
>>

Re: Parallel reading advice

Reply via email to