Thanks for posting your findings back to the list. Is 2-3 seconds including the startup time for the JVM?
J-D On Wed, Nov 28, 2012 at 2:25 PM, Sean McNamara <[email protected]>wrote: > Turns out there is a way to reuse the connection in Spark. I was also > forgetting to call setCaching (that was the primary reason). So it's very > fast now and I have the data where I need it. > > The first request still takes 2-3 seconds to setup and see data > (regardless of how much), but after that it's super fast. > > Sean > > > On 11/28/12 10:37 AM, "Sean McNamara" <[email protected]> wrote: > > >Hi J-D > > > >Really good questions. I will check for a misconfiguration. > > > > > >> I'm not sure what you're talking about here. Which master > > > >I am using http://spark-project.org/ , so the master I am referring to is > >really the spark driver. Spark can read from a hadoop InputFormat and > >populate itself that way, but you don't have control over which > >slave/worker data will land on using it. My goal is to use spark to reach > >in for slices of data that are in HBase, and be able to perform set > >operations on the data in parallel using spark. Being able to load a > >partition onto the right node is important. This is so that I don't have > >to reshuffle the data, just to get it onto the right node that handles a > >particular data partition. > > > > > >> BTW why can't you keep the connections around? > > > >The spark api is totally functional, AFAIK it's not possible to setup a > >connection and keep it around (I am asking on that mailing list to be > >sure). > > > > > >> Since this is something done within the HBase client, doing it > >>externally sounds terribly tacky > > > >Yup. The reason I am entertaining this route is that using an InputFormat > >with spark I was able to load in way more data, and it was all sub second. > > Since moving to having the spark slaves handle pulling in their data (not > >using the InputFormat) it seems slower for some reason. I figured it > >might be because using an InputFormat the slaves were told what to load, > >vs. each of the 40 slaves having to do more work to find what to load. > >Perhaps my assumption is wrong? Thoughts? > > > > > >I really appreciate your insights. Thanks! > > > > > > > > > > > >On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <[email protected]> wrote: > > > >>Inline. > >> > >>J-D > >> > >>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara > >><[email protected]>wrote: > >> > >>> I have a table who's keys are prefixed with a byte to help distribute > >>>the > >>> keys so scans don't hotspot. > >>> > >>> I also have a bunch of slave processes that work to scan the prefix > >>> partitions in parallel. Currently each slave sets up their own hbase > >>> connection, scanner, etc.. Most of the slave processes finish their > >>>scan > >>> and return within 2-3 seconds. It tends to take the same amount of > >>>time > >>> regardless of if there's lots of data, or very little. So I think that > >>>2 > >>> sec overhead is there because each slave will setup a new connection on > >>> each request (I am unable to reuse connections in the slaves). > >>> > >> > >>2 secs sounds way too high. I recommend you check into this and see where > >>the time is spent as you may find underlying issues lis misconfiguration. > >> > >> > >>> > >>> I'm wondering if I could remove some of that overhead by using the > >>>master > >>> (which can reuse it's hbase connection) to determine the splits, and > >>>then > >>> delegating that information out to each slave. I think I could possible > >>>use > >>> TableInputFormat/TableRecordReader to accomplish this? Would this > >>>route > >>> make sense? > >>> > >> > >>I'm not sure what you're talking about here. Which master? HBase's or > >>there's something in your infrastructure that's also called "master"? > >>Then > >>I'm not sure what your are trying to achieve by "determine the splits", > >>you > >>mean finding the regions you need to contact from your slaves? Since this > >>is something done within the HBase client, doing it externally sounds > >>terribly hacky. BTW why can't you keep the connections around? Is it a > >>problem of JVMs being re-spawned? If so, there are techniques you can use > >>to keep them around for reuse and then you would also benefit from > >>reusing > >>connections. > >> > >>Hope this helps, > >> > >>J-D > > > >
