> Is 2-3 seconds including the startup time for the JVM? Not in this case. I put a timer wrapper around my call to HTablePool.getTable and the scan.
Thanks J-D Sean On 11/29/12 3:07 PM, "Jean-Daniel Cryans" <[email protected]> wrote: >Thanks for posting your findings back to the list. > >Is 2-3 seconds including the startup time for the JVM? > >J-D > >On Wed, Nov 28, 2012 at 2:25 PM, Sean McNamara ><[email protected]>wrote: > >> Turns out there is a way to reuse the connection in Spark. I was also >> forgetting to call setCaching (that was the primary reason). So it's >>very >> fast now and I have the data where I need it. >> >> The first request still takes 2-3 seconds to setup and see data >> (regardless of how much), but after that it's super fast. >> >> Sean >> >> >> On 11/28/12 10:37 AM, "Sean McNamara" <[email protected]> >>wrote: >> >> >Hi J-D >> > >> >Really good questions. I will check for a misconfiguration. >> > >> > >> >> I'm not sure what you're talking about here. Which master >> > >> >I am using http://spark-project.org/ , so the master I am referring to >>is >> >really the spark driver. Spark can read from a hadoop InputFormat and >> >populate itself that way, but you don't have control over which >> >slave/worker data will land on using it. My goal is to use spark to >>reach >> >in for slices of data that are in HBase, and be able to perform set >> >operations on the data in parallel using spark. Being able to load a >> >partition onto the right node is important. This is so that I don't >>have >> >to reshuffle the data, just to get it onto the right node that handles >>a >> >particular data partition. >> > >> > >> >> BTW why can't you keep the connections around? >> > >> >The spark api is totally functional, AFAIK it's not possible to setup a >> >connection and keep it around (I am asking on that mailing list to be >> >sure). >> > >> > >> >> Since this is something done within the HBase client, doing it >> >>externally sounds terribly tacky >> > >> >Yup. The reason I am entertaining this route is that using an >>InputFormat >> >with spark I was able to load in way more data, and it was all sub >>second. >> > Since moving to having the spark slaves handle pulling in their data >>(not >> >using the InputFormat) it seems slower for some reason. I figured it >> >might be because using an InputFormat the slaves were told what to >>load, >> >vs. each of the 40 slaves having to do more work to find what to load. >> >Perhaps my assumption is wrong? Thoughts? >> > >> > >> >I really appreciate your insights. Thanks! >> > >> > >> > >> > >> > >> >On 11/28/12 3:10 AM, "Jean-Daniel Cryans" <[email protected]> wrote: >> > >> >>Inline. >> >> >> >>J-D >> >> >> >>On Wed, Nov 28, 2012 at 7:28 AM, Sean McNamara >> >><[email protected]>wrote: >> >> >> >>> I have a table who's keys are prefixed with a byte to help >>distribute >> >>>the >> >>> keys so scans don't hotspot. >> >>> >> >>> I also have a bunch of slave processes that work to scan the prefix >> >>> partitions in parallel. Currently each slave sets up their own >>hbase >> >>> connection, scanner, etc.. Most of the slave processes finish their >> >>>scan >> >>> and return within 2-3 seconds. It tends to take the same amount of >> >>>time >> >>> regardless of if there's lots of data, or very little. So I think >>that >> >>>2 >> >>> sec overhead is there because each slave will setup a new >>connection on >> >>> each request (I am unable to reuse connections in the slaves). >> >>> >> >> >> >>2 secs sounds way too high. I recommend you check into this and see >>where >> >>the time is spent as you may find underlying issues lis >>misconfiguration. >> >> >> >> >> >>> >> >>> I'm wondering if I could remove some of that overhead by using the >> >>>master >> >>> (which can reuse it's hbase connection) to determine the splits, and >> >>>then >> >>> delegating that information out to each slave. I think I could >>possible >> >>>use >> >>> TableInputFormat/TableRecordReader to accomplish this? Would this >> >>>route >> >>> make sense? >> >>> >> >> >> >>I'm not sure what you're talking about here. Which master? HBase's or >> >>there's something in your infrastructure that's also called "master"? >> >>Then >> >>I'm not sure what your are trying to achieve by "determine the >>splits", >> >>you >> >>mean finding the regions you need to contact from your slaves? Since >>this >> >>is something done within the HBase client, doing it externally sounds >> >>terribly hacky. BTW why can't you keep the connections around? Is it a >> >>problem of JVMs being re-spawned? If so, there are techniques you can >>use >> >>to keep them around for reuse and then you would also benefit from >> >>reusing >> >>connections. >> >> >> >>Hope this helps, >> >> >> >>J-D >> > >> >>
