I agree with NDimiduk, writing hfile directly is cumbersome and if you are forced to rebalance causing unnecessary GC We stream in using thrift for large volume puts in real time Storm is also used (can sit on top of Kafka, or 0MQ) to do pre-processing / calculation but its still thrift that put the data in That way we can use any node to load data and let hbase natively distribute the data as needed.
On Fri, Feb 13, 2015 at 9:44 PM, lars hofhansl <[email protected]> wrote: > That's pretty cool. Have you documented somewhere how exactly you do that > (a blog post or something)? That'd be useful for other folks to know. > From: Geovanie Marquez <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Friday, February 13, 2015 12:14 PM > Subject: Re: Streaming data to htable > > We use Spark to convert large batches of data directly into HFiles. We've > found it to be extremely performant, but we do not batch since our use case > is not streaming. We bring it in about 50GB at a time so we would not > suffer from the small files issue mentioned, but we do manually manage our > region splits. > > My point is that Spark to HFile proved to be over 65% better for us than > MapReduce to HFile. > > > > On Fri, Feb 13, 2015 at 1:01 PM, Andrey Stepachev <[email protected]> > wrote: > > > Hi Jaime. > > > > That a bit of magic to use HFiles directly without considering > > keys and data layout (as mentioned by Nick you will face > > with a task of manually splitting keys, so effectively you will > > do what hbase already does effectively). > > > > Original answer was for concrete usecase: it is known where > > keys goes (to the last region). > > > > In my case table was manually splittable so we always > > knew where data will go, so at once each writer writes > > only data only into one of active regions. Data was presplittted > > by kafka, so we have no need to split data once more by > > hbase. > > > > But again, that was very specific situation, and I'd recommend > > to stick with puts until you fully understand your model > > and find out that you can use bulk load for it > > > > > > > > On Fri, Feb 13, 2015 at 5:46 PM, Nick Dimiduk <[email protected]> > wrote: > > > > > Writing HFiles can become cumbersome if the data is spread evenly > across > > > regions -- you'll end up with lots of small files rather than a few big > > > ones. > > > > > > You can batch writes through the client API. I would recommend you > start > > > with HTableInterface$put(List<Put>). You can tune the client-side > buffer > > > (#setWriteBufferSize(long) or hbase.client.write.buffer), which > defaults > > to > > > 2mb. Be careful with your RPC timeouts. As you increase this size, you > > may > > > need to tweak them as well. > > > > > > Also be sure you've tuned your memstores appropriately. Make sure > you're > > > getting nice 128mb flushes. Do this by managing the write-active region > > > count on each machine. > > > > > > On Fri, Feb 13, 2015 at 8:37 AM, Jaime Solano <[email protected]> > > wrote: > > > > > > > Hi Andrey, > > > > > > > > We're facing a similar situation, where we plan to load a lot of data > > > into > > > > HBase direclty. We considered writing the Hfiles without MapReduce. > Is > > > this > > > > something you've done in the past? Are there any sample codes we > could > > > use > > > > as guide? On another side, what would you consider "big enough" to > > switch > > > > from regular Puts to HFiles-writing? > > > > > > > > Thanks! > > > > > > > > On Fri Feb 13 2015 at 10:58:28 Andrey Stepachev <[email protected]> > > > wrote: > > > > > > > > > Hi hongbin, > > > > > > > > > > It seems that depend on how many data you ingest. > > > > > In case of big enough I'd look at creating HFiles > > > > > directly without mapreduce (for example using > > > > > HFileOutputFormat without mapreduce or using > > > > > HFileWriter directly). > > > > > Created files can be imported by LoadIncrementalHFiles#doBulkLoad > > > > > directly into hbase. And need to be sure that your > > > > > regions will not split too fast, bulk load can load only > > > > > hfiles which contain keyvalues from one or two adjust > > > > > regions. (better if splits are disabled and did externally) > > > > > > > > > > But you need to be sure that you actually need > > > > > to do such micromanagement and not just stick > > > > > with regular Puts. HBase can sustain quite good > > > > > amount of input data to start worry about. > > > > > > > > > > Cheers. > > > > > > > > > > On Fri, Feb 13, 2015 at 6:20 AM, hongbin ma <[email protected]> > > > > wrote: > > > > > > > > > > > hi, > > > > > > > > > > > > I'm trying to use a htable to store data that comes in a > streaming > > > > > fashion. > > > > > > The streaming in data is guaranteed to have a larger KEY than ANY > > > > > existing > > > > > > keys in the table. > > > > > > And the data will be READONLY. > > > > > > > > > > > > The data is streaming in at a very high rate, I don't want to > > issue a > > > > PUT > > > > > > operation for each data entry, because obviously it is poor in > > > > > performance. > > > > > > I'm thinking about pooling the data entries and flush them to > hbase > > > > every > > > > > > five minutes, and I AFAIK there're few options: > > > > > > > > > > > > 1. Pool the data entries, and every 5 minute run a MR job to > > convert > > > > the > > > > > > data to hfile format. This approach could avoid the overhead of > > > single > > > > > PUT, > > > > > > but I'm afraid the MR job might be too costly( waiting in the job > > > > queue) > > > > > to > > > > > > keep in pace. > > > > > > > > > > > > 2. Use HtableInterface.put(List<Put>) the batched version should > be > > > > > faster, > > > > > > but I'm not quite sure how much. > > > > > > > > > > > > 3.? > > > > > > > > > > > > can anyone give me some advice on this? > > > > > > thanks! > > > > > > > > > > > > hongbin > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Andrey. > > > > > > > > > > > > > > > > > > > > -- > > Andrey. > > > > > > -- Abraham Tom Email: [email protected] Phone: 415-515-3621
