Am 13.02.2015 um 10:39 schrieb Sleiman Jneidi: > I would go with second option, HtableInterface.put(List<Put>). The first > option sounds dodgy, where 5 minutes is a good time for things to go wrong > and you lose your data
I agree with Sleiman. In my opinion the "multi put" option is the best plan. The time a hbase client needs for a "multi put" for 10, 100 or 1000 is nearly the same because of the "overhead" of the operation. The larger the array of "Put" gets, the more efficient the application will be (as a rule of thumb). And I would make a simple system of three threads. The first is the streaming thread which eats up the streamed data and generates a "Put" and put it into a ArrayList. If the ArrayList is larger than 1M or 10M elements (such a number is quite common and is realistic) OR 5 min (second thread is for timing) are over a new ArrayList for dumping new data is created and the first array is given the the third thread to be putted. By this you never loses streamed data because your thread is blocked and the data is never older than 5 mins. And the implementation should be very easy. As the op askes for more options: There is a third option. You could use another system to buffer which does not have the same overhead problem. E.g. you could dump the data first into a sql table and then run over that with mapred or whatever. But that's not a clever option. I just added it for completeness. But as Sleiman I think your second option is the way to go! Best wishes Wilm
