Re: Flushing to HDFS sooner

Ulrich Staudinger Sun, 19 Feb 2012 04:59:09 -0800

Hey there,

On Sun, Feb 19, 2012 at 1:44 PM, Manuel de Ferran <[email protected]
> wrote:


> Greetings,
>
> on a testing platform (running HBase-0.90.3 on top of Hadoop-0.20-append),
> we did the following :
> - create a dummy table
> - put a single row
> - get this row from the shell
> - wait a few minutes
> - kill -9 the datanodes
>
> Because regionservers could not connect to datanodes, they shutdown.
>
> On restart, the row has vanished. But if we do the same and "flush 'dummy'"
> from the Shell before killing the datanodes, the row is still there.
>
> Is it related to WAL ? MemStores ? What happened ?
>
> What are the recommended settings so rows are auto-flushed or at least
> flushed more frequently ?
>
>
>

I can't speak for anyone else than me, but I do flush manually in sane
intervals and depending on the amount of data that I put in.

I typically store time series data in hbase and financial timeseries mean
in my case intraday market data. I did some performance tests and found
that flushing after every row insert kills write performance. Same is true
if I write many thousand rows before I do a commit. I found a good balance
(but that's data specific, I assume) in inserting 1000 rows and then
flushing. Next 1000 rows, flushing. At the end of processing data, a final
flush again. By doing so, I have never had any problems with lost data so
far.

Regards


-- 
Ulrich Staudinger

<http://goog_958005736>http://www.activequant.com
Connect online: https://www.xing.com/profile/Ulrich_Staudinger

Re: Flushing to HDFS sooner

Reply via email to