Hallo,

i am building a RDF Store using HBase and experimenting with different index tables and Schema Designs.

For the input, i have a File where each line is a RDF triple in N3 Format.

I need to write to multiple Tables since i need to build several index tables. For the sake of reducing IO and not reading the file a few times i want to do that in one Map-Only Job. Later the file will contain a few million triples.

I am experimenting in Pseudo-Distributed-Mode so far but will be able to run it on our cluster soon. Storing the data in the Tables does not need to be speed-optimized at any cost, but i just want to do it as simple and fast as possible.


What is the best way to write to more than 1 table in one Map-Task?

a)
I can either use "MultiTableOutputFormat.class" and write in map() using:
Put put = new Put(key);
put.add(kv);
context.write(tableName, put);

Can i write to e.g. 6 Tables in this way by creating a new Put for each table?

But how can i turn off autoFlush and set writeBufferSize in this case? Because i think autoflush is not that good in this case of putting lots of values.


b)
I can use an instance of HTable in the Mapper class. Then i can set autoFlush and writeBufferSize and write to the table using:
HTable table = new HTable(config, tableName);
table.put(put);

But it is recommended to use only one instance of HTable, so i would need to do
table = new Table();
for each table i want to write to. Is that still fine with 6 tables?
I stumbled upon HTablePool. Is this for these scenarios?


Thank You and Regards,
Christopher

Reply via email to