Hallo,
i am building a RDF Store using HBase and experimenting with different
index tables and Schema Designs.
For the input, i have a File where each line is a RDF triple in N3 Format.
I need to write to multiple Tables since i need to build several index
tables. For the sake of reducing IO and not reading the file a few times
i want to do that in one Map-Only Job. Later the file will contain a few
million triples.
I am experimenting in Pseudo-Distributed-Mode so far but will be able to
run it on our cluster soon.
Storing the data in the Tables does not need to be speed-optimized at
any cost, but i just want to do it as simple and fast as possible.
What is the best way to write to more than 1 table in one Map-Task?
a)
I can either use "MultiTableOutputFormat.class" and write in map() using:
Put put = new Put(key);
put.add(kv);
context.write(tableName, put);
Can i write to e.g. 6 Tables in this way by creating a new Put for each
table?
But how can i turn off autoFlush and set writeBufferSize in this case?
Because i think autoflush is not that good in this case of putting lots
of values.
b)
I can use an instance of HTable in the Mapper class. Then i can set
autoFlush and writeBufferSize and write to the table using:
HTable table = new HTable(config, tableName);
table.put(put);
But it is recommended to use only one instance of HTable, so i would
need to do
table = new Table();
for each table i want to write to. Is that still fine with 6 tables?
I stumbled upon HTablePool. Is this for these scenarios?
Thank You and Regards,
Christopher