I have a map task that's extracting documents from a flat file and
writing them into an HBase table as individual records; the key is
based off the path of the file (idempotent) but balances key-space
distribution with locality of reference. Additionally, I have a
secondary table where the key is the hash of a file's contents (e.g.,
MD5), and indexes back into the primary table (along with other data).
Rows aren't subject to deletion, which makes life easy.

I've successfully used HFileOutputFormat and KeyValueSortReducer on a
related task that prepopulates data into the secondary table and this
works great. I'd like to convert my extraction task over to writing
HFiles out in bulk, for both tables.

I have enough control over the keys for the primary table that the map
task could write rows to the primary table in order, making it
map-side only (assuming one HFile per task). The map task could then
emit KeyValue objects for the secondary hash table and let
HFileOutpuFormat/KeyValueSortReducer do its thing.

The question is, how do I write an HFile from a map task?
HFile.Writer? What are the gotchas?

Thanks in advance,

Jon
-- 
Jon Stewart, Principal
(646) 719-0317 | [email protected] | Arlington, VA

Reply via email to