I have a map task that's extracting documents from a flat file and writing them into an HBase table as individual records; the key is based off the path of the file (idempotent) but balances key-space distribution with locality of reference. Additionally, I have a secondary table where the key is the hash of a file's contents (e.g., MD5), and indexes back into the primary table (along with other data). Rows aren't subject to deletion, which makes life easy.
I've successfully used HFileOutputFormat and KeyValueSortReducer on a related task that prepopulates data into the secondary table and this works great. I'd like to convert my extraction task over to writing HFiles out in bulk, for both tables. I have enough control over the keys for the primary table that the map task could write rows to the primary table in order, making it map-side only (assuming one HFile per task). The map task could then emit KeyValue objects for the secondary hash table and let HFileOutpuFormat/KeyValueSortReducer do its thing. The question is, how do I write an HFile from a map task? HFile.Writer? What are the gotchas? Thanks in advance, Jon -- Jon Stewart, Principal (646) 719-0317 | [email protected] | Arlington, VA
