Hi, I want to use Spark with HBase and I'm confused about how to ingest my data using HBase' HFileOutputFormat. It recommends calling configureIncrementalLoad which does the following:
- Inspects the table to configure a total order partitioner - Uploads the partitions file to the cluster and adds it to the DistributedCache - Sets the number of reduce tasks to match the current number of regions - Sets the output key/value class to match HFileOutputFormat2's requirements - Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer) But in Spark, it seems I have to do the sorting and partition myself, right? Can anyone show me how to do it properly? Is there a better way to ingest data fast to HBase from Spark? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/