Hey Oliver, Thanks a "billion" for the response -:) I will take any code you can provide even if it's a hack! I will even send you an Amazon gift card - not that you care or need it -:)
Can you share some performance statistics? Thanks again. On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF) <[email protected]> wrote: > Heya Something, > > I had a similar task recently and by far the best way to go about this is > with bulk loading after pre-splitting your target table. As you know > ImportTsv doesn't understand Avro files so I hacked together my own > ImportAvro class to create the Hfiles that I eventually moved into HBase > with completebulkload. I haven't committed my class anywhere because it's > a pretty ugly hack, but I'm happy to share it with you as a starting point. > Doing billions of puts will just drive you crazy. > > Cheers, > Oliver > > On 2012-05-09, at 4:51 PM, Something Something wrote: > > > I ran the following MR job that reads AVRO files & puts them on HBase. > The > > files have tons of data (billions). We have a fairly decent size > cluster. > > When I ran this MR job, it brought down HBase. When I commented out the > > Puts on HBase, the job completed in 45 seconds (yes that's seconds). > > > > Obviously, my HBase configuration is not ideal. I am using all the > default > > HBase configurations that come out of Cloudera's distribution: > 0.90.4+49. > > > > I am planning to read up on the following two: > > > > http://hbase.apache.org/book/important_configurations.html > > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ > > > > But can someone quickly take a look and recommend a list of priorities, > > such as "try this first..."? That would be greatly appreciated. As > > always, thanks for the time. > > > > > > Here's the Mapper. (There's no reducer): > > > > > > > > public class AvroProfileMapper extends AvroMapper<GenericData.Record, > > NullWritable> { > > private static final Logger logger = > > LoggerFactory.getLogger(AvroProfileMapper.class); > > > > final private String SEPARATOR = "*"; > > > > private HTable table; > > > > private String datasetDate; > > private String tableName; > > > > @Override > > public void configure(JobConf jobConf) { > > super.configure(jobConf); > > datasetDate = jobConf.get("datasetDate"); > > tableName = jobConf.get("tableName"); > > > > // Open table for writing > > try { > > table = new HTable(jobConf, tableName); > > table.setAutoFlush(false); > > table.setWriteBufferSize(1024 * 1024 * 12); > > } catch (IOException e) { > > throw new RuntimeException("Failed table construction", e); > > } > > } > > > > @Override > > public void map(GenericData.Record record, AvroCollector<NullWritable> > > collector, > > Reporter reporter) throws IOException { > > > > String u1 = record.get("u1").toString(); > > > > GenericData.Array<GenericData.Record> fields = > > (GenericData.Array<GenericData.Record>) record.get("bag"); > > for (GenericData.Record rec : fields) { > > Integer s1 = (Integer) rec.get("s1"); > > Integer n1 = (Integer) rec.get("n1"); > > Integer c1 = (Integer) rec.get("c1"); > > Integer freq = (Integer) rec.get("freq"); > > if (freq == null) { > > freq = 0; > > } > > > > String key = u1 + SEPARATOR + n1 + SEPARATOR + c1 + SEPARATOR > + > > s1; > > Put put = new Put(Bytes.toBytes(key)); > > put.setWriteToWAL(false); > > put.add(Bytes.toBytes("info"), Bytes.toBytes("frequency"), > > Bytes.toBytes(freq.toString())); > > try { > > table.put(put); > > } catch (IOException e) { > > throw new RuntimeException("Error while writing to " + > > table + " table.", e); > > } > > > > } > > logger.error("------------ Finished processing user: " + u1); > > } > > > > @Override > > public void close() throws IOException { > > table.close(); > > } > > > > } > > > -- > Oliver Meyn > Software Developer > Global Biodiversity Information Facility (GBIF) > +45 35 32 15 12 > http://www.gbif.org > >
