I think you should not try to join the tables this way. It will be against the recommended design/pattern of HBase (joins in HBase alone go against the design) and M/R. You should first, maybe through another M/R job or PIg script, for example, pre-process data and massage it into a uniform or appropriate structure conforming to the M/R architecture (maybe convert them into ext files first?) Have you looked into the recommended M/R join strategies?
Some links to start with: http://codingjunkie.net/mapreduce-reduce-joins/ http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/ http://blog.matthewrathbone.com/2013/02/09/real-world-hadoop-implementing-a-left-outer-join-in-hadoop-map-reduce.html Regards, Shahab On Mon, Aug 19, 2013 at 9:43 AM, Pavan Sudheendra <[email protected]>wrote: > I'm basically trying to do a join across 3 tables in the mapper.. In the > reducer i am doing a group by and writing the output to another table.. > > Although, i agree that my code is pathetic, what i could actually do is > create a HTable object once and pass it as an extra argument to the map > function.. But, would that solve the problem? > > Roughly these are my tables and the code flows like this > Mapper-> Table1 -> Contentidx ->Content -> Mapper aggregates the values -> > Reducer. > > > Table1 -> 19 Million rows. > Contentidx table - 150k rows. > Content table - 93k rows. > > Yes, i have looked at the map-reduce example given by the hbase website and > that is how i am following. > > > > On Mon, Aug 19, 2013 at 7:05 PM, Shahab Yunus <[email protected] > >wrote: > > > Can you please explain or show the flow of the code a bit more? Why are > you > > create the HTable object again and again in the mapper? Where is > > ContentidxTable > > (the name of the table, I believe?) defined? What is your actually > > requirement? > > > > Also, have you looked into this, the api for wiring HBase tables with M/R > > jobs? > > http://hbase.apache.org/book/mapreduce.example.html > > > > Regards, > > Shahab > > > > > > On Mon, Aug 19, 2013 at 9:05 AM, Pavan Sudheendra <[email protected] > > >wrote: > > > > > Also, the same code works perfectly fine when i run it in single node > > > cluster. I've added the hbase classpath to HADOOP_CLASSPATH and have > set > > > all the other env variables also.. > > > > > > > > > On Mon, Aug 19, 2013 at 6:33 PM, Pavan Sudheendra <[email protected] > > > >wrote: > > > > > > > Hi all, > > > > I'm getting the following error messages everytime i run the > map-reduce > > > > job across multiple hadoop clusters: > > > > > > > > java.lang.NullPointerException > > > > at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:414) > > > > at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:170) > > > > at com.company$AnalyzeMapper.contentidxjoin(MRjobt.java:153) > > > > > > > > > > > > Here's the code: > > > > > > > > public void map(ImmutableBytesWritable row, Result columns, Context > > > > context) > > > > throws IOException { > > > > ... > > > > ... > > > > public static String contentidxjoin(String contentId) { > > > > Configuration conf = HBaseConfiguration.create(); > > > > HTable table; > > > > try { > > > > table = new HTable(conf, ContentidxTable); > > > > if(table!= null) { > > > > Get get1 = new Get(Bytes.toBytes(contentId)); > > > > > get1.addColumn(Bytes.toBytes(ContentidxTable_ColumnFamily), > > > > Bytes.toBytes(ContentidxTable_ColumnQualifier)); > > > > Result result1 = table.get(get1); > > > > byte[] val1 = > > > > result1.getValue(Bytes.toBytes(ContentidxTable_ColumnFamily), > > > > Bytes.toBytes(ContentidxTable_ColumnQualifier)); > > > > if(val1!=null) { > > > > LOGGER.info("Fetched data from BARB-Content table"); > > > > } else { > > > > LOGGER.error("Error fetching data from BARB-Content > > > > table"); > > > > } > > > > return_value = > contentjoin(Bytes.toString(val1),contentId); > > > > } > > > > } > > > > catch (Exception e) { > > > > LOGGER.error("Error inside contentidxjoin method"); > > > > e.printStackTrace(); > > > > } > > > > return return_value; > > > > } > > > > } > > > > > > > > Assume all variables are defined. > > > > > > > > Can anyone please tell me why the table never gets instantiated or > > > > entered? I had set up break points and this function gets called many > > > times > > > > while mapper executes.. everytime it says *Error inside > contentidxjoin > > > > method*.. I'm 100% sure there are rows in the ContentidxTable so not > > sure > > > > why its not able to fetch the value from it.. > > > > > > > > Please help! > > > > > > > > > > > > -- > > > > Regards- > > > > Pavan > > > > > > > > > > > > > > > > -- > > > Regards- > > > Pavan > > > > > > > > > -- > Regards- > Pavan >
