Theoretically it is possible but it goes against the design of the HBase and M/R architecture. And when I say 'goes against', it does not mean that it is impossible but it means that you can face extreme performance degradation, difficulty in maintaining and flexibility of the system and poor robustness...issues applicable when your misuse or use some concept/architecture/framework/paradigm/tool incorrectly.
Coming to the question of using M/R, I am confused that where exactly your supervisor wants you to use M/R? In the whole project? Anywhere in the project? Do you have to must join HBase tables or use M/R to join HBase tables (which would be quite surprising)? Because as I said earlier, you can break your high-level application/system in to a chain of dependent M/R jobs, where one job(s) feeds the other with the data. E.g. the first job(s) read data from HBase, perform some transformation and persist it in HFS in flat files. Then your second job reads those and applies more logic to it, possible joining it with another set of data available. Here I am just giving you an idea that there are many options to break-down your system into smaller chunks still using HBase and M/R. It all depends on your requirements and then accordingly designing your set of jobs (application). This might require some creative thinking at your part. These are just my 2 cents. Regards, Shahab On Mon, Aug 19, 2013 at 10:22 AM, Pavan Sudheendra <[email protected]>wrote: > But there's a lot of processing happening with the table data before sent > over to the reducer.. Theoretically speaking, it should be possible.. > > Our supervisor strictly wants a mr application to do this.. > > Do you want to see more code? I'm just baffled as to why it's giving null > pointer when there is data clearly. > > Regards, > Pavan > On Aug 19, 2013 7:41 PM, "Shahab Yunus" <[email protected]> wrote: > > > I think you should not try to join the tables this way. It will be > against > > the recommended design/pattern of HBase (joins in HBase alone go against > > the design) and M/R. You should first, maybe through another M/R job or > PIg > > script, for example, pre-process data and massage it into a uniform or > > appropriate structure conforming to the M/R architecture (maybe convert > > them into ext files first?) Have you looked into the recommended M/R join > > strategies? > > > > Some links to start with: > > > > http://codingjunkie.net/mapreduce-reduce-joins/ > > http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/ > > > > > http://blog.matthewrathbone.com/2013/02/09/real-world-hadoop-implementing-a-left-outer-join-in-hadoop-map-reduce.html > > > > Regards, > > Shahab > > > > > > On Mon, Aug 19, 2013 at 9:43 AM, Pavan Sudheendra <[email protected] > > >wrote: > > > > > I'm basically trying to do a join across 3 tables in the mapper.. In > the > > > reducer i am doing a group by and writing the output to another table.. > > > > > > Although, i agree that my code is pathetic, what i could actually do is > > > create a HTable object once and pass it as an extra argument to the map > > > function.. But, would that solve the problem? > > > > > > Roughly these are my tables and the code flows like this > > > Mapper-> Table1 -> Contentidx ->Content -> Mapper aggregates the values > > -> > > > Reducer. > > > > > > > > > Table1 -> 19 Million rows. > > > Contentidx table - 150k rows. > > > Content table - 93k rows. > > > > > > Yes, i have looked at the map-reduce example given by the hbase website > > and > > > that is how i am following. > > > > > > > > > > > > On Mon, Aug 19, 2013 at 7:05 PM, Shahab Yunus <[email protected] > > > >wrote: > > > > > > > Can you please explain or show the flow of the code a bit more? Why > are > > > you > > > > create the HTable object again and again in the mapper? Where is > > > > ContentidxTable > > > > (the name of the table, I believe?) defined? What is your actually > > > > requirement? > > > > > > > > Also, have you looked into this, the api for wiring HBase tables with > > M/R > > > > jobs? > > > > http://hbase.apache.org/book/mapreduce.example.html > > > > > > > > Regards, > > > > Shahab > > > > > > > > > > > > On Mon, Aug 19, 2013 at 9:05 AM, Pavan Sudheendra < > [email protected] > > > > >wrote: > > > > > > > > > Also, the same code works perfectly fine when i run it in single > node > > > > > cluster. I've added the hbase classpath to HADOOP_CLASSPATH and > have > > > set > > > > > all the other env variables also.. > > > > > > > > > > > > > > > On Mon, Aug 19, 2013 at 6:33 PM, Pavan Sudheendra < > > [email protected] > > > > > >wrote: > > > > > > > > > > > Hi all, > > > > > > I'm getting the following error messages everytime i run the > > > map-reduce > > > > > > job across multiple hadoop clusters: > > > > > > > > > > > > java.lang.NullPointerException > > > > > > at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:414) > > > > > > at > > org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:170) > > > > > > at com.company$AnalyzeMapper.contentidxjoin(MRjobt.java:153) > > > > > > > > > > > > > > > > > > Here's the code: > > > > > > > > > > > > public void map(ImmutableBytesWritable row, Result columns, > Context > > > > > > context) > > > > > > throws IOException { > > > > > > ... > > > > > > ... > > > > > > public static String contentidxjoin(String contentId) { > > > > > > Configuration conf = HBaseConfiguration.create(); > > > > > > HTable table; > > > > > > try { > > > > > > table = new HTable(conf, ContentidxTable); > > > > > > if(table!= null) { > > > > > > Get get1 = new Get(Bytes.toBytes(contentId)); > > > > > > > > > get1.addColumn(Bytes.toBytes(ContentidxTable_ColumnFamily), > > > > > > Bytes.toBytes(ContentidxTable_ColumnQualifier)); > > > > > > Result result1 = table.get(get1); > > > > > > byte[] val1 = > > > > > > result1.getValue(Bytes.toBytes(ContentidxTable_ColumnFamily), > > > > > > > Bytes.toBytes(ContentidxTable_ColumnQualifier)); > > > > > > if(val1!=null) { > > > > > > LOGGER.info("Fetched data from BARB-Content > > table"); > > > > > > } else { > > > > > > LOGGER.error("Error fetching data from > BARB-Content > > > > > > table"); > > > > > > } > > > > > > return_value = > > > contentjoin(Bytes.toString(val1),contentId); > > > > > > } > > > > > > } > > > > > > catch (Exception e) { > > > > > > LOGGER.error("Error inside contentidxjoin method"); > > > > > > e.printStackTrace(); > > > > > > } > > > > > > return return_value; > > > > > > } > > > > > > } > > > > > > > > > > > > Assume all variables are defined. > > > > > > > > > > > > Can anyone please tell me why the table never gets instantiated > or > > > > > > entered? I had set up break points and this function gets called > > many > > > > > times > > > > > > while mapper executes.. everytime it says *Error inside > > > contentidxjoin > > > > > > method*.. I'm 100% sure there are rows in the ContentidxTable so > > not > > > > sure > > > > > > why its not able to fetch the value from it.. > > > > > > > > > > > > Please help! > > > > > > > > > > > > > > > > > > -- > > > > > > Regards- > > > > > > Pavan > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards- > > > > > Pavan > > > > > > > > > > > > > > > > > > > > > -- > > > Regards- > > > Pavan > > > > > >
