How much time would you think the MR application will take for processing 19 million records in 1 table and 4.5 million records in another table?
On Tue, Aug 20, 2013 at 1:33 AM, Shahab Yunus <[email protected]>wrote: > Theoretically it is possible but it goes against the design of the HBase > and M/R architecture. And when I say 'goes against', it does not mean that > it is impossible but it means that you can face extreme performance > degradation, difficulty in maintaining and flexibility of the system and > poor robustness...issues applicable when your misuse or use some > concept/architecture/framework/paradigm/tool incorrectly. > > Coming to the question of using M/R, I am confused that where exactly your > supervisor wants you to use M/R? In the whole project? Anywhere in the > project? Do you have to must join HBase tables or use M/R to join HBase > tables (which would be quite surprising)? Because as I said earlier, you > can break your high-level application/system in to a chain of dependent M/R > jobs, where one job(s) feeds the other with the data. E.g. the first job(s) > read data from HBase, perform some transformation and persist it in HFS in > flat files. Then your second job reads those and applies more logic to it, > possible joining it with another set of data available. Here I am just > giving you an idea that there are many options to break-down your system > into smaller chunks still using HBase and M/R. It all depends on your > requirements and then accordingly designing your set of jobs (application). > This might require some creative thinking at your part. > > These are just my 2 cents. > > Regards, > Shahab > > > On Mon, Aug 19, 2013 at 10:22 AM, Pavan Sudheendra <[email protected] > >wrote: > > > But there's a lot of processing happening with the table data before sent > > over to the reducer.. Theoretically speaking, it should be possible.. > > > > Our supervisor strictly wants a mr application to do this.. > > > > Do you want to see more code? I'm just baffled as to why it's giving null > > pointer when there is data clearly. > > > > Regards, > > Pavan > > On Aug 19, 2013 7:41 PM, "Shahab Yunus" <[email protected]> wrote: > > > > > I think you should not try to join the tables this way. It will be > > against > > > the recommended design/pattern of HBase (joins in HBase alone go > against > > > the design) and M/R. You should first, maybe through another M/R job or > > PIg > > > script, for example, pre-process data and massage it into a uniform or > > > appropriate structure conforming to the M/R architecture (maybe convert > > > them into ext files first?) Have you looked into the recommended M/R > join > > > strategies? > > > > > > Some links to start with: > > > > > > http://codingjunkie.net/mapreduce-reduce-joins/ > > > http://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/ > > > > > > > > > http://blog.matthewrathbone.com/2013/02/09/real-world-hadoop-implementing-a-left-outer-join-in-hadoop-map-reduce.html > > > > > > Regards, > > > Shahab > > > > > > > > > On Mon, Aug 19, 2013 at 9:43 AM, Pavan Sudheendra <[email protected] > > > >wrote: > > > > > > > I'm basically trying to do a join across 3 tables in the mapper.. In > > the > > > > reducer i am doing a group by and writing the output to another > table.. > > > > > > > > Although, i agree that my code is pathetic, what i could actually do > is > > > > create a HTable object once and pass it as an extra argument to the > map > > > > function.. But, would that solve the problem? > > > > > > > > Roughly these are my tables and the code flows like this > > > > Mapper-> Table1 -> Contentidx ->Content -> Mapper aggregates the > values > > > -> > > > > Reducer. > > > > > > > > > > > > Table1 -> 19 Million rows. > > > > Contentidx table - 150k rows. > > > > Content table - 93k rows. > > > > > > > > Yes, i have looked at the map-reduce example given by the hbase > website > > > and > > > > that is how i am following. > > > > > > > > > > > > > > > > On Mon, Aug 19, 2013 at 7:05 PM, Shahab Yunus < > [email protected] > > > > >wrote: > > > > > > > > > Can you please explain or show the flow of the code a bit more? Why > > are > > > > you > > > > > create the HTable object again and again in the mapper? Where is > > > > > ContentidxTable > > > > > (the name of the table, I believe?) defined? What is your actually > > > > > requirement? > > > > > > > > > > Also, have you looked into this, the api for wiring HBase tables > with > > > M/R > > > > > jobs? > > > > > http://hbase.apache.org/book/mapreduce.example.html > > > > > > > > > > Regards, > > > > > Shahab > > > > > > > > > > > > > > > On Mon, Aug 19, 2013 at 9:05 AM, Pavan Sudheendra < > > [email protected] > > > > > >wrote: > > > > > > > > > > > Also, the same code works perfectly fine when i run it in single > > node > > > > > > cluster. I've added the hbase classpath to HADOOP_CLASSPATH and > > have > > > > set > > > > > > all the other env variables also.. > > > > > > > > > > > > > > > > > > On Mon, Aug 19, 2013 at 6:33 PM, Pavan Sudheendra < > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > Hi all, > > > > > > > I'm getting the following error messages everytime i run the > > > > map-reduce > > > > > > > job across multiple hadoop clusters: > > > > > > > > > > > > > > java.lang.NullPointerException > > > > > > > at > org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:414) > > > > > > > at > > > org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:170) > > > > > > > at com.company$AnalyzeMapper.contentidxjoin(MRjobt.java:153) > > > > > > > > > > > > > > > > > > > > > Here's the code: > > > > > > > > > > > > > > public void map(ImmutableBytesWritable row, Result columns, > > Context > > > > > > > context) > > > > > > > throws IOException { > > > > > > > ... > > > > > > > ... > > > > > > > public static String contentidxjoin(String contentId) { > > > > > > > Configuration conf = HBaseConfiguration.create(); > > > > > > > HTable table; > > > > > > > try { > > > > > > > table = new HTable(conf, ContentidxTable); > > > > > > > if(table!= null) { > > > > > > > Get get1 = new Get(Bytes.toBytes(contentId)); > > > > > > > > > > > get1.addColumn(Bytes.toBytes(ContentidxTable_ColumnFamily), > > > > > > > Bytes.toBytes(ContentidxTable_ColumnQualifier)); > > > > > > > Result result1 = table.get(get1); > > > > > > > byte[] val1 = > > > > > > > result1.getValue(Bytes.toBytes(ContentidxTable_ColumnFamily), > > > > > > > > > Bytes.toBytes(ContentidxTable_ColumnQualifier)); > > > > > > > if(val1!=null) { > > > > > > > LOGGER.info("Fetched data from BARB-Content > > > table"); > > > > > > > } else { > > > > > > > LOGGER.error("Error fetching data from > > BARB-Content > > > > > > > table"); > > > > > > > } > > > > > > > return_value = > > > > contentjoin(Bytes.toString(val1),contentId); > > > > > > > } > > > > > > > } > > > > > > > catch (Exception e) { > > > > > > > LOGGER.error("Error inside contentidxjoin method"); > > > > > > > e.printStackTrace(); > > > > > > > } > > > > > > > return return_value; > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > Assume all variables are defined. > > > > > > > > > > > > > > Can anyone please tell me why the table never gets instantiated > > or > > > > > > > entered? I had set up break points and this function gets > called > > > many > > > > > > times > > > > > > > while mapper executes.. everytime it says *Error inside > > > > contentidxjoin > > > > > > > method*.. I'm 100% sure there are rows in the ContentidxTable > so > > > not > > > > > sure > > > > > > > why its not able to fetch the value from it.. > > > > > > > > > > > > > > Please help! > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Regards- > > > > > > > Pavan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards- > > > > > > Pavan > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Regards- > > > > Pavan > > > > > > > > > > -- Regards- Pavan
