Re: Question about MapReduce

Jean-Marc Spaggiari Fri, 02 Nov 2012 13:01:28 -0700

That was my initial plan too, but I was wondering if there was any
other best practice about the delete.  So I will go that way.


Thanks,

JM

2012/11/2, Shrijeet Paliwal <[email protected]>:
> Not sure what exactly is happening in your job. But in one of the delete
> jobs I wrote I was creating an instance of HTable in setup method of my
> mapper
>
> delTab = new HTable(conf, conf.get(TABLE_NAME));
>
> And performing delete in map() call using delTab. So no, you do not have
> access to table directly *usually*.
>
>
> -Shrijeet
>
>
> On Fri, Nov 2, 2012 at 12:47 PM, Jean-Marc Spaggiari <
> [email protected]> wrote:
>
>> Sorry, one last question.
>>
>> On the map method, I have access to the row using the values
>> parameter. Now, based on the value content, I might want to delete it.
>> Do I have access to the table directly from one of the parameters? Or
>> should I call the delete using an HTableInterface from my pool?
>>
>> Thanks,
>>
>> JM
>>
>> 2012/11/2, Jean-Marc Spaggiari <[email protected]>:
>> > Yep, you perfectly got my question.
>> >
>> > I just tried and it's working perfectly!
>> >
>> > Thanks a lot! I now have a lot to play with.
>> >
>> > JM
>> >
>> > 2012/11/2, Shrijeet Paliwal <[email protected]>:
>> >> JM,
>> >>
>> >> I personally would chose to put it neither hadoop libs nor hbase libs.
>> >> Have
>> >> them go to your application's own install directory.
>> >>
>> >> Then you could sent the variable HADOOP_CLASSPATH to have your jar
>> >> (also
>> >> include hbase jars, hbase dependencies and dependencies your program
>> >> needs)
>> >> And to execute fire 'hadoop jar' command.
>> >>
>> >> An example[1]:
>> >>
>> >> Set classpath:
>> >> export HADOOP_CLASSPATH=`hbase
>> classpath`:mycool.jar:mycooldependency.jar
>> >>
>> >> Fire following to launch your job:
>> >> hadoop jar mycool.jar hbase.experiments.MyCoolProgram
>> >> -Dmapred.running.map.limit=50
>> >> -Dmapred.map.tasks.speculative.execution=false aCommandLineArg
>> >>
>> >>
>> >> Did I get your question right?
>> >>
>> >> [1] In the example I gave `hbase classpath` gets you set with all
>> >> hbase
>> >> jars.
>> >>
>> >>
>> >>
>> >> On Fri, Nov 2, 2012 at 11:56 AM, Jean-Marc Spaggiari <
>> >> [email protected]> wrote:
>> >>
>> >>> Hi Shrijeet,
>> >>>
>> >>> Helped a lot! Thanks!
>> >>>
>> >>> Now, the only think I need is to know where's the best place to put
>> >>> my
>> >>> JAR on the server. Should I put it on the hadoop lib directory? Or
>> >>> somewhere on the HBase structure?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> JM
>> >>>
>> >>> 2012/10/29, Shrijeet Paliwal <[email protected]>:
>> >>> > In line.
>> >>> >
>> >>> > On Mon, Oct 29, 2012 at 8:11 AM, Jean-Marc Spaggiari <
>> >>> > [email protected]> wrote:
>> >>> >
>> >>> >> I'm replying to myself ;)
>> >>> >>
>> >>> >> I found "cleanup" and "setup" methods from the TableMapper table.
>> >>> >> So
>> >>> >> I
>> >>> >> think those are the methods I was looking for. I will init the
>> >>> >> HTablePool there. Please let me know if I'm wrong.
>> >>> >>
>> >>> >> Now, I still have few other questions.
>> >>> >>
>> >>> >> 1) context.getCurrentValue() can throw a InterrruptedException,
>> >>> >> but
>> >>> >> when can this occur? Is there a timeout on the Mapper side? Of
>> >>> >> it's
>> >>> >> if
>> >>> >> the region is going down while the job is running?
>> >>> >>
>> >>> >
>> >>> > You do not need to call  context.getCurrentValue(). The 'value'
>> >>> > argument
>> >>> to
>> >>> > map method[1] has the information you are looking for.
>> >>> >
>> >>> >
>> >>> >> 2) How can I pass parameters to the Map method? Can I use
>> >>> >> job.getConfiguration().put to add some properties there, can get
>> them
>> >>> >> back in context.getConfiguration.get?
>> >>> >>
>> >>> >
>> >>> > Yes, thats how it is done.
>> >>> >
>> >>> >
>> >>> >> 3) What's the best way to log results/exceptions/traces from the
>> >>> >> map
>> >>> >> method?
>> >>> >>
>> >>> >
>> >>> > In most cases, you'll have mapper and reducer classes as nested
>> static
>> >>> > classes within some enclosing class. You can get handle to the
>> >>> > Logger
>> >>> from
>> >>> > the enclosing class and do your usual LOG.info, LOG.warn yada yada.
>> >>> >
>> >>> > Hope it helps.
>> >>> >
>> >>> > [1] map(KEYIN key, *VALUEIN value*, Context context)
>> >>> >
>> >>> >>
>> >>> >> I will search on my side, but some help will be welcome because it
>> >>> >> seems there is not much documentation when we start to dig a bit
>> >>> >> :(
>> >>> >>
>> >>> >> JM
>> >>> >>
>> >>> >> 2012/10/27, Jean-Marc Spaggiari <[email protected]>:
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > I'm thinking about my firs MapReduce class and I have some
>> >>> >> > questions.
>> >>> >> >
>> >>> >> > The goal of it will be to move some rows from one table to
>> >>> >> > another
>> >>> >> > one
>> >>> >> > based on the timestamp only.
>> >>> >> >
>> >>> >> > Since this is pretty new for me, I'm starting from the
>> >>> >> > RowCounter
>> >>> >> > class to have a baseline.
>> >>> >> >
>> >>> >> > There are few things I will have to update. First, the
>> >>> >> > createSumittableJob method to get timestamp range instead of key
>> >>> >> > range, and "play2 with the parameters. This part is fine.
>> >>> >> >
>> >>> >> > Next, I need to update the map method, and this is where I have
>> >>> >> > some
>> >>> >> > questions.
>> >>> >> >
>> >>> >> > I'm able to find the timestamp of all the cf:c from the
>> >>> >> > context.getCurrentValue() method, that's fine. Now, my concern
>> >>> >> > is
>> >>> >> > on
>> >>> >> > the way to get access to the table to store this field, and the
>> >>> >> > table
>> >>> >> > to delete it. Should I instantiate an HTable for the source
>> >>> >> > table,
>> >>> >> > and
>> >>> >> > execute and delete on it, then do an insert on another HTable
>> >>> >> > instance?  Should I use an HTablePool? Also, since I’m already
>> >>> >> > on
>> >>> >> > the
>> >>> >> > row, can’t I just mark it as deleted instead of calling a new
>> >>> >> > HTable?
>> >>> >> >
>> >>> >> > Also, instead of calling the delete and put one by one, I would
>> >>> >> > like
>> >>> >> > to put them on a list and execute it only when it’s over 10
>> >>> >> > members.
>> >>> >> > How can I make sure that at the end of the job, this is flushed?
>> >>> >> > Else,
>> >>> >> > I will lose some operations. Is there a kind of “dispose” method
>> >>> >> > called on the region when the job is done?
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> >
>> >>> >> > JM
>> >>> >> >
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >
>>
>

Re: Question about MapReduce

Reply via email to