I wrote my own processors, as in WordCount in v.0.4. Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in context of Wordcount from 0.5 version. So That I decided to base on Wordcount from 0.4 version. It worked ok until the output format problem. Nevertheless, I made a workaround to just check performance of TEZ with sessions. I generated sequenceFileInput for each iteration by MapReduce algorithm. Then I used this input for TEZ version of the algorithm (I saved TEZ output in an other place). Results are very promising. By small dataset (~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster. I don't have time now to work on problem with SequenceFile as an output. I would rather to rewrite the code according to best practices. I think also update TEZ 0.4 to 0.5 will be required.
Kindly regards Wojciech Indyk 2014-05-21 19:31 GMT+02:00 Bikas Saha <[email protected]>: > You are right. In fact, it’s a very interesting use case. > > > > Are you using MapProcessor and ReduceProcessor? Or have you written your > own processor and are just using Tez inputs/outputs? > > > > If you look at the latest WordCount.java code in the tez code base, then > you can see the current best practice for using the API. For these best > practices on using the Tez API, you should look at compiling against the > current master that tracks the next 0.5 release. If you are building tez > locally then it’s the master branch. Otherwise maven artifacts (for > dependency on 0.5.0-incubating-SNAPSHOT) are at > https://repository.apache.org/content/groups/snapshots/org/apache/tez > > > > > > Let us know if this helps! > > Bikas > > > > *From:* Wojciech Indyk [mailto:[email protected]] > *Sent:* Wednesday, May 21, 2014 1:58 AM > *To:* [email protected] > *Subject:* Re: Sequence file as an output > > > > When I remove MRHelpers.doJobClientMagic then NullPointerException in > Configuration class occurs. > > > > Could you advise me a base class (class and branch/release) for good > practice in TEZ for mapReduce jobs? I've rewritten my MR job to use > Counters (not available in MapReduce on TEZ) and Sessions (to improve > iterative processing speed). I have just Map and Reduce phase, it works in > loop (several iterations), so I think using session can improve a > performance. Am I right? > > > Kindly regards > > Wojciech Indyk > > > > 2014-05-21 0:33 GMT+02:00 Siddharth Seth <[email protected]>: > > It's possible that the old Output Format is being used (mapred vs > mapreduce). > > Could you try forcing this to use the new API with the following. > > finalVertex.setBoolean("mapred.mapper.new-api", true); > > Also, if you happen to be using MRHelpers.doJobClientMagic - remove that, > since that could reset this parameter. > > > > This is a little messed up, but we're working on making this much easier > to use in 0.5. > > > > Thanks > > - Sid > > > > > > On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <[email protected]> > wrote: > > Hi all! > > I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile. > > I use: > > finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR, > SequenceFileOutputFormat.class.getName()); > > The problem is the output is saved as TextOutputFormat. I use Sequence > file as an input to DAG and it works fine (I use SequenceFileInputFormat). > > > Kindly regards > > Wojciech Indyk > > > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You.
