If you see issues in your 0.5 build while running on the cluster you may want to follow the latest instructions in BUILDING.txt to target Hadoop 2.2 (HDP 2.1).
*From:* Bikas Saha [mailto:[email protected]] *Sent:* Thursday, May 22, 2014 2:56 PM *To:* [email protected] *Subject:* RE: Sequence file as an output That’s good news. The gains with larger data set may be lower because the time is dominated by the actual code that’s doing work. You may check that. You can actually build 0.5 and use it on your cluster because Tez is a client side application. You only need to have the correct jars on the local client classpath and on HDFS location pointed to by TEZ_LIB_URI in your tez-site.xml. Bikas *From:* Wojciech Indyk [mailto:[email protected]] *Sent:* Thursday, May 22, 2014 2:33 PM *To:* [email protected] *Subject:* Re: Sequence file as an output I wrote my own processors, as in WordCount in v.0.4. Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in context of Wordcount from 0.5 version. So That I decided to base on Wordcount from 0.4 version. It worked ok until the output format problem. Nevertheless, I made a workaround to just check performance of TEZ with sessions. I generated sequenceFileInput for each iteration by MapReduce algorithm. Then I used this input for TEZ version of the algorithm (I saved TEZ output in an other place). Results are very promising. By small dataset (~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster. I don't have time now to work on problem with SequenceFile as an output. I would rather to rewrite the code according to best practices. I think also update TEZ 0.4 to 0.5 will be required. Kindly regards Wojciech Indyk 2014-05-21 19:31 GMT+02:00 Bikas Saha <[email protected]>: You are right. In fact, it’s a very interesting use case. Are you using MapProcessor and ReduceProcessor? Or have you written your own processor and are just using Tez inputs/outputs? If you look at the latest WordCount.java code in the tez code base, then you can see the current best practice for using the API. For these best practices on using the Tez API, you should look at compiling against the current master that tracks the next 0.5 release. If you are building tez locally then it’s the master branch. Otherwise maven artifacts (for dependency on 0.5.0-incubating-SNAPSHOT) are at https://repository.apache.org/content/groups/snapshots/org/apache/tez Let us know if this helps! Bikas *From:* Wojciech Indyk [mailto:[email protected]] *Sent:* Wednesday, May 21, 2014 1:58 AM *To:* [email protected] *Subject:* Re: Sequence file as an output When I remove MRHelpers.doJobClientMagic then NullPointerException in Configuration class occurs. Could you advise me a base class (class and branch/release) for good practice in TEZ for mapReduce jobs? I've rewritten my MR job to use Counters (not available in MapReduce on TEZ) and Sessions (to improve iterative processing speed). I have just Map and Reduce phase, it works in loop (several iterations), so I think using session can improve a performance. Am I right? Kindly regards Wojciech Indyk 2014-05-21 0:33 GMT+02:00 Siddharth Seth <[email protected]>: It's possible that the old Output Format is being used (mapred vs mapreduce). Could you try forcing this to use the new API with the following. finalVertex.setBoolean("mapred.mapper.new-api", true); Also, if you happen to be using MRHelpers.doJobClientMagic - remove that, since that could reset this parameter. This is a little messed up, but we're working on making this much easier to use in 0.5. Thanks - Sid On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <[email protected]> wrote: Hi all! I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile. I use: finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR, SequenceFileOutputFormat.class.getName()); The problem is the output is saved as TextOutputFormat. I use Sequence file as an input to DAG and it works fine (I use SequenceFileInputFormat). Kindly regards Wojciech Indyk CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
