@Bikas, @Wolciech, I think 0.5 should likely work without changes as HDP-2.1 is based on Apache Hadoop 2.4
( http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_releasenotes_hdp_2.1/content/ch_relnotes-hdp-2.1.1-product.html ) thanks — HItesh On May 22, 2014, at 2:59 PM, Bikas Saha <[email protected]> wrote: > If you see issues in your 0.5 build while running on the cluster you may want > to follow the latest instructions in BUILDING.txt to target Hadoop 2.2 (HDP > 2.1). > > From: Bikas Saha [mailto:[email protected]] > Sent: Thursday, May 22, 2014 2:56 PM > To: [email protected] > Subject: RE: Sequence file as an output > > That’s good news. The gains with larger data set may be lower because the > time is dominated by the actual code that’s doing work. You may check that. > > You can actually build 0.5 and use it on your cluster because Tez is a client > side application. You only need to have the correct jars on the local client > classpath and on HDFS location pointed to by TEZ_LIB_URI in your tez-site.xml. > > Bikas > > From: Wojciech Indyk [mailto:[email protected]] > Sent: Thursday, May 22, 2014 2:33 PM > To: [email protected] > Subject: Re: Sequence file as an output > > I wrote my own processors, as in WordCount in v.0.4. > Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where > TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in context > of Wordcount from 0.5 version. So That I decided to base on Wordcount from > 0.4 version. It worked ok until the output format problem. > Nevertheless, I made a workaround to just check performance of TEZ with > sessions. I generated sequenceFileInput for each iteration by MapReduce > algorithm. Then I used this input for TEZ version of the algorithm (I saved > TEZ output in an other place). Results are very promising. By small dataset > (~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster. > I don't have time now to work on problem with SequenceFile as an output. I > would rather to rewrite the code according to best practices. I think also > update TEZ 0.4 to 0.5 will be required. > > Kindly regards > Wojciech Indyk > > 2014-05-21 19:31 GMT+02:00 Bikas Saha <[email protected]>: > You are right. In fact, it’s a very interesting use case. > > Are you using MapProcessor and ReduceProcessor? Or have you written your own > processor and are just using Tez inputs/outputs? > > If you look at the latest WordCount.java code in the tez code base, then you > can see the current best practice for using the API. For these best practices > on using the Tez API, you should look at compiling against the current master > that tracks the next 0.5 release. If you are building tez locally then it’s > the master branch. Otherwise maven artifacts (for dependency on > 0.5.0-incubating-SNAPSHOT) are at > https://repository.apache.org/content/groups/snapshots/org/apache/tez > > > Let us know if this helps! > Bikas > > From: Wojciech Indyk [mailto:[email protected]] > Sent: Wednesday, May 21, 2014 1:58 AM > To: [email protected] > Subject: Re: Sequence file as an output > > When I remove MRHelpers.doJobClientMagic then NullPointerException in > Configuration class occurs. > > Could you advise me a base class (class and branch/release) for good practice > in TEZ for mapReduce jobs? I've rewritten my MR job to use Counters (not > available in MapReduce on TEZ) and Sessions (to improve iterative processing > speed). I have just Map and Reduce phase, it works in loop (several > iterations), so I think using session can improve a performance. Am I right? > > Kindly regards > Wojciech Indyk > > 2014-05-21 0:33 GMT+02:00 Siddharth Seth <[email protected]>: > It's possible that the old Output Format is being used (mapred vs mapreduce). > Could you try forcing this to use the new API with the following. > finalVertex.setBoolean("mapred.mapper.new-api", true); > Also, if you happen to be using MRHelpers.doJobClientMagic - remove that, > since that could reset this parameter. > > This is a little messed up, but we're working on making this much easier to > use in 0.5. > > Thanks > - Sid > > > On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <[email protected]> > wrote: > Hi all! > I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile. > I use: > finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR, > SequenceFileOutputFormat.class.getName()); > The problem is the output is saved as TextOutputFormat. I use Sequence file > as an input to DAG and it works fine (I use SequenceFileInputFormat). > > Kindly regards > Wojciech Indyk > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader of > this message is not the intended recipient, you are hereby notified that any > printing, copying, dissemination, distribution, disclosure or forwarding of > this communication is strictly prohibited. If you have received this > communication in error, please contact the sender immediately and delete it > from your system. Thank You. > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader of > this message is not the intended recipient, you are hereby notified that any > printing, copying, dissemination, distribution, disclosure or forwarding of > this communication is strictly prohibited. If you have received this > communication in error, please contact the sender immediately and delete it > from your system. Thank You.
