Re: Sequence file as an output

Hitesh Shah Thu, 22 May 2014 15:05:27 -0700

@Bikas, @Wolciech, 

I think 0.5 should likely work without changes as HDP-2.1 is based on Apache 
Hadoop 2.4


( 
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_releasenotes_hdp_2.1/content/ch_relnotes-hdp-2.1.1-product.html
 )

thanks
— HItesh

On May 22, 2014, at 2:59 PM, Bikas Saha <[email protected]> wrote:

> If you see issues in your 0.5 build while running on the cluster you may want 
> to follow the latest instructions in BUILDING.txt to target Hadoop 2.2 (HDP 
> 2.1).
>  
> From: Bikas Saha [mailto:[email protected]] 
> Sent: Thursday, May 22, 2014 2:56 PM
> To: [email protected]
> Subject: RE: Sequence file as an output
>  
> That’s good news. The gains with larger data set may be lower because the 
> time is dominated by the actual code that’s doing work. You may check that.
>  
> You can actually build 0.5 and use it on your cluster because Tez is a client 
> side application. You only need to have the correct jars on the local client 
> classpath and on HDFS location pointed to by TEZ_LIB_URI in your tez-site.xml.
>  
> Bikas
>  
> From: Wojciech Indyk [mailto:[email protected]] 
> Sent: Thursday, May 22, 2014 2:33 PM
> To: [email protected]
> Subject: Re: Sequence file as an output
>  
> I wrote my own processors, as in WordCount in v.0.4.
> Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where 
> TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in context 
> of Wordcount from 0.5 version. So That I decided to base on Wordcount from 
> 0.4 version. It worked ok until the output format problem. 
> Nevertheless, I made a workaround to just check performance of TEZ with 
> sessions. I generated sequenceFileInput for each iteration by MapReduce 
> algorithm. Then I used this input for TEZ version of the algorithm (I saved 
> TEZ output in an other place). Results are very promising. By small dataset 
> (~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster.
> I don't have time now to work on problem with SequenceFile as an output. I 
> would rather to rewrite the code according to best practices. I think also 
> update TEZ 0.4 to 0.5 will be required.
> 
> Kindly regards
> Wojciech Indyk
>  
> 2014-05-21 19:31 GMT+02:00 Bikas Saha <[email protected]>:
> You are right. In fact, it’s a very interesting use case.
>  
> Are you using MapProcessor and ReduceProcessor? Or have you written your own 
> processor and are just using Tez inputs/outputs?
>  
> If you look at the latest WordCount.java code in the tez code base, then you 
> can see the current best practice for using the API. For these best practices 
> on using the Tez API, you should look at compiling against the current master 
> that tracks the next 0.5 release. If you are building tez locally then it’s 
> the master branch. Otherwise maven artifacts (for dependency on 
> 0.5.0-incubating-SNAPSHOT) are at 
> https://repository.apache.org/content/groups/snapshots/org/apache/tez
>  
>  
> Let us know if this helps!
> Bikas
>  
> From: Wojciech Indyk [mailto:[email protected]] 
> Sent: Wednesday, May 21, 2014 1:58 AM
> To: [email protected]
> Subject: Re: Sequence file as an output
>  
> When I remove MRHelpers.doJobClientMagic then NullPointerException in 
> Configuration class occurs. 
>  
> Could you advise me a base class (class and branch/release) for good practice 
> in TEZ for mapReduce jobs? I've rewritten my MR job to use Counters (not 
> available in MapReduce on TEZ) and Sessions (to improve iterative processing 
> speed). I have just Map and Reduce phase, it works in loop (several 
> iterations), so I think using session can improve a performance. Am I right?
> 
> Kindly regards
> Wojciech Indyk
>  
> 2014-05-21 0:33 GMT+02:00 Siddharth Seth <[email protected]>:
> It's possible that the old Output Format is being used (mapred vs mapreduce).
> Could you try forcing this to use the new API with the following.
>     finalVertex.setBoolean("mapred.mapper.new-api", true);
> Also, if you happen to be using MRHelpers.doJobClientMagic - remove that, 
> since that could reset this parameter.
>  
> This is a little messed up, but we're working on making this much easier to 
> use in 0.5.
>  
> Thanks
> - Sid
>  
>  
> On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <[email protected]> 
> wrote:
> Hi all!
> I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.
> I use:
> finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR, 
> SequenceFileOutputFormat.class.getName());
> The problem is the output is saved as TextOutputFormat. I use Sequence file 
> as an input to DAG and it works fine (I use SequenceFileInputFormat).
> 
> Kindly regards
> Wojciech Indyk
>  
>  
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader of 
> this message is not the intended recipient, you are hereby notified that any 
> printing, copying, dissemination, distribution, disclosure or forwarding of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please contact the sender immediately and delete it 
> from your system. Thank You.
>  
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader of 
> this message is not the intended recipient, you are hereby notified that any 
> printing, copying, dissemination, distribution, disclosure or forwarding of 
> this communication is strictly prohibited. If you have received this 
> communication in error, please contact the sender immediately and delete it 
> from your system. Thank You.

Re: Sequence file as an output

Reply via email to