Re: Sequence file as an output

Wojciech Indyk Thu, 22 May 2014 14:34:32 -0700

I wrote my own processors, as in WordCount in v.0.4.
Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where
TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in
context of Wordcount from 0.5 version. So That I decided to base on
Wordcount from 0.4 version. It worked ok until the output format problem.
Nevertheless, I made a workaround to just check performance of TEZ with
sessions. I generated sequenceFileInput for each iteration by MapReduce
algorithm. Then I used this input for TEZ version of the algorithm (I saved
TEZ output in an other place). Results are very promising. By small dataset
(~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster.
I don't have time now to work on problem with SequenceFile as an output. I
would rather to rewrite the code according to best practices. I think also
update TEZ 0.4 to 0.5 will be required.


Kindly regards
Wojciech Indyk


2014-05-21 19:31 GMT+02:00 Bikas Saha <[email protected]>:

> You are right. In fact, it’s a very interesting use case.
>
>
>
> Are you using MapProcessor and ReduceProcessor? Or have you written your
> own processor and are just using Tez inputs/outputs?
>
>
>
> If you look at the latest WordCount.java code in the tez code base, then
> you can see the current best practice for using the API. For these best
> practices on using the Tez API, you should look at compiling against the
> current master that tracks the next 0.5 release. If you are building tez
> locally then it’s the master branch. Otherwise maven artifacts (for
> dependency on 0.5.0-incubating-SNAPSHOT) are at
> https://repository.apache.org/content/groups/snapshots/org/apache/tez
>
>
>
>
>
> Let us know if this helps!
>
> Bikas
>
>
>
> *From:* Wojciech Indyk [mailto:[email protected]]
> *Sent:* Wednesday, May 21, 2014 1:58 AM
> *To:* [email protected]
> *Subject:* Re: Sequence file as an output
>
>
>
> When I remove MRHelpers.doJobClientMagic then NullPointerException in
> Configuration class occurs.
>
>
>
> Could you advise me a base class (class and branch/release) for good
> practice in TEZ for mapReduce jobs? I've rewritten my MR job to use
> Counters (not available in MapReduce on TEZ) and Sessions (to improve
> iterative processing speed). I have just Map and Reduce phase, it works in
> loop (several iterations), so I think using session can improve a
> performance. Am I right?
>
>
> Kindly regards
>
> Wojciech Indyk
>
>
>
> 2014-05-21 0:33 GMT+02:00 Siddharth Seth <[email protected]>:
>
> It's possible that the old Output Format is being used (mapred vs
> mapreduce).
>
> Could you try forcing this to use the new API with the following.
>
>     finalVertex.setBoolean("mapred.mapper.new-api", true);
>
> Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
> since that could reset this parameter.
>
>
>
> This is a little messed up, but we're working on making this much easier
> to use in 0.5.
>
>
>
> Thanks
>
> - Sid
>
>
>
>
>
> On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <[email protected]>
> wrote:
>
> Hi all!
>
> I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.
>
> I use:
>
> finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
> SequenceFileOutputFormat.class.getName());
>
> The problem is the output is saved as TextOutputFormat. I use Sequence
> file as an input to DAG and it works fine (I use SequenceFileInputFormat).
>
>
> Kindly regards
>
> Wojciech Indyk
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Sequence file as an output

Reply via email to