RE: Sequence file as an output

Bikas Saha Thu, 22 May 2014 15:00:04 -0700

If you see issues in your 0.5 build while running on the cluster you may
want to follow the latest instructions in BUILDING.txt to target Hadoop 2.2
(HDP 2.1).




*From:* Bikas Saha [mailto:[email protected]]
*Sent:* Thursday, May 22, 2014 2:56 PM
*To:* [email protected]
*Subject:* RE: Sequence file as an output



That’s good news. The gains with larger data set may be lower because the
time is dominated by the actual code that’s doing work. You may check that.



You can actually build 0.5 and use it on your cluster because Tez is a
client side application. You only need to have the correct jars on the
local client classpath and on HDFS location pointed to by TEZ_LIB_URI in
your tez-site.xml.



Bikas



*From:* Wojciech Indyk [mailto:[email protected]]
*Sent:* Thursday, May 22, 2014 2:33 PM
*To:* [email protected]
*Subject:* Re: Sequence file as an output



I wrote my own processors, as in WordCount in v.0.4.

Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where
TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in
context of Wordcount from 0.5 version. So That I decided to base on
Wordcount from 0.4 version. It worked ok until the output format problem.

Nevertheless, I made a workaround to just check performance of TEZ with
sessions. I generated sequenceFileInput for each iteration by MapReduce
algorithm. Then I used this input for TEZ version of the algorithm (I saved
TEZ output in an other place). Results are very promising. By small dataset
(~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster.

I don't have time now to work on problem with SequenceFile as an output. I
would rather to rewrite the code according to best practices. I think also
update TEZ 0.4 to 0.5 will be required.


Kindly regards

Wojciech Indyk



2014-05-21 19:31 GMT+02:00 Bikas Saha <[email protected]>:

You are right. In fact, it’s a very interesting use case.



Are you using MapProcessor and ReduceProcessor? Or have you written your
own processor and are just using Tez inputs/outputs?



If you look at the latest WordCount.java code in the tez code base, then
you can see the current best practice for using the API. For these best
practices on using the Tez API, you should look at compiling against the
current master that tracks the next 0.5 release. If you are building tez
locally then it’s the master branch. Otherwise maven artifacts (for
dependency on 0.5.0-incubating-SNAPSHOT) are at
https://repository.apache.org/content/groups/snapshots/org/apache/tez





Let us know if this helps!

Bikas



*From:* Wojciech Indyk [mailto:[email protected]]
*Sent:* Wednesday, May 21, 2014 1:58 AM
*To:* [email protected]
*Subject:* Re: Sequence file as an output



When I remove MRHelpers.doJobClientMagic then NullPointerException in
Configuration class occurs.



Could you advise me a base class (class and branch/release) for good
practice in TEZ for mapReduce jobs? I've rewritten my MR job to use
Counters (not available in MapReduce on TEZ) and Sessions (to improve
iterative processing speed). I have just Map and Reduce phase, it works in
loop (several iterations), so I think using session can improve a
performance. Am I right?


Kindly regards

Wojciech Indyk



2014-05-21 0:33 GMT+02:00 Siddharth Seth <[email protected]>:

It's possible that the old Output Format is being used (mapred vs
mapreduce).

Could you try forcing this to use the new API with the following.

    finalVertex.setBoolean("mapred.mapper.new-api", true);

Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
since that could reset this parameter.



This is a little messed up, but we're working on making this much easier to
use in 0.5.



Thanks

- Sid





On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <[email protected]>
wrote:

Hi all!

I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.

I use:

finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
SequenceFileOutputFormat.class.getName());

The problem is the output is saved as TextOutputFormat. I use Sequence file
as an input to DAG and it works fine (I use SequenceFileInputFormat).


Kindly regards

Wojciech Indyk






CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: Sequence file as an output

Reply via email to