For complex workflows indeed Oozie(or Azkaban) is the answer.
-Bhartah
On Wed, Feb 15, 2012 at 1:29 PM, Bharath Mundlapudi wrote:
> Or you could use job chaining in MR.
> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>
> -Bharath
>
>
> On Wed, Feb 15, 2
Or you could use job chaining in MR.
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
-Bharath
On Wed, Feb 15, 2012 at 11:26 AM, John Armstrong wrote:
> Actually, I think this is what Oozie is for. It seems to leap out as a
> great example of a forked workflow.
>
> hth
>
>
>
>
With 10 machines, if you want, you can just build 1 rack. What you might
loose is rack awareness of Hadoop. Meaning all your data will be in one
rack, so a rack lose is cluster loss.
Network topology depends on the following factors:
1. Your replication factor requirement for HDFS blocks
2. How mu
If you don't set any replication factor, HDFS client will use the default
setting unless you edit the client confs.
Three ways you can play around with replication:
1. Java API
2. Client Confs
3. Cluster Confs (Namenode conf)
-Bharath
From: Ralf Heyde
To: co
True, you need a P410 controller. You can create RAID0 for each disk to make it
as JBOD.
-Bharath
From: Koert Kuipers
To: common-user@hadoop.apache.org
Sent: Thursday, August 11, 2011 2:50 PM
Subject: Question about RAID controllers and hadoop
Hello all,
We
Another option to look at is Pig Or Hive. These need MapReduce.
-Bharath
From: Rita
To: ""
Sent: Monday, July 11, 2011 4:31 AM
Subject: large data and hbase
I have a dataset which is several terabytes in size. I would like to query
this data using hbase (sql
Slow start is an important parameter. Definitely impacts job runtime. My
experience in the past has been that, setting this parameter to too low or
setting to too high can have issues with job latencies. If you are trying to
run same job then its easy to set right value but if your cluster is
m
Shouldn't be a problem. But making sure, you disconnect the connection from
this monitoring client might be helpful at peak loads.
-Bharath
From: Marc Sturlese
To: hadoop-u...@lucene.apache.org
Sent: Friday, July 8, 2011 10:49 AM
Subject: check namenode, jobtr
Short answer: Job History Logs(JHL).
Long answer: JHL stores all the meta information related to jobs which were
executed like Job start/finish, Task start/finish, TaskAttempt start/finish.
Job.Map_Phase = Function ( Job.Map(first).StartTime, Job.Map(last).FinishTime)
Job.Reduce_Phase = Function
a new cluster... You should plan on 10gbe.
Sent from a remote device. Please excuse any typos...
Mike Segel
On Jun 29, 2011, at 1:07 AM, Bharath Mundlapudi wrote:
> One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly
> having extra bandwidth is good but at wha
One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly
having extra bandwidth is good but at what price?
Please note that all the points you mentioned can work with 1Gb NICs today.
Unless if you can back with price/performance data. Typically, Map output is
compressed. If
arath
From: Mark Kerzner
To: common-user@hadoop.apache.org; Bharath Mundlapudi
Sent: Tuesday, June 28, 2011 12:56 PM
Subject: Re: Hadoop Summit - Poster 49
Ah, I just came from Santa Clara! Will there be sessions online?
Thank you,
Mark
On Tue, Jun 28, 2011 at 2:43 PM, Bh
@hadoop.apache.org; Bharath Mundlapudi
Sent: Sunday, June 26, 2011 5:50 PM
Subject: Re: Comparing two logs, finding missing records
Bharath,
how would a Pig query look like?
Thank you,
Mark
On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi
wrote:
If you have Serde or PigLoader for your
Hi Mark,
Yes, This should be possible.
Solution:
If you modify the FileItemSimilarity class to take hdfs file path in your
spring application and use HDFS file APIs to read the similarity data file from
HDFS. In the URI though, you must specify the complete path to your HDFS file
like hdfs:/
If you have Serde or PigLoader for your log format, probably Pig or Hive will
be a quicker solution with the join.
-Bharath
From: Mark Kerzner
To: Hadoop Discussion Group
Sent: Saturday, June 25, 2011 9:39 PM
Subject: Comparing two logs, finding missing recor
One solution i am thinking is lets say you have Kafka or some JMS
implementation where your job client is subscribed to and submits job
dynamically based on queue input size. You may need to run final job to combine
them all.
-Bharath
From: Saumitra
To: commo
>> BTW, when using the second format, I found that bzip2 has better compression
>> rate than Lzo (2.1 M). Did I made any mistake when using Lzo compression?
It depends on your requirements. Like if you prefer high compression rate over
performance. bzip2 is orders of magnitude slower than Lz
what is the fileSize you are specifying for TestDFSIO?
I think, If you haven't specified this value then default is 1MB. This might
cause client bottleneck.
-Bharath
From: sendy.seftiandy
To: core-u...@hadoop.apache.org
Sent: Saturday, May 21, 2011 11:35 PM
Another way is executing this cmd:
hadoop fsck -files -blocks -locations
-Bharath
From: Harsh J
To: common-user@hadoop.apache.org
Sent: Saturday, May 21, 2011 6:45 AM
Subject: Re: How to see block information on NameNode ?
One way: Opening the file in the web
There are lots of other things you are not considering in this equation.
Like, reducer output of each block is written 3 times, since replication is 3.
What is the network setup? Is it 1Gbps link? Check if there are any issues with
network also.
Check if there are any straggler nodes. This can
Is that all the messages in the datanode log? Do you see any SHUTDOWN message
also?
-Bharath
From: Panayotis Antonopoulos
To: common-user@hadoop.apache.org
Sent: Thursday, May 12, 2011 6:07 PM
Subject: Datanode doesn't start but there is no exception in the lo
)
Bharath Mundlapudi wrote:
> how about this?
>
> for i in $(seq 1 200); do
> exec_stream_job.sh $dir $i &
>
>
> exec_stream_job.sh
>
>
> regenerate-files $dir $i
> hadoop
> jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-
how about this?
for i in $(seq 1 200); do
exec_stream_job.sh $dir $i &
exec_stream_job.sh
regenerate-files $dir $i
hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
-D mapred.job.name="$i" \
-file $dir \
-mapper "..." -
The key message is in the first line.
11/05/02 23:59:47 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /input/i1
could only be replicated to 0 nodes, instead of 1
Zero datanodes were found to perform this operation. Check the DN log
Right, if you have a hardware which supports hot-swappable disk, this might be
easiest one. But still you will need to restart the datanode to detect this new
disk. There is an open Jira on this.
-Bharath
- Original Message -
From: "Jones, Nick"
To: "common-user@hadoop.apache.org"
C
Gridmix v3 and TeraSort are good benchmarks to test overall hadoop cluster. But
synthetic load generator is specifically for Namenode. You can refer the
documentation below for how to run this tool.
http://hadoop.apache.org/common/docs/r0.20.2/SLG_user_guide.html
Also, there are various other
If you are running 32 bit JVM and depending on OS, you can't go beyond ~ 3GB.
If you are on windows, 2GB heap is your best bet for 32bit JVM.
Try this:
Edit conf/hadoop-env.sh for
export HADOOP_CLIENT_OPTS="-Xmx3G ${HADOOP_CLIENT_OPTS}"
and now you run your hadoop commands.
-Bharath
F
Couple of things you can try.
1. Increase the Heap Size for the tasks.
2. Since, your OOM happening randomly, try setting
-XX:+HeapDumpOnOutOfMemoryError for your child JVM parameters. Atleast you can
detect, why your heap growing -is it due to a leak ? or if you need to increase
the heap size
28 matches
Mail list logo