Re: How do I synchronize Hadoop jobs?

2012-02-15 Thread Bharath Mundlapudi
For complex workflows indeed Oozie(or Azkaban) is the answer. -Bhartah On Wed, Feb 15, 2012 at 1:29 PM, Bharath Mundlapudi wrote: > Or you could use job chaining in MR. > http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining > > -Bharath > > > On Wed, Feb 15, 2

Re: How do I synchronize Hadoop jobs?

2012-02-15 Thread Bharath Mundlapudi
Or you could use job chaining in MR. http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining -Bharath On Wed, Feb 15, 2012 at 11:26 AM, John Armstrong wrote: > Actually, I think this is what Oozie is for. It seems to leap out as a > great example of a forked workflow. > > hth > > > >

Re: 10 nodes how to build the topological graph

2012-02-03 Thread Bharath Mundlapudi
With 10 machines, if you want, you can just build 1 rack. What you might loose is rack awareness of Hadoop. Meaning all your data will be in one rack, so a rack lose is cluster loss. Network topology depends on the following factors: 1. Your replication factor requirement for HDFS blocks 2. How mu

Re: Hadoop doesnt use Replication Level of Namenode

2011-09-12 Thread Bharath Mundlapudi
If you don't set any replication factor, HDFS client will use the default setting unless you edit the client confs. Three ways you can play around with replication: 1. Java API  2. Client Confs 3. Cluster Confs (Namenode conf) -Bharath From: Ralf Heyde To: co

Re: Question about RAID controllers and hadoop

2011-08-11 Thread Bharath Mundlapudi
True, you need a P410 controller. You can create RAID0 for each disk to make it as JBOD. -Bharath From: Koert Kuipers To: common-user@hadoop.apache.org Sent: Thursday, August 11, 2011 2:50 PM Subject: Question about RAID controllers and hadoop Hello all, We

Re: large data and hbase

2011-07-11 Thread Bharath Mundlapudi
Another option to look at is Pig Or Hive. These need MapReduce. -Bharath From: Rita To: "" Sent: Monday, July 11, 2011 4:31 AM Subject: large data and hbase I have a dataset which is several terabytes in size. I would like to query this data using hbase (sql

Re: Cluster Tuning

2011-07-08 Thread Bharath Mundlapudi
Slow start is an important parameter. Definitely impacts job runtime. My experience in the past has been that, setting this parameter to too low or setting to too high can have issues with job latencies. If you are trying to run same job then its easy to set right value but if your cluster is m

Re: check namenode, jobtracker, datanodes and tasktracker status

2011-07-08 Thread Bharath Mundlapudi
Shouldn't be a problem. But making sure, you disconnect the connection from this monitoring client might be helpful at peak loads. -Bharath From: Marc Sturlese To: hadoop-u...@lucene.apache.org Sent: Friday, July 8, 2011 10:49 AM Subject: check namenode, jobtr

Re: measure the time taken to complete map and reduce phase

2011-07-04 Thread Bharath Mundlapudi
Short answer: Job History Logs(JHL). Long answer: JHL stores all the meta information related to jobs which were executed like Job start/finish, Task start/finish, TaskAttempt start/finish. Job.Map_Phase = Function ( Job.Map(first).StartTime, Job.Map(last).FinishTime) Job.Reduce_Phase = Function

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-30 Thread Bharath Mundlapudi
a new cluster... You should plan on 10gbe. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 29, 2011, at 1:07 AM, Bharath Mundlapudi wrote: > One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly > having extra bandwidth is good but at wha

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Bharath Mundlapudi
One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price? Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If

Re: Hadoop Summit - Poster 49

2011-06-28 Thread Bharath Mundlapudi
arath From: Mark Kerzner To: common-user@hadoop.apache.org; Bharath Mundlapudi Sent: Tuesday, June 28, 2011 12:56 PM Subject: Re: Hadoop Summit - Poster 49 Ah, I just came from Santa Clara! Will there be sessions online? Thank you, Mark On Tue, Jun 28, 2011 at 2:43 PM, Bh

Re: Comparing two logs, finding missing records

2011-06-26 Thread Bharath Mundlapudi
@hadoop.apache.org; Bharath Mundlapudi Sent: Sunday, June 26, 2011 5:50 PM Subject: Re: Comparing two logs, finding missing records Bharath, how would a Pig query look like? Thank you, Mark On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi wrote: If you have Serde or PigLoader for your

Re: Reading HDFS files via Spring

2011-06-26 Thread Bharath Mundlapudi
Hi Mark, Yes, This should be possible. Solution: If you modify the FileItemSimilarity class to take hdfs file path in your spring application and use HDFS file APIs to read the similarity data file from HDFS. In the URI though, you must specify the complete path to your HDFS file like hdfs:/

Re: Comparing two logs, finding missing records

2011-06-26 Thread Bharath Mundlapudi
If you have Serde or PigLoader for your log format, probably Pig or Hive will be a quicker solution with the join. -Bharath From: Mark Kerzner To: Hadoop Discussion Group Sent: Saturday, June 25, 2011 9:39 PM Subject: Comparing two logs, finding missing recor

Re: Queue support from HDFS

2011-06-26 Thread Bharath Mundlapudi
One solution i am thinking is lets say you have Kafka or some JMS implementation where your job client is subscribed to and submits job dynamically based on queue input size. You may need to run final job to combine them all. -Bharath From: Saumitra To: commo

Re: Split control in Lzo index

2011-06-24 Thread Bharath Mundlapudi
  >> BTW, when using the second format, I found that bzip2 has better compression >> rate than Lzo (2.1 M).  Did I made any mistake when using Lzo compression? It depends on your requirements. Like if you prefer high compression rate over performance. bzip2 is orders of magnitude slower than Lz

Re: TestDFSIO delivers bad values of "throughput" and "average IO rate"

2011-05-22 Thread Bharath Mundlapudi
what is the fileSize you are specifying for TestDFSIO? I think, If you haven't specified this value then default is 1MB. This might cause client bottleneck. -Bharath From: sendy.seftiandy To: core-u...@hadoop.apache.org Sent: Saturday, May 21, 2011 11:35 PM

Re: How to see block information on NameNode ?

2011-05-21 Thread Bharath Mundlapudi
Another way is executing this cmd: hadoop fsck -files -blocks -locations -Bharath From: Harsh J To: common-user@hadoop.apache.org Sent: Saturday, May 21, 2011 6:45 AM Subject: Re: How to see block information on NameNode ? One way: Opening the file in the web

Re: Poor performance on word count test on a 10 node cluster.

2011-05-16 Thread Bharath Mundlapudi
There are lots of other things you are not considering in this equation. Like, reducer output of each block is written 3 times, since replication is 3. What is the network setup? Is it 1Gbps link? Check if there are any issues with network also. Check if there are any straggler nodes. This can

Re: Datanode doesn't start but there is no exception in the log

2011-05-12 Thread Bharath Mundlapudi
Is that all the messages in the datanode log? Do you see any SHUTDOWN message also? -Bharath From: Panayotis Antonopoulos To: common-user@hadoop.apache.org Sent: Thursday, May 12, 2011 6:07 PM Subject: Datanode doesn't start but there is no exception in the lo

Re: can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?

2011-05-06 Thread Bharath Mundlapudi
) Bharath Mundlapudi wrote: > how about this? > > for i in $(seq 1 200); do > exec_stream_job.sh $dir $i & > > > exec_stream_job.sh > > > regenerate-files $dir $i > hadoop > jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-

Re: can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?

2011-05-06 Thread Bharath Mundlapudi
how about this? for i in $(seq 1 200); do exec_stream_job.sh $dir $i & exec_stream_job.sh regenerate-files $dir $i hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \         -D mapred.job.name="$i" \         -file $dir \         -mapper "..." -

Re: hi

2011-05-02 Thread Bharath Mundlapudi
The key message is in the first line. 11/05/02 23:59:47 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /input/i1 could only be replicated to 0 nodes, instead of 1 Zero datanodes were found to perform this operation. Check the DN log

Re: Fixing a bad HD

2011-04-25 Thread Bharath Mundlapudi
Right, if you have a hardware which supports hot-swappable disk, this might be easiest one. But still you will need to restart the datanode to detect this new disk. There is an open Jira on this. -Bharath - Original Message - From: "Jones, Nick" To: "common-user@hadoop.apache.org" C

Re: Load testing in hadoop

2011-03-15 Thread Bharath Mundlapudi
Gridmix v3 and TeraSort are good benchmarks to test overall hadoop cluster. But synthetic load generator is specifically for Namenode. You can refer the documentation below for how to run this tool. http://hadoop.apache.org/common/docs/r0.20.2/SLG_user_guide.html Also, there are various other

Re: load a serialized object in hadoop

2010-10-13 Thread Bharath Mundlapudi
If you are running 32 bit JVM and depending on OS, you can't go beyond ~ 3GB. If you are on windows, 2GB heap is your best bet for 32bit JVM. Try this: Edit conf/hadoop-env.sh for export HADOOP_CLIENT_OPTS="-Xmx3G ${HADOOP_CLIENT_OPTS}" and now you run your hadoop commands. -Bharath F

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-09-27 Thread Bharath Mundlapudi
Couple of things you can try. 1. Increase the Heap Size for the tasks. 2. Since, your OOM happening randomly, try setting -XX:+HeapDumpOnOutOfMemoryError for your child JVM parameters. Atleast you can detect, why your heap growing -is it due to a leak ? or if you need to increase the heap size