Slides/Videos of Hadoop Summit

2009-06-22 Thread jaideep dhok
Hi all, Are the slides or videos of the talks given at Hadoop Summit available online? I checked the Yahoo! website for the summit but could not find any links. Regards, -- Jaideep

FYI, Large-scale graph computing at Google

2009-06-22 Thread Edward J. Yoon
http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html -- It sounds like Pregel seems, a computing framework based on dynamic programming for the graph operations. I guess maybe they removed the file communications/intermediate files during iterations. Anyway, What

Re: Multiple NIC Cards

2009-06-22 Thread JQ Hadoop
The address of the JobTracker (NameNode) is specified using * mapred.job.tracker* (*fs.default.name*) in the configurations. When the JobTracker (NameNode) starts, it will listen on the address specified by * mapred.job.tracker* (*fs.default.name*); and when a TaskTracker (DataNode) starts, it

Can we submit a mapreduce job from another mapreduce job?

2009-06-22 Thread Ramakishore Yelamanchilli
Is there any way we can submit a mapreduce job from another map job? The requirement is: I have customers with start date and end date as follows: CustomerID Start Date End Date Xxx mm/dd/yymm/dd/yy YYY mm/dd/yymm/dd/yy ZZZ

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Raghu Angadi
Is this before 0.20.0? Assuming you have closed these streams, it is mostly https://issues.apache.org/jira/browse/HADOOP-4346 It is the JDK internal implementation that depends on GC to free up its cache of selectors. HADOOP-4346 avoids this by using hadoop's own cache. Raghu. Stas Oskin

Measuring runtime of Map-reduce Jobs

2009-06-22 Thread bharath vissapragada
Hi , Are there any tools which can measure the run-time of the map-reduce jobs ?? any help is appreciated . Thanks in advance

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Stas Oskin
Hi. I've started doing just that, and indeed the amount of fd's of the DataNode process have reduced significantly. My problem is that my own app, which works with DFS, still have dozens of pipes and epolls open. The usual level seems to be about 300-400 fd's, but when I access the DFS for

Re: Name Node HA (HADOOP-4539)

2009-06-22 Thread Steve Loughran
Andrew Wharton wrote: https://issues.apache.org/jira/browse/HADOOP-4539 I am curious about the state of this fix. It is listed as Incompatible, but is resolved and committed (according to the comments). Is the backup name node going to make it into 0.21? Will it remove the SPOF for HDFS? And if

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran
jason hadoop wrote: Yes. Otherwise the file descriptors will flow away like water. I also strongly suggest having at least 64k file descriptors as the open file limit. On Sun, Jun 21, 2009 at 12:43 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. Thanks for the advice. So you advice explicitly

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran
Scott Carey wrote: Furthermore, if for some reason it is required to dispose of any objects after others are GC'd, weak references and a weak reference queue will perform significantly better in throughput and latency - orders of magnitude better - than finalizers. Good point. I would

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran
Raghu Angadi wrote: Is this before 0.20.0? Assuming you have closed these streams, it is mostly https://issues.apache.org/jira/browse/HADOOP-4346 It is the JDK internal implementation that depends on GC to free up its cache of selectors. HADOOP-4346 avoids this by using hadoop's own cache.

Specifying which systems to be used as DataNode

2009-06-22 Thread Santosh Bs1
Hi, I am very new to Hadoop, I have few basic questions... How and where do I need to specify which all system in the given cluster to be used as DataNodes ? Can I change this set dynamically ?

java.io.IOException: Error opening job jar

2009-06-22 Thread Shravan Mahankali
Hi Group, I was having trouble getting through an example Hadoop program. I have searched the mailing list but could not find any thing useful. Below is the issue: 1) Executed below command to submit a job to Hadoop: /hadoop-0.18.3/bin/hadoop jar -libjars AggregateWordCount.jar

Re: java.io.IOException: Error opening job jar

2009-06-22 Thread Harish Mallipeddi
It cannot find your job jar file. Make sure you run this command from the directory that has the AggregateWordCount.jar (and you can lose the -libjars flag too - you need that only if you need to specify extra jar dependencies apart from your job jar file). - Harish On Mon, Jun 22, 2009 at 3:45

Re: problem about put a lot of files

2009-06-22 Thread stchu
Hi, Thanks for your quickly reponses. I tried to relax this limit to 204800, but it still not work. Is this possible caused from fs objects? Anyway, thanks a lot! 2009/6/22 zhuweimin xim-...@tsm.kddilabs.jp Hi The max open files have limit in LINUX box. Please using ulimit to view and

RE: java.io.IOException: Error opening job jar

2009-06-22 Thread Shravan Mahankali
Thanks for your reply Harish. Am running this example from with in the directory containing AggregateWordCount.jar file. But even then, I have this issue. Earlier I had issue of java.lang.ClassNotFoundException: org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass, so in some

RE: java.io.IOException: Error opening job jar

2009-06-22 Thread Ramakishore Yelamanchilli
Can you attach the jar file you have? -Ram -Original Message- From: Shravan Mahankali [mailto:shravan.mahank...@catalytic.com] Sent: Monday, June 22, 2009 3:49 AM To: 'Harish Mallipeddi'; core-user@hadoop.apache.org Subject: RE: java.io.IOException: Error opening job jar Thanks for

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Stas Oskin
Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. 2009/6/22 Steve Loughran ste...@apache.org Raghu Angadi wrote: Is this before 0.20.0? Assuming you have closed

RE: Name Node HA (HADOOP-4539)

2009-06-22 Thread Brian.Levine
If the BackupNode doesn't promise HA, then how would additional testing on this feature aid in the HA story? Maybe you could expand on the purpose of HADOOP-4539 because now I'm confused. How does the approaching 0.21 cutoff translate into a release date for 0.21? -brian -Original

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Raghu Angadi
64k might help in the sense, you might hit GC before you hit the limit. Otherwise, your only options are to use the patch attached to HADOOP-4346 or run System.gc() occasionally. I think it should be committed to 0.18.4 Raghu. Stas Oskin wrote: Hi. Yes, it happens with 0.18.3. I'm

HDFS out of space

2009-06-22 Thread Kris Jirapinyo
Hi all, How does one handle a mount running out of space for HDFS? We have two disks mounted on /mnt and /mnt2 respectively on one of the machines that are used for HDFS, and /mnt is at 99% while /mnt2 is at 30%. Is there a way to tell the machine to balance itself out? I know for the

RE: java.io.IOException: Error opening job jar

2009-06-22 Thread Ramakishore Yelamanchilli
There's no file attached Shravan. Regards Ram -Original Message- From: Shravan Mahankali [mailto:shravan.mahank...@catalytic.com] Sent: Monday, June 22, 2009 4:43 AM To: core-user@hadoop.apache.org; 'Harish Mallipeddi' Subject: RE: java.io.IOException: Error opening job jar Hi

Interfaces/Implementations and Key/Values for M/R

2009-06-22 Thread Grant Ingersoll
Hi, Over at Mahout (http://lucene.apache.org/mahout) we have a Vector interface with two implementations DenseVector and SparseVector. When it comes to writing Mapper/Reducer, we have been able to just use Vector, but when it comes to actually binding real data via a Configuration, we

Re: multiple file input

2009-06-22 Thread Erik Paulson
On Thu, Jun 18, 2009 at 01:36:14PM -0700, Owen O'Malley wrote: On Jun 18, 2009, at 10:56 AM, pmg wrote: Each line from FileA gets compared with every line from FileB1, FileB2 etc. etc. FileB1, FileB2 etc. are in a different input directory In the general case, I'd define an InputFormat

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Stas Oskin
Hi Rahid. A question - this issue does not influence Hadoop itself (DataNodes, etc...), but rather influence any application using DFS, correct? If so, without patching iI should either to increase fd limit (which might fill-up as well???), or periodically launch the GC? Regards. 2009/6/22

Re: Making sure the tmp directory is cleaned?

2009-06-22 Thread Pankil Doshi
Yes, If your job gets completed successfully .possibly it removes after completion of both map and reduce tasks. Pankil On Mon, Jun 22, 2009 at 3:15 PM, Qin Gao q...@cs.cmu.edu wrote: Hi All, Do you know if the tmp directory on every map/reduce task will be deleted automatically after the

Re: Making sure the tmp directory is cleaned?

2009-06-22 Thread Qin Gao
Thanks! But what if the jobs get killed or failed? Does hadoop try to clean it? we are considering bad situations - if job gets killed, will the tmp dirs sit on local disks forever and eats up all the diskspace? I guess this should be considered in distributed cache, but those files are

Re: Slides/Videos of Hadoop Summit

2009-06-22 Thread Alex Loddengaard
The Cloudera talks are here: http://www.cloudera.com/blog/2009/06/22/a-great-week-for-hadoop-summit-west-roundup/ As for the rest, I'm not sure. Alex On Sun, Jun 21, 2009 at 11:46 PM, jaideep dhok jdd...@gmail.com wrote: Hi all, Are the slides or videos of the talks given at Hadoop Summit

Re: Measuring runtime of Map-reduce Jobs

2009-06-22 Thread Alex Loddengaard
What specific information are you interested in? The job history logs show all sorts of great information (look in the history sub directory of the JobTracker node's log directory). Alex On Mon, Jun 22, 2009 at 1:23 AM, bharath vissapragada bhara...@students.iiit.ac.in wrote: Hi , Are

Re: Problem in viewing WEB UI

2009-06-22 Thread Pankil Doshi
I am not sure but sometimes you might see that datanodes are working from cmd prompt.. But actually when you look at the logs you find sme kind of error in that..Check the logs of datanode.. Pankil On Wed, Jun 17, 2009 at 1:42 AM, ashish pareek pareek...@gmail.com wrote: Hi, When I run

Re: Disk Usage Overhead of Hadoop Upgrade

2009-06-22 Thread Pankil Doshi
hi Stu, which block conversion are you talking about? If you are talking abt block size of data then it remains same in upgrade unless and until you change it. Pankil On Tue, Jun 16, 2009 at 5:16 PM, Stu Hood stuart.h...@rackspace.com wrote: Hey gang, We're preparing to upgrade our cluster

Re: HDFS out of space

2009-06-22 Thread Alex Loddengaard
Are you seeing any exceptions because of the disk being at 99% capacity? Hadoop should do something sane here and write new data to the disk with more capacity. That said, it is ideal to be balanced. As far as I know, there is no way to balance an individual DataNode's hard drives (Hadoop does

Re: HDFS out of space

2009-06-22 Thread Pankil Doshi
Hey Alex, Will Hadoop balancer utility work in this case? Pankil On Mon, Jun 22, 2009 at 4:30 PM, Alex Loddengaard a...@cloudera.com wrote: Are you seeing any exceptions because of the disk being at 99% capacity? Hadoop should do something sane here and write new data to the disk with more

Re: Making sure the tmp directory is cleaned?

2009-06-22 Thread Pankil Doshi
No..If your job gets killed or failed.Temp wont clean up.. and In that case you will have to carefully clean that on your own. If you dont clean it up yourself it will eat up your disk space. Pankil On Mon, Jun 22, 2009 at 4:24 PM, Qin Gao q...@cs.cmu.edu wrote: Thanks! But what if the jobs

Re: HDFS out of space

2009-06-22 Thread Matt Massie
Pankil- I'd be interested to know the size of the /mnt and /mnt2 partitions. Are they the same? Can you run the following and report the output... % df -h /mnt /mnt2 Thanks. -Matt On Jun 22, 2009, at 1:32 PM, Pankil Doshi wrote: Hey Alex, Will Hadoop balancer utility work in this

Re: HDFS out of space

2009-06-22 Thread Allen Wittenauer
On 6/22/09 10:12 AM, Kris Jirapinyo kjirapi...@biz360.com wrote: Hi all, How does one handle a mount running out of space for HDFS? We have two disks mounted on /mnt and /mnt2 respectively on one of the machines that are used for HDFS, and /mnt is at 99% while /mnt2 is at 30%. Is

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran
Stas Oskin wrote: Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. good question -I'm not sure. For anythiong you get with FileSystem.get(), its now dangerous to

Re: Disk Usage Overhead of Hadoop Upgrade

2009-06-22 Thread Raghu Angadi
The initial overhead is fairly small (extra hard link for each file). After that, the overhead grows as you delete the files (thus its blocks) that existed before the upgrade.. since the physical files for blocks are deleted only after you finalize. So the overhead == (the blocks that got

Re: HDFS out of space

2009-06-22 Thread Usman Waheed
I have used the balancer to balance the data in the cluster with the -threshold option. The bandwidth transfer was set to 1MB/sec ( I think thats the default setting) in one of the config files and had to move 500GB of data around. It did take sometime but eventually the data got spread

Re: Making sure the tmp directory is cleaned?

2009-06-22 Thread Allen Wittenauer
On 6/22/09 12:15 PM, Qin Gao q...@cs.cmu.edu wrote: Do you know if the tmp directory on every map/reduce task will be deleted automatically after the map task finishes or will do I have to delete them? I mean the tmp directory that automatically created by on current directory. Past

Re: Making sure the tmp directory is cleaned?

2009-06-22 Thread Qin Gao
Thanks, then I will try keep a log on the files and clean them out, thanks. --Q On Mon, Jun 22, 2009 at 4:34 PM, Pankil Doshi forpan...@gmail.com wrote: No..If your job gets killed or failed.Temp wont clean up.. and In that case you will have to carefully clean that on your own. If you dont

Re: HDFS out of space

2009-06-22 Thread Pankil Doshi
Matt. Kris can give that info.. I am one of the users from mailing list. PAnkil On Mon, Jun 22, 2009 at 4:37 PM, Matt Massie m...@cloudera.com wrote: Pankil- I'd be interested to know the size of the /mnt and /mnt2 partitions. Are they the same? Can you run the following and report the

Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Stas Oskin
Ok, seems this issue is already patched in the Hadoop distro I'm using (Cloudera). Any idea if I still should call GC manually/periodically to clean out all the stale pipes / epolls? 2009/6/22 Steve Loughran ste...@apache.org Stas Oskin wrote: Hi. So what would be the recommended approach

Re: HDFS out of space

2009-06-22 Thread Kris Jirapinyo
It's a typical Amazon EC2 Large instance, so 414G each. -- Kris. On Mon, Jun 22, 2009 at 1:37 PM, Matt Massie m...@cloudera.com wrote: Pankil- I'd be interested to know the size of the /mnt and /mnt2 partitions. Are they the same? Can you run the following and report the output... % df

Re: Hadoop Vaidya tool

2009-06-22 Thread Vitthal Gogate
Hello Pratik, -joblog also should be a specific job history file path not a directory. Usually, I copy the job conf xml file and job history log file to a local file system and then use a file:// protocol (although hdfs:// should also work) e.g, Sh

THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle

2009-06-22 Thread Bradford Stephens
Hey all, just a friendly reminder that this is Wednesday! I hope to see everyone there again. Please let me know if there's something interesting you'd like to talk about -- I'll help however I can. You don't even need a Powerpoint presentation -- there's many whiteboards. I'll try to have a video

Strange Exeception

2009-06-22 Thread akhil1988
Hi All! I have been running Hadoop jobs through my user account on a cluster, for a while now. But now I am getting this strange exception when I try to execute a job. If anyone knows, please let me know why this is happening. [akhil1...@altocumulus WordCount]$ hadoop jar

Determining input record directory using Streaming...

2009-06-22 Thread C G
Hi All: Is there any way using Hadoop Streaming to determining the directory from which an input record is being read?  This is straightforward in Hadoop using InputFormats, but I am curious if the same concept can be applied to streaming. The goal here is to read in data from 2 directories, say

Re: Strange Exeception

2009-06-22 Thread jason hadoop
The directory specified by the configuration parameter mapred.system.dir, defaulting to /tmp/hadoop/mapred/system, doesn't exist. Most likely your tmp cleaner task has removed it, and I am guessing it is only created at cluster start time. On Mon, Jun 22, 2009 at 6:19 PM, akhil1988

Re: When is configure and close run

2009-06-22 Thread jason hadoop
configure and close are run for each task, mapper and reducer. The configure and close are NOT run on the combiner class. On Mon, Jun 22, 2009 at 9:23 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, In a mapreduce job, a given map JVM will run N map tasks. Are the configure and close

Re: Determining input record directory using Streaming...

2009-06-22 Thread jason hadoop
Check the process environment for your streaming tasks, generally the configuration variables are exported into the process environment. The Mapper input file is normally stored as some variant of mapred.input.file. The reducer's input is the mapper output for that reduce, so the input file is

RE: java.io.IOException: Error opening job jar

2009-06-22 Thread Shravan Mahankali
Hi Ramakishore, Unable to attach files to mailing list! I hope Harish received the attached docs to his gmail a/c. PFA attached those here. Any help would be appreciated. Thank You, Shravan Kumar. M Catalytic Software Ltd. [SEI-CMMI Level 5 Company] - This email

Re: Slides/Videos of Hadoop Summit

2009-06-22 Thread jaideep dhok
Thanks for the link. - JD On Tue, Jun 23, 2009 at 1:55 AM, Alex Loddengaarda...@cloudera.com wrote: The Cloudera talks are here: http://www.cloudera.com/blog/2009/06/22/a-great-week-for-hadoop-summit-west-roundup/ As for the rest, I'm not sure. Alex On Sun, Jun 21, 2009 at 11:46 PM,