Is it possible? I want to group data blocks.

2009-06-23 Thread Hyunsik Choi
Hi all, I would like to give data locality. In other words, I want to place certain data blocks on one machine. In some problems, subsets of an entire dataset need one another for answer. Most of the graph problems are good examples. Is it possible? If impossible, can you advice about that?

Re: Too many open files error, which gets resolved after some time

2009-06-23 Thread Stas Oskin
Hi. Thank for the advice, just to clarify: The upgrade of you speak of of cleaning the pipes/epolls more often, is regarding the issue discussed (HADOOP-4346, fixed in my distribution), or it's some other issue? If yes, does it has a ticket I can see, or it should be filled to Jira? Thanks!

UnknownHostException

2009-06-23 Thread bharath vissapragada
when i try to execute the command bin/start-dfs.sh , i get the following error . I have checked the hadoop-site.xml file on all the nodes , and they are fine .. can some-one help me out! 10.2.24.21: Exception in thread main java.net.UnknownHostException: unknown host: 10.2.24.21. 10.2.24.21:

Re: UnknownHostException

2009-06-23 Thread Matt Massie
fs.default.name in your hadoop-site.xml needs to be set to a fully- qualified domain name (instead of an IP address) -Matt On Jun 23, 2009, at 6:42 AM, bharath vissapragada wrote: when i try to execute the command bin/start-dfs.sh , i get the following error . I have checked the

Re: UnknownHostException

2009-06-23 Thread Raghu Angadi
This is at RPC client level and there is requirement for fully qualified hostname. May be . at the end of 10.2.24.21 causing the problem? btw, in 0.21 even fs.default.name does not need to be fully qualified name.. anything that resolves to an ipaddress is fine (at least for common/FS and

EC2, Max tasks, under utilized?

2009-06-23 Thread Saptarshi Guha
Hello, I'm running a 90 node c1.xlarge cluster. No reducers, mapred.max.map.tasks=6 per machine. The AMI is own and uses Hadoop 0.19.1 The dataset has 145K keys, and the processing time is huge. Now, when set the mapred.map.tasks=14,000 what ends up running is 49 map tasks, across the machines.

Re: EC2, Max tasks, under utilized?

2009-06-23 Thread Saptarshi Guha
Hello, I should also point out that I'm using a SequenceFileInputFormat. Regards Saptarshi Guha On Tue, Jun 23, 2009 at 10:43 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, I'm running a 90 node c1.xlarge cluster. No reducers, mapred.max.map.tasks=6 per machine. The AMI is own

RE: UnknownHostException

2009-06-23 Thread zjffdu
I encountered this problem before. If t you can ping the machine using its name, but cannot ping it using its IP address. then what you have to do is add the mapping into /etc/hosts -Original Message- From: bharathvissapragada1...@gmail.com [mailto:bharathvissapragada1...@gmail.com]

Re: UnknownHostException

2009-06-23 Thread Raghu Angadi
Raghu Angadi wrote: This is at RPC client level and there is requirement for fully qualified I meant to say there is NO requirement ... hostname. May be . at the end of 10.2.24.21 causing the problem? btw, in 0.21 even fs.default.name does not need to be fully qualified that fix is

Re: Too many open files error, which gets resolved after some time

2009-06-23 Thread Raghu Angadi
Stas Oskin wrote: Hi. Any idea if calling System.gc() periodically will help reducing the amount of pipes / epolls? since you have HADOOP-4346, you should not have excessive epoll/pipe fds open. First of all do you still have the problem? If yes, how many hadoop streams do you have at a

Re: Is it possible? I want to group data blocks.

2009-06-23 Thread Alex Loddengaard
Hi Hyunsik, Unfortunately you can't control the servers that blocks go on. Hadoop does block allocation for you, and it tries its best to distribute data evenly among the cluster, so long as replicated blocks reside on different machines, on different racks (assuming you've made Hadoop

Re: UnknownHostException

2009-06-23 Thread bharath vissapragada
It worked fine when i updated /etc/hosts file (of all the slaves) and writing fully qualified domain name in the hadoop-site.xml. It worked fine for sometime .. then started giving new error 09/06/23 22:21:49 INFO ipc.Client: Retrying connect to server: master/ 10.2.24.21:54310. Already tried 0

Re: UnknownHostException

2009-06-23 Thread bharath vissapragada
namenode is stopping automatically!! On Tue, Jun 23, 2009 at 10:29 PM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: It worked fine when i updated /etc/hosts file (of all the slaves) and writing fully qualified domain name in the hadoop-site.xml. It worked fine for sometime

Re: EC2, Max tasks, under utilized?

2009-06-23 Thread Hong Tang
Do you use block compression in sequence file? How large is your total dataset? On Jun 23, 2009, at 7:50 AM, Saptarshi Guha wrote: Hello, I should also point out that I'm using a SequenceFileInputFormat. Regards Saptarshi Guha On Tue, Jun 23, 2009 at 10:43 AM, Saptarshi Guha

Re: Determining input record directory using Streaming...

2009-06-23 Thread Bo Shi
Jason, do you know offhand when this feature was introduced? .18.x? Thanks, Bo On Mon, Jun 22, 2009 at 10:58 PM, jason hadoopjason.had...@gmail.com wrote: Check the process environment for your streaming tasks, generally the configuration variables are exported into the process environment.

Doing MapReduce over Har files

2009-06-23 Thread Roshan James
When I run map reduce task over a har file as the input, I see that the input splits refer to 64mb byte boundaries inside the part file. My mappers only know how to process the contents of each logical file inside the har file. Is there some way by which I can take the offset range specified by

Re: Too many open files error, which gets resolved after some time

2009-06-23 Thread Raghu Angadi
To be more accurate, once you have HADOOP-4346, fds for epoll and pipes = 3 * threads blocked on Hadoop I/O Unless you have hundreds of threads at a time, you should not see hundreds of these. These fds stay up to 10sec even after the threads exit. I am a bit confused about your exact

Re: THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle

2009-06-23 Thread Bradford Stephens
Greetings, I've gotten a few replies on this, but I'd really like to know who else is coming. Just send me a quick note :) Cheers, Bradford On Mon, Jun 22, 2009 at 5:40 PM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hey all, just a friendly reminder that this is Wednesday! I hope to

Re: Too many open files error, which gets resolved after some time

2009-06-23 Thread Stas Oskin
Hi. In my testings, I typically opened between 20 and 40 concurrent streams. Regards. 2009/6/23 Raghu Angadi rang...@yahoo-inc.com Stas Oskin wrote: Hi. Any idea if calling System.gc() periodically will help reducing the amount of pipes / epolls? since you have HADOOP-4346, you should

Re: Too many open files error, which gets resolved after some time

2009-06-23 Thread Raghu Angadi
how many threads do you have? Number of active threads is very important. Normally, #fds = (3 * #threads_blocked_on_io) + #streams 12 per stream is certainly way off. Raghu. Stas Oskin wrote: Hi. In my case it was actually ~ 12 fd's per stream, which included pipes and epolls. Could it

Re: Too many open files error, which gets resolved after some time

2009-06-23 Thread Stas Oskin
Hi. So if I open one stream, it should be 4? 2009/6/23 Raghu Angadi rang...@yahoo-inc.com how many threads do you have? Number of active threads is very important. Normally, #fds = (3 * #threads_blocked_on_io) + #streams 12 per stream is certainly way off. Raghu. Stas Oskin wrote:

Accessing stderr with Hadoop Streaming

2009-06-23 Thread S D
Is there a way to access stderr when using Hadoop Streaming? I see how stdout is written to the log files but I'm more concerned about what happens when errors occur. Access to stderr would help debug when a run doesn't complete successfully but I haven't been able to figure out how to retrieve

Re: Accessing stderr with Hadoop Streaming

2009-06-23 Thread Mayuran Yogarajah
S D wrote: Is there a way to access stderr when using Hadoop Streaming? I see how stdout is written to the log files but I'm more concerned about what happens when errors occur. Access to stderr would help debug when a run doesn't complete successfully but I haven't been able to figure out how

Does balancer ensure a file's replication is satisfied?

2009-06-23 Thread Stuart White
In my Hadoop cluster, I've had several drives fail lately (and they've been replaced). Each time a new empty drive is placed in the cluster, I run the balancer. I understand that the balancer will redistribute the load of file blocks across the nodes. My question is: will balancer also look at

Can you tell if a particular mapper was data local ?

2009-06-23 Thread Suratna Budalakoti
Hi all, Is there any way to tell, from logs, or by reading/setting a counter, whether a particular mapper was data local, i.e., it ran on the same node as its input data? Thanks, Suratna

Re: Can you tell if a particular mapper was data local ?

2009-06-23 Thread Bradford Stephens
(Correct me if I'm wrong), but I think you can tell though the Hadoop Web UI -- it'll show a count of which map tasks are data-local. You can then click on that to see a list of all the tasks there, and drill down to see which nodes those tasks ran on. On Tue, Jun 23, 2009 at 6:37 PM, Suratna

Re: THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle

2009-06-23 Thread Wynne, Adam S
I can¹t make it this time, I¹m out of town. On 6/23/09 12:53 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, I've gotten a few replies on this, but I'd really like to know who else is coming. Just send me a quick note :) Cheers, Bradford On Mon, Jun 22, 2009 at

Re: Strange Exeception

2009-06-23 Thread akhil1988
Thanks Jason! I gave your suggestion to my cluster administrator and now it is working. Following was his reply to me: But /hadoop/tmp is not /scratch and the only thing that I clean is /scratch. It looks like the disks in the job tracker machine died. I swapped the disks from another node

Re: Does balancer ensure a file's replication is satisfied?

2009-06-23 Thread jason hadoop
The namenode is constantly receiving reports about what datanode has what blocks, and performing replication when a block becomes under replicated. On Tue, Jun 23, 2009 at 6:18 PM, Stuart White stuart.whi...@gmail.comwrote: In my Hadoop cluster, I've had several drives fail lately (and they've

Re: Determining input record directory using Streaming...

2009-06-23 Thread jason hadoop
I happened to have a copy of 18.1 lying about, and the JobConf is added to the per process runtime environment in 18.1. The entire configuration from the JobConf object is added to the environment, with the jobconf key names being transformed slightly. Any character in the key name, that is not