Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Lucas Nazário dos Santos
Hello again, In fact I can get the cluster up and running with two nodes in different LANs. The problem appears when executing a job. As you can see in the piece of log bellow, the datanode tries to comunicate with the namenode using the IP 10.1.1.5. The issue is that the datanode should be

RE: Join example

2008-08-08 Thread Wei Wu
There are some examples in $HADOOPHOME/src/contrib/data_join, which I hope would help. Wei -Original Message- From: John DeTreville [mailto:[EMAIL PROTECTED] Sent: Friday, August 08, 2008 2:34 AM To: core-user@hadoop.apache.org Subject: Join example Hadoop ships with a few example

Hadoop + Servlet Problems

2008-08-08 Thread Kylie McCormick
Hi! I've gotten Hadoop to run a search as I want, but now I'm trying to add a servlet component to it. All of Hadoop works properly, but when I set components from the servlet instead of setting them via the command-line, Hadoop only produces temporary output files and doesn't complete. I've

access jobconf in streaming job

2008-08-08 Thread Rong-en Fan
I'm using streaming with a mapper written in perl. However, an issue is that I want to pass some arguments via command line. In regular Java mapper, I can access JobConf in Mapper. Is there a way to do this? Thanks, Rong-En Fan

Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Steve Loughran
Piotr Kozikowski wrote: Hi there: We would like to know what are the most likely causes of this sort of error: Exception closing file /data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022 java.io.IOException: Could not get block locations.

Re: access jobconf in streaming job

2008-08-08 Thread Rong-en Fan
After looking into streaming source, the answer is via environment variables. For example, mapred.task.timeout is in the mapred_task_timeout environment variable. On Fri, Aug 8, 2008 at 4:26 PM, Rong-en Fan [EMAIL PROTECTED] wrote: I'm using streaming with a mapper written in perl. However, an

Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Alexander Aristov
I come across the same issue and also with hadoop 0.17.1 would be interesting if someone say the cause of the issue. Alex 2008/8/8 Steve Loughran [EMAIL PROTECTED] Piotr Kozikowski wrote: Hi there: We would like to know what are the most likely causes of this sort of error: Exception

what is the correct usage of hdfs metrics

2008-08-08 Thread Ivan Georgiev
Hi, I have been unable to find any examples on how to use the MBeans provided from HDFS. Could anyone that has any experience on the topic share some info. What is the URL to use to connect to the MBeanServer ? Is it done through rmi, or only through jvm ? Any help is highly appreciated.

Hadoop Pipes Job submission and JobId

2008-08-08 Thread Leon Mergen
Hello, I was wondering what the correct way to submit a Job to hadoop using the Pipes API is -- currently, I invoke a command similar to this: /usr/local/hadoop/bin/hadoop pipes -conf /usr/local/mapreduce/reports/reports.xml -input /store/requests/archive/*/*/* -output out However, this way of

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Lukáš Vlček
HI, I am not an expert on Hadoop configuration but is this safe? As far as I understand the IP address is public and connection to the datanode port is not secured. Am I correct? Lukas On Fri, Aug 8, 2008 at 8:35 AM, Lucas Nazário dos Santos [EMAIL PROTECTED] wrote: Hello again, In fact I

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Lucas Nazário dos Santos
You are completely right. It's not safe at all. But this is what I have for now: two computers distributed across the Internet. I would really appreciate if anyone could give me spark on how to configure the namenode's IP in a datanode. As I could identify in log files, the datanode keeps trying

Re: Distributed Lucene - from hadoop contrib

2008-08-08 Thread Ning Li
1) Katta n Distributed Lucene are different projects though, right? Both being based on kind of the same paradigm (Distributed Index)? The design of Katta and that of Distributed Lucene are quite different last time I checked. I pointed out the Katta project because you can find the code for

Re: fuse-dfs

2008-08-08 Thread Pete Wyckoff
Hi Sebastian. Setting of times doesn¹t work, but ls, rm, rmdir, mkdir, cp, etc etc should work. Things that are not currently supported include: Touch, chown, chmod, permissions in general and obviously random writes for which you would get an IO error. This is what I get on 0.17 for df ­h:

Re: extracting input to a task from a (streaming) job?

2008-08-08 Thread Yuri Pradkin
On Thursday 07 August 2008 16:43:10 John Heidemann wrote: On Thu, 07 Aug 2008 19:42:05 +0200, Leon Mergen wrote: Hello John, On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote: I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000

How to set System property for my job

2008-08-08 Thread Tarandeep Singh
Hi, While submitting a job to Hadoop, how can I set system properties that are required by my code ? Passing -Dmy.prop=myvalue to the hadoop job command is not going to work as hadoop command will pass this to my program as command line argument. Is there any way to achieve this ? Thanks, Taran

Re: access jobconf in streaming job

2008-08-08 Thread Andreas Kostyrka
On Friday 08 August 2008 11:43:50 Rong-en Fan wrote: After looking into streaming source, the answer is via environment variables. For example, mapred.task.timeout is in the mapred_task_timeout environment variable. Well, another typical way to deal with that is to pass the parameters via

How to enable compression of blockfiles?

2008-08-08 Thread Michael K. Tung
Hello, I have a simple question. How do I configure DFS to store compressed block files? I've noticed by looking at the blk_ files that the text documents I am storing are uncompressed. Currently our hadoop deployment is taking up 10x the diskspace as compared to our system before moving to

performance not great, or did I miss something?

2008-08-08 Thread James Graham (Greywolf)
Greetings, I'm very very new to this (as you could probably tell from my other postings). I have 20 nodes available as a cluster, less one as the namenode and one as the jobtracker (unless I can use them too). Specs are: 226GB of available disk space on each one; 4 processors (2 x dualcore)

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Andreas Kostyrka
On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote: You are completely right. It's not safe at all. But this is what I have for now: two computers distributed across the Internet. I would really appreciate if anyone could give me spark on how to configure the namenode's IP in a

Re: Join example

2008-08-08 Thread Chris Douglas
The contrib/data_join framework is different from the map-side join framework, under o.a.h.mapred.join. To see what the example is doing in an outer join, generate a few sample, text input files, tab-separated: join/a.txt: a0 a1 a2

RE: Join example

2008-08-08 Thread John DeTreville
Thanks very much, Chris! Cheers, John -Original Message- From: Chris Douglas [mailto:[EMAIL PROTECTED] Sent: Friday, August 08, 2008 1:57 PM To: core-user@hadoop.apache.org Subject: Re: Join example The contrib/data_join framework is different from the map-side join framework, under

Re: what is the correct usage of hdfs metrics

2008-08-08 Thread lohit
I have tried to connect to it via jconsole. Apart from that I have seen people of this list use Ganglia to collect metrics or just dump to a file. To start off you could easily use FileContext (dumping metrics to file). Check out the metrics config file (hadoop-metrics.properties) under conf

Re: namenode jobtracker: joint or separate, which is better?

2008-08-08 Thread lohit
It depends on your machine configuration, how much resource it has and what you can afford to lose in case of failures. It would be good to run NameNode and jobtracker on their own dedicate nodes and datanodes and tasktracker on rest of the nodes. We have seen cases where tasktrackers take

Re: How to enable compression of blockfiles?

2008-08-08 Thread lohit
I think at present only SequenceFiles can be compressed. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html If you have plain text files, they are stored as is into blocks. You can store them as .gz and hadoop recognizes it and process the gz files. But its not

Re: namenode jobtracker: joint or separate, which is better?

2008-08-08 Thread James Graham (Greywolf)
Thus spake lohit:: It depends on your machine configuration, how much resource it has and what you can afford to lose in case of failures. It would be good to run NameNode and jobtracker on their own dedicate nodes and datanodes and tasktracker on rest of the nodes. We have seen cases where

RE: Join example

2008-08-08 Thread John DeTreville
When I try the map-side join example (under Hadoop 0.17.1, running in standalone mode under Win32), it attempts to dereference a null pointer. $ cat One/some.txt A 1 B 1 C 1 E 1 $ cat Two/some.txt A 2 B 2 C 2 D 2 $ bin/hadoop jar *examples.jar join

Re: performance not great, or did I miss something?

2008-08-08 Thread Allen Wittenauer
On 8/8/08 1:25 PM, James Graham (Greywolf) [EMAIL PROTECTED] wrote: 226GB of available disk space on each one; 4 processors (2 x dualcore) 8GB of RAM each. Some simple stuff: (Assuming SATA): Are you using AHCI? Do you have the write cache enabled? Is the topologyProgram providing proper

Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Dhruba Borthakur
It is possible that your namenode is overloaded and is not able to respond to RPC requests from clients. Please check the namenode logs to see if you see lines of the form discarding calls dhrua On Fri, Aug 8, 2008 at 3:41 AM, Alexander Aristov [EMAIL PROTECTED] wrote: I come across the

Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Piotr Kozikowski
Thank you for the reply. Apparently whatever it was is now gone after a hadoop restart, but I'll keep that in mind should it happen again. Piotr On Fri, 2008-08-08 at 17:31 -0700, Dhruba Borthakur wrote: It is possible that your namenode is overloaded and is not able to respond to RPC requests

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Lucas Nazário dos Santos
Thanks Andreas. I'll try it. On Fri, Aug 8, 2008 at 5:47 PM, Andreas Kostyrka [EMAIL PROTECTED]wrote: On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote: You are completely right. It's not safe at all. But this is what I have for now: two computers distributed across the