Re: Strange Reduce Bahavior

2009-04-02 Thread jason hadoop
1) when running in pseudo-distributed mode, only 2 values for the reduce count are accepted, 0 and 1. All other positive values are mapped to 1. 2) The single reduce task spawned has several steps, and each of these steps account for about 1/3 of it's overall progress. The 1st third, is

Re: Amazon Elastic MapReduce

2009-04-02 Thread zhang jianfeng
Does it support pig ? On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel ch...@wensel.net wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris

Re: Amazon Elastic MapReduce

2009-04-02 Thread Miles Osborne
... and only in the US Miles 2009/4/2 zhang jianfeng zjf...@gmail.com: Does it support pig ? On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel ch...@wensel.net wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it:

Re: Amazon Elastic MapReduce

2009-04-02 Thread zhang jianfeng
seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne mi...@inf.ed.ac.uk wrote: ... and only in the US Miles 2009/4/2 zhang jianfeng

Announcing Amazon Elastic MapReduce

2009-04-02 Thread Sirota, Peter
Dear Hadoop community, We are excited today to introduce the public beta of Amazon Elastic MapReduce, a web service that enables developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop (0.18.3) running on the web-scale infrastructure of Amazon

mapreduce problem

2009-04-02 Thread Vishal Ghawate
Hi, I am new to map-reduce programming model , I am writing a MR that will process the log file and results are written to different files on hdfs based on some values in the log file The program is working fine even if I haven't done any processing in reducer ,I am not

Re: hadoop job controller

2009-04-02 Thread Stefan Podkowinski
You can get the job progress and completion status through an instance of org.apache.hadoop.mapred.JobClient . If you really want to use perl I guess you still need to write a small java application that talks to perl and JobClient on the other side. Theres also some support for Thrift in the

HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread andy2005cst
I need to use the output of the reduce, but I don't know how to do. use the wordcount program as an example if i want to collect the wordcount into a hashtable for further use, how can i do? the example just show how to let the result onto disk. myemail is : andy2005...@gmail.com looking forward

Re: Amazon Elastic MapReduce

2009-04-02 Thread Brian Bockelman
On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. Not everyone has a support team or an operations team or enough time to learn how to

Re: mapreduce problem

2009-04-02 Thread Rasit OZDAS
MultipleOutputFormat would be what you want. It supplies multiple files as output. I can paste some code here if you want.. 2009/4/2 Vishal Ghawate vishal_ghaw...@persistent.co.in Hi, I am new to map-reduce programming model , I am writing a MR that will process the log file and results

Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS
Hi, hadoop is normally designed to write to disk. There are a special file format, which writes output to RAM instead of disk. But I don't have an idea if it's what you're looking for. If what you said exists, there should be a mechanism which sends output as objects rather than file content

Re: reducer in M-R

2009-04-02 Thread Rasit OZDAS
Since every file name is different, you have a unique key for each map output. That means, every iterator has only one element. So you won't need to search for a given name. But it's possible that I misunderstood you. 2009/4/2 Vishal Ghawate vishal_ghaw...@persistent.co.in Hi , I just wanted

Re: Cannot resolve Datonode address in slave file

2009-04-02 Thread Rasit OZDAS
Hi, Sim, I've two suggessions, if you haven't done yet: 1. Check if your other hosts can ssh to master. 2. Take a look at logs of other hosts. 2009/4/2 Puri, Aseem aseem.p...@honeywell.com Hi I have a small Hadoop cluster with 3 machines. One is my NameNode/JobTracker +

Re: Strange Reduce Bahavior

2009-04-02 Thread Rasit OZDAS
Yes, we've constructed a local version of a hadoop process, We needed 500 input files in hadoop to reach the speed of local process, total time was 82 seconds in a cluster of 6 machines. And I think it's a good performance among other distributed processing systems. 2009/4/2 jason hadoop

RE: Cannot resolve Datonode address in slave file

2009-04-02 Thread Puri, Aseem
Hi Rasit, Now I got a different problem when I start my Hadoop server the slave datanode do not accept password. It gives message permission denied. I have also use the commands on all m/c $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys

Re: Reducer side output

2009-04-02 Thread Rasit OZDAS
I think it's about that you have no right to access to the path you define. Did you try it with a path under your user directory? You can change permissions from console. 2009/4/1 Nagaraj K nagar...@yahoo-inc.com Hi, I am trying to do a side-effect output along with the usual output from the

Re: Running MapReduce without setJar

2009-04-02 Thread Rasit OZDAS
Yes, as an additional info, you can use this code just to start the job, not wait until it's finished: JobClient client = new JobClient(conf); client.runJob(conf); 2009/4/1 javateck javateck javat...@gmail.com you can run from java program: JobConf conf = new

Re: Cannot resolve Datonode address in slave file

2009-04-02 Thread Guilherme Germoglio
you should append id_dsa.pub to ~/.ssh/authorized_keys on the other computers from the cluster. if your home directory is shared by all of them (e.g., you're mounting /home/$user using NFS), cat ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys might work. however, if it isn't shared, you might use

Re: what change to be done in OutputCollector to print custom writable object

2009-04-02 Thread Rasit OZDAS
There is also a good alternative, We use ObjectInputFormat and ObjectRecordReader. With it you can easily do File - Object translations. I can send a code sample to your mail if you want.

Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS
If performance is important to you, Look at the quote from a previous thread: HDFS is a file system for distributed storage typically for distributed computing scenerio over hadoop. For office purpose you will require a SAN (Storage Area Network) - an architecture to attach remote computer

Hardware - please sanity check?

2009-04-02 Thread tim robertson
Hi all, I am not a hardware guy but about to set up a 10 node cluster for some processing of (mostly) tab files, generating various indexes and researching HBase, Mahout, pig, hive etc. Could someone please sanity check that these specs look sensible? [I know 4 drives would be better but price

Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS
I doubt If I understood you correctly, but if so, there is a previous thread to better understand what hadoop is intended to be, and what disadvantages it has: http://www.nabble.com/Using-HDFS-to-serve-www-requests-td22725659.html 2009/4/2 Rasit OZDAS rasitoz...@gmail.com If performance is

Re: hdfs-doubt

2009-04-02 Thread Rasit OZDAS
It seems that either NameNode or DataNode is not started. You can take a look at log files, and paste related lines here. 2009/3/29 deepya m_dee...@yahoo.co.in: Thanks, I have another doubt.I just want to run the examples and see how it works.I am trying to copy the file from local file

Re: Hardware - please sanity check?

2009-04-02 Thread Miles Osborne
make sure you also have a fast switch, since you will be transmitting data across your network and this will come to bite you otherwise (roughly, you need one core per hadoop-related job, each mapper, task tracker etc; the per-core memory may be too small if you are doing anything

Re: Amazon Elastic MapReduce

2009-04-02 Thread Chris K Wensel
You should check out the new pricing. On Apr 2, 2009, at 1:13 AM, zhang jianfeng wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne

Re: Identify the input file for a failed mapper/reducer

2009-04-02 Thread Rasit OZDAS
Two quotes for this problem: Streaming map tasks should have a map_input_file environment variable like the following: map_input_file=hdfs://HOST/path/to/file the value for map.input.file gives you the exact information you need. (didn't try) Rasit 2009/3/26 Jason Fennell jdfenn...@gmail.com:

Re: Hardware - please sanity check?

2009-04-02 Thread tim robertson
Thanks Miles, Thus far most of my work has been on EC2 large instances and *mostly* my code is not memory intensive (I sometimes do joins against polygons and hold Geospatial indexes in memory, but am aware of keeping things within the -Xmx for this). I am mostly looking to move routine data

Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Bryan Duxbury
I don't really see what the downside of reading it from disk is. A list of word counts should be pretty small on disk so it shouldn't take long to read it into a HashMap. Doing anything else is going to cause you to go a long way out of your way to end up with the same result. -Bryan On

Re: Running MapReduce without setJar

2009-04-02 Thread Farhan Husain
Does this class need to have the mapper and reducer classes too? On Wed, Apr 1, 2009 at 1:52 PM, javateck javateck javat...@gmail.comwrote: you can run from java program: JobConf conf = new JobConf(MapReduceWork.class); // setting your params JobClient.runJob(conf);

Re: Amazon Elastic MapReduce

2009-04-02 Thread Kevin Peterson
So if I understand correctly, this is an automated system to bring up a hadoop cluster on EC2, import some data from S3, run a job flow, write the data back to S3, and bring down the cluster? This seems like a pretty good deal. At the pricing they are offering, unless I'm able to keep a cluster

Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread He Chen
It seems like the InMemoryFileSystem class has been deprecated in Hadoop 0.19.1. Why? I want to reuse the result of reduce as the next time map's input. Cascading does not work, because the data of each step is dependent. I set each timestep mapreduce job as synchronization. If the

Re: Amazon Elastic MapReduce

2009-04-02 Thread Peter Skomoroch
Kevin, The API accepts any arguments you can pass in the standard jobconf for Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON job description that will run on the service. -Pete On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson kpeter...@biz360.com wrote: So if I

Re: Multiple k,v pairs from a single map - possible?

2009-04-02 Thread Amandeep Khurana
Here's the JIRA for the Oracle fix. https://issues.apache.org/jira/browse/HADOOP-5616 Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Mar 27, 2009 at 5:18 AM, Brian MacKay brian.mac...@medecision.comwrote: Amandeep, Add this to

Re: Running MapReduce without setJar

2009-04-02 Thread Farhan Husain
I did all of them i.e. I used setMapClass, setReduceClass and new JobConf(MapReduceWork.class) but still it cannot run the job without a jar file. I understand the reason that it looks for those classes inside a jar but I think there should be some better way to find those classes without using a

Lost TaskTracker Errors

2009-04-02 Thread Bhupesh Bansal
Hey Folks, Since last 2-3 days I am seeing many of these errors popping up in our hadoop cluster. Task attempt_200904011612_0025_m_000120_0 failed to report status for 604 seconds. Killing JobTracker logs are doesn¹t have any more info And task tracker logs are clean. The failures occurred

Re: Hardware - please sanity check?

2009-04-02 Thread Patrick Angeles
I had a similar curiosity, but more regarding disk speed. Can I assume linear improvement between 7200rpm - 10k rpm - 15k rpm? How much of a bottleneck is disk access? Another question is regarding hardware redundancy. What is the relative value of the following: - RAID / hot-swappable drives -

Question about upgrading

2009-04-02 Thread Usman Waheed
Hello, I have a 5 node cluster with one master node. I am upgrading from 16.4 to 18.3 but am a little confused if i am doing it the right way. I read up on the documentatin and how to use the -upgrade switch but want to make sure i havent missed any step. First i took down the cluster by

Re: Hardware - please sanity check?

2009-04-02 Thread Philip Zeyliger
I've been assuming that RAID is generally a good idea (disks fail quite often, and it's cheaper to hotswap a drive than to rebuild an entire box). Hadoop data nodes are often configured without RAID (i.e., JBOD = Just a Bunch of Disks)--HDFS already provides for the data redundancy. Also,

Checking if a streaming job failed

2009-04-02 Thread Mayuran Yogarajah
Hello, does anyone know how I can check if a streaming job (in Perl) has failed or succeeded? The only way I can see at the moment is to check the web interface for that jobID and parse out the '*Status:*' value. Is it not possible to do this using 'hadoop job -status' ? I see there is a count

Re: RPM spec file for 0.19.1

2009-04-02 Thread Christophe Bisciglia
Hey Ian, we are totally fine with this - the only reason we didn't contribute the SPEC file is that it is the output of our internal build system, and we don't have the bandwidth to properly maintain multiple RPMs. That said, we chatted about this a bit today, and were wondering if the community

Re: Checking if a streaming job failed

2009-04-02 Thread Miles Osborne
here is how i do it (in perl). hadoop streaming is actually called by a shell script, which in this case expects compressed input and produces compressed output. but you get the idea: (the mailer had messed-up the formatting somewhat) sub runStreamingCompInCompOut { my $mapper = shift @_;

HDFS data block clarification

2009-04-02 Thread javateck javateck
Can someone tell whether a file will occupy one or more blocks? for example, the default block size is 64MB, and if I save a 4k file to HDFS, will the 4K file occupy the whole 64MB block alone? so in this case, do I do need to configure the block size to 10k if most of my files are less than

Re: HDFS data block clarification

2009-04-02 Thread jason hadoop
HDFS only allocates as much physical disk space is required for a block, up to the block size for the file (+ some header data). So if you write a 4k file, the single block for that file will be around 4k. If you write a 65M file, there will be two blocks, one of roughly 64M, and one of roughly

Re: skip setting output path for a sequential MR job..

2009-04-02 Thread some speed
Removing the file programatically is doing the trick for me. thank you all for your answers and help :-) On Tue, Mar 31, 2009 at 12:25 AM, some speed speed.s...@gmail.com wrote: Hello everyone, Is it necessary to redirect the ouput of reduce to a file? When I am trying to run the same M-R

Re: Multiple k,v pairs from a single map - possible?

2009-04-02 Thread 皮皮
thank you very much . this is what i am looking for. 2009/3/27 Brian MacKay brian.mac...@medecision.com Amandeep, Add this to your driver. MultipleOutputs.addNamedOutput(conf, PHONE,TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(conf, NAME,