1) when running in pseudo-distributed mode, only 2 values for the reduce
count are accepted, 0 and 1. All other positive values are mapped to 1.
2) The single reduce task spawned has several steps, and each of these steps
account for about 1/3 of it's overall progress.
The 1st third, is
Does it support pig ?
On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel ch...@wensel.net wrote:
FYI
Amazons new Hadoop offering:
http://aws.amazon.com/elasticmapreduce/
And Cascading 1.0 supports it:
http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
cheers,
ckw
--
Chris
... and only in the US
Miles
2009/4/2 zhang jianfeng zjf...@gmail.com:
Does it support pig ?
On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel ch...@wensel.net wrote:
FYI
Amazons new Hadoop offering:
http://aws.amazon.com/elasticmapreduce/
And Cascading 1.0 supports it:
seems like I should pay for additional money, so why not configure a hadoop
cluster in EC2 by myself. This already have been automatic using script.
On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne mi...@inf.ed.ac.uk wrote:
... and only in the US
Miles
2009/4/2 zhang jianfeng
Dear Hadoop community,
We are excited today to introduce the public beta of Amazon Elastic MapReduce,
a web service that enables developers to easily and cost-effectively process
vast amounts of data. It utilizes a hosted Hadoop (0.18.3) running on the
web-scale infrastructure of Amazon
Hi,
I am new to map-reduce programming model ,
I am writing a MR that will process the log file and results are written to
different files on hdfs based on some values in the log file
The program is working fine even if I haven't done any
processing in reducer ,I am not
You can get the job progress and completion status through an instance
of org.apache.hadoop.mapred.JobClient . If you really want to use perl
I guess you still need to write a small java application that talks to
perl and JobClient on the other side.
Theres also some support for Thrift in the
I need to use the output of the reduce, but I don't know how to do.
use the wordcount program as an example if i want to collect the wordcount
into a hashtable for further use, how can i do?
the example just show how to let the result onto disk.
myemail is : andy2005...@gmail.com
looking forward
On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:
seems like I should pay for additional money, so why not configure a
hadoop
cluster in EC2 by myself. This already have been automatic using
script.
Not everyone has a support team or an operations team or enough time
to learn how to
MultipleOutputFormat would be what you want. It supplies multiple files as
output.
I can paste some code here if you want..
2009/4/2 Vishal Ghawate vishal_ghaw...@persistent.co.in
Hi,
I am new to map-reduce programming model ,
I am writing a MR that will process the log file and results
Hi, hadoop is normally designed to write to disk. There are a special file
format, which writes output to RAM instead of disk.
But I don't have an idea if it's what you're looking for.
If what you said exists, there should be a mechanism which sends output as
objects rather than file content
Since every file name is different, you have a unique key for each map
output.
That means, every iterator has only one element. So you won't need to search
for a given name.
But it's possible that I misunderstood you.
2009/4/2 Vishal Ghawate vishal_ghaw...@persistent.co.in
Hi ,
I just wanted
Hi, Sim,
I've two suggessions, if you haven't done yet:
1. Check if your other hosts can ssh to master.
2. Take a look at logs of other hosts.
2009/4/2 Puri, Aseem aseem.p...@honeywell.com
Hi
I have a small Hadoop cluster with 3 machines. One is my
NameNode/JobTracker +
Yes, we've constructed a local version of a hadoop process,
We needed 500 input files in hadoop to reach the speed of local process,
total time was 82 seconds in a cluster of 6 machines.
And I think it's a good performance among other distributed processing
systems.
2009/4/2 jason hadoop
Hi Rasit,
Now I got a different problem when I start my Hadoop server the slave datanode
do not accept password. It gives message permission denied.
I have also use the commands on all m/c
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys
I think it's about that you have no right to access to the path you define.
Did you try it with a path under your user directory?
You can change permissions from console.
2009/4/1 Nagaraj K nagar...@yahoo-inc.com
Hi,
I am trying to do a side-effect output along with the usual output from the
Yes, as an additional info,
you can use this code just to start the job, not wait until it's finished:
JobClient client = new JobClient(conf);
client.runJob(conf);
2009/4/1 javateck javateck javat...@gmail.com
you can run from java program:
JobConf conf = new
you should append id_dsa.pub to ~/.ssh/authorized_keys on the other
computers from the cluster. if your home directory is shared by all of them
(e.g., you're mounting /home/$user using NFS), cat ~/.ssh/id_dsa.pub
~/.ssh/authorized_keys might work. however, if it isn't shared, you might
use
There is also a good alternative,
We use ObjectInputFormat and ObjectRecordReader.
With it you can easily do File - Object translations.
I can send a code sample to your mail if you want.
If performance is important to you, Look at the quote from a previous
thread:
HDFS is a file system for distributed storage typically for distributed
computing scenerio over hadoop. For office purpose you will require a SAN
(Storage Area Network) - an architecture to attach remote computer
Hi all,
I am not a hardware guy but about to set up a 10 node cluster for some
processing of (mostly) tab files, generating various indexes and
researching HBase, Mahout, pig, hive etc.
Could someone please sanity check that these specs look sensible?
[I know 4 drives would be better but price
I doubt If I understood you correctly, but if so, there is a previous
thread to better understand what hadoop is intended to be, and what
disadvantages it has:
http://www.nabble.com/Using-HDFS-to-serve-www-requests-td22725659.html
2009/4/2 Rasit OZDAS rasitoz...@gmail.com
If performance is
It seems that either NameNode or DataNode is not started.
You can take a look at log files, and paste related lines here.
2009/3/29 deepya m_dee...@yahoo.co.in:
Thanks,
I have another doubt.I just want to run the examples and see how it works.I
am trying to copy the file from local file
make sure you also have a fast switch, since you will be transmitting
data across your network and this will come to bite you otherwise
(roughly, you need one core per hadoop-related job, each mapper, task
tracker etc; the per-core memory may be too small if you are doing
anything
You should check out the new pricing.
On Apr 2, 2009, at 1:13 AM, zhang jianfeng wrote:
seems like I should pay for additional money, so why not configure a
hadoop
cluster in EC2 by myself. This already have been automatic using
script.
On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne
Two quotes for this problem:
Streaming map tasks should have a map_input_file environment
variable like the following:
map_input_file=hdfs://HOST/path/to/file
the value for map.input.file gives you the exact information you need.
(didn't try)
Rasit
2009/3/26 Jason Fennell jdfenn...@gmail.com:
Thanks Miles,
Thus far most of my work has been on EC2 large instances and *mostly*
my code is not memory intensive (I sometimes do joins against polygons
and hold Geospatial indexes in memory, but am aware of keeping things
within the -Xmx for this).
I am mostly looking to move routine data
I don't really see what the downside of reading it from disk is. A
list of word counts should be pretty small on disk so it shouldn't
take long to read it into a HashMap. Doing anything else is going to
cause you to go a long way out of your way to end up with the same
result.
-Bryan
On
Does this class need to have the mapper and reducer classes too?
On Wed, Apr 1, 2009 at 1:52 PM, javateck javateck javat...@gmail.comwrote:
you can run from java program:
JobConf conf = new JobConf(MapReduceWork.class);
// setting your params
JobClient.runJob(conf);
So if I understand correctly, this is an automated system to bring up a
hadoop cluster on EC2, import some data from S3, run a job flow, write the
data back to S3, and bring down the cluster?
This seems like a pretty good deal. At the pricing they are offering, unless
I'm able to keep a cluster
It seems like the InMemoryFileSystem class has been deprecated in Hadoop
0.19.1. Why?
I want to reuse the result of reduce as the next time map's input. Cascading
does not work, because the data of each step is dependent. I set each
timestep mapreduce job as synchronization. If the
Kevin,
The API accepts any arguments you can pass in the standard jobconf for
Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON
job description that will run on the service.
-Pete
On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson kpeter...@biz360.com wrote:
So if I
Here's the JIRA for the Oracle fix.
https://issues.apache.org/jira/browse/HADOOP-5616
Amandeep
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
On Fri, Mar 27, 2009 at 5:18 AM, Brian MacKay
brian.mac...@medecision.comwrote:
Amandeep,
Add this to
I did all of them i.e. I used setMapClass, setReduceClass and new
JobConf(MapReduceWork.class) but still it cannot run the job without a jar
file. I understand the reason that it looks for those classes inside a jar
but I think there should be some better way to find those classes without
using a
Hey Folks,
Since last 2-3 days I am seeing many of these errors popping up in our
hadoop cluster.
Task attempt_200904011612_0025_m_000120_0 failed to report status for 604
seconds. Killing
JobTracker logs are doesn¹t have any more info And task tracker logs are
clean.
The failures occurred
I had a similar curiosity, but more regarding disk speed.
Can I assume linear improvement between 7200rpm - 10k rpm - 15k rpm? How
much of a bottleneck is disk access?
Another question is regarding hardware redundancy. What is the relative
value of the following:
- RAID / hot-swappable drives
-
Hello,
I have a 5 node cluster with one master node. I am upgrading from 16.4 to
18.3 but am a little confused if i am doing it the right way. I read up on
the documentatin and how to use the -upgrade switch but want to make sure
i havent missed any step.
First i took down the cluster by
I've been assuming that RAID is generally a good idea (disks fail quite
often, and it's cheaper to hotswap a drive than to rebuild an entire box).
Hadoop data nodes are often configured without RAID (i.e., JBOD = Just a
Bunch of Disks)--HDFS already provides for the data redundancy. Also,
Hello, does anyone know how I can check if a streaming job (in Perl) has
failed or succeeded? The only way I can see at the moment is to check
the web interface for that jobID and parse out the '*Status:*' value.
Is it not possible to do this using 'hadoop job -status' ? I see there
is a count
Hey Ian, we are totally fine with this - the only reason we didn't
contribute the SPEC file is that it is the output of our internal
build system, and we don't have the bandwidth to properly maintain
multiple RPMs.
That said, we chatted about this a bit today, and were wondering if
the community
here is how i do it (in perl). hadoop streaming is actually called by
a shell script, which in this case expects compressed input and
produces compressed output. but you get the idea:
(the mailer had messed-up the formatting somewhat)
sub runStreamingCompInCompOut {
my $mapper = shift @_;
Can someone tell whether a file will occupy one or more blocks? for
example, the default block size is 64MB, and if I save a 4k file to HDFS,
will the 4K file occupy the whole 64MB block alone? so in this case, do I do
need to configure the block size to 10k if most of my files are less than
HDFS only allocates as much physical disk space is required for a block, up
to the block size for the file (+ some header data).
So if you write a 4k file, the single block for that file will be around 4k.
If you write a 65M file, there will be two blocks, one of roughly 64M, and
one of roughly
Removing the file programatically is doing the trick for me. thank you all
for your answers and help :-)
On Tue, Mar 31, 2009 at 12:25 AM, some speed speed.s...@gmail.com wrote:
Hello everyone,
Is it necessary to redirect the ouput of reduce to a file? When I am trying
to run the same M-R
thank you very much . this is what i am looking for.
2009/3/27 Brian MacKay brian.mac...@medecision.com
Amandeep,
Add this to your driver.
MultipleOutputs.addNamedOutput(conf, PHONE,TextOutputFormat.class,
Text.class, Text.class);
MultipleOutputs.addNamedOutput(conf, NAME,
45 matches
Mail list logo