Hello everybody
I have a problem. I installed Hadoop on 2-nodes cluster and run Wordcount
example. It takes about 20 sec for processing of 1,5MB text file. We want to
use Map/Reduce in real time (interactive: by user's requests). User can't
wait for his request 20 sec. This is too long. Is it
Hi Igor,
I am not sure if Hadoop is designed for realtime requests. I have a feeling
that you are trying to use Hadoop in a way that it isnot designed for. From my
experience, Hadoop cluster will be much slower than local hadoop mode when
processing smaller dataset, because there is always
On 01/02/11 08:19, Igor Bubkin wrote:
Hello everybody
I have a problem. I installed Hadoop on 2-nodes cluster and run Wordcount
example. It takes about 20 sec for processing of 1,5MB text file. We want to
use Map/Reduce in real time (interactive: by user's requests). User can't
wait for his
Try this rather small C++ program...it will more than likley be a LOT faster
than anything you could do in hadoop. Hadoop is not the hammer for every nail.
Too many people think that any cluster solution will automagically scale
their problem...tain't true.
I'd appreciate hearing your
Hello,
I want to measure the performance of HDFS on a cluster. There are 16
nodes in the cluster and each node has 12GB memory. In order to reduce
the impact of caching of the file system, each file written by TestDFSIO
is 10GB. The replication factor is 1. Then I got the errors such as
Hi,
I observe that sometimes the map/reduce progress is going backward. What
does this mean?
11/02/01 12:57:51 INFO mapred.JobClient: map 100% reduce 99%
11/02/01 12:59:14 INFO mapred.JobClient: map 100% reduce 98%
11/02/01 12:59:45 INFO mapred.JobClient: map 100% reduce 99%
11/02/01
It means that the scheduler is killing off some of your reducer jobs or some of
them are dieing. Maybe they are taking too long. You should check out your
job tracker and look at some of the details and then drill down to see if you
are getting any errors in some of your reducers.
Cheers
I have found the following abstraction a streaming counter update in many
places on the web:
reporter:counter:group,counter,amount
to be sent to cerr, but I haven't found a description of the three parameters.
Presumably group is just a string of my own choosing, but I don't know what
A 'group' is a set of counters (like 'Map-Reduce Framework' in default
counters - That's a group name). A 'counter' needs a name to signify
what it is about.
The increment is what the 'amount' signifies. Generally 1, but can be
any amount you need to increment the current count by. All counters
Ah, so count doesn't set the count, it increments it, meaning I shouldn't
increment it each time. That would result in some sort of strange exponential
counter. I should just pass 1 every time. I get it.
I'm still unclear on the distinction between the group and the individual
counter, but
So streaming uses stdout to organize the mapper/reducer output, one record per
line with each key/val split at the first TAB.
(Presumably multiple TABS are permitted and become embedded in the value
string, I haven't experimented with this yet).
Obviously, one must be very careful not to write
Hi
I have been running some benchmarks with some hadoop jobs on different nodes
and disk configurations to see what is a good configuration to get the optimum
performance.
Here are some results that I have. Using the hadoop job log, I added up the
timings for each off the map task and
Group is more of a meta-data thing. For instance, one may want all
FileSystem related counters under a single namespace (which would be a
Group, here).
On Wed, Feb 2, 2011 at 2:17 AM, Keith Wiley kwi...@keithwiley.com wrote:
Ah, so count doesn't set the count, it increments it, meaning I
Hi, Raj.
Interesting analysis...
These numbers appear to be off. For example, 405s for mappers + 751s for
reducers = 1156s for all tasks. If you have 2000 map and reduce tasks, this
means each task is spending roughly 500ms to do actual work. That is a very
low number and seems impossible.
- P
Hi Steve,
-are you asking for XL or bigger VMs to get the full physical host and
less network throtting?
I've used m1.large, m1.xlarge and cc1.4xlarge instance types and seen this
issue on all of them. Speaking specifically about cc1.4xlarge instances, I
see disk read speeds for ephemeral
Patrick,All
APologies. My brain must have just frozen. There are two problems here. The
first problem is that my perl script alreadfy divides the time by 1000. So the
time is seconds not milliseconds.
The second problem isĀ more fundamental. You can add all the task timings.,
That ignores the
Inline
On Wed, Feb 2, 2011 at 5:35 AM, Raj V rajv...@yahoo.com wrote:
Patrick,All
APologies. My brain must have just frozen. There are two problems here. The
first problem is that my perl script alreadfy divides the time by 1000. So
the time is seconds not milliseconds.
The second problem
I would really appreciate any help people can offer on the following matters.
When running a streaming job, -D, -files, -libjars, and -archives don't seem
work, but -jobconf, -file, -cacheFile, and -cacheArchive do. With the first
four parameters anywhere in command I always get a Streaming
18 matches
Mail list logo