hi peter,
thinking aloud on this -
trade-offs may depend on:
* how much grouping would be possible (tracking a PDF would be
interesting for metrics)
* locality of key/value pairs (distributed among mapper and reducer tasks)
to that point, will there be much time spent in the shuffle? if
Ryan,
A developer on our team wrote some JSP to add to the Job Tracker, so
that job times and other stats could be accessed programmatically via
web services:
https://issues.apache.org/jira/browse/HADOOP-4559
There's another update coming for that patch in JIRA, to get task data.
Paco
On
Hi Lohit,
Our teams collects those kinds of measurements using this patch:
https://issues.apache.org/jira/browse/HADOOP-4559
Some example Java code in the comments shows how to access the data,
which is serialized as JSON. Looks like the red_hdfs_bytes_written
value would give you that.
the
combiner.
On Mon, Nov 17, 2008 at 23:04, Devaraj Das [EMAIL PROTECTED] wrote:
On 11/18/08 3:59 AM, Paco NATHAN [EMAIL PROTECTED] wrote:
Could someone please help explain the job counters shown for Combine
records on the JobTracker JSP page?
Here's an example from one of our MR jobs
Could someone please help explain the job counters shown for Combine
records on the JobTracker JSP page?
Here's an example from one of our MR jobs. There are Combine input
and output record counters shown for both Map phase and Reduce phase.
We're not quite sure how to interpret them -
Map
, Karl Anderson [EMAIL PROTECTED] wrote:
On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:
This workflow could be initiated from a crontab -- totally automated.
However, we still see occasional failures of the cluster, and must
restart manually, but not often. Stability for that has improved much
Hi Stuart,
Yes, we do that. Ditto on most of what Chris described.
We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from
S3 when it launches. That controls the versions for tools/frameworks,
instead of redoing an AMI each time a tool has an update.
A remote server -- in our data
Hi Ravion,
The problem you are describing sounds like a workflow where you must
be careful to verify certain conditions before proceeding to a next
step.
We have similar kinds of use cases for Hadoop apps at work, which are
essentially ETL. I recommend that you look at http://cascading.org as
Edward,
Can you describe more about Hama, with respect to Hadoop?
I've read through the Incubator proposal and your blog -- it's a great approach.
Are there any benchmarks available? E.g., size of data sets used,
kinds of operations performed, etc.
Will this project be able to make use of
Arijit,
For workflow, check out http://cascading.org -- that works quite well
and fits what you described.
Greenplum and Aster Data have announced support for running MR within
the context of their relational databases, e.g.,
http://www.greenplum.com/resources/mapreduce/
In terms of PIG, Hive,
://www.connectivasystems.com
-Original Message-
From: Paco NATHAN [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 24, 2008 6:10 PM
To: core-user@hadoop.apache.org; [EMAIL PROTECTED]
Subject: Re: Questions about Hadoop
Arijit,
For workflow, check out http://cascading.org
Thanks, Steve -
Another flexible approach to handling messages across firewalls,
between jt and worker nodes, etc., would be to place an APMQ message
broker on the jobtracker and another inside our local network. We're
experimenting with RabbitMQ for that.
On Tue, Sep 16, 2008 at 4:03 AM,
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
make it simple, pull those from S3 buckets. That provides a flexible
pattern for managing the frameworks involved, more so than needing to
re-do an EC2 image whenever you want to add a patch to Hadoop.
Given that approach, you
Also, that almost always happens early in the map phase of the first
MR job which runs on our cluster.
Hadoop 0.18.1 on EC2 m1.xl instances.
We run 10 MR jobs in sequence, 6hr duration, not seeing the problem
repeated after that 1 heap space exception.
Paco
On Wed, Sep 3, 2008 at 11:42 AM,
just a guess,
for a long-running sequence of MR jobs, how's the namenode behaving
during that time? if it gets corrupted, one might see that behavior.
we have a similar situation, with 9 MR jobs back-to-back, taking much
of the day.
might be good to add some notification to an external process
A simple example of Hadoop application code which follows that pattern
(iterate until condition). In the jyte section:
http://code.google.com/p/ceteri-mapred/
Loop and condition test are in the same code which calls ToolRunner
and JobClient.
Best,
Paco
On Tue, Jul 29, 2008 at 10:03 AM,
PROTECTED] wrote:
Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it
make sense?
Thanks
Amareshwari
Paco NATHAN wrote:
We have a need to access data found in the JobTracker History link.
Specifically in the Analyse This Job analysis. Must be run in Java,
between jobs
/history/ inside the directory.
Thanks
Amareshwari
Paco NATHAN wrote:
Thank you, Amareshwari -
That helps. Hadn't noticed HistoryViewer before. It has no JavaDoc.
What is a typical usage? In other words, what would be the
outputDir value in the context of ToolRunner, JobClient, etc. ?
Paco
We have a need to access data found in the JobTracker History link.
Specifically in the Analyse This Job analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map,
This is merely an in the ballpark calculation, regarding that 10
minute / 4-node requirement...
We have a reasonably similar Hadoop job (slightly more complex in the
reduce phase) running on AWS with:
* 100+2 nodes (m1.xl config)
* approx 3x the number of rows and data size
* completes
Another site which has data sets available for study is UCI Machine
Learning Repository:
http://archive.ics.uci.edu/ml/
On Tue, Apr 15, 2008 at 8:29 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote:
Does anyone have large Weblink graph ? I want to experiment and benchmark
MapReduce with
Hi Stephen,
Here's a sample Hadoop app which has its build based on Ant:
http://code.google.com/p/ceteri-mapred/
Look in the jyte directory. A target called prep.jar simply uses
the jar/ task in Ant to build a JAR for Hadoop to use.
Yeah, I agree that docs and discussions seem to lean more
More on the subject of outreach, not specific uses at companies, but...
A couple things might help get the word out:
- Add a community group in LinkedIn (shows up on profile searches)
http://www.linkedin.com/static?key=groups_faq
- Add a link on the wiki to the Facebook group
23 matches
Mail list logo