Re: Hadoop streaming performance: elements vs. vectors

2009-03-28 Thread Paco NATHAN
hi peter, thinking aloud on this - trade-offs may depend on: * how much grouping would be possible (tracking a PDF would be interesting for metrics) * locality of key/value pairs (distributed among mapper and reducer tasks) to that point, will there be much time spent in the shuffle? if

Re: EC2 Usage?

2008-12-18 Thread Paco NATHAN
Ryan, A developer on our team wrote some JSP to add to the Job Tracker, so that job times and other stats could be accessed programmatically via web services: https://issues.apache.org/jira/browse/HADOOP-4559 There's another update coming for that patch in JIRA, to get task data. Paco On

Re: Getting Reduce Output Bytes

2008-11-25 Thread Paco NATHAN
Hi Lohit, Our teams collects those kinds of measurements using this patch: https://issues.apache.org/jira/browse/HADOOP-4559 Some example Java code in the comments shows how to access the data, which is serialized as JSON. Looks like the red_hdfs_bytes_written value would give you that.

Re: combiner stats

2008-11-18 Thread Paco NATHAN
the combiner. On Mon, Nov 17, 2008 at 23:04, Devaraj Das [EMAIL PROTECTED] wrote: On 11/18/08 3:59 AM, Paco NATHAN [EMAIL PROTECTED] wrote: Could someone please help explain the job counters shown for Combine records on the JobTracker JSP page? Here's an example from one of our MR jobs

combiner stats

2008-11-17 Thread Paco NATHAN
Could someone please help explain the job counters shown for Combine records on the JobTracker JSP page? Here's an example from one of our MR jobs. There are Combine input and output record counters shown for both Map phase and Reduce phase. We're not quite sure how to interpret them - Map

Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Paco NATHAN
, Karl Anderson [EMAIL PROTECTED] wrote: On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much

Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Paco NATHAN
Hi Stuart, Yes, we do that. Ditto on most of what Chris described. We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from S3 when it launches. That controls the versions for tools/frameworks, instead of redoing an AMI each time a tool has an update. A remote server -- in our data

Re: Can jobs be configured to be sequential

2008-10-17 Thread Paco NATHAN
Hi Ravion, The problem you are describing sounds like a workflow where you must be careful to verify certain conditions before proceeding to a next step. We have similar kinds of use cases for Hadoop apps at work, which are essentially ETL. I recommend that you look at http://cascading.org as

Re: Questions about Hadoop

2008-09-26 Thread Paco NATHAN
Edward, Can you describe more about Hama, with respect to Hadoop? I've read through the Incubator proposal and your blog -- it's a great approach. Are there any benchmarks available? E.g., size of data sets used, kinds of operations performed, etc. Will this project be able to make use of

Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
Arijit, For workflow, check out http://cascading.org -- that works quite well and fits what you described. Greenplum and Aster Data have announced support for running MR within the context of their relational databases, e.g., http://www.greenplum.com/resources/mapreduce/ In terms of PIG, Hive,

Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
://www.connectivasystems.com -Original Message- From: Paco NATHAN [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 6:10 PM To: core-user@hadoop.apache.org; [EMAIL PROTECTED] Subject: Re: Questions about Hadoop Arijit, For workflow, check out http://cascading.org

Re: How to manage a large cluster?

2008-09-16 Thread Paco NATHAN
Thanks, Steve - Another flexible approach to handling messages across firewalls, between jt and worker nodes, etc., would be to place an APMQ message broker on the jobtracker and another inside our local network. We're experimenting with RabbitMQ for that. On Tue, Sep 16, 2008 at 4:03 AM,

Re: How to manage a large cluster?

2008-09-15 Thread Paco NATHAN
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To make it simple, pull those from S3 buckets. That provides a flexible pattern for managing the frameworks involved, more so than needing to re-do an EC2 image whenever you want to add a patch to Hadoop. Given that approach, you

Re: Hadoop (0.18) Spill Failed, out of Heap Space error

2008-09-03 Thread Paco NATHAN
Also, that almost always happens early in the map phase of the first MR job which runs on our cluster. Hadoop 0.18.1 on EC2 m1.xl instances. We run 10 MR jobs in sequence, 6hr duration, not seeing the problem repeated after that 1 heap space exception. Paco On Wed, Sep 3, 2008 at 11:42 AM,

Re: question on fault tolerance

2008-08-11 Thread Paco NATHAN
just a guess, for a long-running sequence of MR jobs, how's the namenode behaving during that time? if it gets corrupted, one might see that behavior. we have a similar situation, with 9 MR jobs back-to-back, taking much of the day. might be good to add some notification to an external process

Re: iterative map-reduce

2008-07-29 Thread Paco NATHAN
A simple example of Hadoop application code which follows that pattern (iterate until condition). In the jyte section: http://code.google.com/p/ceteri-mapred/ Loop and condition test are in the same code which calls ToolRunner and JobClient. Best, Paco On Tue, Jul 29, 2008 at 10:03 AM,

Re: JobTracker History data+analysis

2008-07-28 Thread Paco NATHAN
PROTECTED] wrote: Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it make sense? Thanks Amareshwari Paco NATHAN wrote: We have a need to access data found in the JobTracker History link. Specifically in the Analyse This Job analysis. Must be run in Java, between jobs

Re: JobTracker History data+analysis

2008-07-28 Thread Paco NATHAN
/history/ inside the directory. Thanks Amareshwari Paco NATHAN wrote: Thank you, Amareshwari - That helps. Hadn't noticed HistoryViewer before. It has no JavaDoc. What is a typical usage? In other words, what would be the outputDir value in the context of ToolRunner, JobClient, etc. ? Paco

JobTracker History data+analysis

2008-07-27 Thread Paco NATHAN
We have a need to access data found in the JobTracker History link. Specifically in the Analyse This Job analysis. Must be run in Java, between jobs, in the same code which calls ToolRunner and JobClient. In essence, we need to collect descriptive statistics about task counts and times for map,

Re: Using MapReduce to do table comparing.

2008-07-23 Thread Paco NATHAN
This is merely an in the ballpark calculation, regarding that 10 minute / 4-node requirement... We have a reasonably similar Hadoop job (slightly more complex in the reduce phase) running on AWS with: * 100+2 nodes (m1.xl config) * approx 3x the number of rows and data size * completes

Re: Large Weblink Graph

2008-04-15 Thread Paco NATHAN
Another site which has data sets available for study is UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ On Tue, Apr 15, 2008 at 8:29 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote: Does anyone have large Weblink graph ? I want to experiment and benchmark MapReduce with

Re: walkthrough of developing first hadoop app from scratch

2008-03-21 Thread Paco NATHAN
Hi Stephen, Here's a sample Hadoop app which has its build based on Ant: http://code.google.com/p/ceteri-mapred/ Look in the jyte directory. A target called prep.jar simply uses the jar/ task in Ant to build a JAR for Hadoop to use. Yeah, I agree that docs and discussions seem to lean more

Re: Add your project or company to the powered by page?

2008-02-21 Thread Paco NATHAN
More on the subject of outreach, not specific uses at companies, but... A couple things might help get the word out: - Add a community group in LinkedIn (shows up on profile searches) http://www.linkedin.com/static?key=groups_faq - Add a link on the wiki to the Facebook group