from:"Paco NATHAN"

Re: Hadoop streaming performance: elements vs. vectors

2009-03-28 Thread Paco NATHAN

hi peter, thinking aloud on this - trade-offs may depend on: * how much grouping would be possible (tracking a PDF would be interesting for metrics) * locality of key/value pairs (distributed among mapper and reducer tasks) to that point, will there be much time spent in the shuffle? if

Re: EC2 Usage?

2008-12-18 Thread Paco NATHAN

Ryan, A developer on our team wrote some JSP to add to the Job Tracker, so that job times and other stats could be accessed programmatically via web services: https://issues.apache.org/jira/browse/HADOOP-4559 There's another update coming for that patch in JIRA, to get task data. Paco On

Re: Getting Reduce Output Bytes

2008-11-25 Thread Paco NATHAN

Hi Lohit, Our teams collects those kinds of measurements using this patch: https://issues.apache.org/jira/browse/HADOOP-4559 Some example Java code in the comments shows how to access the data, which is serialized as JSON. Looks like the red_hdfs_bytes_written value would give you that.

Re: combiner stats

2008-11-18 Thread Paco NATHAN

the combiner. On Mon, Nov 17, 2008 at 23:04, Devaraj Das [EMAIL PROTECTED] wrote: On 11/18/08 3:59 AM, Paco NATHAN [EMAIL PROTECTED] wrote: Could someone please help explain the job counters shown for Combine records on the JobTracker JSP page? Here's an example from one of our MR jobs

combiner stats

2008-11-17 Thread Paco NATHAN

Could someone please help explain the job counters shown for Combine records on the JobTracker JSP page? Here's an example from one of our MR jobs. There are Combine input and output record counters shown for both Map phase and Reduce phase. We're not quite sure how to interpret them - Map

Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Paco NATHAN

, Karl Anderson [EMAIL PROTECTED] wrote: On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much

Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Paco NATHAN

Hi Stuart, Yes, we do that. Ditto on most of what Chris described. We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from S3 when it launches. That controls the versions for tools/frameworks, instead of redoing an AMI each time a tool has an update. A remote server -- in our data

Re: Can jobs be configured to be sequential

2008-10-17 Thread Paco NATHAN

Hi Ravion, The problem you are describing sounds like a workflow where you must be careful to verify certain conditions before proceeding to a next step. We have similar kinds of use cases for Hadoop apps at work, which are essentially ETL. I recommend that you look at http://cascading.org as

Re: Questions about Hadoop

2008-09-26 Thread Paco NATHAN

Edward, Can you describe more about Hama, with respect to Hadoop? I've read through the Incubator proposal and your blog -- it's a great approach. Are there any benchmarks available? E.g., size of data sets used, kinds of operations performed, etc. Will this project be able to make use of

Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN

Arijit, For workflow, check out http://cascading.org -- that works quite well and fits what you described. Greenplum and Aster Data have announced support for running MR within the context of their relational databases, e.g., http://www.greenplum.com/resources/mapreduce/ In terms of PIG, Hive,

Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN

://www.connectivasystems.com -Original Message- From: Paco NATHAN [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 6:10 PM To: core-user@hadoop.apache.org; [EMAIL PROTECTED] Subject: Re: Questions about Hadoop Arijit, For workflow, check out http://cascading.org

Re: How to manage a large cluster?

2008-09-16 Thread Paco NATHAN

Thanks, Steve - Another flexible approach to handling messages across firewalls, between jt and worker nodes, etc., would be to place an APMQ message broker on the jobtracker and another inside our local network. We're experimenting with RabbitMQ for that. On Tue, Sep 16, 2008 at 4:03 AM,

Re: How to manage a large cluster?

2008-09-15 Thread Paco NATHAN

We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To make it simple, pull those from S3 buckets. That provides a flexible pattern for managing the frameworks involved, more so than needing to re-do an EC2 image whenever you want to add a patch to Hadoop. Given that approach, you

Re: Hadoop (0.18) Spill Failed, out of Heap Space error

2008-09-03 Thread Paco NATHAN

Also, that almost always happens early in the map phase of the first MR job which runs on our cluster. Hadoop 0.18.1 on EC2 m1.xl instances. We run 10 MR jobs in sequence, 6hr duration, not seeing the problem repeated after that 1 heap space exception. Paco On Wed, Sep 3, 2008 at 11:42 AM,

Re: question on fault tolerance

2008-08-11 Thread Paco NATHAN

just a guess, for a long-running sequence of MR jobs, how's the namenode behaving during that time? if it gets corrupted, one might see that behavior. we have a similar situation, with 9 MR jobs back-to-back, taking much of the day. might be good to add some notification to an external process

Re: iterative map-reduce

2008-07-29 Thread Paco NATHAN

A simple example of Hadoop application code which follows that pattern (iterate until condition). In the jyte section: http://code.google.com/p/ceteri-mapred/ Loop and condition test are in the same code which calls ToolRunner and JobClient. Best, Paco On Tue, Jul 29, 2008 at 10:03 AM,

Re: JobTracker History data+analysis

2008-07-28 Thread Paco NATHAN

PROTECTED] wrote: Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it make sense? Thanks Amareshwari Paco NATHAN wrote: We have a need to access data found in the JobTracker History link. Specifically in the Analyse This Job analysis. Must be run in Java, between jobs

Re: JobTracker History data+analysis

2008-07-28 Thread Paco NATHAN

/history/ inside the directory. Thanks Amareshwari Paco NATHAN wrote: Thank you, Amareshwari - That helps. Hadn't noticed HistoryViewer before. It has no JavaDoc. What is a typical usage? In other words, what would be the outputDir value in the context of ToolRunner, JobClient, etc. ? Paco

JobTracker History data+analysis

2008-07-27 Thread Paco NATHAN

We have a need to access data found in the JobTracker History link. Specifically in the Analyse This Job analysis. Must be run in Java, between jobs, in the same code which calls ToolRunner and JobClient. In essence, we need to collect descriptive statistics about task counts and times for map,

Re: Using MapReduce to do table comparing.

2008-07-23 Thread Paco NATHAN

This is merely an in the ballpark calculation, regarding that 10 minute / 4-node requirement... We have a reasonably similar Hadoop job (slightly more complex in the reduce phase) running on AWS with: * 100+2 nodes (m1.xl config) * approx 3x the number of rows and data size * completes

Re: Large Weblink Graph

2008-04-15 Thread Paco NATHAN

Another site which has data sets available for study is UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ On Tue, Apr 15, 2008 at 8:29 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote: Does anyone have large Weblink graph ? I want to experiment and benchmark MapReduce with

Re: walkthrough of developing first hadoop app from scratch

2008-03-21 Thread Paco NATHAN

Hi Stephen, Here's a sample Hadoop app which has its build based on Ant: http://code.google.com/p/ceteri-mapred/ Look in the jyte directory. A target called prep.jar simply uses the jar/ task in Ant to build a JAR for Hadoop to use. Yeah, I agree that docs and discussions seem to lean more

Re: Add your project or company to the powered by page?

2008-02-21 Thread Paco NATHAN

More on the subject of outreach, not specific uses at companies, but... A couple things might help get the word out: - Add a community group in LinkedIn (shows up on profile searches) http://www.linkedin.com/static?key=groups_faq - Add a link on the wiki to the Facebook group

Re: Hadoop streaming performance: elements vs. vectors

Re: EC2 Usage?

Re: Getting Reduce Output Bytes

Re: combiner stats

combiner stats

Re: Auto-shutdown for EC2 clusters

Re: Auto-shutdown for EC2 clusters

Re: Can jobs be configured to be sequential

Re: Questions about Hadoop

Re: Questions about Hadoop

Re: Questions about Hadoop

Re: How to manage a large cluster?

Re: How to manage a large cluster?

Re: Hadoop (0.18) Spill Failed, out of Heap Space error

Re: question on fault tolerance

Re: iterative map-reduce

Re: JobTracker History data+analysis

Re: JobTracker History data+analysis

JobTracker History data+analysis

Re: Using MapReduce to do table comparing.

Re: Large Weblink Graph

Re: walkthrough of developing first hadoop app from scratch

Re: Add your project or company to the powered by page?

23 matches

Site Navigation

Mail list logo

Footer information