Re: Efficient algorithm for many-to-many reduce-side join?
I believe PIG, and I know Cascading use a kind of 'spillable' list that can be re-iterated across. PIG's version is a bit more sophisticated last I looked. that said, if you were using either one of them, you wouldn't need to write your own many-to-many join. cheers, ckw On May 28, 2009, at 8:14 AM, Todd Lipcon wrote: One last possible trick to consider: If you were to subclass SequenceFileRecordReader, you'd have access to its seek method, allowing you to rewind the reducer input. You could then implement a block hash join with something like the following pseudocode: ahash = new HashMapKey, Val(); while (i have ram available) { read a record if the record is from table B, break put the record into ahash } nextAPos = reader.getPos() while (current record is an A record) { skip to next record } firstBPos = reader.getPos() while (current record has current key) { read and join against ahash process joined result } if firstBPos nextAPos { seek(nextAPos) go back to top } On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon t...@cloudera.com wrote: Hi Stuart, It seems to me like you have a few options. Option 1: Just use a lot of RAM. Unless you really expect many millions of entries on both sides of the join, you might be able to get away with buffering despite its inefficiency. Option 2: Use LocalDirAllocator to find some local storage to spill all of the left table's records to disk in a MapFile format. Then as you iterate over the right table, do lookups in the MapFile. This is really the same as option 1, except that you're using disk as an extension of RAM. Option 3: Convert this to a map-side merge join. Basically what you need to do is sort both tables by the join key, and partition them with the same partitioner into the same number of columns. This way you have an equal number of part-N files for both tables, and within each part- N file they're ordered by join key. In each map task, you open both tableA/ part-N and tableB/part-N and do a sequential merge to perform the join. I believe the CompositeInputFormat class helps with this, though I've never used it. Option 4: Perform the join in several passes. Whichever table is smaller, break into pieces that fit in RAM. Unless my relational algebra is off, A JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 UNION B2. Hope that helps -Todd On Thu, May 28, 2009 at 5:02 AM, Stuart White stuart.whi...@gmail.com wrote: I need to do a reduce-side join of two datasets. It's a many-to- many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented to the reducers before records from the second dataset. I should hold on to the value from the one dataset and remember it as I iterate across the values from the second dataset. This seems like it only works well for one-to-many joins (when one of your datasets will only have a single record with any given key). This scales well because you're only remembering one value. In a many-to-many join, if you apply this same algorithm, you'll need to remember all values from one dataset, which of course will be problematic (and won't scale) when dealing with large datasets with large numbers of records with the same keys. Does an efficient algorithm exist for a many-to-many reduce-side join? -- Chris K Wensel ch...@concurrentinc.com http://www.concurrentinc.com
Re: Amazon Elastic MapReduce
You should check out the new pricing. On Apr 2, 2009, at 1:13 AM, zhang jianfeng wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne mi...@inf.ed.ac.uk wrote: ... and only in the US Miles 2009/4/2 zhang jianfeng zjf...@gmail.com: Does it support pig ? On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel ch...@wensel.net wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Reducers spawned when mapred.reduce.tasks=0
fwiw, we have released a workaround for this issue in Cascading 1.0.5. http://www.cascading.org/ http://cascading.googlecode.com/files/cascading-1.0.5.tgz In short, Hadoop 0.19.0 and .1 instantiate the users Reducer class and subsequently calls configure() when there is no intention to use the class (during job/task cleanup tasks). This clearly can cause havoc for users who use configure() to initialize resources used by the reduce() method. Testing for jobConf.getNumReduceTasks() is 0 inside the configure() method seems to work out well. branch-0.19 looks like it won't instantiate the Reducer class during job/task cleanup tasks, so I expect will leak into future releases. cheers, ckw On Mar 12, 2009, at 8:20 PM, Amareshwari Sriramadasu wrote: Are you seeing reducers getting spawned from web ui? then, it is a bug. If not, there won't be reducers spawned, it could be job-setup/ job- cleanup task that is running on a reduce slot. See HADOOP-3150 and HADOOP-4261. -Amareshwari Chris K Wensel wrote: May have found the answer, waiting on confirmation from users. Turns out 0.19.0 and .1 instantiate the reducer class when the task is actually intended for job/task cleanup. branch-0.19 looks like it resolves this issue by not instantiating the reducer class in this case. I've got a workaround in the next maint release: http://github.com/cwensel/cascading/tree/wip-1.0.5 ckw On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote: Hey all Have some users reporting intermittent spawning of Reducers when the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1. This is also confirmed when jobConf is queried in the (supposedly ignored) Reducer implementation. In general this issue would likely go unnoticed since the default reducer is IdentityReducer. but since it should be ignored in the Mapper only case, we don't bother not setting the value, and subsequently comes to ones attention rather abruptly. am happy to open a JIRA, but wanted to see if anyone else is experiencing this issue. note the issue seems to manifest with or without spec exec. ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Reducers spawned when mapred.reduce.tasks=0
Hey all Have some users reporting intermittent spawning of Reducers when the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1. This is also confirmed when jobConf is queried in the (supposedly ignored) Reducer implementation. In general this issue would likely go unnoticed since the default reducer is IdentityReducer. but since it should be ignored in the Mapper only case, we don't bother not setting the value, and subsequently comes to ones attention rather abruptly. am happy to open a JIRA, but wanted to see if anyone else is experiencing this issue. note the issue seems to manifest with or without spec exec. ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Cascading support of HBase
Hey all, Just wanted to let everyone know the HBase Hackathon was a success (thanks Streamy!), and we now have Cascading Tap adapters for HBase. You can find the link here (along with additional third-party extensions in the near future). http://www.cascading.org/modules.html Please feel free to give it a test, clone the repo, and submit patches back to me via GitHub. The unit test shows how simple it is to import/export data to/from an HBase cluster. http://bit.ly/fIpAE For those not familiar with Cascading, it is an alternative processing API that is generally more natural to develop and think in than MapReduce. http://www.cascading.org/ enjoy, chris p.s. If you have any code you want to contribute back, just stick it on GitHub and send me a link. -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Control over max map/reduce tasks per job
Hey Jonathan Are you looking to limit the total number of concurrent mapper/ reducers a single job can consume cluster wide, or limit the number per node? That is, you have X mappers/reducers, but only can allow N mappers/ reducers to run at a time globally, for a given job. Or, you are cool with all X running concurrently globally, but want to guarantee that no node can run more than N tasks from that job? Or both? just reconciling the conversation we had last week with this thread. ckw On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: All, I have a few relatively small clusters (5-20 nodes) and am having trouble keeping them loaded with my MR jobs. The primary issue is that I have different jobs that have drastically different patterns. I have jobs that read/write to/from HBase or Hadoop with minimal logic (network throughput bound or io bound), others that perform crawling (network latency bound), and one huge parsing streaming job (very CPU bound, each task eats a core). I'd like to launch very large numbers of tasks for network latency bound jobs, however the large CPU bound job means I have to keep the max maps allowed per node low enough as to not starve the Datanode and Regionserver. I'm an HBase dev but not familiar enough with Hadoop MR code to even know what would be involved with implementing this. However, in talking with other users, it seems like this would be a well-received option. I wanted to ping the list before filing an issue because it seems like someone may have thought about this in the past. Thanks. Jonathan Gray -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Scale Unlimited Professionals Program
Hey All Just wanted to let everyone know that Scale Unlimited will start offering many of its courses heavily discounted, if not free, to independent consultants and contractors. http://www.scaleunlimited.com/programs We are doing this because we receive a number of consulting/ contracting opportunities that we wish to delegate back to trusted consultants and developers. But more importantly, many consultants don't have time to learn Hadoop and related technologies, so Hadoop is often overlooked on new projects. We would like to get more developers comfortable knowing when and when not to use Hadoop on a project, ultimately leading to more projects using Hadoop and to Hadoop becoming more stable and feature rich in the process. We plan to offer our Hadoop Boot Camp for FREE in the Bay Area in the next few weeks. If interested in participating, email me directly. http://www.scaleunlimited.com/courses/hadoop-boot-camp Note this offer is limited to professional independent consultants, contractors, and small boutique contracting firms that are looking to expand their tool base. We only ask for an industry standard referral fee for any projects that result from a referral, if any. To be added to our referral list or if you have a project that might benefit from Hadoop or related technologies, please email me directly. This course will also be announced for open public enrollment in the coming days. cheers, chris -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Windows Support
Hey Dan There is discussion/issue on this here: https://issues.apache.org/jira/browse/HADOOP-4998 ckw On Jan 19, 2009, at 8:55 AM, Dan Diephouse wrote: On Mon, Jan 19, 2009 at 11:35 AM, Steve Loughran ste...@apache.org wrote: Dan Diephouse wrote: I recognize that Windows support is, um, limited :-) But, any ideas what exactly would need to be changed to support Windows (without cygwin) if someone such as myself were so motivated? The most immediate thing I ran into was the UserGroupInformation which would need a windows implementation. I see there is an issue to switch to JAAS too, which may be the proper fix? Are there lots of other things that would need to be changed? I think it may be worth opening a JIRA for windows support and creating some subtasks for the various issues, even if no one tackles them quite yet. Thanks, Dan I think a key one you need to address is motiviation. Is cygwin that bad for a piece of server-side code? No I guess I was trying to get an idea of how much work it was. It seems easy enough to supply a WindowsUserGroupInformation class (or a platform agnostic one). I wondered how many other things like this there were before I put together a patch. Seems bad Java practices to depend on shell utilities :-). Not very platform agnostic... Dan -- Dan Diephouse http://netzooid.com/blog -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Cascading 1.0.0 Released
Hi all Just a quick note to let everyone know that Cascading 1.0.0 is out. http://www.cascading.org/ Cascading is an API for defining and executing data processing flows without needing to think in MapReduce. This release supports only Hadoop 0.19.x. Minor releases will be available to track Hadoop's progress. A list of some of the high level features can be read about here: http://www.cascading.org/documentation/features/ Or you can peruse the User Guide: http://www.cascading.org/documentation/userguide.html Developer Support and OEM/Commercial Licensing is available through Concurrent, Inc. http://www.concurrentinc.com/ And finally, Advanced Hadoop and Cascading training (and consulting) is available through Scale Unlimited: http://www.scaleunlimited.com/ cheers, chris -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Storing/retrieving time series with hadoop
Hey Brock I used Cascading quite extensively with time series data. Along with the standard function/filter/aggregator operations in the Cascading processing model, there is what we call a buffer. Its really just a user friendly Reduce that integrates well with other operations and offers up a sliding window across your grouped data. Quite useful for running averages or filling in missing intervals etc. Plus there are handy operations for switching from text time strings to long time stamps and back etc.. YMMV cheers, ckw On Jan 7, 2009, at 5:03 PM, Brock Judkins wrote: Hi list, I am researching hadoop as a possible solution for my company's data warehousing solution. My question is whether hadoop, possibly in combination with Hive or Pig, is a good solution for time-series data? We basically have a ton of web analytics to store that we display both internally and externally. For the time being I am storing timestamped data points in a huge MySQL table, but I know this will not scale very far (although it's holding up ok at almost 90MM rows). I am aware that hadoop can scale insanely large (larger than I need), but does anyone have experience using it to draw charts based on time series with fairly low latency? Thanks! Brock -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Lookup HashMap available within the Map
cool. If you need a hand with Cascading stuff, feel free to ping me on the mail list or #cascading irc. lots of other friendly folk there already. ckw On Nov 25, 2008, at 12:35 PM, tim robertson wrote: Thanks Chris, I have a different test running, then will implement that. Might give cascading a shot for what I am doing. Cheers Tim On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Tim The .configure() method is what you are looking for i believe. It is called once per task, which in the default case, is once per jvm. Note Jobs are broken into parallel tasks, each task handles a portion of the input data. So you may create your map 100 times, because there are 100 tasks, it will only be created once per jvm. I hope this makes sense. chris On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Auto-shutdown for EC2 clusters
fyi, the src/contrib/ec2 scripts do just what Paco suggests. minus the static IP stuff (you can use the scripts to login via cluster name, and spawn a tunnel for browsing nodes) that is, you can spawn any number of uniquely named, configured, and sized clusters, and you can increase their size independently as well. (shrinking is another matter altogether) ckw On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote: Hi Karl, Rather than using separate key pairs, you can use EC2 security groups to keep track of different clusters. Effectively, that requires a new security group for every cluster -- so just allocate a bunch of different ones in a config file, then have the launch scripts draw from those. We also use EC2 static IP addresses and then have a DNS entry named similarly to each security group, associated with a static IP once that cluster is launched. It's relatively simple to query the running instances and collect them according to security groups. One way to handle detecting failures is just to attempt SSH in a loop. Our rough estimate is that approximately 2% of the attempted EC2 nodes fail at launch. So we allocate more than enough, given that rate. In a nutshell, that's one approach for managing a Hadoop cluster remotely on EC2. Best, Paco On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Same here. I've always had the namenode and web interface be accessible, but sometimes I don't get the slave nodes - usually zero slaves when this happens, sometimes I only miss one or two. My rough estimate is that this happens 1% of the time. I currently have to notice this and restart manually. Do you have a good way to detect it? I have several Hadoop clusters running at once with the same AWS image and SSH keypair, so I can't count running instances. I could have a separate keypair per cluster and count instances with that keypair, but I'd like to be able to start clusters opportunistically, with more than one cluster doing the same kind of job on different data. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Auto-shutdown for EC2 clusters
Hey Stuart I did that for a client using Cascading events and SQS. When jobs completed, they dropped a message on SQS where a listener picked up new jobs and ran with them, or decided to kill off the cluster. The currently shipping EC2 scripts are suitable for having multiple simultaneous clusters for this purpose. Cascading has always and now Hadoop supports (thanks Tom) raw file access on S3, so this is quite natural. This is the best approach as data is pulled directly into the Mapper, instead of onto HDFS first, then read into the Mapper from HDFS. YMMV chris On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote: Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Hadoop Training
Hey all Just wanted to make a quick announcement that Scale Unlimited has started delivering its 2 day Hadoop Boot Camp. More info here: http://www.scaleunlimited.com/hadoop-bootcamp.html We currently are offering the classes on-site within the US/UK/EU to those companies needing to get a team up to speed rapidly. But we are working to put together a public class in the Bay Area. Please email me if you are interested, so we can gauge interest. Also, we may be in NY for the NY Hadoop User Group, if any org out there wants to throw together a class during the week of Nov. 10, again, give me a shout. cheers, chris -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Monthly Hadoop User Group Meeting (Bay Area)
Chris K Wensel wrote: doh, conveniently collides with the GridGain and GridDynamics presentations: http://web.meetup.com/66/calendar/8561664/ Bay Area Hadoop User Group meetings are held on the third Wednesday every month. This has been on the calendar for quite a while. Doug maybe I should have said, coincidentally. -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Simple Survey
Quick reminder to take the survey. We know more than a dozen companies are using Hadoop. heh http://www.scaleunlimited.com/survey.html thanks! chris On Sep 8, 2008, at 10:43 AM, Chris K Wensel wrote: Hey all Scale Unlimited is putting together some case studies for an upcoming class and wants to get a snapshot of what the Hadoop user community looks like. If you have 2 minutes, please feel free to take the short anonymous survey below: http://www.scaleunlimited.com/survey.html All results will be public. cheers, chris -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Simple Survey
how weird. i'll forward this on and see if they can fix it. thanks for letting me know. chris On Sep 9, 2008, at 4:22 PM, John Kane wrote: Unfortunately there is a problem with the survey. I was unable to answer correctly question '9. How much data is stored on your Hadoop cluster (in GB)?' It would not let me enter more than 10TB (we currently have 45TB of data in our cluster; actual data, not a sum of disk used (with all of its replicas) but unique data). Other than that, I tried :-) On Tue, Sep 9, 2008 at 4:01 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Quick reminder to take the survey. We know more than a dozen companies are using Hadoop. heh http://www.scaleunlimited.com/survey.html thanks! chris On Sep 8, 2008, at 10:43 AM, Chris K Wensel wrote: Hey all Scale Unlimited is putting together some case studies for an upcoming class and wants to get a snapshot of what the Hadoop user community looks like. If you have 2 minutes, please feel free to take the short anonymous survey below: http://www.scaleunlimited.com/survey.html All results will be public. cheers, chris -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Basic code organization questions + scheduling
If you wrote a simple URL fetcher function for Cascading, you would have a very powerful web crawler that would dwarf Nutch in flexibility. That said, Nutch is optimized for storage, has supporting tools, ranking algorithms, and has been up against some nasty html and other document types. building a really robust crawler is non-trivial. If i was just starting out and needed to implement a proprietary process, I would use Nutch for fetching raw content, and refreshing it. then use Cascading for parsing, indexing, etc. cheers, chris On Sep 8, 2008, at 12:42 AM, tarjei wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Alex (and others). You should take a look at Nutch. It's a search-engine built on Lucene, though it can be setup on top of Hadoop. Take a look: This didn't help me much. Although the description I gave of the basic flow of the app seems to be close to what Nutch is doing (and I've been looking at the Nutch code), the questions are more general and not related to indexing as such, but about code organization. If someone has more input to those, feel free to add it. On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote: Hi, I'm planning to use Hadoop in for a set of typical crawler/ indexer tasks. The basic flow is input:array of urls actions: | 1. get pages | 2. extract new urls from pages - start new job extract text - index / filter (as new jobs) What I'm considering is how I should build this application to fit into the map/reduce context. I'm thinking that step 1 and 2 should be separate map/reduce tasks that then pipe things on to the next step. This is where I am a bit at loss to see how it is smart to organize the code in logical units and also how to spawn new tasks when an old one is over. Is the usual way to control the flow of a set of tasks to have an external application running that listens to jobs ending via the endNotificationUri and then spawns new tasks or should the job itself contain code to create new jobs? Would it be a good idea to use Cascading here? I'm also considering how I should do job scheduling (I got a lot of reoccurring tasks). Has anyone found a good framework for job control of reoccurring tasks or should I plan to build my own using quartz ? Any tips/best practices with regard to the issues described above are most welcome. Feel free to ask further questions if you find my descriptions of the issues lacking. Kind regards, Tarjei Kind regards, Tarjei -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o 9km8MTJcTQxnc7bijR1Oxs0= =79fZ -END PGP SIGNATURE- -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Simple Survey
Hey all Scale Unlimited is putting together some case studies for an upcoming class and wants to get a snapshot of what the Hadoop user community looks like. If you have 2 minutes, please feel free to take the short anonymous survey below: http://www.scaleunlimited.com/survey.html All results will be public. cheers, chris -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Monthly Hadoop User Group Meeting (Bay Area)
doh, conveniently collides with the GridGain and GridDynamics presentations: http://web.meetup.com/66/calendar/8561664/ On Sep 8, 2008, at 4:55 PM, Ajay Anand wrote: The next Hadoop User Group (Bay Area) meeting is scheduled for Wednesday, Sept 17th from 6 - 7:30 pm at Yahoo! Mission College, Santa Clara, CA, Building 2, Training Rooms 34. Agenda: Cloud Computing Testbed - Thomas Sandholm, HP Katta on Hadoop - Stefan Groschupf Registration and directions: http://upcoming.yahoo.com/event/1075456/ Look forward to seeing you there! Ajay -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
RandomTextWriter
Hey all Has anyone had success with RandomTextWriter? I'm finding it fairly unstable on 0.16.x, haven't tried 0.17 yet though. chris -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: RandomTextWriter
In local mode, only one mapper succeeds, the remaining never start. And when on a cluster, a handful of mappers die with some exception about not finding the output directory (sorry, don't have the exception handy). I'm upgrading to 0.17.1 to see if they persist. ckw On Jul 7, 2008, at 10:08 AM, Arun C Murthy wrote: On Jul 7, 2008, at 9:46 AM, Chris K Wensel wrote: Hey all Has anyone had success with RandomTextWriter? I'm finding it fairly unstable on 0.16.x, haven't tried 0.17 yet though. What problems are you seeing? It seems to work fine for me... Arun -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: hadoop in the ETL process
If your referring to loading an RDBMS with data on Hadoop, this is doable. but you will need to write your own JDBC adapters to your tables. But you might review what you are using the RDBMS for and see if those jobs would be better off running on Hadoop entirely, if not for most of the processing. ckw On Jul 2, 2008, at 10:51 AM, David J. O'Dell wrote: Is anyone using hadoop for any part of the ETL process? Given its ability to process large amounts of log files this seems like a good fit. -- David O'Dell Director, Operations e: [EMAIL PROTECTED] t: (415) 738-5152 180 Townsend St., Third Floor San Francisco, CA 94107 -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Using S3 Block FileSystem as HDFS replacement
by editing the hadoop-site.xml, you set the default. but I don't recommend changing the default on EC2. but you can specify the filesystem to use through the URL that references your data (jobConf.addInputPath etc) for a particular job. in the case of the S3 block filesystem, just use a s3:// url. ckw On Jun 30, 2008, at 8:04 PM, slitz wrote: Hello, I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki that it's possible to choose either S3 native FileSystem or S3 Block Filesystem. I would like to use S3 Block FileSystem to avoid the task of manually transferring data from S3 to HDFS every time i want to run a job. I'm still experimenting with EC2 contrib scripts and those seem to be excellent. What i can't understand is how may be possible to use S3 using a public hadoop AMI since from my understanding hadoop-site.xml gets written on each instance startup with the options on hadoop-init, and it seems that the public AMI (at least the 0.17.0 one) is not configured to use S3 at all(which makes sense because the bucket would need individual configuration anyway). So... to use S3 block FileSystem with EC2 i need to create a custom AMI with a modified hadoop-init script right? or am I completely confused? slitz -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Using S3 Block FileSystem as HDFS replacement
How do i put something into the fs? something like bin/hadoop fs -put input input will not work well since s3 is not the default fs, so i tried to do bin/hadoop fs -put input s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't worked, i always got an error complaining about not having provided the ID/ secret for s3. hadoop distcp ... should work with your s3 urls 08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to process : 2 08/07/01 22:12:57 INFO mapred.JobClient: Running job: job_200807012133_0010 08/07/01 22:12:58 INFO mapred.JobClient: map 100% reduce 100% java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) (...) I tried several times and with the wordcount example but the error were always the same. unsure. many things could be conspiring against you. leave the defaults in hadoop-site and use distcp to copy things around. that's the simplest I think. ckw
Re: realtime hadoop
On Jun 23, 2008, at 9:54 PM, Matt Kent wrote: Unless you have a significant amount of work to be done, I wouldn't recommend using Hadoop because it's not worth the overhead of launching the jobs and moving the data around. I think part of the tradeoff is having a system that is resilient to failure against work that must get done, regardless of the amount of work. ckw -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Ec2 and MR Job question
well, to answer your last question first, just set the # reducers to zero. but you can't just run reducers without mappers (as far as I know, having never tried). so your local job will need to run identity mappers in order to feed your reducers. http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html ckw On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote: I have a question someone may have answered here before but I can not find the answer. Assuming I have a cluster of servers hosting a large amount of data I want to run a large job that the maps take a lot of cpu power to run and the reduces only take a small amount cpu to run. I want to run the maps on a group of EC2 servers and run the reduces on the local cluster of 10 machines. The problem I am seeing is the map outputs, if I run the maps on EC2 they are stored local on the instance What I am looking to do is have the map output files stored in hdfs so I can kill the EC2 instances sense I do not need them for the reduces. The only way I can thank to do this is run two jobs one maper and store the output on hdfs and then run a second job to run the reduces from the map outputs store on the hfds. Is there away to make the mappers store the final output in hdfs? -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: does anyone have idea on how to run multiple sequential jobs with bash script
:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) -- hustlin, hustlin, everyday I'm hustlin -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: does anyone have idea on how to run multiple sequential jobs with bash script
Thanks Ted.. Couple quick comments. At one level Cascading is a MapReduce query planner, just like PIG. Except the API is for public consumption and fully extensible, in PIG you typically interact with the PigLatin syntax. Subsequently, with Cascading, you can layer your own syntax on top of the API. Currently there is Groovy support (Groovy is used to assemble the work, it does not run on the mappers or reducers). I hear rumors about Jython elsewhere. A couple groovy examples (note these are obviously trivial, the dsl can absorb tremendous complexity if need be)... http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/wordcount.groovy http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/widefinder.groovy Since Cascading is in part a 'planner', it actually builds internally a new representation from what the developer assembled and renders out the necessary map/reduce jobs (and transparently links them) at runtime. As Hadoop evolves, the planner will incorporate the new features and leverage them transparently. Plus there are opportunities for identifying patterns and applying different strategies (hypothetically map side vs reduce side joins, for one). It is also conceivable (but untried) that different planners can exist to target different systems other than Hadoop (making your code/libraries portable). Much of this is true for PIG as well. http://www.cascading.org/documentation/overview.html Also, Cascading will at some point provide a PIG adapter, allowing PigLatin queries to participate in a larger Cascading 'Cascade' (the topological scheduler). Cascading is great with integration, connecting things outside Hadoop with stuff to be done inside Hadoop. And PIG looks like a great way to concisely represent a complex solution and execute it. There isn't any reason they can't work together (it has always been the intention). The takeaway is that with Cascading and PIG, users do not think in MapReduce. With PIG, you think in PigLatin. With Cascading, you can use the pipe/filter based API, or use your favorite scripting language and build a DSL for your problem domain. Many companies have done similar things internally, but they tend to be nothing more than a scriptable way to write a map/reduce job and glue them together. You still think in MapReduce, which in my opinion doesn't scale well. My (biased) recommendation is this. Build out your application in Cascading. If part of the problem is best represented in PIG, no worries use PIG and feed and clean up after PIG with Cascading. And if you see a solvable bottleneck, and we can't convince the planner to recognize the pattern and plan better, replace that piece of the process with a custom MapReduce job (or more). Solve your problem first, then optimize the solution, if need be. ckw On Jun 11, 2008, at 5:00 PM, Ted Dunning wrote: Pig is much more ambitious than cascading. Because of the ambitions, simple things got overlooked. For instance, something as simple as computing a file name to load is not possible in pig, nor is it possible to write functions in pig. You can hook to Java functions (for some things), but you can't really write programs in pig. On the other hand, pig may eventually provide really incredible capabilities including program rewriting and optimization that would be incredibly hard to write directly in Java. The point of cascading was simply to make life easier for a normal Java/map-reduce programmer. It provides an abstraction for gluing together several map-reduce programs and for doing a few common things like joins. Because you are still writing Java (or Groovy) code, you have all of the functionality you always had. But, this same benefit costs you the future in terms of what optimizations are likely to ever be possible. The summary for us (especially 4-6 months ago when we were deciding) is that cascading is good enough to use now and pig will probably be more useful later. On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao [EMAIL PROTECTED] wrote: I find cascading very similar to pig, do you care to provide your comment here? If map reduce programmers are to go to the next level (scripting/query language), which way to go? -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: does anyone have idea on how to run multiple sequential jobs with bash script
However, for continuous production data processing, hadoop+cascading sounds like a good option. This will be especially true with stream assertions and traps (as mentioned previously, and available in trunk). grin I've written workloads for clients that render down to ~60 unique Hadoop map/reduce jobs, all inter-related, from ~10 unique units of work (internally lots of joins, sorts and math). I can't imagine having written them by hand. ckw -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: contrib EC2 with hadoop 0.17
Thanks for the description, Chris. Now that I understand the basic model, I'm starting to see how the configuration is passed to the slaves using the -d option of ec2-run-instances. One config question: on our cluster (hadoop 0.17 with INSTANCE_TYPE=m1.small) the conf/hadoop-default.xml has mapred.reduce.tasks set to 1, and mapred.map.tasks set to 2. From experimenting and reading the FAQ, it looks like those numbers should be higher, unless you have single-machine cluster. Maybe there's something I'm missing, but by upping mapred.map.tasks and mapred.reduce.tasks to 5 and 15 (in our job jar) we're getting much better performance. Is there a reason hadoop-init doesn't build a hadoop-site.xml file with higher or configurable values for these fields? configuration values should be set in conf/hadoop-site.xml. Those particular values you are referring to probably should be set per job and generally don't have anything to do with instance sizes but more to do with cluster size and the job being run. different instance sizes have mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum set accordingly (see hadoop- init), but again might/should be tuned to your application (cpu or io bound). ckw Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: contrib EC2 with hadoop 0.17
The new scripts do not use the start/stop-all.sh scripts, and thus do not maintain the slaves file. This is so cluster startup is much faster and a bit more reliable (keys do not need to be pushed to the slaves). Also we can grow the cluster lazily just by starting slave nodes. That is, they are mostly optimized for booting a large cluster fast, doing work, then shutting down (allowing for huge short lived clusters, vs a smaller/cheaper long lived one). But it probably would be wise to provide scripts to build/refresh the slaves file, and push keys to slaves, so the cluster can be traditionally maintained, instead of just re-instantiated with new parameters etc. I wonder if these scripts would make sense in general, instead of being ec2 specific? ckw On Jun 7, 2008, at 11:31 AM, Chris Anderson wrote: First of all, thanks to whoever maintains the hadoop-ec2 scripts. They've saved us untold time and frustration getting started with a small testing cluster (5 instances). A question: when we log into the newly created cluster, and run jobs from the example jar (pi, etc) everything works great. We expect our custom jobs will run just as smoothly. However, when we restart the namenodes and tasktrackers by running bin/stop-all.sh on the master, it tries to stop only activity on localhost. Running start-all.sh then boots up a localhost-only cluster (on which jobs run just fine). The only way we've been able to recover from this situation is to use bin/terminate-hadoop-cluster and bin/destroy-hadoop-cluster and then start again from scratch with a new cluster. There must be a simple way to restart the namenodes and jobtrackers across all machines from the master. Also, I think understanding the answer to this question might put a lot more into perspective for me, so I can go on to do more advanced things on my own. Thanks for any assistance / insight! Chris output from stop-all.sh == stopping jobtracker localhost: Warning: Permanently added 'localhost' (RSA) to the list of known hosts. localhost: no tasktracker to stop stopping namenode localhost: no datanode to stop localhost: no secondarynamenode to stop conf files in /usr/local/hadoop-0.17.0 == # cat conf/slaves localhost # cat conf/masters localhost -- Chris Anderson http://jchris.mfdz.com Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: hadoop on EC2
if you use the new scripts in 0.17.0, just run hadoop-ec2 proxy cluster-name this starts a ssh tunnel to your cluster. installing foxy proxy in FF gives you whole cluster visibility.. obviously this isn't the best solution if you need to let many semi trusted users browse your cluster. On May 28, 2008, at 1:22 PM, Andreas Kostyrka wrote: Hi! I just wondered what other people use to access the hadoop webservers, when running on EC2? Ideas that I had: 1.) opening ports 50030 and so on = not good, data goes unprotected over the internet. Even if I could enable some form of authentication it would still plain http. 2.) Some kind of tunneling solution. The problem on this side is that each of my cluster node is in a different subnet, plus the dualism between the internal and external addresses of the nodes. Any hints? TIA, Andreas Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: hadoop on EC2
obviously this isn't the best solution if you need to let many semi trusted users browse your cluster. Actually, it would be much more secure if the tunnel service ran on a trusted server letting your users connect remotely via SOCKS and then browse the cluster. These users wouldn't need any AWS keys etc. Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Read timed out, Abandoning block blk_-5476242061384228962
I'm using the machine running the namenode to run maps as well - could that be a source of my problem? The load is fairly high, essentially no idle time. 8 cores per machine, so I've got 8 maps running. I'm guessing I'd be better off running 80 smaller machines instead of 20 larger ones for the same price, but we haven't been approved for more than 20 instances yet. Given that I'm not seeing any idle time, I'm assuming that I'm CPU not IO-bound. fwiw, I have not found the large or xlarge EC2 instances proportionally faster with Hadoop. Thus we run many small instances more cheaply. btw, the email notifying you that you have been approved may lag the actual approval (mine did for days). might be worth trying a larger cluster to see. Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Read timed out, Abandoning block blk_-5476242061384228962
Hi James Were you able to start all the nodes in the same 'availability zone'? You using the new AMI kernels? If you are using the contrib/ec2 scripts, you might upgrade (just the scripts) to http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/ These support the new kernels and availability zones. My transient errors went away when upgrading. The functional changes are documented here: http://wiki.apache.org/hadoop/AmazonEC2 fyi, you will need to build your own images (via the create-image command) with whatever version of Hadoop you are comfortable with. this will also get you a Ganglia install... ckw On May 7, 2008, at 1:29 PM, James Moore wrote: What is this bit of the log trying to tell me, and what sorts of things should I be looking at to make sure it doesn't happen? I don't think the network has any basic configuration issues - I can telnet from the machine creating this log to the destination - telnet 10.252.222.239 50010 works fine when I ssh in to the box with this error. 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-5476242061384228962 2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.222.239:50010 I'm seeing a fair number of these. My reduces finally complete, but there are usually a couple at the end that take longer than I think they should, and they frequently have these sorts of errors. I'm running 20 machines on ec2 right now, with hadoop version 0.16.4. -- James Moore | [EMAIL PROTECTED] blog.restphone.com Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Groovy Scripting for Hadoop
Hey all Just wanted to let interested parties know we just released 0.1.0 of our Groovy 'builder' extension. We think this will be a great tool for those groups that need to expose Hadoop to the 'casual' user who needs to get and manipulate valuable data on a Hadoop cluster, but doesn't have the time to learn Java, the Hadoop API, or to think in MapReduce to solve problems that are a notch or more above trivial. It is worthy of mentioning here that no Groovy code is run in the cluster (on the slave nodes). Groovy is only being used as a configuration language to allow for the assembly of complex workflows to be run on Hadoop. An introduction: http://www.cascading.org/documentation/groovy.html Links to samples included in the distro: The canonical word count example (.groovy) http://tinyurl.com/6gp8xp Or a wide finder example: http://tinyurl.com/5korhj These examples don't show it, but splits and joins are fully supported (as they are in Cascading). Further, local libraries can be used, but there is still work to do to make this transparent. Please feel free to join our mail-list and post feedback. http://groups.google.com/group/cascading-user cheers, ckw Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Best practices for handling many small files
are the files to be stored on HDFS long term, or do they need to be fetched from an external authoritative source? depending on how things are setup in your datacenter etc... you could aggregate them into a fat sequence file (or a few). keep in mind how long it would take to fetch the files and aggregate them (this is a serial process) and if the corpus changes often (how often will you need to make these sequence files). another option is to make a manifest (list of docs to fetch), feed that to your mapper and have it fetch each file individually. this would be useful if the corpus is reasonably arbitrary between runs and could eliminate much of the load time. but painful if the data is external to your datacenter and the cost to refetch is high. there really is no simple answer.. ckw On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote: million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) well - even for #2 - it begs the question of how the packing itself will be parallelized .. There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. -Original Message- From: [EMAIL PROTECTED] on behalf of Stuart Sierra Sent: Wed 4/23/2008 8:55 AM To: core-user@hadoop.apache.org Subject: Best practices for handling many small files Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single record? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: adding nodes to an EC2 cluster
Stephen Check out the patch in Hadoop-2410 to the contrib/ec2 scripts https://issues.apache.org/jira/browse/HADOOP-2410 (just grab the ec2.tgz attachment) these scripts allow you do dynamically grow your cluster plus some extra goodies. you will need to use them to build your own ami, they are not compatible with the hadoop-images amis. using them should be self explanatory. ckw On Apr 15, 2008, at 7:03 PM, Stephen J. Barr wrote: Hello, Does anyone have any experience adding nodes to a cluster running on EC2? If so, is there some documentation on how to do this? Thanks, -stephen Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: adding nodes to an EC2 cluster
I'm unsure of your particular problem. but the scripts/patch I referenced previously remove any dependency on DynDNS. the recipe would be something like... make a s3 bucket and update hadoop-ec2-env.sh make an image: hadoop-ec2 create-image make a 2 node (3 machine) cluster: hadoop-ec2 launch-cluster test-group 2 upload your jar file hadoop-ec2 push test-group /path/to/your.jar login: hadoop-ec2 login test-group run your jar hadoop jar your.jar kill your cluster when done: hadoop-ec2 terminate-cluster test-group test-group is the name of the ec2 hadoop cluster group. these scripts support many concurrent groups. that all said, these scripts are untested with cygwin on windows.. work great on a mac though. On Apr 15, 2008, at 8:47 PM, Prerna Manaktala wrote: Hey Can u solve my problem also. I am badly stuck up I tried to set up hadoop with cygwin according to the paper:http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 But I had problems working with dyndns.I created a new host there:prerna.dyndns.org and gave the ip address of it in hadoop-ec2-env.sh as a value of MASTER_HOST. But when I do bin/hadoop-ec2 start-hadoop the error comes: ssh:connect to host prerna.dyndns.org,port 22:connection refused ssh failed for [EMAIL PROTECTED] Also a warning comes that:id_rsa_gsg-keypair not accesible:No such file or directory though there is this file. As of now the status is: I am working with the EC2 environment. I registered and am being billed for EC2 and S3. Right now I have two cygwin windows open. 1 is as an administrator-server(on which sshd running) in which I have a separate folder for hadoop files and am able to do bin/hadoop 1 as a normal user-client. In the client there is no hadoop folder and hence I cant run bin/ hadoop. From here I do not know how to proceed? I basically want to implement http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 . Hence I created a host using dyndns. If you can help me,it will be great. On Tue, Apr 15, 2008 at 11:30 PM, Chris K Wensel [EMAIL PROTECTED] wrote: it's pretty simple. just create an s3 bucket, note the bucket name in the hadoop-ec2.sh file (with the other necessary values). then call hadoop-ec2 create-image i'll try and update the wiki soon. ckw On Apr 15, 2008, at 7:28 PM, Stephen J. Barr wrote: Thank you. I will check that out. I haven't built an AMI before. Hopefully it isn't too complicated, as it is easy to use the pre-built AMI's. -stephen Chris K Wensel wrote: Stephen Check out the patch in Hadoop-2410 to the contrib/ec2 scripts https://issues.apache.org/jira/browse/HADOOP-2410 (just grab the ec2.tgz attachment) these scripts allow you do dynamically grow your cluster plus some extra goodies. you will need to use them to build your own ami, they are not compatible with the hadoop-images amis. using them should be self explanatory. ckw On Apr 15, 2008, at 7:03 PM, Stephen J. Barr wrote: Hello, Does anyone have any experience adding nodes to a cluster running on EC2? If so, is there some documentation on how to do this? Thanks, -stephen Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Hadoop performance on EC2?
What does ganglia show for load and network? You should also be able to see gc stats (count and time). Might help as well. fyi, running hadoop-ec2 proxy cluster-name will both setup a socks tunnel and list available urls you can cut/ paste into your browser. one of the urls is for the ganglia interface. On Apr 11, 2008, at 2:01 PM, Nate Carlson wrote: On Wed, 9 Apr 2008, Chris K Wensel wrote: make sure all nodes are running in the same 'availability zone', http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347 check! and that you are using the new xen kernels. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101 check! also, make sure each node is addressing its peers via the ec2 private addresses, not the public ones. check! there is a patch in jira for the ec2/contrib scripts that address these issues. https://issues.apache.org/jira/browse/HADOOP-2410 if you use those scripts, you will be able to see a ganglia display showing utilization on the machines. 8/7 map/reducers sounds like alot. Reduced - I dropped it to 3/2 for testing. I am using these scripts now, and am still seeing very poor performance on EC2 compared to my development environment. ;( I'll be capturing some more extensive stats over the weekend, and see if I can glean anything useful... | nate carlson | [EMAIL PROTECTED] | http:// www.natecarlson.com | | depriving some poor village of its idiot since 1981| Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Hadoop performance on EC2?
a few things.. make sure all nodes are running in the same 'availability zone', http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347 and that you are using the new xen kernels. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101 also, make sure each node is addressing its peers via the ec2 private addresses, not the public ones. there is a patch in jira for the ec2/contrib scripts that address these issues. https://issues.apache.org/jira/browse/HADOOP-2410 if you use those scripts, you will be able to see a ganglia display showing utilization on the machines. 8/7 map/reducers sounds like alot. ymmv On Apr 9, 2008, at 7:07 PM, Nate Carlson wrote: Hey all, We've got a job that we're running in both a development environment, and out on EC2. I've been rather displeased with the performance on EC2, and was curious if the results that we've been seeing are similar to other people's, or if I've got something misconfigured. ;) In both environments, the load on the master node is around 1-1.5, and the load on the slave nodes in around 8-10. I have also tried cranking up the JVM memory on the EC2 nodes (since we got RAM to blow), with very little performance difference. Basically, the job takes about 3.5 hours on development, but takes 15 hours on EC2. With the portion that takes all the time, it is not dependent on any external hosts - just the MySQL server on the master node. I benchmarked the VCPU's between our dev and EC2, and they are about equivilent.. I would expect EC2 to take 1.5x as long, since there is one less CPU per slave, but it's taking much longer than that. Appreciate any tips! Similarities between the environments: - 1 master node, 2 slave nodes - 1 mapper and reducer on the master, 8 mappers and 7 reducers on the slaves - Hadoop 0.16.2 - Local HDFS storage (we were using S3 on amazon before, and I switched to local storage) - MySQL database running on the master node - Xen VM's in both environments (our own Xen for dev, Amazon's for EC2) - Debian Etch 64-bit OS; 64-bit JVM Development master node configuration: - 4x VCPU's (Xeon E5335 2ghz) - 3gb memory - 4gb swap Development slave nodes configuration: - 3x VCPU's (Xeon E5335 2ghz) - 2gb memory - 4gb swap EC2 Configuration (Large instance type): - 2x VCPU's (Opteron 2ghz) - 8gb memory - 4gb swap - All nodes running in the same availabity zone | nate carlson | [EMAIL PROTECTED] | http:// www.natecarlson.com | | depriving some poor village of its idiot since 1981| Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: EC2 contrib scripts
make that xen kernels.. btw, they scale much better (see previous post) under heavy load. So now instead of timeouts and dropped connections, jvm instances exit prematurely. unsure of the cause of this just yet. but its so few, the impact is negligible. ckw On Mar 28, 2008, at 10:00 AM, Chris K Wensel wrote: Hey all I pushed up a patch (and tar) for the ec2 contrib scripts that provide support instance sizes, new zen kernels, availability zones, concurrent clusters, resizing, ganglia, etc. the patch can be found here: https://issues.apache.org/jira/browse/HADOOP-2410 I use these daily, but it is likely wise for others to give them a shot to make sure they work for someone besides me. Since I can't publish to the hadoop-images bucket, you will need to build your own image stored in your own bucket. This only takes a few minutes. Feedback against the JIRA issue is best. cheers, ckw Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Performance / cluster scaling question
FYI, Just ran a 50 node cluster using one of the new kernels for Fedora with all nodes forced onto the same 'availability zone' and there were no timeouts or failed writes. On Mar 27, 2008, at 4:16 PM, Chris K Wensel wrote: If it's any consolation, I'm seeing similar behaviors on 0.16.0 when running on EC2 when I push the number of nodes in the cluster past 40. On Mar 24, 2008, at 6:31 AM, André Martin wrote: Thanks for the clarification, dhruba :-) Anyway, what can cause those other exceptions such as Could not get block locations and DataXceiver: java.io.EOFException? Can anyone give me a little more insight about those exceptions? And does anyone have a similar workload (frequent writes and deletion of small files), and what could cause the performance degradation (see first post)? I think HDFS should be able to handle two million and more files/blocks... Also, I observed that some of my datanodes do not heartbeat to the namenode for several seconds (up to 400 :-() from time to time - when I check those specific datanodes and do a top, I see the du command running that seems to got stuck?!? Thanks and Happy Easter :-) Cu on the 'net, Bye - bye, André èrbnA dhruba Borthakur wrote: The namenode lazily instructs a Datanode to delete blocks. As a response to every heartbeat from a Datanode, the Namenode instructs it to delete a maximum on 100 blocks. Typically, the heartbeat periodicity is 3 seconds. The heartbeat thread in the Datanode deletes the block files synchronously before it can send the next heartbeat. That's the reason a small number (like 100) was chosen. If you have 8 datanodes, your system will probably delete about 800 blocks every 3 seconds. Thanks, dhruba -Original Message- From: André Martin [mailto:[EMAIL PROTECTED] Sent: Friday, March 21, 2008 3:06 PM To: core-user@hadoop.apache.org Subject: Re: Performance / cluster scaling question After waiting a few hours (without having any load), the block number and DFS Used space seems to go down... My question is: is the hardware simply too weak/slow to send the block deletion request to the datanodes in a timely manner, or do simply those crappy HDDs cause the delay, since I noticed that I can take up to 40 minutes when deleting ~400.000 files at once manually using rm -r... Actually - my main concern is why the performance à la the throughput goes down - any ideas? Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Hadoop on EC2 for large cluster
you can't do this with the contrib/ec2 scripts/ami. but passing the master private dns name to the slaves on boot as 'user- data' works fine. when a slave starts, it contacts the master and joins the cluster. there isn't any need for a slave to rsync from the master, thus removing the dependency on them having the private key. and not using the start|stop-all scripts, you don't need to maintain the slaves file, and can thus lazily boot your cluster. to do this, you will need to create your own AMI that works this way. not hard, just time consuming. On Mar 20, 2008, at 11:56 AM, Prasan Ary wrote: Chris, What do you mean when you say boot the slaves with the master private name ? === Chris K Wensel [EMAIL PROTECTED] wrote: I found it much better to start the master first, then boot the slaves with the master private name. i do not use the start|stop-all scrips, so i do not need to maintain the slaves file. thus i don't need to push private keys around to support those scripts. this lets me start 20 nodes, then add 20 more later. or kill some. btw, get ganglia installed. life will be better knowing what's going on. also, setting up FoxyProxy on firefox lets you browse your whole cluster if you setup a ssh tunnel (socks). On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances. I wanted to know if there is a better way to accomplish this. Thanks, PA - Never miss a thing. Make Yahoo your homepage. Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ - Looking for last minute shopping deals? Find them fast with Yahoo! Search. Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/
Re: copy - sort hanging
tried a new run. and it turns out that there are lots of failed connections here and there. but the cluster seizing seems to have different origins. I have i Job using up all the available mapper tasks. and a series of queued jobs that will output data to a custom FileSystem. Noting speculative execution is off, when the first job mappers 'finish' is immediately kills 160 of them. but only one shows an error. Already completed TIP repeated ~40 times. the copy - sort hangs at this point. the next jobs in the queue look like the attempt to initialize their mappers. but a ClassNotFoundException is thrown when it cannot find my custom FileSystem in the class path. This FileSystem was used about 10 jobs previous as input to the mappers. it is now the output of future reducers. see trace below. i'm wondering if getTaskOutputPath should eat throwable (instead of IOE), since the only thing happening here is the path is being made fully qualified. 2008-03-14 12:59:28,053 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 50002, call heartbeat([EMAIL PROTECTED], false, true, 1519) from 10.254.87.130:47919: error: java.io.IOException: java.lang.RuntimeException: java.lang.ClassNotFoundException: cascading.tap.hadoop.S3HttpFileSystem java.io.IOException: java.lang.RuntimeException: java.lang.ClassNotFoundException: cascading.tap.hadoop.S3HttpFileSystem at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:161) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.Task.getTaskOutputPath(Task.java: 195) at org.apache.hadoop.mapred.Task.setConf(Task.java:400) at org .apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java: 733) at org .apache .hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.java:568) at org .apache .hadoop.mapred.JobTracker.getNewTaskForTaskTracker(JobTracker.java:1409) at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1191) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) On Mar 13, 2008, at 4:59 PM, Chris K Wensel wrote: I don't really have these logs as i've bounce my cluster. But am willing to ferret out anything in particular on my next failed run. On Mar 13, 2008, at 4:32 PM, Raghu Angadi wrote: Yeah, its kind of hard to deal with these failure once they start occurring. Are all these logs from the same datnode? Could you separate logs from different datanodes? If the first exception stack is while replicating a block (as opposed to initial write), then http://issues.apache.org/jira/browse/HADOOP-3007 would help there. i.e. failure on next datanode should not affect this datanode, you still need to check why the remote datanode failed. Another problem is that once DataNode fails to write a block, the same back can not be written to this node for next one hour. These are the can not be written to errors you see below. We should really fix this. I will file a jira. Raghu. Chris K Wensel wrote: here is a reset, followed by three attempts to write the block. 2008-03-13 13:40:06,892 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.26.3:35762 dest: /10.251.26.3:50010 2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode: Exception in receiveBlock for block blk_7813471133156061911 java.net.SocketException: Connection reset 2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.net.SocketException: Connection reset 2008-03-13 13:40:06,958 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java: 65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.dfs.DataNode $BlockReceiver.receivePacket(DataNode.java:2194) at org.apache.hadoop.dfs.DataNode $BlockReceiver.receiveBlock(DataNode.java:2244) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:1150) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 938) at java.lang.Thread.run(Thread.java:619) 2008-03-13 13:40:11,751 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.27.148:48384 dest: /10.251.27.148:50010 2008-03-13 13:40:11,752 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though
Re: copy - sort hanging
here is a reset, followed by three attempts to write the block. 2008-03-13 13:40:06,892 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.26.3:35762 dest: / 10.251.26.3:50010 2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode: Exception in receiveBlock for block blk_7813471133156061911 java.net.SocketException: Connection reset 2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.net.SocketException: Connection reset 2008-03-13 13:40:06,958 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java: 65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.dfs.DataNode $BlockReceiver.receivePacket(DataNode.java:2194) at org.apache.hadoop.dfs.DataNode $BlockReceiver.receiveBlock(DataNode.java:2244) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:1150) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) 2008-03-13 13:40:11,751 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.27.148:48384 dest: / 10.251.27.148:50010 2008-03-13 13:40:11,752 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. 2008-03-13 13:40:11,752 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java: 1983) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:1074) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) 2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.70.210:37345 dest: / 10.251.70.210:50010 2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. 2008-03-13 13:48:37,925 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java: 1983) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:1074) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) 2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.26.223:49176 dest: / 10.251.26.223:50010 2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. 2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java: 1983) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:1074) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) On Mar 13, 2008, at 11:25 AM, Chris K Wensel wrote: should add that 10.251.65.207 (receiving end of NameSystem.pendingTransfer below) has this datanode log entry. 2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. 2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.io.IOException: Block
Re: copy - sort hanging
I don't really have these logs as i've bounce my cluster. But am willing to ferret out anything in particular on my next failed run. On Mar 13, 2008, at 4:32 PM, Raghu Angadi wrote: Yeah, its kind of hard to deal with these failure once they start occurring. Are all these logs from the same datnode? Could you separate logs from different datanodes? If the first exception stack is while replicating a block (as opposed to initial write), then http://issues.apache.org/jira/browse/HADOOP-3007 would help there. i.e. failure on next datanode should not affect this datanode, you still need to check why the remote datanode failed. Another problem is that once DataNode fails to write a block, the same back can not be written to this node for next one hour. These are the can not be written to errors you see below. We should really fix this. I will file a jira. Raghu. Chris K Wensel wrote: here is a reset, followed by three attempts to write the block. 2008-03-13 13:40:06,892 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.26.3:35762 dest: /10.251.26.3:50010 2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode: Exception in receiveBlock for block blk_7813471133156061911 java.net.SocketException: Connection reset 2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.net.SocketException: Connection reset 2008-03-13 13:40:06,958 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java: 65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.dfs.DataNode $BlockReceiver.receivePacket(DataNode.java:2194) at org.apache.hadoop.dfs.DataNode $BlockReceiver.receiveBlock(DataNode.java:2244) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java: 1150) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 938) at java.lang.Thread.run(Thread.java:619) 2008-03-13 13:40:11,751 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.27.148:48384 dest: /10.251.27.148:50010 2008-03-13 13:40:11,752 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. 2008-03-13 13:40:11,752 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java: 638) at org.apache.hadoop.dfs.DataNode $BlockReceiver.init(DataNode.java:1983) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:1074) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 938) at java.lang.Thread.run(Thread.java:619) 2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.70.210:37345 dest: /10.251.70.210:50010 2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. 2008-03-13 13:48:37,925 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java: 638) at org.apache.hadoop.dfs.DataNode $BlockReceiver.init(DataNode.java:1983) at org.apache.hadoop.dfs.DataNode $DataXceiver.writeBlock(DataNode.java:1074) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 938) at java.lang.Thread.run(Thread.java:619) 2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode: Receiving block blk_7813471133156061911 src: /10.251.26.223:49176 dest: /10.251.26.223:50010 2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_7813471133156061911 received exception java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created. 2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode: 10.251.65.207:50010:DataXceiver: java.io.IOException: Block blk_7813471133156061911 has already been started (though not completed), and thus cannot be created
Re: S3/EC2 setup problem: port 9001 unreachable
Andreas Here are some moderately useful notes on using EC2/S3, mostly learned leveraging Hadoop. The groups can't see themselves issue is listed grin. http://www.manamplified.org/archives/2008/03/notes-on-using-ec2-s3.html enjoy ckw On Mar 10, 2008, at 9:51 AM, Andreas Kostyrka wrote: Found it, was security group setup problem ;( Andreas Am Montag, den 10.03.2008, 16:49 +0100 schrieb Andreas Kostyrka: Hi! I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not using the Hadoop AMIs) I've got the S3 based HDFS working, but I'm stumped when I try to get a test job running: [EMAIL PROTECTED]:~/hadoop-0.16.0$ time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper /tmp/test.sh - reducer cat -input testlogs/* -output testlogs-output additionalConfSpec_:null null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar17969/] [] /tmp/ streamjob17970.jar tmpDir=null 08/03/10 14:01:28 INFO mapred.FileInputFormat: Total input paths to process : 152 08/03/10 14:02:58 INFO streaming.StreamJob: getLocalDirs(): [/tmp/ hadoop-hadoop/mapred/local] 08/03/10 14:02:58 INFO streaming.StreamJob: Running job: job_200803101400_0001 08/03/10 14:02:58 INFO streaming.StreamJob: To kill this job, run: 08/03/10 14:02:58 INFO streaming.StreamJob: /home/hadoop/ hadoop-0.16.0/bin/../bin/hadoop job - Dmapred.job.tracker=ec2-67-202-58-97.compute-1.amazonaws.com:9001 - kill job_200803101400_0001 08/03/10 14:02:58 INFO streaming.StreamJob: Tracking URL: http://ip-10-251-75-165.ec2.internal:50030/jobdetails.jsp?jobid=job_200803101400_0001 08/03/10 14:02:59 INFO streaming.StreamJob: map 0% reduce 0% Furthermore, when I try to connect port 9001 on 10.251.75.165 via telnet from the masterhost itself, it connects: [EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet 10.251.75.165 9001 Trying 10.251.75.165... Connected to 10.251.75.165. Escape character is '^]'. ^] telnet quit Connection closed. When I try to do this from other VMs in my cluster, it just hangs. (tcpdump on the masterhost shows no activity for tcp port 9001): [EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet ip-10-251-75-165.ec2.internal 9001 Trying 10.251.75.165... [EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet ip-10-251-75-165.ec2.internal 22 Trying 10.251.75.165... Connected to ip-10-251-75-165.ec2.internal. Escape character is '^]'. SSH-2.0-OpenSSH_4.3p2 Debian-9 ^] telnet quit Connection closed. This is also shown when I connect port 50030, which shows 0 nodes ready to process the job. Furthermore, the slaves show the following messages: 2008-03-10 15:30:11,455 INFO org.apache.hadoop.ipc.RPC: Problem connecting to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 10.251.75.165:9001 2008-03-10 15:31:12,465 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 10.251.75.165:9001. Already tried 1 time(s). 2008-03-10 15:32:13,475 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 10.251.75.165:9001. Already tried 2 time(s). Last but not least, here is my site conf: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namefs.default.name/name values3://lookhad/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem./description /property property namefs.s3.awsAccessKeyId/name value2DFGTTFSDFDSZU5SDSD7S5202/value /property property namefs.s3.awsSecretAccessKey/name valueRUWgsdfsd67SFDfsdflaI9Gjzfsd8789ksd2r1PtG/value /property property namemapred.job.tracker/name valueec2-67-202-58-97.compute-1.amazonaws.com:9001/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property /configuration The masternode listens not no localhost: [EMAIL PROTECTED]:~/hadoop-0.16.0$ netstat -an | grep 9001 tcp0 0 10.251.75.165:9001 0.0.0.0:* LISTEN Any ideas? My conclusions thus are: 1.) First, it's not a general connectivity problem, because I can connect port 22 without any problems. 2.) OTOH, on port 9001, inside the same group, the connectivity seems to be limited. 3.) All AWS docs tell me that VMs in one group have no firewalls in place. So what is happening here? Any ideas? Andreas Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/
Re: Question about using the metrics framework.
I have ganglia up on my cluster, and I definitely see some metrics from the map/reduce tasks. But I don't see anything from the JVM context for ganglia except from the master node. but i have gmetad running on the master and reporting from a local gmond seems flakey. On Mar 6, 2008, at 2:03 PM, Jason Venner wrote: We have started our first attempt at this, and do not see the metrics being reported. Our first cut simply is trying to report the counters at the end of the job. A theory is that the job is exiting before the metrics are flushed. This code is in the driver for our map/reduce task, and is called after the JobClient.runJob() method returns. private ContextFactory contextFactory = null; private MetricsContext myContext = null; private MetricsRecord randomizeRecord = null; { try { contextFactory = ContextFactory.getFactory(); myContext = contextFactory.getContext(myContext); randomizeRecord = myContext.createRecord( MyContextName ); } catch( Exception e ) { LOG.error( Unable to initialize metrics context factory, e ); } } void reportCounterMetrics( Counters counters ) { for( String groupName : counters.getGroupNames() ) { Group group = counters.getGroup( groupName ); for( String counterName : group.getCounterNames() ) { randomizeRecord.setMetric( counterName, group.getCounter(counterName) ); LOG.info( reporting counter + counterName + + group.getCounter(counterName) ); } } randomizeRecord.update(); } Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/
Re: Question about using the metrics framework.
actually, I don't see any jvm metrics across the cluster. any idea how to get a local gmond to gmetad to report local statistics? it is also accumulating slave stats just fine (minus jvm). On Mar 6, 2008, at 3:09 PM, Jason Venner wrote: We have tweeked our metrics properties to cause all nodes to report to the master node, and we see the map/reduce metrics also see the jvm metrics from all nodes. The other thing is that this only runs on the master node, as it is only run from the driver which currently runs on the master node. As I look deeper, I believe the metrics are not being flushed when the context is shut down, and the flush methods are not implemented for the ganglia context. Chris K Wensel wrote: I have ganglia up on my cluster, and I definitely see some metrics from the map/reduce tasks. But I don't see anything from the JVM context for ganglia except from the master node. but i have gmetad running on the master and reporting from a local gmond seems flakey. On Mar 6, 2008, at 2:03 PM, Jason Venner wrote: We have started our first attempt at this, and do not see the metrics being reported. Our first cut simply is trying to report the counters at the end of the job. A theory is that the job is exiting before the metrics are flushed. This code is in the driver for our map/reduce task, and is called after the JobClient.runJob() method returns. private ContextFactory contextFactory = null; private MetricsContext myContext = null; private MetricsRecord randomizeRecord = null; { try { contextFactory = ContextFactory.getFactory(); myContext = contextFactory.getContext(myContext); randomizeRecord = myContext.createRecord( MyContextName ); } catch( Exception e ) { LOG.error( Unable to initialize metrics context factory, e ); } } void reportCounterMetrics( Counters counters ) { for( String groupName : counters.getGroupNames() ) { Group group = counters.getGroup( groupName ); for( String counterName : group.getCounterNames() ) { randomizeRecord.setMetric( counterName, group.getCounter(counterName) ); LOG.info( reporting counter + counterName + + group.getCounter(counterName) ); } } randomizeRecord.update(); } Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/
Re: Question about using the metrics framework.
never mind on the jvm. just found the typo.. frown On Mar 6, 2008, at 3:19 PM, Chris K Wensel wrote: actually, I don't see any jvm metrics across the cluster. any idea how to get a local gmond to gmetad to report local statistics? it is also accumulating slave stats just fine (minus jvm). On Mar 6, 2008, at 3:09 PM, Jason Venner wrote: We have tweeked our metrics properties to cause all nodes to report to the master node, and we see the map/reduce metrics also see the jvm metrics from all nodes. The other thing is that this only runs on the master node, as it is only run from the driver which currently runs on the master node. As I look deeper, I believe the metrics are not being flushed when the context is shut down, and the flush methods are not implemented for the ganglia context. Chris K Wensel wrote: I have ganglia up on my cluster, and I definitely see some metrics from the map/reduce tasks. But I don't see anything from the JVM context for ganglia except from the master node. but i have gmetad running on the master and reporting from a local gmond seems flakey. On Mar 6, 2008, at 2:03 PM, Jason Venner wrote: We have started our first attempt at this, and do not see the metrics being reported. Our first cut simply is trying to report the counters at the end of the job. A theory is that the job is exiting before the metrics are flushed. This code is in the driver for our map/reduce task, and is called after the JobClient.runJob() method returns. private ContextFactory contextFactory = null; private MetricsContext myContext = null; private MetricsRecord randomizeRecord = null; { try { contextFactory = ContextFactory.getFactory(); myContext = contextFactory.getContext(myContext); randomizeRecord = myContext.createRecord( MyContextName ); } catch( Exception e ) { LOG.error( Unable to initialize metrics context factory, e ); } } void reportCounterMetrics( Counters counters ) { for( String groupName : counters.getGroupNames() ) { Group group = counters.getGroup( groupName ); for( String counterName : group.getCounterNames() ) { randomizeRecord.setMetric( counterName, group.getCounter(counterName) ); LOG.info( reporting counter + counterName + + group.getCounter(counterName) ); } } randomizeRecord.update(); } Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/