Great work! Thanks for sharing with us. I hope others will find your work useful. I'm definitely impressed.
Cheers, -- Andrei Savu On Thu, Nov 3, 2011 at 2:04 PM, Paolo Castagna < [email protected]> wrote: > Hi, > I wanted to share with this (use case) with you, I wrote "users" instead > of "user" so I am sending this to the right address now. > > Paolo > > -------- Original Message -------- > Subject: An experiment with tdbloader3 (i.e. how to create TDB indexes > with MapReduce) > Date: Thu, 03 Nov 2011 11:38:20 +0000 > From: Paolo Castagna <[email protected]**> > To: [email protected].**org <[email protected]> > CC: [email protected] > > Hi, > I want to share the results of an experiment I did using tdbloader3 > with an Hadoop cluster (provisioned via Apache Whirr) running on EC2. > > tdbloader3 generates TDB indexes with a sequence of four MapReduce jobs. > tdbloader3 is ASL but it is not an official module for Apache Jena. > > At the moment it's a prototype/experiment and this is the reason why > its sources are on GitHub: > https://github.com/castagna/**tdbloader3<https://github.com/castagna/tdbloader3> > > *If* it turns out to be useful to people and *if* there will be demand, > I have no problems in contributing this to Apache Jena as a separate > module (and support it). But, I suspect there aren't many people with > Hadoop clusters of ~100 nodes and RDF datasets of billion of triples > or quads and the need to generate TDB indexes. :-) > > For now GitHub works well. More testing and experiments are necessary. > > The dataset I used in my experiment is an Open Library version in RDF: > 962,071,613 triples, I found it somewhere in the office in our S3 > account. I am not sure if it is publicly available. Sorry. > > To run the experiment I used a 17 nodes Hadoop cluster on EC2. > > This is the Apache Whirr recipe I used to provision the cluster: > > ---------[ hadoop-ec2.properties ]---------- > whirr.cluster-name=hadoop > whirr.instance-templates=1 hadoop-namenode+hadoop-**jobtracker,16 > hadoop-datanode+hadoop-**tasktracker > whirr.instance-templates-max-**percent-failures=100 > hadoop-namenode+hadoop-**jobtracker,80 > hadoop-datanode+hadoop-**tasktracker > whirr.max-startup-retries=1 > whirr.provider=aws-ec2 > whirr.identity=${env:AWS_**ACCESS_KEY_ID_LIVE} > whirr.credential=${env:AWS_**SECRET_ACCESS_KEY_LIVE} > whirr.hardware-id=c1.xlarge > whirr.location-id=us-east-1 > whirr.image-id=us-east-1/ami-**1136fb78 > whirr.private-key-file=${sys:**user.home}/.ssh/whirr > whirr.public-key-file=${whirr.**private-key-file}.pub > whirr.hadoop.version=0.20.204.**0 > whirr.hadoop.tarball.url=http:**//archive.apache.org/dist/** > hadoop/core/hadoop-${whirr.**hadoop.version}/hadoop-${** > whirr.hadoop.version}.tar.gz<http://archive.apache.org/dist/hadoop/core/hadoop-$%7Bwhirr.hadoop.version%7D/hadoop-$%7Bwhirr.hadoop.version%7D.tar.gz> > --------- > > The recipe specifies a 17 nodes cluster, it also specifies that a > 20% failure rate during the startup of the cluster is acceptable. > Only 15 datanodes/tasktrackers were successfully started and joined > the Hadoop cluster. Moreover, during the processing, 2 datanodes/ > tasktrackers died. Hadoop coped with that with no problems. > > As you can see, I used c1.xlarge instances which have 7 GB of RAM, > 20 "EC2 compute units" (8 virtual cores with 2.5 EC2 compute units > each) and 1.6 GB of dist space. Price of c1.xlarge instances in > US East (Virginia) is $0.68 per hour. > > These are the elapsed times for the jobs I run on the cluster: > > distcp: 1hrs, 19mins (1) > tdbloader3-stats: 1hrs, 22mins (2) > tdbloader3-first: 1hrs, 48mins (3) > tdbloader3-second: 2hrs, 40mins > tdbloader3-third: 49mins > tdbloader3-fourth: 1hrs, 36mins > download: - (4) > > (1) In my case, distcp copied 112 GB (uncompressed) from S3 onto HDFS. > > (2) tdbloader3-stats is optional and it is not strictly required for > the TDB indexes, but it is useful to produce the stats.opt file. > It can probably be merged into one of tdbloader3-first|second jobs. > > (3) considering the tdbloader3-{first-fourth} jobs only, they took ~7 > hours, therefore an overall speed of about 38,000 triples per second. > Not impressive and a bit too expensive to be run on a cluster running > on the cloud (it's a good business for Amazon though). > > (4) it depends on the bandwidth available. This step can be improved > compressing the nodes.dat files and part of what download does can be > done as a 10 map only job. > > For those lucky enough to already have a 40-100 nodes Hadoop cluster, > tdbloader3 can be an option alongside tdbloader and tdbloader2. > Certainly not a replacement of these two great tools + as much RAM as > you can, for the rest of us. :-) > > Last but not least, if you are reading this and you are one of those > lucky enough to have a (sometimes idle) Hadoop cluster of 40-100 nodes > and you would like to help me out with the testing of tdbloader3 or run > a few experiment with it, please, get in touch with me. > > Cheers, > Paolo > > > PS: > Apache Whirr (http://whirr.apache.org/) is a great project to quickly > have an Hadoop cluster up and running for testing or running experiments. > Moreover, Whirr is not limited to Hadoop, it has support for: ZooKeeper, > HBase, Cassandra, ElasticSearch, etc.) > > EC2 is great for short live, bursts and the easy/elastic provisioning. > However, it is probably not the best choice for long-running clusters. > If you use m1.small instances to run you Hadoop cluster you are likely > to have problems with RAM. >
