Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Chris K Wensel
I believe PIG, and I know Cascading use a kind of 'spillable' list  
that can be re-iterated across. PIG's version is a bit more  
sophisticated last I looked.


that said, if you were using either one of them, you wouldn't need to  
write your own many-to-many join.


cheers,
ckw

On May 28, 2009, at 8:14 AM, Todd Lipcon wrote:


One last possible trick to consider:

If you were to subclass SequenceFileRecordReader, you'd have access  
to its

seek method, allowing you to rewind the reducer input. You could then
implement a block hash join with something like the following  
pseudocode:


ahash = new HashMapKey, Val();
while (i have ram available) {
 read a record
 if the record is from table B, break
 put the record into ahash
}
nextAPos = reader.getPos()

while (current record is an A record) {
 skip to next record
}
firstBPos = reader.getPos()

while (current record has current key) {
 read and join against ahash
 process joined result
}

if firstBPos  nextAPos {
 seek(nextAPos)
 go back to top
}


On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon t...@cloudera.com  
wrote:



Hi Stuart,

It seems to me like you have a few options.

Option 1: Just use a lot of RAM. Unless you really expect many  
millions of

entries on both sides of the join, you might be able to get away with
buffering despite its inefficiency.

Option 2: Use LocalDirAllocator to find some local storage to spill  
all of
the left table's records to disk in a MapFile format. Then as you  
iterate
over the right table, do lookups in the MapFile. This is really the  
same as

option 1, except that you're using disk as an extension of RAM.

Option 3: Convert this to a map-side merge join. Basically what you  
need to
do is sort both tables by the join key, and partition them with the  
same
partitioner into the same number of columns. This way you have an  
equal
number of part-N files for both tables, and within each part- 
N file
they're ordered by join key. In each map task, you open both tableA/ 
part-N
and tableB/part-N and do a sequential merge to perform the join. I  
believe
the CompositeInputFormat class helps with this, though I've never  
used it.


Option 4: Perform the join in several passes. Whichever table is  
smaller,
break into pieces that fit in RAM. Unless my relational algebra is  
off, A
JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B =  
B1 UNION

B2.

Hope that helps
-Todd


On Thu, May 28, 2009 at 5:02 AM, Stuart White stuart.whi...@gmail.com 
wrote:


I need to do a reduce-side join of two datasets.  It's a many-to- 
many

join; that is, each dataset can can multiple records with any given
key.

Every description of a reduce-side join I've seen involves
constructing your keys out of your mapper such that records from one
dataset will be presented to the reducers before records from the
second dataset.  I should hold on to the value from the one  
dataset

and remember it as I iterate across the values from the second
dataset.

This seems like it only works well for one-to-many joins (when one  
of

your datasets will only have a single record with any given key).
This scales well because you're only remembering one value.

In a many-to-many join, if you apply this same algorithm, you'll  
need

to remember all values from one dataset, which of course will be
problematic (and won't scale) when dealing with large datasets with
large numbers of records with the same keys.

Does an efficient algorithm exist for a many-to-many reduce-side  
join?







--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com



Re: Amazon Elastic MapReduce

2009-04-02 Thread Chris K Wensel

You should check out the new pricing.

On Apr 2, 2009, at 1:13 AM, zhang jianfeng wrote:

seems like I should pay for additional money, so why not configure a  
hadoop
cluster in EC2 by myself. This already have been automatic using  
script.






On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne mi...@inf.ed.ac.uk  
wrote:



... and only in the US

Miles

2009/4/2 zhang jianfeng zjf...@gmail.com:

Does it support pig ?


On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel ch...@wensel.net  
wrote:




FYI

Amazons new Hadoop offering:
http://aws.amazon.com/elasticmapreduce/

And Cascading 1.0 supports it:
http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html

cheers,
ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/








--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-13 Thread Chris K Wensel

fwiw, we have released a workaround for this issue in Cascading 1.0.5.

http://www.cascading.org/
http://cascading.googlecode.com/files/cascading-1.0.5.tgz

In short, Hadoop 0.19.0 and .1 instantiate the users Reducer class and  
subsequently calls configure() when there is no intention to use the  
class (during job/task cleanup tasks).


This clearly can cause havoc for users who use configure() to  
initialize resources used by the reduce() method.


Testing for jobConf.getNumReduceTasks() is 0 inside the configure()  
method seems to work out well.


branch-0.19 looks like it won't instantiate the Reducer class during  
job/task cleanup tasks, so I expect will leak into future releases.


cheers,

ckw

On Mar 12, 2009, at 8:20 PM, Amareshwari Sriramadasu wrote:

Are you seeing reducers getting spawned from web ui? then, it is a  
bug.
If not, there won't be reducers spawned, it could be job-setup/ job- 
cleanup task that is running on a reduce slot. See HADOOP-3150 and  
HADOOP-4261.

-Amareshwari
Chris K Wensel wrote:


May have found the answer, waiting on confirmation from users.

Turns out 0.19.0 and .1 instantiate the reducer class when the task  
is actually intended for job/task cleanup.


branch-0.19 looks like it resolves this issue by not instantiating  
the reducer class in this case.


I've got a workaround in the next maint release:
http://github.com/cwensel/cascading/tree/wip-1.0.5

ckw

On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote:


Hey all

Have some users reporting intermittent spawning of Reducers when  
the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly  
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default  
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't  
bother not setting the value, and subsequently comes to ones  
attention rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is  
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/





--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Chris K Wensel

Hey all

Have some users reporting intermittent spawning of Reducers when the  
job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly  
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default  
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't  
bother not setting the value, and subsequently comes to ones attention  
rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is  
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Cascading support of HBase

2009-02-03 Thread Chris K Wensel

Hey all,

Just wanted to let everyone know the HBase Hackathon was a success  
(thanks Streamy!), and we now have Cascading Tap adapters for HBase.


You can find the link here (along with additional third-party  
extensions in the near future).

http://www.cascading.org/modules.html

Please feel free to give it a test, clone the repo, and submit patches  
back to me via GitHub.


The unit test shows how simple it is to import/export data to/from an  
HBase cluster.

http://bit.ly/fIpAE

For those not familiar with Cascading, it is an alternative processing  
API that is generally more natural to develop and think in than  
MapReduce.

http://www.cascading.org/

enjoy,

chris

p.s. If you have any code you want to contribute back, just stick it  
on GitHub and send me a link.


--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Control over max map/reduce tasks per job

2009-02-03 Thread Chris K Wensel

Hey Jonathan

Are you looking to limit the total number of concurrent mapper/ 
reducers a single job can consume cluster wide, or limit the number  
per node?


That is, you have X mappers/reducers, but only can allow N mappers/ 
reducers to run at a time globally, for a given job.


Or, you are cool with all X running concurrently globally, but want to  
guarantee that no node can run more than N tasks from that job?


Or both?

just reconciling the conversation we had last week with this thread.

ckw

On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:


All,



I have a few relatively small clusters (5-20 nodes) and am having  
trouble

keeping them loaded with my MR jobs.



The primary issue is that I have different jobs that have drastically
different patterns.  I have jobs that read/write to/from HBase or  
Hadoop

with minimal logic (network throughput bound or io bound), others that
perform crawling (network latency bound), and one huge parsing  
streaming job

(very CPU bound, each task eats a core).



I'd like to launch very large numbers of tasks for network latency  
bound
jobs, however the large CPU bound job means I have to keep the max  
maps
allowed per node low enough as to not starve the Datanode and  
Regionserver.




I'm an HBase dev but not familiar enough with Hadoop MR code to even  
know
what would be involved with implementing this.  However, in talking  
with

other users, it seems like this would be a well-received option.



I wanted to ping the list before filing an issue because it seems like
someone may have thought about this in the past.



Thanks.



Jonathan Gray



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Scale Unlimited Professionals Program

2009-02-02 Thread Chris K Wensel

Hey All

Just wanted to let everyone know that Scale Unlimited will start  
offering many of its courses heavily discounted, if not free, to  
independent consultants and contractors.

http://www.scaleunlimited.com/programs

We are doing this because we receive a number of consulting/ 
contracting opportunities that we wish to delegate back to trusted  
consultants and developers.


But more importantly, many consultants don't have time to learn Hadoop  
and related technologies, so Hadoop is often overlooked on new  
projects. We would like to get more developers comfortable knowing  
when and when not to use Hadoop on a project, ultimately leading to  
more projects using Hadoop and to Hadoop becoming more stable and  
feature rich in the process.


We plan to offer our Hadoop Boot Camp for FREE in the Bay Area in the  
next few weeks. If interested in participating, email me directly.

http://www.scaleunlimited.com/courses/hadoop-boot-camp

Note this offer is limited to professional independent consultants,  
contractors, and small boutique contracting firms that are looking to  
expand their tool base. We only ask for an industry standard referral  
fee for any projects that result from a referral, if any.


To be added to our referral list or if you have a project that might  
benefit from Hadoop or related technologies, please email me directly.


This course will also be announced for open public enrollment in the  
coming days.


cheers,
chris

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Windows Support

2009-01-19 Thread Chris K Wensel

Hey Dan

There is discussion/issue on this here:
https://issues.apache.org/jira/browse/HADOOP-4998

ckw

On Jan 19, 2009, at 8:55 AM, Dan Diephouse wrote:

On Mon, Jan 19, 2009 at 11:35 AM, Steve Loughran ste...@apache.org  
wrote:



Dan Diephouse wrote:

I recognize that Windows support is, um, limited :-) But, any  
ideas what
exactly would need to be changed to support Windows (without  
cygwin) if
someone such as myself were so motivated? The most immediate thing  
I ran

into was the UserGroupInformation which would need a windows
implementation.
I see there is an issue to switch to JAAS too, which may be the  
proper

fix?
Are there lots of other things that would need to be changed?

I think it may be worth opening a JIRA for windows support and  
creating

some
subtasks for the various issues, even if no one tackles them quite  
yet.


Thanks,
Dan


I think a key one you need to address is motiviation. Is cygwin  
that bad

for a piece of server-side code?



No I guess I was trying to get an idea of how much work it was.  
It seems
easy enough to supply a WindowsUserGroupInformation class (or a  
platform
agnostic one).  I wondered how many other things like this there  
were before

I put together a patch. Seems bad Java practices to depend on shell
utilities :-). Not very platform agnostic...
Dan

--
Dan Diephouse
http://netzooid.com/blog


--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Cascading 1.0.0 Released

2009-01-15 Thread Chris K Wensel

Hi all

Just a quick note to let everyone know that Cascading 1.0.0 is out.
http://www.cascading.org/

Cascading is an API for defining and executing data processing flows  
without needing to think in MapReduce.


This release supports only Hadoop 0.19.x. Minor releases will be  
available to track Hadoop's progress.


A list of some of the high level features can be read about here:
http://www.cascading.org/documentation/features/

Or you can peruse the User Guide:
http://www.cascading.org/documentation/userguide.html

Developer Support and OEM/Commercial Licensing is available through  
Concurrent, Inc.

http://www.concurrentinc.com/

And finally, Advanced Hadoop and Cascading training (and consulting)  
is available through Scale Unlimited:

http://www.scaleunlimited.com/

cheers,
chris

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Storing/retrieving time series with hadoop

2009-01-12 Thread Chris K Wensel

Hey Brock

I used Cascading quite extensively with time series data.

Along with the standard function/filter/aggregator operations in the  
Cascading processing model, there is what we call a buffer.


Its really just a user friendly Reduce that integrates well with other  
operations and offers up a sliding window across your grouped data.  
Quite useful for running averages or filling in missing intervals etc.


Plus there are handy operations for switching from text time strings  
to long time stamps and back etc..


YMMV

cheers,
ckw

On Jan 7, 2009, at 5:03 PM, Brock Judkins wrote:


Hi list,
I am researching hadoop as a possible solution for my company's data
warehousing solution. My question is whether hadoop, possibly in  
combination
with Hive or Pig, is a good solution for time-series data? We  
basically have

a ton of web analytics to store that we display both internally and
externally.

For the time being I am storing timestamped data points in a huge  
MySQL
table, but I know this will not scale very far (although it's  
holding up ok

at almost 90MM rows). I am aware that hadoop can scale insanely large
(larger than I need), but does anyone have experience using it to draw
charts based on time series with fairly low latency?

Thanks!
Brock


--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Lookup HashMap available within the Map

2008-11-25 Thread Chris K Wensel
cool. If you need a hand with Cascading stuff, feel free to ping me on  
the mail list or #cascading irc. lots of other friendly folk there  
already.


ckw

On Nov 25, 2008, at 12:35 PM, tim robertson wrote:


Thanks Chris,

I have a different test running, then will implement that.  Might give
cascading a shot for what I am doing.

Cheers

Tim


On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED]  
wrote:

Hey Tim

The .configure() method is what you are looking for i believe.

It is called once per task, which in the default case, is once per  
jvm.


Note Jobs are broken into parallel tasks, each task handles a  
portion of the
input data. So you may create your map 100 times, because there are  
100

tasks, it will only be created once per jvm.

I hope this makes sense.

chris

On Nov 25, 2008, at 11:46 AM, tim robertson wrote:


Hi Doug,

Thanks - it is not so much I want to run in a single JVM - I do  
want a

bunch of machines doing the work, it is just I want them all to have
this in-memory lookup index, that is configured once per job.  Is
there some hook somewhere that I can trigger a read from the
distributed cache, or is a Mapper.configure() the best place for  
this?

Can it be called multiple times per Job meaning I need to keep some
static synchronised indicator flag?

Thanks again,

Tim


On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED]  
wrote:


tim robertson wrote:


Thanks Alex - this will allow me to share the shapefile, but I  
need to

one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?


In 0.19, with HADOOP-249, all tasks from a job can be run in a  
single

JVM.
So, yes, you could access a static cache from Mapper.configure().

Doug




--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Auto-shutdown for EC2 clusters

2008-10-24 Thread Chris K Wensel


fyi, the src/contrib/ec2 scripts do just what Paco suggests.

minus the static IP stuff (you can use the scripts to login via  
cluster name, and spawn a tunnel for browsing nodes)


that is, you can spawn any number of uniquely named, configured, and  
sized clusters, and you can increase their size independently as well.  
(shrinking is another matter altogether)


ckw

On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote:


Hi Karl,

Rather than using separate key pairs, you can use EC2 security groups
to keep track of different clusters.

Effectively, that requires a new security group for every cluster --
so just allocate a bunch of different ones in a config file, then have
the launch scripts draw from those. We also use EC2 static IP
addresses and then have a DNS entry named similarly to each security
group, associated with a static IP once that cluster is launched.
It's relatively simple to query the running instances and collect them
according to security groups.

One way to handle detecting failures is just to attempt SSH in a loop.
Our rough estimate is that approximately 2% of the attempted EC2 nodes
fail at launch. So we allocate more than enough, given that rate.

In a nutshell, that's one approach for managing a Hadoop cluster
remotely on EC2.

Best,
Paco


On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson [EMAIL PROTECTED] wrote:


On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:


This workflow could be initiated from a crontab -- totally  
automated.

However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved  
much

since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.


Same here.  I've always had the namenode and web interface be  
accessible,
but sometimes I don't get the slave nodes - usually zero slaves  
when this
happens, sometimes I only miss one or two.  My rough estimate is  
that this

happens 1% of the time.

I currently have to notice this and restart manually.  Do you have  
a good
way to detect it?  I have several Hadoop clusters running at once  
with the
same AWS image and SSH keypair, so I can't count running  
instances.  I could
have a separate keypair per cluster and count instances with that  
keypair,
but I'd like to be able to start clusters opportunistically, with  
more than

one cluster doing the same kind of job on different data.


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra






--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Chris K Wensel

Hey Stuart

I did that for a client using Cascading events and SQS.

When jobs completed, they dropped a message on SQS where a listener  
picked up new jobs and ran with them, or decided to kill off the  
cluster. The currently shipping EC2 scripts are suitable for having  
multiple simultaneous clusters for this purpose.


Cascading has always and now Hadoop supports (thanks Tom) raw file  
access on S3, so this is quite natural. This is the best approach as  
data is pulled directly into the Mapper, instead of onto HDFS first,  
then read into the Mapper from HDFS.


YMMV

chris

On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:


Hi folks,
Anybody tried scripting Hadoop on EC2 to...
1. Launch a cluster
2. Pull data from S3
3. Run a job
4. Copy results to S3
5. Terminate the cluster
... without any user interaction?

-Stuart


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Hadoop Training

2008-10-14 Thread Chris K Wensel

Hey all

Just wanted to make a quick announcement that Scale Unlimited has  
started delivering its 2 day Hadoop Boot Camp.


More info here:
http://www.scaleunlimited.com/hadoop-bootcamp.html

We currently are offering the classes on-site within the US/UK/EU to  
those companies needing to get a team up to speed rapidly.


But we are working to put together a public class in the Bay Area.  
Please email me if you are interested, so we can gauge interest.


Also, we may be in NY for the NY Hadoop User Group, if any org out  
there wants to throw together a class during the week of Nov. 10,  
again, give me a shout.


cheers,
chris

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Monthly Hadoop User Group Meeting (Bay Area)

2008-09-09 Thread Chris K Wensel

Chris K Wensel wrote:
doh, conveniently collides with the GridGain and GridDynamics  
presentations:

http://web.meetup.com/66/calendar/8561664/


Bay Area Hadoop User Group meetings are held on the third Wednesday  
every month.  This has been on the calendar for quite a while.


Doug



maybe I should have said, coincidentally.

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Simple Survey

2008-09-09 Thread Chris K Wensel


Quick reminder to take the survey. We know more than a dozen companies  
are using Hadoop. heh


http://www.scaleunlimited.com/survey.html   

thanks!
chris

On Sep 8, 2008, at 10:43 AM, Chris K Wensel wrote:


Hey all

Scale Unlimited is putting together some case studies for an  
upcoming class and wants to get a snapshot of what the Hadoop user  
community looks like.


If you have 2 minutes, please feel free to take the short anonymous  
survey below:


http://www.scaleunlimited.com/survey.html

All results will be public.

cheers,
chris

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Simple Survey

2008-09-09 Thread Chris K Wensel

how weird. i'll forward this on and see if they can fix it.

thanks for letting me know.

chris

On Sep 9, 2008, at 4:22 PM, John Kane wrote:

Unfortunately there is a problem with the survey. I was unable to  
answer
correctly question '9. How much data is stored on your Hadoop  
cluster (in
GB)?' It would not let me enter more than 10TB (we currently have  
45TB of
data in our cluster; actual data, not a sum of disk used (with all  
of its

replicas) but unique data).

Other than that, I tried :-)

On Tue, Sep 9, 2008 at 4:01 PM, Chris K Wensel [EMAIL PROTECTED]  
wrote:




Quick reminder to take the survey. We know more than a dozen  
companies are

using Hadoop. heh

http://www.scaleunlimited.com/survey.html

thanks!
chris


On Sep 8, 2008, at 10:43 AM, Chris K Wensel wrote:

Hey all


Scale Unlimited is putting together some case studies for an  
upcoming
class and wants to get a snapshot of what the Hadoop user  
community looks

like.

If you have 2 minutes, please feel free to take the short  
anonymous survey

below:

http://www.scaleunlimited.com/survey.html

All results will be public.

cheers,
chris

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/




--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Basic code organization questions + scheduling

2008-09-08 Thread Chris K Wensel


If you wrote a simple URL fetcher function for Cascading, you would  
have a very powerful web crawler that would dwarf Nutch in flexibility.


That said, Nutch is optimized for storage, has supporting tools,  
ranking algorithms, and has been up against some nasty html and other  
document types. building a really robust crawler is non-trivial.


If i was just starting out and needed to implement a proprietary  
process, I would use Nutch for fetching raw content, and refreshing  
it. then use Cascading for parsing, indexing, etc.


cheers,
chris

On Sep 8, 2008, at 12:42 AM, tarjei wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Alex (and others).

You should take a look at Nutch.  It's a search-engine built on  
Lucene,

though it can be setup on top of Hadoop.  Take a look:

This didn't help me much. Although the description I gave of the basic
flow of the app seems to be close to what Nutch is doing (and I've  
been

looking at the Nutch code), the questions are more general and not
related to indexing as such, but about code organization. If someone  
has

more input to those, feel free to add it.



On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote:

Hi, I'm planning to use Hadoop in for a set of typical crawler/ 
indexer

tasks. The basic flow is

input:array of urls
actions:  |
1.  get pages
|
2.  extract new urls from pages - start new job
  extract text  - index / filter (as new jobs)

What I'm considering is how I should build this application to fit  
into the
map/reduce context. I'm thinking that step 1 and 2 should be  
separate

map/reduce tasks that then pipe things on to the next step.

This is where I am a bit at loss to see how it is smart to  
organize the
code in logical units and also how to spawn new tasks when an old  
one is

over.

Is the usual way to control the flow of a set of tasks to have an  
external
application running that listens to jobs ending via the  
endNotificationUri
and then spawns new tasks or should the job itself contain code to  
create

new jobs? Would it be a good idea to use Cascading here?

I'm also considering how I should do job scheduling (I got a lot of
reoccurring tasks). Has anyone found a good framework for job  
control of

reoccurring tasks or should I plan to build my own using quartz ?

Any tips/best practices with regard to the issues described above  
are most
welcome. Feel free to ask further questions if you find my  
descriptions of

the issues lacking.


Kind regards,
Tarjei



Kind regards,
Tarjei







-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o
9km8MTJcTQxnc7bijR1Oxs0=
=79fZ
-END PGP SIGNATURE-


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Simple Survey

2008-09-08 Thread Chris K Wensel

Hey all

Scale Unlimited is putting together some case studies for an upcoming  
class and wants to get a snapshot of what the Hadoop user community  
looks like.


If you have 2 minutes, please feel free to take the short anonymous  
survey below:


http://www.scaleunlimited.com/survey.html

All results will be public.

cheers,
chris

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Monthly Hadoop User Group Meeting (Bay Area)

2008-09-08 Thread Chris K Wensel


doh, conveniently collides with the GridGain and GridDynamics  
presentations:


http://web.meetup.com/66/calendar/8561664/

On Sep 8, 2008, at 4:55 PM, Ajay Anand wrote:


The next Hadoop User Group (Bay Area) meeting is scheduled for
Wednesday, Sept 17th from 6 - 7:30 pm at Yahoo! Mission College, Santa
Clara, CA, Building 2, Training Rooms 34.



Agenda:

Cloud Computing Testbed - Thomas Sandholm, HP
Katta on Hadoop - Stefan Groschupf



Registration and directions: http://upcoming.yahoo.com/event/1075456/



Look forward to seeing you there!

Ajay





--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



RandomTextWriter

2008-07-07 Thread Chris K Wensel

Hey all

Has anyone had success with RandomTextWriter?

I'm finding it fairly unstable on 0.16.x, haven't tried 0.17 yet though.

chris

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/








Re: RandomTextWriter

2008-07-07 Thread Chris K Wensel

In local mode, only one mapper succeeds, the remaining never start.

And when on a cluster, a handful of mappers die with some exception  
about not finding the output directory (sorry, don't have the  
exception handy).


I'm upgrading to 0.17.1 to see if they persist.

ckw

On Jul 7, 2008, at 10:08 AM, Arun C Murthy wrote:



On Jul 7, 2008, at 9:46 AM, Chris K Wensel wrote:


Hey all

Has anyone had success with RandomTextWriter?

I'm finding it fairly unstable on 0.16.x, haven't tried 0.17 yet  
though.




What problems are you seeing? It seems to work fine for me...

Arun



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/








Re: hadoop in the ETL process

2008-07-02 Thread Chris K Wensel
If your referring to loading an RDBMS with data on Hadoop, this is  
doable. but you will need to write your own JDBC adapters to your  
tables.


But you might review what you are using the RDBMS for and see if those  
jobs would be better off running on Hadoop entirely, if not for most  
of the processing.


ckw

On Jul 2, 2008, at 10:51 AM, David J. O'Dell wrote:


Is anyone using hadoop for any part of the ETL process?

Given its ability to process large amounts of log files this seems  
like

a good fit.

--
David O'Dell
Director, Operations
e: [EMAIL PROTECTED]
t:  (415) 738-5152
180 Townsend St., Third Floor
San Francisco, CA 94107



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/








Re: Using S3 Block FileSystem as HDFS replacement

2008-07-01 Thread Chris K Wensel
by editing the hadoop-site.xml, you set the default. but I don't  
recommend changing the default on EC2.


but you can specify the filesystem to use through the URL that  
references your data (jobConf.addInputPath etc) for a particular job.  
in the case of the S3 block filesystem, just use a s3:// url.


ckw

On Jun 30, 2008, at 8:04 PM, slitz wrote:


Hello,
I've been trying to setup hadoop to use s3 as filesystem, i read in  
the wiki

that it's possible to choose either S3 native FileSystem or S3 Block
Filesystem. I would like to use S3 Block FileSystem to avoid the  
task of
manually transferring data from S3 to HDFS every time i want to  
run a job.


I'm still experimenting with EC2 contrib scripts and those seem to be
excellent.
What i can't understand is how may be possible to use S3 using a  
public
hadoop AMI since from my understanding hadoop-site.xml gets written  
on each
instance startup with the options on hadoop-init, and it seems that  
the

public AMI (at least the 0.17.0 one) is not configured to use S3 at
all(which makes sense because the bucket would need individual  
configuration

anyway).

So... to use S3 block FileSystem with EC2 i need to create a custom  
AMI with

a modified hadoop-init script right? or am I completely confused?


slitz


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/








Re: Using S3 Block FileSystem as HDFS replacement

2008-07-01 Thread Chris K Wensel

How do i put something into the fs?
something like bin/hadoop fs -put input input will not work well  
since s3

is not the default fs, so i tried to do bin/hadoop fs -put input
s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't  
worked, i
always got an error complaining about not having provided the ID/ 
secret for

s3.



 hadoop distcp ...

should work with your s3 urls



08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to  
process

: 2
08/07/01 22:12:57 INFO mapred.JobClient: Running job:  
job_200807012133_0010

08/07/01 22:12:58 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
(...)

I tried several times and with the wordcount example but the error  
were

always the same.



unsure. many things could be conspiring against you. leave the  
defaults in hadoop-site and use distcp to copy things around. that's  
the simplest I think.


ckw


Re: realtime hadoop

2008-06-24 Thread Chris K Wensel

On Jun 23, 2008, at 9:54 PM, Matt Kent wrote:


Unless you have a significant amount of work to be done, I wouldn't
recommend using Hadoop because it's not worth the overhead of  
launching

the jobs and moving the data around.



I think part of the tradeoff is having a system that is resilient to  
failure against work that must get done, regardless of the amount of  
work.


ckw

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/








Re: Ec2 and MR Job question

2008-06-14 Thread Chris K Wensel
well, to answer your last question first, just set the # reducers to  
zero.


but you can't just run reducers without mappers (as far as I know,  
having never tried). so your local job will need to run identity  
mappers in order to feed your reducers.

http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

ckw

On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote:

I have a question someone may have answered here before but I can  
not find the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to  
run and the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces  
on the local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2  
they are stored local on the instance
What I am looking to do is have the map output files stored in hdfs  
so I can kill the EC2 instances sense I do not need them for the  
reduces.


The only way I can thank to do this is run two jobs one maper and  
store the output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel
:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
  at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)





--
hustlin, hustlin, everyday I'm hustlin


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel

Thanks Ted..

Couple quick comments.

At one level Cascading is a MapReduce query planner, just like PIG.  
Except the API is for public consumption and fully extensible, in PIG  
you typically interact with the PigLatin syntax. Subsequently, with  
Cascading, you can layer your own syntax on top of the API. Currently  
there is Groovy support (Groovy is used to assemble the work, it does  
not run on the mappers or reducers). I hear rumors about Jython  
elsewhere.


A couple groovy examples (note these are obviously trivial, the dsl  
can absorb tremendous complexity if need be)...

http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/wordcount.groovy
http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/widefinder.groovy

Since Cascading is in part a 'planner', it actually builds internally  
a new representation from what the developer assembled and renders  
out  the necessary map/reduce jobs (and transparently links them) at  
runtime. As Hadoop evolves, the planner will incorporate the new  
features and leverage them transparently. Plus there are opportunities  
for identifying patterns and applying different strategies  
(hypothetically map side vs reduce side joins, for one). It is also  
conceivable (but untried) that different planners can exist to target  
different systems other than Hadoop (making your code/libraries  
portable). Much of this is true for PIG as well.

http://www.cascading.org/documentation/overview.html

Also, Cascading will at some point provide a PIG adapter, allowing  
PigLatin queries to participate in a larger Cascading 'Cascade' (the  
topological scheduler). Cascading is great with integration,  
connecting things outside Hadoop with stuff to be done inside Hadoop.  
And PIG looks like a great way to concisely represent a complex  
solution and execute it. There isn't any reason they can't work  
together (it has always been the intention).


The takeaway is that with Cascading and PIG, users do not think in  
MapReduce. With PIG, you think in PigLatin. With Cascading, you can  
use the pipe/filter based API, or use your favorite scripting language  
and build a DSL for your problem domain.


Many companies have done similar things internally, but they tend to  
be nothing more than a scriptable way to write a map/reduce job and  
glue them together. You still think in MapReduce, which in my opinion  
doesn't scale well.


My (biased) recommendation is this.

Build out your application in Cascading. If part of the problem is  
best represented in PIG, no worries use PIG and feed and clean up  
after PIG with Cascading. And if you see a solvable bottleneck, and we  
can't convince the planner to recognize the pattern and plan better,  
replace that piece of the process with a custom MapReduce job (or more).


Solve your problem first, then optimize the solution, if need be.

ckw

On Jun 11, 2008, at 5:00 PM, Ted Dunning wrote:

Pig is much more ambitious than cascading.  Because of the  
ambitions, simple
things got overlooked.  For instance, something as simple as  
computing a

file name to load is not possible in pig, nor is it possible to write
functions in pig.  You can hook to Java functions (for some things),  
but you
can't really write programs in pig.  On the other hand, pig may  
eventually

provide really incredible capabilities including program rewriting and
optimization that would be incredibly hard to write directly in Java.

The point of cascading was simply to make life easier for a normal
Java/map-reduce programmer.  It provides an abstraction for gluing  
together
several map-reduce programs and for doing a few common things like  
joins.
Because you are still writing Java (or Groovy) code, you have all of  
the
functionality you always had.  But, this same benefit costs you the  
future

in terms of what optimizations are likely to ever be possible.

The summary for us (especially 4-6 months ago when we were deciding)  
is that
cascading is good enough to use now and pig will probably be more  
useful

later.

On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao [EMAIL PROTECTED]  
wrote:




I find cascading very similar to pig, do you care to provide your  
comment
here? If map reduce programmers are to go to the next level  
(scripting/query

language), which way to go?





--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-11 Thread Chris K Wensel
However, for continuous production data processing, hadoop+cascading  
sounds like a good option.



This will be especially true with stream assertions and traps (as  
mentioned previously, and available in trunk). grin


I've written workloads for clients that render down to ~60 unique  
Hadoop map/reduce jobs, all inter-related, from ~10 unique units of  
work (internally lots of joins, sorts and math). I can't imagine  
having written them by hand.


ckw

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Re: contrib EC2 with hadoop 0.17

2008-06-09 Thread Chris K Wensel

Thanks for the description, Chris. Now that I understand the basic
model, I'm starting to see how the configuration is passed to the
slaves using the -d option of ec2-run-instances.

One config question: on our cluster (hadoop 0.17 with
INSTANCE_TYPE=m1.small) the conf/hadoop-default.xml has
mapred.reduce.tasks set to 1, and mapred.map.tasks set to 2.

From experimenting and reading the FAQ, it looks like those numbers
should be higher, unless you have single-machine cluster. Maybe
there's something I'm missing, but by upping mapred.map.tasks and
mapred.reduce.tasks to 5 and 15 (in our job jar) we're getting much
better performance. Is there a reason hadoop-init doesn't build a
hadoop-site.xml file with higher or configurable values for these
fields?



configuration values should be set in conf/hadoop-site.xml. Those  
particular values you are referring to probably should be set per job  
and generally don't have anything to do with instance sizes but more  
to do with cluster size and the job being run.


different instance sizes have mapred.tasktracker.map.tasks.maximum and  
mapred.tasktracker.reduce.tasks.maximum set accordingly (see hadoop- 
init), but again might/should be tuned to your application (cpu or io  
bound).


ckw

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: contrib EC2 with hadoop 0.17

2008-06-07 Thread Chris K Wensel
The new scripts do not use the start/stop-all.sh scripts, and thus do  
not maintain the slaves file. This is so cluster startup is much  
faster and a bit more reliable (keys do not need to be pushed to the  
slaves). Also we can grow the cluster lazily just by starting slave  
nodes. That is, they are mostly optimized for booting a large cluster  
fast, doing work, then shutting down (allowing for huge short lived  
clusters, vs a smaller/cheaper long lived one).


But it probably would be wise to provide scripts to build/refresh the  
slaves file, and push keys to slaves, so the cluster can be  
traditionally maintained, instead of just re-instantiated with new  
parameters etc.


I wonder if these scripts would make sense in general, instead of  
being ec2 specific?


ckw

On Jun 7, 2008, at 11:31 AM, Chris Anderson wrote:


First of all, thanks to whoever maintains the hadoop-ec2 scripts.
They've saved us untold time and frustration getting started with a
small testing cluster (5 instances).

A question: when we log into the newly created cluster, and run jobs
from the example jar (pi, etc) everything works great. We expect our
custom jobs will run just as smoothly.

However, when we restart the namenodes and tasktrackers by running
bin/stop-all.sh on the master, it tries to stop only activity on
localhost. Running start-all.sh then boots up a localhost-only cluster
(on which jobs run just fine).

The only way we've been able to recover from this situation is to use
bin/terminate-hadoop-cluster and bin/destroy-hadoop-cluster and then
start again from scratch with a new cluster.

There must be a simple way to restart the namenodes and jobtrackers
across all machines from the master. Also, I think understanding the
answer to this question might put a lot more into perspective for me,
so I can go on to do more advanced things on my own.

Thanks for any assistance / insight!

Chris


output from stop-all.sh
==

stopping jobtracker
localhost: Warning: Permanently added 'localhost' (RSA) to the list of
known hosts.
localhost: no tasktracker to stop
stopping namenode
localhost: no datanode to stop
localhost: no secondarynamenode to stop


conf files in /usr/local/hadoop-0.17.0
==

# cat conf/slaves
localhost
# cat conf/masters
localhost




--
Chris Anderson
http://jchris.mfdz.com


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel

if you use the new scripts in 0.17.0, just run

 hadoop-ec2 proxy cluster-name

this starts a ssh tunnel to your cluster.

installing foxy proxy in FF gives you whole cluster visibility..

obviously this isn't the best solution if you need to let many semi  
trusted users browse your cluster.


On May 28, 2008, at 1:22 PM, Andreas Kostyrka wrote:


Hi!

I just wondered what other people use to access the hadoop webservers,
when running on EC2?

Ideas that I had:
1.) opening ports 50030 and so on = not good, data goes unprotected
over the internet. Even if I could enable some form of  
authentication it

would still plain http.

2.) Some kind of tunneling solution. The problem on this side is that
each of my cluster node is in a different subnet, plus the dualism
between the internal and external addresses of the nodes.

Any hints? TIA,

Andreas


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel


obviously this isn't the best solution if you need to let many semi  
trusted users browse your cluster.



Actually, it would be much more secure if the tunnel service ran on a  
trusted server letting your users connect remotely via SOCKS and then  
browse the cluster. These users wouldn't need any AWS keys etc.



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Read timed out, Abandoning block blk_-5476242061384228962

2008-05-13 Thread Chris K Wensel



I'm using the machine running the namenode to run maps as well - could
that be a source of my problem?  The load is fairly high, essentially
no idle time.  8 cores per machine, so I've got 8 maps running.  I'm
guessing I'd be better off running 80 smaller machines instead of 20
larger ones for the same price, but we haven't been approved for more
than 20 instances yet.  Given that I'm not seeing any idle time, I'm
assuming that I'm CPU not IO-bound.



fwiw, I have not found the large or xlarge EC2 instances  
proportionally faster with Hadoop. Thus we run many small instances  
more cheaply.


btw, the email notifying you that you have been approved may lag the  
actual approval (mine did for days). might be worth trying a larger  
cluster to see.


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Read timed out, Abandoning block blk_-5476242061384228962

2008-05-07 Thread Chris K Wensel

Hi James

Were you able to start all the nodes in the same 'availability zone'?  
You using the new AMI kernels?


If you are using the contrib/ec2 scripts, you might upgrade (just the  
scripts) to

http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/

These support the new kernels and availability zones. My transient  
errors went away when upgrading.


The functional changes are documented here:
http://wiki.apache.org/hadoop/AmazonEC2

fyi, you will need to build your own images (via the create-image  
command) with whatever version of Hadoop you are comfortable with.  
this will also get you a Ganglia install...


ckw

On May 7, 2008, at 1:29 PM, James Moore wrote:


What is this bit of the log trying to tell me, and what sorts of
things should I be looking at to make sure it doesn't happen?

I don't think the network has any basic configuration issues - I can
telnet from the machine creating this log to the destination - telnet
10.252.222.239 50010 works fine when I ssh in to the box with this
error.

2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Abandoning block blk_-5476242061384228962
2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting
to find target node: 10.252.222.239:50010

I'm seeing a fair number of these.  My reduces finally complete, but
there are usually a couple at the end that take longer than I think
they should, and they frequently have these sorts of errors.

I'm running 20 machines on ec2 right now, with hadoop version 0.16.4.
--
James Moore | [EMAIL PROTECTED]
blog.restphone.com


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Groovy Scripting for Hadoop

2008-05-05 Thread Chris K Wensel

Hey all

Just wanted to let interested parties know we just released 0.1.0 of  
our Groovy 'builder' extension.


We think this will be a great tool for those groups that need to  
expose Hadoop to the 'casual' user who needs to get and manipulate  
valuable data on a Hadoop cluster, but doesn't have the time to learn  
Java, the Hadoop API, or to think in MapReduce to solve problems that  
are a notch or more above trivial.


It is worthy of mentioning here that no Groovy code is run in the  
cluster (on the slave nodes). Groovy is only being used as a  
configuration language to allow for the assembly of complex workflows  
to be run on Hadoop.


An introduction:
http://www.cascading.org/documentation/groovy.html

Links to samples included in the distro:

The canonical word count example (.groovy)
http://tinyurl.com/6gp8xp

Or a wide finder example:
http://tinyurl.com/5korhj

These examples don't show it, but splits and joins are fully supported  
(as they are in Cascading). Further, local libraries can be used, but  
there is still work to do to make this transparent.


Please feel free to join our mail-list and post feedback.
http://groups.google.com/group/cascading-user

cheers,
ckw

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Best practices for handling many small files

2008-04-23 Thread Chris K Wensel
are the files to be stored on HDFS long term, or do they need to be  
fetched from an external authoritative source?


depending on how things are setup in your datacenter etc...

you could aggregate them into a fat sequence file (or a few). keep in  
mind how long it would take to fetch the files and aggregate them  
(this is a serial process) and if the corpus changes often (how often  
will you need to make these sequence files).


another option is to make a manifest (list of docs to fetch), feed  
that to your mapper and have it fetch each file individually. this  
would be useful if the corpus is reasonably arbitrary between runs and  
could eliminate much of the load time. but painful if the data is  
external to your datacenter and the cost to refetch is high.


there really is no simple answer..

ckw


On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote:
million map processes are horrible. aside from overhead - don't do  
it if u share the cluster with other jobs (all other jobs will get  
killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)


well - even for #2 - it begs the question of how the packing itself  
will be parallelized ..


There's a MultiFileInputFormat that can be extended - that allows  
processing of multiple files in a single map job. it needs  
improvement. For one - it's an abstract class - and a concrete  
implementation for (at least)  text files would help. also - the  
splitting logic is not very smart (from what i last saw). ideally -  
it should take the million files and form it into N groups (say N is  
size of your cluster) where each group has files local to the Nth  
machine and then process them on that machine. currently it doesn't  
do this (the groups are arbitrary). But it's still the way to go ..



-Original Message-
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files

Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single record?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: adding nodes to an EC2 cluster

2008-04-15 Thread Chris K Wensel

Stephen

Check out the patch in Hadoop-2410 to the contrib/ec2 scripts
https://issues.apache.org/jira/browse/HADOOP-2410
(just grab the ec2.tgz attachment)

these scripts allow you do dynamically grow your cluster plus some  
extra goodies. you will need to use them to build your own ami, they  
are not compatible with the hadoop-images amis.


using them should be self explanatory.

ckw


On Apr 15, 2008, at 7:03 PM, Stephen J. Barr wrote:

Hello,

Does anyone have any experience adding nodes to a cluster running on  
EC2? If so, is there some documentation on how to do this?


Thanks,
-stephen


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: adding nodes to an EC2 cluster

2008-04-15 Thread Chris K Wensel

I'm unsure of your particular problem.

but the scripts/patch I referenced previously remove any dependency on  
DynDNS.


the recipe would be something like...

make a s3 bucket and update hadoop-ec2-env.sh

make an image:
 hadoop-ec2 create-image

make a 2 node (3 machine) cluster:
 hadoop-ec2 launch-cluster test-group 2

upload your jar file
 hadoop-ec2 push test-group /path/to/your.jar

login:
 hadoop-ec2 login test-group

run your jar
 hadoop jar your.jar

kill your cluster when done:
 hadoop-ec2 terminate-cluster test-group

test-group is the name of the ec2 hadoop cluster group. these scripts  
support many concurrent groups.


that all said, these scripts are untested with cygwin on windows..  
work great on a mac though.


On Apr 15, 2008, at 8:47 PM, Prerna Manaktala wrote:

Hey
Can u solve my problem also.
I am badly stuck up

I tried to set up hadoop with cygwin according to the
paper:http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873
But I had problems working with dyndns.I created a new host
there:prerna.dyndns.org
and gave the ip address of it in hadoop-ec2-env.sh as a value of  
MASTER_HOST.

But when I do bin/hadoop-ec2 start-hadoop the error comes:
ssh:connect to host prerna.dyndns.org,port 22:connection refused
ssh failed for [EMAIL PROTECTED]
Also a warning comes that:id_rsa_gsg-keypair not accesible:No such
file or directory though there is this file.

As of now the status is:
I am working with the EC2 environment.
I registered and am being billed for EC2 and S3.
Right now I have two cygwin windows open.
1 is as an administrator-server(on which sshd running) in which I have
a separate folder for hadoop files
and am able to do bin/hadoop
1 as  a normal user-client.
In the client there is no hadoop folder and hence I cant run bin/ 
hadoop.

From here I do not know how to proceed?
I basically want to implement
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 
.

Hence I created a host using dyndns.
If you can help me,it will be great.

On Tue, Apr 15, 2008 at 11:30 PM, Chris K Wensel [EMAIL PROTECTED]  
wrote:


it's pretty simple. just create an s3 bucket, note the bucket name  
in the

hadoop-ec2.sh file (with the other necessary values). then call

hadoop-ec2 create-image

i'll try and update the wiki soon.

ckw

On Apr 15, 2008, at 7:28 PM, Stephen J. Barr wrote:

Thank you. I will check that out. I haven't built an AMI before.  
Hopefully

it isn't too complicated, as it is easy to use the pre-built AMI's.


-stephen

Chris K Wensel wrote:


Stephen

Check out the patch in Hadoop-2410 to the contrib/ec2 scripts
https://issues.apache.org/jira/browse/HADOOP-2410
(just grab the ec2.tgz attachment)

these scripts allow you do dynamically grow your cluster plus  
some extra
goodies. you will need to use them to build your own ami, they are  
not

compatible with the hadoop-images amis.


using them should be self explanatory.

ckw


On Apr 15, 2008, at 7:03 PM, Stephen J. Barr wrote:


Hello,

Does anyone have any experience adding nodes to a cluster  
running on

EC2? If so, is there some documentation on how to do this?


Thanks,
-stephen



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/









Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/







Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Hadoop performance on EC2?

2008-04-11 Thread Chris K Wensel

What does ganglia show for load and network?

You should also be able to see gc stats (count and time). Might help  
as well.


fyi,
running
 hadoop-ec2 proxy cluster-name

will both setup a socks tunnel and list available urls you can cut/ 
paste into your browser. one of the urls is for the ganglia interface.


On Apr 11, 2008, at 2:01 PM, Nate Carlson wrote:

On Wed, 9 Apr 2008, Chris K Wensel wrote:

make sure all nodes are running in the same 'availability zone', 
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347


check!


and that you are using the new xen kernels.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101


check!

also, make sure each node is addressing its peers via the ec2  
private addresses, not the public ones.


check!

there is a patch in jira for the ec2/contrib scripts that address  
these issues.

https://issues.apache.org/jira/browse/HADOOP-2410

if you use those scripts, you will be able to see a ganglia display  
showing utilization on the machines. 8/7 map/reducers sounds like  
alot.


Reduced - I dropped it to 3/2 for testing.

I am using these scripts now, and am still seeing very poor  
performance on EC2 compared to my development environment.  ;(


I'll be capturing some more extensive stats over the weekend, and  
see if I can glean anything useful...



| nate carlson | [EMAIL PROTECTED] | http:// 
www.natecarlson.com |
|   depriving some poor village of its idiot since  
1981|




Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Hadoop performance on EC2?

2008-04-09 Thread Chris K Wensel

a few things..

make sure all nodes are running in the same 'availability zone',
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347

and that you are using the new xen kernels.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101

also, make sure each node is addressing its peers via the ec2 private  
addresses, not the public ones.


there is a patch in jira for the ec2/contrib scripts that address  
these issues.

https://issues.apache.org/jira/browse/HADOOP-2410

if you use those scripts, you will be able to see a ganglia display  
showing utilization on the machines. 8/7 map/reducers sounds like alot.


ymmv

On Apr 9, 2008, at 7:07 PM, Nate Carlson wrote:

Hey all,

We've got a job that we're running in both a development  
environment, and out on EC2.  I've been rather displeased with the  
performance on EC2, and was curious if the results that we've been  
seeing are similar to other people's, or if I've got something  
misconfigured.  ;)  In both environments, the load on the master  
node is around 1-1.5, and the load on the slave nodes in around  
8-10. I have also tried cranking up the JVM memory on the EC2 nodes  
(since we got RAM to blow), with very little performance difference.


Basically, the job takes about 3.5 hours on development, but takes  
15 hours on EC2. With the portion that takes all the time, it is not  
dependent on any external hosts - just the MySQL server on the  
master node.  I benchmarked the VCPU's between our dev and EC2, and  
they are about equivilent.. I would expect EC2 to take 1.5x as long,  
since there is one less CPU per slave, but it's taking much longer  
than that.


Appreciate any tips!

Similarities between the environments:
- 1 master node, 2 slave nodes
- 1 mapper and reducer on the master, 8 mappers and 7 reducers on the
 slaves
- Hadoop 0.16.2
- Local HDFS storage (we were using S3 on amazon before, and I  
switched to

 local storage)
- MySQL database running on the master node
- Xen VM's in both environments (our own Xen for dev, Amazon's for  
EC2)

- Debian Etch 64-bit OS; 64-bit JVM

Development master node configuration:
- 4x VCPU's (Xeon E5335 2ghz)
- 3gb memory
- 4gb swap

Development slave nodes configuration:
- 3x VCPU's (Xeon E5335 2ghz)
- 2gb memory
- 4gb swap

EC2 Configuration (Large instance type):
- 2x VCPU's (Opteron 2ghz)
- 8gb memory
- 4gb swap
- All nodes running in the same availabity zone


| nate carlson | [EMAIL PROTECTED] | http:// 
www.natecarlson.com |
|   depriving some poor village of its idiot since  
1981|




Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: EC2 contrib scripts

2008-03-28 Thread Chris K Wensel

make that xen kernels..

btw, they scale much better (see previous post) under heavy load. So  
now instead of timeouts and dropped connections, jvm instances exit  
prematurely. unsure of the cause of this just yet. but its so few, the  
impact is negligible.


ckw

On Mar 28, 2008, at 10:00 AM, Chris K Wensel wrote:

Hey all

I pushed up a patch (and tar) for the ec2 contrib scripts that  
provide support instance sizes, new zen kernels, availability zones,  
concurrent clusters, resizing, ganglia, etc.


the patch can be found here:
https://issues.apache.org/jira/browse/HADOOP-2410

I use these daily, but it is likely wise for others to give them a  
shot to make sure they work for someone besides me.


Since I can't publish to the hadoop-images bucket, you will need to  
build your own image stored in your own bucket. This only takes a  
few minutes.


Feedback against the JIRA issue is best.

cheers,
ckw

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Performance / cluster scaling question

2008-03-27 Thread Chris K Wensel
FYI, Just ran a 50 node cluster using one of the new kernels for  
Fedora with all nodes forced onto the same 'availability zone' and  
there were no timeouts or failed writes.


On Mar 27, 2008, at 4:16 PM, Chris K Wensel wrote:
If it's any consolation, I'm seeing similar behaviors on 0.16.0 when  
running on EC2 when I push the number of nodes in the cluster past 40.


On Mar 24, 2008, at 6:31 AM, André Martin wrote:

Thanks for the clarification, dhruba :-)
Anyway, what can cause those other exceptions such as  Could not  
get block locations and DataXceiver: java.io.EOFException? Can  
anyone give me a little more insight about those exceptions?
And does anyone have a similar workload (frequent writes and  
deletion of small files), and what could cause the performance  
degradation (see first post)?  I think HDFS should be able to  
handle two million and more files/blocks...
Also, I observed that some of my datanodes do not heartbeat to  
the namenode for several seconds (up to 400 :-() from time to time  
- when I check those specific datanodes and do a top, I see the  
du command running that seems to got stuck?!?

Thanks and Happy Easter :-)

Cu on the 'net,
 Bye - bye,

 André   èrbnA 

dhruba Borthakur wrote:

The namenode lazily instructs a Datanode to delete blocks. As a  
response to every heartbeat from a Datanode, the Namenode  
instructs it to delete a maximum on 100 blocks. Typically, the  
heartbeat periodicity is 3 seconds. The heartbeat thread in the  
Datanode deletes the block files synchronously before it can send  
the next heartbeat. That's the reason a small number (like 100)  
was chosen.


If you have 8 datanodes, your system will probably delete about  
800 blocks every 3 seconds.


Thanks,
dhruba

-Original Message-
From: André Martin [mailto:[EMAIL PROTECTED] Sent: Friday,  
March 21, 2008 3:06 PM

To: core-user@hadoop.apache.org
Subject: Re: Performance / cluster scaling question

After waiting a few hours (without having any load), the block  
number and DFS Used space seems to go down...
My question is: is the hardware simply too weak/slow to send the  
block deletion request to the datanodes in a timely manner, or do  
simply those crappy HDDs cause the delay, since I noticed that I  
can take up to 40 minutes when deleting ~400.000 files at once  
manually using rm -r...
Actually - my main concern is why the performance à la the  
throughput goes down - any ideas?




Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Chris K Wensel

you can't do this with the contrib/ec2 scripts/ami.

but passing the master private dns name to the slaves on boot as 'user- 
data' works fine. when a slave starts, it contacts the master and  
joins the cluster. there isn't any need for a slave to rsync from the  
master, thus removing the dependency on them having the private key.  
and not using the start|stop-all scripts, you don't need to maintain  
the slaves file, and can thus lazily boot your cluster.


to do this, you will need to create your own AMI that works this way.  
not hard, just time consuming.


On Mar 20, 2008, at 11:56 AM, Prasan Ary wrote:

Chris,
 What do you mean when you say boot the slaves with the master  
private name ?



 ===

Chris K Wensel [EMAIL PROTECTED] wrote:
 I found it much better to start the master first, then boot the  
slaves

with the master private name.

i do not use the start|stop-all scrips, so i do not need to maintain
the slaves file. thus i don't need to push private keys around to
support those scripts.

this lets me start 20 nodes, then add 20 more later. or kill some.

btw, get ganglia installed. life will be better knowing what's going  
on.


also, setting up FoxyProxy on firefox lets you browse your whole
cluster if you setup a ssh tunnel (socks).

On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote:

Hi All,
I have been trying to configure Hadoop on EC2 for large number of
clusters ( 100 plus). It seems that I have to copy EC2 private key
to all the machines in the cluster so that they can have SSH
connections.
For now it seems I have to run a script to copy the key file to
each of the EC2 instances. I wanted to know if there is a better way
to accomplish this.

Thanks,
PA


-
Never miss a thing. Make Yahoo your homepage.


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/






-
Looking for last minute shopping deals?  Find them fast with Yahoo!  
Search.


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Re: copy - sort hanging

2008-03-14 Thread Chris K Wensel
tried a new run. and it turns out that there are lots of failed  
connections here and there. but the cluster seizing seems to have  
different origins.


I have i Job using up all the available mapper tasks. and a series of  
queued jobs that will output data to a custom FileSystem.


Noting speculative execution is off, when the first job mappers  
'finish' is immediately kills 160 of them. but only one shows an  
error. Already completed TIP repeated ~40 times.


the copy - sort hangs at this point.

the next jobs in the queue look like the attempt to initialize their  
mappers. but a ClassNotFoundException is thrown when it cannot find my  
custom FileSystem in the class path. This FileSystem was used about 10  
jobs previous as input to the mappers. it is now the output of future  
reducers. see trace below.


i'm wondering if getTaskOutputPath should eat throwable (instead of  
IOE), since the only thing happening here is the path is being made  
fully qualified.


2008-03-14 12:59:28,053 INFO org.apache.hadoop.ipc.Server: IPC Server  
handler 4 on 50002, call  
heartbeat([EMAIL PROTECTED], false,  
true, 1519) from 10.254.87.130:47919: error: java.io.IOException:  
java.lang.RuntimeException: java.lang.ClassNotFoundException:  
cascading.tap.hadoop.S3HttpFileSystem
java.io.IOException: java.lang.RuntimeException:  
java.lang.ClassNotFoundException: cascading.tap.hadoop.S3HttpFileSystem
at  
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:607)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:161)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.mapred.Task.getTaskOutputPath(Task.java: 
195)

at org.apache.hadoop.mapred.Task.setConf(Task.java:400)
at  
org 
.apache.hadoop.mapred.TaskInProgress.getTaskToRun(TaskInProgress.java: 
733)
at  
org 
.apache 
.hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.java:568)
at  
org 
.apache 
.hadoop.mapred.JobTracker.getNewTaskForTaskTracker(JobTracker.java:1409)
at  
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1191)

at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

On Mar 13, 2008, at 4:59 PM, Chris K Wensel wrote:



I don't really have these logs as i've bounce my cluster. But am  
willing to ferret out anything in particular on my next failed run.


On Mar 13, 2008, at 4:32 PM, Raghu Angadi wrote:



Yeah, its kind of hard to deal with these failure once they start  
occurring.


Are all these logs from the same datnode? Could you separate logs  
from different datanodes?


If the first exception stack is while replicating a block (as  
opposed to initial write), then http://issues.apache.org/jira/browse/HADOOP-3007 
 would help there. i.e. failure on next datanode should not affect  
this datanode, you still need to check why the remote datanode  
failed.


Another problem is that once DataNode fails to write a block, the  
same back can not be written to this node for next one hour. These  
are the can not be written to errors you see below. We should  
really fix this. I will file a jira.


Raghu.

Chris K Wensel wrote:

here is a reset, followed by three attempts to write the block.
2008-03-13 13:40:06,892 INFO org.apache.hadoop.dfs.DataNode:  
Receiving block blk_7813471133156061911 src: /10.251.26.3:35762  
dest: /10.251.26.3:50010
2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode:  
Exception in receiveBlock for block blk_7813471133156061911  
java.net.SocketException: Connection reset
2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.net.SocketException: Connection reset
2008-03-13 13:40:06,958 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.net.SocketException:  
Connection reset
  at  
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)

  at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
  at  
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java: 
65)
  at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
123)

  at java.io.DataOutputStream.flush(DataOutputStream.java:106)
  at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.receivePacket(DataNode.java:2194) at  
org.apache.hadoop.dfs.DataNode 
$BlockReceiver.receiveBlock(DataNode.java:2244) at  
org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:1150)
  at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 
938)

  at java.lang.Thread.run(Thread.java:619)
2008-03-13 13:40:11,751 INFO org.apache.hadoop.dfs.DataNode:  
Receiving block blk_7813471133156061911 src: /10.251.27.148:48384  
dest: /10.251.27.148:50010
2008-03-13 13:40:11,752 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already  
been started (though

Re: copy - sort hanging

2008-03-13 Thread Chris K Wensel

here is a reset, followed by three attempts to write the block.

2008-03-13 13:40:06,892 INFO org.apache.hadoop.dfs.DataNode: Receiving  
block blk_7813471133156061911 src: /10.251.26.3:35762 dest: / 
10.251.26.3:50010
2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode: Exception  
in receiveBlock for block blk_7813471133156061911  
java.net.SocketException: Connection reset
2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.net.SocketException: Connection reset
2008-03-13 13:40:06,958 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.net.SocketException: Connection  
reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java: 
65)

at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
	at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.receivePacket(DataNode.java:2194)
	at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.receiveBlock(DataNode.java:2244)
	at org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:1150)

at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
at java.lang.Thread.run(Thread.java:619)

2008-03-13 13:40:11,751 INFO org.apache.hadoop.dfs.DataNode: Receiving  
block blk_7813471133156061911 src: /10.251.27.148:48384 dest: / 
10.251.27.148:50010
2008-03-13 13:40:11,752 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already been  
started (though not completed), and thus cannot be created.
2008-03-13 13:40:11,752 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.io.IOException: Block  
blk_7813471133156061911 has already been started (though not  
completed), and thus cannot be created.

at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638)
	at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java: 
1983)
	at org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:1074)

at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
at java.lang.Thread.run(Thread.java:619)


2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode: Receiving  
block blk_7813471133156061911 src: /10.251.70.210:37345 dest: / 
10.251.70.210:50010
2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already been  
started (though not completed), and thus cannot be created.
2008-03-13 13:48:37,925 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.io.IOException: Block  
blk_7813471133156061911 has already been started (though not  
completed), and thus cannot be created.

at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638)
	at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java: 
1983)
	at org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:1074)

at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
at java.lang.Thread.run(Thread.java:619)

2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode: Receiving  
block blk_7813471133156061911 src: /10.251.26.223:49176 dest: / 
10.251.26.223:50010
2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already been  
started (though not completed), and thus cannot be created.
2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.io.IOException: Block  
blk_7813471133156061911 has already been started (though not  
completed), and thus cannot be created.

at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:638)
	at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java: 
1983)
	at org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:1074)

at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
at java.lang.Thread.run(Thread.java:619)

On Mar 13, 2008, at 11:25 AM, Chris K Wensel wrote:



should add that 10.251.65.207 (receiving end of  
NameSystem.pendingTransfer below) has this datanode log entry.



2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already been  
started (though not completed), and thus cannot be created.
2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.io.IOException: Block

Re: copy - sort hanging

2008-03-13 Thread Chris K Wensel


I don't really have these logs as i've bounce my cluster. But am  
willing to ferret out anything in particular on my next failed run.


On Mar 13, 2008, at 4:32 PM, Raghu Angadi wrote:



Yeah, its kind of hard to deal with these failure once they start  
occurring.


Are all these logs from the same datnode? Could you separate logs  
from different datanodes?


If the first exception stack is while replicating a block (as  
opposed to initial write), then http://issues.apache.org/jira/browse/HADOOP-3007 
 would help there. i.e. failure on next datanode should not affect  
this datanode, you still need to check why the remote datanode failed.


Another problem is that once DataNode fails to write a block, the  
same back can not be written to this node for next one hour. These  
are the can not be written to errors you see below. We should  
really fix this. I will file a jira.


Raghu.

Chris K Wensel wrote:

here is a reset, followed by three attempts to write the block.
2008-03-13 13:40:06,892 INFO org.apache.hadoop.dfs.DataNode:  
Receiving block blk_7813471133156061911 src: /10.251.26.3:35762  
dest: /10.251.26.3:50010
2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode:  
Exception in receiveBlock for block blk_7813471133156061911  
java.net.SocketException: Connection reset
2008-03-13 13:40:06,957 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.net.SocketException: Connection reset
2008-03-13 13:40:06,958 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.net.SocketException:  
Connection reset
   at  
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)

   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
   at  
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java: 
65)
   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java: 
123)

   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
   at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.receivePacket(DataNode.java:2194) at  
org.apache.hadoop.dfs.DataNode 
$BlockReceiver.receiveBlock(DataNode.java:2244) at  
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java: 
1150)
   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 
938)

   at java.lang.Thread.run(Thread.java:619)
2008-03-13 13:40:11,751 INFO org.apache.hadoop.dfs.DataNode:  
Receiving block blk_7813471133156061911 src: /10.251.27.148:48384  
dest: /10.251.27.148:50010
2008-03-13 13:40:11,752 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already been  
started (though not completed), and thus cannot be created.
2008-03-13 13:40:11,752 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.io.IOException: Block  
blk_7813471133156061911 has already been started (though not  
completed), and thus cannot be created.
   at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java: 
638)
   at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.init(DataNode.java:1983)
   at org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:1074)
   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 
938)

   at java.lang.Thread.run(Thread.java:619)
2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode:  
Receiving block blk_7813471133156061911 src: /10.251.70.210:37345  
dest: /10.251.70.210:50010
2008-03-13 13:48:37,925 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already been  
started (though not completed), and thus cannot be created.
2008-03-13 13:48:37,925 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.io.IOException: Block  
blk_7813471133156061911 has already been started (though not  
completed), and thus cannot be created.
   at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java: 
638)
   at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.init(DataNode.java:1983)
   at org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:1074)
   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java: 
938)

   at java.lang.Thread.run(Thread.java:619)
2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode:  
Receiving block blk_7813471133156061911 src: /10.251.26.223:49176  
dest: /10.251.26.223:50010
2008-03-13 14:08:36,089 INFO org.apache.hadoop.dfs.DataNode:  
writeBlock blk_7813471133156061911 received exception  
java.io.IOException: Block blk_7813471133156061911 has already been  
started (though not completed), and thus cannot be created.
2008-03-13 14:08:36,089 ERROR org.apache.hadoop.dfs.DataNode:  
10.251.65.207:50010:DataXceiver: java.io.IOException: Block  
blk_7813471133156061911 has already been started (though not  
completed), and thus cannot be created

Re: S3/EC2 setup problem: port 9001 unreachable

2008-03-10 Thread Chris K Wensel

Andreas

Here are some moderately useful notes on using EC2/S3, mostly learned  
leveraging Hadoop. The groups can't see themselves issue is listed  
grin.


http://www.manamplified.org/archives/2008/03/notes-on-using-ec2-s3.html

enjoy
ckw

On Mar 10, 2008, at 9:51 AM, Andreas Kostyrka wrote:


Found it, was security group setup problem ;(

Andreas

Am Montag, den 10.03.2008, 16:49 +0100 schrieb Andreas Kostyrka:

Hi!

I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not
using the Hadoop AMIs)

I've got the S3 based HDFS working, but I'm stumped when I try to  
get a

test job running:

[EMAIL PROTECTED]:~/hadoop-0.16.0$ time bin/hadoop jar  
contrib/streaming/hadoop-0.16.0-streaming.jar -mapper /tmp/test.sh - 
reducer cat -input testlogs/* -output testlogs-output

additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar17969/] [] /tmp/ 
streamjob17970.jar tmpDir=null
08/03/10 14:01:28 INFO mapred.FileInputFormat: Total input paths to  
process : 152
08/03/10 14:02:58 INFO streaming.StreamJob: getLocalDirs(): [/tmp/ 
hadoop-hadoop/mapred/local]
08/03/10 14:02:58 INFO streaming.StreamJob: Running job:  
job_200803101400_0001

08/03/10 14:02:58 INFO streaming.StreamJob: To kill this job, run:
08/03/10 14:02:58 INFO streaming.StreamJob: /home/hadoop/ 
hadoop-0.16.0/bin/../bin/hadoop job  - 
Dmapred.job.tracker=ec2-67-202-58-97.compute-1.amazonaws.com:9001 - 
kill job_200803101400_0001

08/03/10 14:02:58 INFO streaming.StreamJob: Tracking URL: 
http://ip-10-251-75-165.ec2.internal:50030/jobdetails.jsp?jobid=job_200803101400_0001
08/03/10 14:02:59 INFO streaming.StreamJob:  map 0%  reduce 0%

Furthermore, when I try to connect port 9001 on 10.251.75.165 via  
telnet from the masterhost itself, it connects:

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet 10.251.75.165 9001
Trying 10.251.75.165...
Connected to 10.251.75.165.
Escape character is '^]'.
^]
telnet quit
Connection closed.

When I try to do this from other VMs in my cluster, it just hangs.
(tcpdump on the masterhost shows no activity for tcp port 9001):

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet  
ip-10-251-75-165.ec2.internal 9001

Trying 10.251.75.165...

[EMAIL PROTECTED]:~/hadoop-0.16.0$ telnet  
ip-10-251-75-165.ec2.internal 22

Trying 10.251.75.165...
Connected to ip-10-251-75-165.ec2.internal.
Escape character is '^]'.
SSH-2.0-OpenSSH_4.3p2 Debian-9
^]
telnet quit
Connection closed.

This is also shown when I connect port 50030, which shows 0 nodes  
ready to process the job.


Furthermore, the slaves show the following messages:
2008-03-10 15:30:11,455 INFO org.apache.hadoop.ipc.RPC: Problem  
connecting to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 
10.251.75.165:9001
2008-03-10 15:31:12,465 INFO org.apache.hadoop.ipc.Client: Retrying  
connect to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 
10.251.75.165:9001. Already tried 1 time(s).
2008-03-10 15:32:13,475 INFO org.apache.hadoop.ipc.Client: Retrying  
connect to server: ec2-67-202-58-97.compute-1.amazonaws.com/ 
10.251.75.165:9001. Already tried 2 time(s).


Last but not least, here is my site conf:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
configuration

property
 namefs.default.name/name
 values3://lookhad/value
 descriptionThe name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem./description
/property

property
 namefs.s3.awsAccessKeyId/name
 value2DFGTTFSDFDSZU5SDSD7S5202/value
/property

property
 namefs.s3.awsSecretAccessKey/name
 valueRUWgsdfsd67SFDfsdflaI9Gjzfsd8789ksd2r1PtG/value
/property

property
 namemapred.job.tracker/name
 valueec2-67-202-58-97.compute-1.amazonaws.com:9001/value
 descriptionThe host and port that the MapReduce job tracker runs
 at.  If local, then jobs are run in-process as a single map
 and reduce task.
 /description
/property
/configuration

The masternode listens not no localhost:
[EMAIL PROTECTED]:~/hadoop-0.16.0$ netstat -an | grep 9001
tcp0  0 10.251.75.165:9001  0.0.0.0:*
LISTEN


Any ideas? My conclusions thus are:

1.) First, it's not a general connectivity problem, because I can  
connect port 22 without any problems.
2.) OTOH, on port 9001, inside the same group, the connectivity  
seems to be limited.
3.) All AWS docs tell me that VMs in one group have no firewalls in  
place.


So what is happening here? Any ideas?

Andreas


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Re: Question about using the metrics framework.

2008-03-06 Thread Chris K Wensel
I have ganglia up on my cluster, and I definitely see some metrics  
from the map/reduce tasks. But I don't see anything from the JVM  
context for ganglia except from the master node.


but i have gmetad running on the master and reporting from a local  
gmond seems flakey.


On Mar 6, 2008, at 2:03 PM, Jason Venner wrote:

We have started our first attempt at this, and do not see the  
metrics being reported.


Our first cut simply is trying to report the counters at the end of  
the job.


A theory is that the job is exiting before the metrics are flushed.

This code is in the driver for our map/reduce task, and is called  
after the JobClient.runJob() method returns.


  private ContextFactory contextFactory = null;
  private MetricsContext myContext = null;
  private MetricsRecord randomizeRecord = null;
  {
  try {
  contextFactory = ContextFactory.getFactory();
  myContext = contextFactory.getContext(myContext);
  randomizeRecord = myContext.createRecord( MyContextName );
  } catch( Exception e ) {
  LOG.error( Unable to initialize metrics context factory,  
e );

  }
  }

void reportCounterMetrics( Counters counters ) {
  for( String groupName : counters.getGroupNames() ) {
  Group group = counters.getGroup( groupName );
  for( String counterName : group.getCounterNames() ) {
  randomizeRecord.setMetric( counterName,  
group.getCounter(counterName) );
  LOG.info( reporting counter  + counterName +   +  
group.getCounter(counterName) );

  }
  }
  randomizeRecord.update();
  }



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Re: Question about using the metrics framework.

2008-03-06 Thread Chris K Wensel

actually, I don't see any jvm metrics across the cluster.

any idea how to get a local gmond to gmetad to report local  
statistics? it is also accumulating slave stats just fine (minus jvm).


On Mar 6, 2008, at 3:09 PM, Jason Venner wrote:

We have tweeked our metrics properties to cause all nodes to report  
to the master node, and we see the map/reduce metrics also see the  
jvm metrics from all nodes.


The other thing is that this only runs on the master node, as it is  
only run from the driver which currently runs on the master node.


As I look deeper, I believe the metrics are not being flushed when  
the context is shut down, and the flush methods are not implemented  
for the ganglia context.



Chris K Wensel wrote:
I have ganglia up on my cluster, and I definitely see some metrics  
from the map/reduce tasks. But I don't see anything from the JVM  
context for ganglia except from the master node.


but i have gmetad running on the master and reporting from a local  
gmond seems flakey.


On Mar 6, 2008, at 2:03 PM, Jason Venner wrote:

We have started our first attempt at this, and do not see the  
metrics being reported.


Our first cut simply is trying to report the counters at the end  
of the job.


A theory is that the job is exiting before the metrics are flushed.

This code is in the driver for our map/reduce task, and is called  
after the JobClient.runJob() method returns.


 private ContextFactory contextFactory = null;
 private MetricsContext myContext = null;
 private MetricsRecord randomizeRecord = null;
 {
 try {
 contextFactory = ContextFactory.getFactory();
 myContext = contextFactory.getContext(myContext);
 randomizeRecord =  
myContext.createRecord( MyContextName );

 } catch( Exception e ) {
 LOG.error( Unable to initialize metrics context  
factory, e );

 }
 }

   void reportCounterMetrics( Counters counters ) {
 for( String groupName : counters.getGroupNames() ) {
 Group group = counters.getGroup( groupName );
 for( String counterName : group.getCounterNames() ) {
 randomizeRecord.setMetric( counterName,  
group.getCounter(counterName) );
 LOG.info( reporting counter  + counterName +   +  
group.getCounter(counterName) );

 }
 }
 randomizeRecord.update();
 }



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Re: Question about using the metrics framework.

2008-03-06 Thread Chris K Wensel


never mind on the jvm. just found the typo.. frown

On Mar 6, 2008, at 3:19 PM, Chris K Wensel wrote:


actually, I don't see any jvm metrics across the cluster.

any idea how to get a local gmond to gmetad to report local  
statistics? it is also accumulating slave stats just fine (minus jvm).


On Mar 6, 2008, at 3:09 PM, Jason Venner wrote:

We have tweeked our metrics properties to cause all nodes to report  
to the master node, and we see the map/reduce metrics also see the  
jvm metrics from all nodes.


The other thing is that this only runs on the master node, as it is  
only run from the driver which currently runs on the master node.


As I look deeper, I believe the metrics are not being flushed when  
the context is shut down, and the flush methods are not implemented  
for the ganglia context.



Chris K Wensel wrote:
I have ganglia up on my cluster, and I definitely see some metrics  
from the map/reduce tasks. But I don't see anything from the JVM  
context for ganglia except from the master node.


but i have gmetad running on the master and reporting from a local  
gmond seems flakey.


On Mar 6, 2008, at 2:03 PM, Jason Venner wrote:

We have started our first attempt at this, and do not see the  
metrics being reported.


Our first cut simply is trying to report the counters at the end  
of the job.


A theory is that the job is exiting before the metrics are flushed.

This code is in the driver for our map/reduce task, and is called  
after the JobClient.runJob() method returns.


private ContextFactory contextFactory = null;
private MetricsContext myContext = null;
private MetricsRecord randomizeRecord = null;
{
try {
contextFactory = ContextFactory.getFactory();
myContext = contextFactory.getContext(myContext);
randomizeRecord =  
myContext.createRecord( MyContextName );

} catch( Exception e ) {
LOG.error( Unable to initialize metrics context  
factory, e );

}
}

  void reportCounterMetrics( Counters counters ) {
for( String groupName : counters.getGroupNames() ) {
Group group = counters.getGroup( groupName );
for( String counterName : group.getCounterNames() ) {
randomizeRecord.setMetric( counterName,  
group.getCounter(counterName) );
LOG.info( reporting counter  + counterName +   +  
group.getCounter(counterName) );

}
}
randomizeRecord.update();
}



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/





Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/