Re: Click Stream Data

2011-01-30 Thread brien colwell
You might consider starting with other sequence data like file bytes or DNA. The main difference between those and click stream is how you model the steps. On Jan 30, 2011 1:55 PM, Bruce Williams williams.br...@gmail.com wrote:

Re: Click Stream Data

2011-01-30 Thread brien colwell
Forgot to mention sheet music or tabs as another good source of sequence data ;) On Jan 30, 2011 2:45 PM, brien colwell xcolw...@gmail.com wrote: You might consider starting with other sequence data like file bytes or DNA. The main difference between those and click stream is how you model

Re: Not able to compile '.java' files

2010-02-05 Thread brien colwell
To get a feel for Hadoop, I'd recommend using Eclipse and using a single node to start. If you add all the Hadoop JARs to your Eclipse build path, I think there are five, then Eclipse will manage the classpath for you. The following config settings will set up Hadoop to use the local file

Re: Google has obtained the patent over mapreduce

2010-01-20 Thread brien colwell
Personally, it seems like they gave away too much information before they had the patent. I'm not a patent lawyer, but I'd expect they submitted the patent application or a provisional before they submitted their academic paper or other public disclosure. On Wed, Jan 20, 2010 at 12:09 PM,

Re: Hadoop and X11 related error

2010-01-18 Thread brien colwell
From memory, some parts of AWT won't run in headless mode. I used to run an x virtual frame buffer on servers that created graphics. It's a standard package on most Linux distros. I forget if there was something special needed to set it up, but might be worth looking into. On Sun, Jan 17, 2010

Re: Small doubt in MR

2010-01-02 Thread brien colwell
Another approach would be to use a custom InputFormat implementation, with the flag as a property of the input split . Consider wrapping your InputFormat with something like 'InputFormatWithFlag', that returns splits that combine the wrapped InputFormat's splits with your flag. Since

Re: io performance

2009-11-30 Thread brien colwell
May be of help ... In my experience there is not a single bottleneck. Even your tuple representation may disproportionally impact performance. Too-granular tuples resulting in redundant values will slow down the shuffle, which does at least 2 serialize/de-serialize ops per tuple. Benchmarks on my

Re: How to handle imbalanced data in hadoop ?

2009-11-15 Thread brien colwell
My first thought is that it depends on the reduce logic. If you could do the reduction in two passes then you could do an initial arbitrary partition for the majority key and bring the partitions together in a second reduction (or a map-side join). I would use a round robin strategy to assign the

Re: Hadoop Cluster Error

2009-11-15 Thread brien colwell
Do you mean run it on your local machine from Eclipse versus run on a distributed cluster? I would guess the Eclipse/Cygwin JVM is 32bit and the cluster JVM is 64bit. References occupy more space on a 64bit JVM. You might want to check the -XX:+UseCompressedOops VM option.

Re: what does it mean when a job fails at 100%?

2009-11-13 Thread brien colwell
It could be that the result can't be written to HDFS. Is there any hint in the log? I recently encountered this behavior when writing many files back. Mike Kendall wrote: title says it all.. this isn't the first job i've written either. very confused.

map-side join with directories

2009-10-16 Thread brien colwell
hi all, Regarding CompositeInputFormat, my experience is that when giving a directory as an input, the entries from the files in the directory do not join. Entries join as expected when giving each individual file as an input. Is this the expected behavior? I would expect both join

state of the art WebDAV + HDFS

2009-10-06 Thread brien colwell
hi all, What would you consider the state of the art for WebDAV integration with HDFS? I'm having trouble discerning the functionality that aligns with each patch on HDFS-225 (https://issues.apache.org/jira/browse/HDFS-225) . I've read some patches do not support write operations. Not sure if

Re: Hadoop on Windows

2009-09-17 Thread brien colwell
Our cygwin/windows nodes are picky about the machines they work on. On some they are unreliable. On some they work perfectly. We've had two main issues with cygwin nodes. Hadoop resolves paths in strange ways, so for example /dir is interpreted as c:/dir not %cygwin_home%/dir. For SSH to a

Re: hadoop 0.20.0 jobtracker.info could only be replicated to 0 nodes

2009-09-10 Thread brien colwell
Just an idea ... we've had trouble with Hadoop using internal instead of external addresses on Ubuntu. The data nodes can't connect to the namenode if it's listening on an internal address. On the namenode can you run 'netstat -na' ? What address is the namenode daemon bound to? Steve

Re: Choosing a scheduler

2009-09-09 Thread brien colwell
We use the fair scheduler because it's easy to configure and easily understood -- all jobs get an equal share of the resources. The capacity scheduler has more complex semantics (capacities per queue, how are resources split between jobs of equal priority within the same queue?, etc -- see

building the eclipse plugin

2009-06-28 Thread brien colwell
hi all -- Just wondering how to build the eclipse plugin. ant binary does not seem to catch it. I would like to experiment with a few changes. thanks! Brien