You might consider starting with other sequence data like file bytes or DNA.
The main difference between those and click stream is how you model the
steps.
On Jan 30, 2011 1:55 PM, Bruce Williams williams.br...@gmail.com wrote:
Forgot to mention sheet music or tabs as another good source of sequence
data ;)
On Jan 30, 2011 2:45 PM, brien colwell xcolw...@gmail.com wrote:
You might consider starting with other sequence data like file bytes or
DNA.
The main difference between those and click stream is how you model
To get a feel for Hadoop, I'd recommend using Eclipse and using a single
node to start. If you add all the Hadoop JARs to your Eclipse build
path, I think there are five, then Eclipse will manage the classpath for
you.
The following config settings will set up Hadoop to use the local file
Personally, it
seems like they gave away too much information before they had the
patent.
I'm not a patent lawyer, but I'd expect they submitted the patent
application or a provisional before they submitted their academic paper or
other public disclosure.
On Wed, Jan 20, 2010 at 12:09 PM,
From memory, some parts of AWT won't run in headless mode. I used to run an
x virtual frame buffer on servers that created graphics. It's a standard
package on most Linux distros. I forget if there was something special
needed to set it up, but might be worth looking into.
On Sun, Jan 17, 2010
Another approach would be to use a custom InputFormat implementation,
with the flag as a property of the input split . Consider wrapping your
InputFormat with something like 'InputFormatWithFlag', that returns
splits that combine the wrapped InputFormat's splits with your flag.
Since
May be of help ... In my experience there is not a single bottleneck.
Even your tuple representation may disproportionally impact performance.
Too-granular tuples resulting in redundant values will slow down the
shuffle, which does at least 2 serialize/de-serialize ops per tuple.
Benchmarks on my
My first thought is that it depends on the reduce logic. If you could do the
reduction in two passes then you could do an initial arbitrary partition for
the majority key and bring the partitions together in a second reduction (or
a map-side join). I would use a round robin strategy to assign the
Do you mean run it on your local machine from Eclipse versus run on a
distributed cluster?
I would guess the Eclipse/Cygwin JVM is 32bit and the cluster JVM is
64bit. References occupy more space on a 64bit JVM. You might want to
check the -XX:+UseCompressedOops VM option.
It could be that the result can't be written to HDFS. Is there any hint
in the log? I recently encountered this behavior when writing many files
back.
Mike Kendall wrote:
title says it all.. this isn't the first job i've written either. very
confused.
hi all,
Regarding CompositeInputFormat, my experience is that when giving a
directory as an input, the entries from the files in the directory do
not join. Entries join as expected when giving each individual file as
an input. Is this the expected behavior? I would expect both join
hi all,
What would you consider the state of the art for WebDAV integration with
HDFS? I'm having trouble discerning the functionality that aligns with
each patch on HDFS-225 (https://issues.apache.org/jira/browse/HDFS-225)
. I've read some patches do not support write operations. Not sure if
Our cygwin/windows nodes are picky about the machines they work on. On
some they are unreliable. On some they work perfectly.
We've had two main issues with cygwin nodes.
Hadoop resolves paths in strange ways, so for example /dir is
interpreted as c:/dir not %cygwin_home%/dir. For SSH to a
Just an idea ... we've had trouble with Hadoop using internal instead of
external addresses on Ubuntu. The data nodes can't connect to the
namenode if it's listening on an internal address. On the namenode can
you run 'netstat -na' ? What address is the namenode daemon bound to?
Steve
We use the fair scheduler because it's easy to configure and easily
understood -- all jobs get an equal share of the resources. The capacity
scheduler has more complex semantics (capacities per queue, how are
resources split between jobs of equal priority within the same queue?,
etc -- see
hi all --
Just wondering how to build the eclipse plugin. ant binary does not
seem to catch it. I would like to experiment with a few changes.
thanks!
Brien
16 matches
Mail list logo