I have a very similar question: how do I recursively list all files in a
given directory, to the end that all files are processed by MapReduce? If I
just copy them to the output, let's say, is there any problem dropping them
all in the same output directory in HDFS? To use a bad example, Windows
Hi,
is there an archive to the messages? I am a newcomer, granted, but google
groups has all the discussion capabilities, and it has a searchable archive.
It is strange to have just a mailing list. Am I missing something?
Thank you,
Mark
Hi, esteemed group,
how would I form Maps in MapReduce to recursevely look at every file in a
directory, and do something to this file, such as produce a PDF or compute
its hash?
For that matter, Google builds its index using MapReduce, or so the papers
say. First the crawlers store all the
, 2009 at 10:11 PM, Mark Kerzner markkerz...@gmail.com
wrote:
Hi, esteemed group,
how would I form Maps in MapReduce to recursevely look at every file in a
directory, and do something to this file, such as produce a PDF or
compute
its hash?
For that matter, Google builds its index using
Hi,
there is a performance penalty in Windows (pardon the expression) if you put
too many files in the same directory. The OS becomes very slow, stops seeing
them, and lies about their status to my Java requests. I do not know if this
is also a problem in Linux, but in HDFS - do I need to balance
the correct answer, but this is working
quite well for now and even has some advantages. (No-cost replication from
work to home or offline by rsync or thumb drive, for example.)
flip
On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi rang...@yahoo-inc.com
wrote:
Mark Kerzner wrote:
But it would seem
instead of having a lot of
small files?
On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner markkerz...@gmail.com
wrote:
Hi,
there is a performance penalty in Windows (pardon the expression) if you
put
too many files in the same directory. The OS becomes very slow, stops
seeing
them
.
At attributor with 2million blocks on a datanode, under XFS centos (i686)
5.1 stock kernels would take 21 minutes with noatime, on a 6 disk raid 5
array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the
machine basically served hdfs.
On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner markkerz
referenced previously on
this list
?
Brian
On Jan 25, 2009, at 8:29 PM, Mark Kerzner wrote:
Thank you, Jason, this is awesome information. I am going to use a
balanced
directory tree structure, and I am going to make this independent of the
other parts of the system, so that I can
Doug,
SequenceFile looks like a perfect candidate to use in my project, but are
you saying that I better use uncompressed data if I am not interested in
saving disk space?
Thank you,
Mark
On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting cutt...@apache.org wrote:
Philip (flip) Kromer wrote:
of the box,
so
you don't have to compress data in your code.
Most the time, compression not only saves disk space but improves
performance because there's less data to write.
Andy
On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner markkerz...@gmail.com
wrote:
Doug,
SequenceFile looks like
Thank you, Doug, then all is clear in my head.
Mark
On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting cutt...@apache.org wrote:
Mark Kerzner wrote:
Okay, I am convinced. I only noticed that Doug, the originator, was not
happy about it - but in open source one has to give up control sometimes
Andrzej,
without deeper understanding of exactly what you are doing, I have a gut
feeling that a different distributed system might be a better fit for this
specific task. I assume, you are dealing with very large graphs if you are
using Hadoop, and you want grid processing. But the linear nature
Oh, hail to the creator of Luke!Mark
On Thu, Jan 29, 2009 at 11:20 AM, Andrzej Bialecki a...@getopt.org wrote:
Hi,
I'm looking for an advice. I need to process a directed graph encoded as a
list of from, to pairs. The goal is to compute a list of longest paths in
the graph. There is no
You set it in the conf/hadoop-env.sh file, with an entry like this
export JAVA_HOME=/usr/lib/jvm/default-java
Mark
On Fri, Jan 30, 2009 at 3:49 PM, zander1013 zander1...@gmail.com wrote:
hi,
i am new to hadoop. i am trying to set it up for the first time as a single
node cluster. at
/java: No such file or
directory
bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot
execute: No such file or directory
a...@node0:~/Hadoop/hadoop-0.19.0$
...
please advise...
Mark Kerzner-2 wrote:
You set it in the conf/hadoop-env.sh file, with an entry like
Hi,
every time I start HDFD daemons, I need to format it first with
hadoop namenode -format
Why is this? I would expect to have to format it just once.
Thank you,
Mark
Hi,
I am writing an application to copy all files from a regular PC to a
SequenceFile. I can surely do this by simply recursing all directories on my
PC, but I wonder if there is any way to parallelize this, a MapReduce task
even. Tom White's books seems to imply that it will have to be a custom
write a local program to write several block compressed SequenceFiles
in parallel (to HDFS), each containing a portion of the files on your
PC.
Tom
On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com
wrote:
Truly, I do not see any advantage to doing this, as opposed
the sprawl
flip
On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com
wrote:
Hi,
I am writing an application to copy all files from a regular PC to a
SequenceFile. I can surely do this by simply recursing all directories on
my
PC, but I wonder if there is any way
Hi,
I am going through examples in this book (which I have obtained as early
draft from Safari), and they all work, with occasional fixes. However, the
SequenceFileWriteDemo, even though it works without an error, does not show
the create file when I use this command
hadoop fs -ls /
I remember
Hi all,
I am copying regular binary files to a SequenceFile, and I am using
BytesWritable, to which I am giving all the byte[] content of the file.
However, once it hits a file larger than my computer memory, it may have
problems. Is there a better way?
Thank you,
Mark
Hi all,
I am writing to HDFS with this simple code
File[] files = new File(fileDir).listFiles();
for (File file : files) {
key.set(file.getPath());
byte[] bytes = new FileUtil().readCompleteFile(file);
Hi,
I have written binary files to a SequenceFile, seemeingly successfully, but
when I read them back with the code below, after a first few reads I get the
same number of bytes for the different files. What could go wrong?
Thank you,
Mark
reader = new SequenceFile.Reader(fs, path,
#getBytes() to use.
Tom
On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner markkerz...@gmail.com
wrote:
Hi,
I have written binary files to a SequenceFile, seemeingly successfully,
but
when I read them back with the code below, after a first few reads I get
the
same number of bytes
It is a good and useful overview,thank you. It also mentions Stuart Sierra's
post, where Stuart mentions that the process is slow. Does anybody know why?
I have written code to write from the PC file system to HDFS, and I also
noticed that it is very slow. Instead of 40M/sec, as promised by the
Hi,
Hi,
why is hadoop suddenly telling me
Retrying connect to server: localhost/127.0.0.1:8020
with this configuration
configuration
property
namefs.default.name/name
valuehdfs://localhost:9000/value
/property
property
namemapred.job.tracker/name
Brian, I have a similar question: why does transfer from a local filesystem
to SequenceFile takes so long (about 1 second per Meg)?
Thank you,
Mark
On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman bbock...@cse.unl.eduwrote:
On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
Hi,
Could someone
Hi, the Quick
Starthttp://hadoop.apache.org/core/docs/current/quickstart.htmlhas
this sample configuration
namefs.default.name/name
valuehdfs://localhost:9000/value
but it does not seem to work: even though the daemons do listen to 9000, the
following command always uses 8020
hadoop fs
, at 4:53 PM, Mark Kerzner wrote:
Brian, I have a similar question: why does transfer from a local
filesystem
to SequenceFile takes so long (about 1 second per Meg)?
Hey Mark,
I saw your question about speed the other day ... unfortunately, I didn't
have any specific advice so I stayed quiet
, at 11:09 PM, Mark Kerzner wrote:
Brian, large files using command-line hadoop go fast, so it is something
about my computer or network. I won't worry about this now, especially in
light of Amit reporting fast writes and reads.
You're creating files using SequenceFile, right? It might
I say, that's very interesting and useful.
On Tue, Feb 10, 2009 at 11:37 PM, Brian Bockelman bbock...@cse.unl.eduwrote:
Just to toss out some numbers (and because our users are making
interesting numbers right now)
Here's our external network router:
running and terminate them (bin/stop-all.sh should help) and then
restart your cluster with the new configuration to see if that helps.
Later,
Jeff
On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote:
Mark Kerzner wrote:
Hi,
Hi,
why is hadoop suddenly telling
I once had too many open files when I was opening too many sockets and not
closing them...
On Thu, Feb 12, 2009 at 1:56 PM, Sean Knapp s...@ooyala.com wrote:
Hi all,
I'm continually running into the Too many open files error on 18.3:
DataXceiveServer: java.io.IOException: Too many open files
I had a problem that it listened only on 8020, even though I told it to use
9000
On Fri, Feb 13, 2009 at 7:50 AM, Norbert Burger norbert.bur...@gmail.comwrote:
On Fri, Feb 13, 2009 at 8:37 AM, Steve Loughran ste...@apache.org wrote:
Michael Lynch wrote:
Hi,
As far as I can tell I've
Hi all,
I consistently have this problem that I can run HDFS and restart it after
short breaks of a few hours, but the next day I always have to reformat HDFS
before the daemons begin to work.
Is that normal? Maybe this is treated as temporary data, and the results
need to be copied out of HDFS
16, 2009 at 8:11 PM, Mark Kerzner markkerz...@gmail.com
wrote:
Hi all,
I consistently have this problem that I can run HDFS and restart it
after
short breaks of a few hours, but the next day I always have to reformat
HDFS
before the daemons begin to work.
Is that normal
Exactly the same thing happened to me, and Brian gave the same answer. What
if the default is changed to the user's home directory somewhere?
On Mon, Feb 23, 2009 at 10:05 PM, Brian Bockelman bbock...@cse.unl.eduwrote:
Hello,
Where are you saving your data? If it's being written into /tmp,
of multiple computers, e.g., the flow
of data in and out of a distributed filesystem, distributed reliability,
global computations, etc.
So you might use CUDA within mapreduce to more efficiently run
compute-intensive tasks over petabytes of data.
Doug
Mark Kerzner wrote:
Hi, this from
Thank you for pointing this out!
Mark
On Sun, Mar 1, 2009 at 9:40 PM, Brock Palen bro...@umich.edu wrote:
Just want to thank Christophe Bisciglia for taking some time out to speak
with us about Hadoop on our podcast Research Computing and Engineering (
www.rce-cast.com)
You can find the
Yes, that is definitely the coolest of them all
On Sun, Mar 8, 2009 at 5:11 PM, Jeff Hammerbacher ham...@cloudera.comwrote:
I like MarkMail's excellent service: http://hadoop.markmail.org.
On Sun, Mar 8, 2009 at 2:54 PM, Iman ielgh...@cs.uwaterloo.ca wrote:
You might also want to try the
Hi,
what would be the best place to put temporary files for a reducer? I believe
that since reducers each work on its own machine, at its own time, one can
do anything, but I would like a confirmation from the experts.
Thanks,
Mark
Hi,
does anybody know of an open-source implementation of the Broder
algorithmhttp://www.std.org/%7Emsm/common/clustering.htmlin Hadoop?
Monika Henzinger reports
having done http://ltaa.epfl.ch/monika/mpapers/nearduplicates2006.pdf so
in MapReduce, and I wonder if somebody has repeated her work
Yi-Kai,
that's good to know - and I have read this article - but is your code
available?
Thank you,
Mark
On Tue, Mar 24, 2009 at 9:51 AM, Yi-Kai Tsai yi...@yahoo-inc.com wrote:
hi Mark
we had done something on top of hadoop/hbase (mapreduce for evaluation ,
hbase for online serving )
by
Hi,
I ran a Hadoop MapReduce task in the local mode, reading and writing from
HDFS, and it took 2.5 minutes. Essentially the same operations on the local
file system without MapReduce took 1/2 minute. Is this to be expected?
It seemed that the system lost most of the time in the MapReduce
you,
Mark
On Mon, Apr 20, 2009 at 7:42 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote:
Mark,
There is a setup price when using Hadoop, for each task a new JVM must
be spawned. On such a small scale, you won't see any good using MR.
J-D
On Mon, Apr 20, 2009 at 12:26 AM, Mark Kerzner markkerz
for the link - I wish I were at the conference! Anyway, at
this level I have to make my hands dirty, re-read both Hadoop books, and
other article.
Cheers,
Mark
On Mon, Apr 20, 2009 at 10:24 AM, Arun C Murthy a...@yahoo-inc.com wrote:
On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:
Hi,
I
Hi,
in an MR step, I need to extract text from various files (using Tika). I
have put text extraction into reduce(), because I am writing the extracted
text to the output on HDFS. But now it occurs to me that I might as well
have put it into map() and have default reduce() which will write every
Hi all,
my guess, as good as anybody's, is that Pregel is to large graphs is what
Hadoop is to large datasets. In other words, Pregel is the next natural step
for massively scalable computations after Hadoop. And, as with MapReduce,
Google will talk about the technology but not give out the code
Tom, this is so much right on time! Bravo, Karmasphere.
I installed the plugins, and nothing crashed - in fact, I get the same
screens as the manual promises.
It is worth reading this group - they released the plugin two days ago.
Mark
On Fri, Jun 26, 2009 at 10:13 AM, Tom Wheeler
Christophe,
I am writing my first Hadoop project now, and I have 20 years of consulting,
and I am in Houston. Here is my resume, http://markkerzner.googlepages.com.
I have used EC2.
Sincerely,
Mark
On Fri, Jan 23, 2009 at 4:04 PM, Christophe Bisciglia
christo...@cloudera.com wrote:
Hey all,
51 matches
Mail list logo