Re: Hadoop with many input/output files?

2009-01-22 Thread Mark Kerzner
I have a very similar question: how do I recursively list all files in a given directory, to the end that all files are processed by MapReduce? If I just copy them to the output, let's say, is there any problem dropping them all in the same output directory in HDFS? To use a bad example, Windows

Archive?

2009-01-22 Thread Mark Kerzner
Hi, is there an archive to the messages? I am a newcomer, granted, but google groups has all the discussion capabilities, and it has a searchable archive. It is strange to have just a mailing list. Am I missing something? Thank you, Mark

How-to in MapReduce

2009-01-23 Thread Mark Kerzner
Hi, esteemed group, how would I form Maps in MapReduce to recursevely look at every file in a directory, and do something to this file, such as produce a PDF or compute its hash? For that matter, Google builds its index using MapReduce, or so the papers say. First the crawlers store all the

Re: How-to in MapReduce

2009-01-23 Thread Mark Kerzner
, 2009 at 10:11 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, esteemed group, how would I form Maps in MapReduce to recursevely look at every file in a directory, and do something to this file, such as produce a PDF or compute its hash? For that matter, Google builds its index using

HDFS - millions of files in one directory?

2009-01-23 Thread Mark Kerzner
Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance

Re: HDFS - millions of files in one directory?

2009-01-24 Thread Mark Kerzner
the correct answer, but this is working quite well for now and even has some advantages. (No-cost replication from work to home or offline by rsync or thumb drive, for example.) flip On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Mark Kerzner wrote: But it would seem

Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
instead of having a lot of small files? On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them

Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
. At attributor with 2million blocks on a datanode, under XFS centos (i686) 5.1 stock kernels would take 21 minutes with noatime, on a 6 disk raid 5 array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the machine basically served hdfs. On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner markkerz

Re: HDFS - millions of files in one directory?

2009-01-25 Thread Mark Kerzner
referenced previously on this list ? Brian On Jan 25, 2009, at 8:29 PM, Mark Kerzner wrote: Thank you, Jason, this is awesome information. I am going to use a balanced directory tree structure, and I am going to make this independent of the other parts of the system, so that I can

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Doug, SequenceFile looks like a perfect candidate to use in my project, but are you saying that I better use uncompressed data if I am not interested in saving disk space? Thank you, Mark On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting cutt...@apache.org wrote: Philip (flip) Kromer wrote:

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
of the box, so you don't have to compress data in your code. Most the time, compression not only saves disk space but improves performance because there's less data to write. Andy On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner markkerz...@gmail.com wrote: Doug, SequenceFile looks like

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Thank you, Doug, then all is clear in my head. Mark On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting cutt...@apache.org wrote: Mark Kerzner wrote: Okay, I am convinced. I only noticed that Doug, the originator, was not happy about it - but in open source one has to give up control sometimes

Re: Finding longest path in a graph

2009-01-29 Thread Mark Kerzner
Andrzej, without deeper understanding of exactly what you are doing, I have a gut feeling that a different distributed system might be a better fit for this specific task. I assume, you are dealing with very large graphs if you are using Hadoop, and you want grid processing. But the linear nature

Re: Finding longest path in a graph

2009-01-29 Thread Mark Kerzner
Oh, hail to the creator of Luke!Mark On Thu, Jan 29, 2009 at 11:20 AM, Andrzej Bialecki a...@getopt.org wrote: Hi, I'm looking for an advice. I need to process a directed graph encoded as a list of from, to pairs. The goal is to compute a list of longest paths in the graph. There is no

Re: settin JAVA_HOME...

2009-01-30 Thread Mark Kerzner
You set it in the conf/hadoop-env.sh file, with an entry like this export JAVA_HOME=/usr/lib/jvm/default-java Mark On Fri, Jan 30, 2009 at 3:49 PM, zander1013 zander1...@gmail.com wrote: hi, i am new to hadoop. i am trying to set it up for the first time as a single node cluster. at

Re: settin JAVA_HOME...

2009-01-30 Thread Mark Kerzner
/java: No such file or directory bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java: cannot execute: No such file or directory a...@node0:~/Hadoop/hadoop-0.19.0$ ... please advise... Mark Kerzner-2 wrote: You set it in the conf/hadoop-env.sh file, with an entry like

HDFS formatting

2009-02-01 Thread Mark Kerzner
Hi, every time I start HDFD daemons, I need to format it first with hadoop namenode -format Why is this? I would expect to have to format it just once. Thank you, Mark

best way to copy all files from a file system to hdfs

2009-02-01 Thread Mark Kerzner
Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom

Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Mark Kerzner
write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC. Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote: Truly, I do not see any advantage to doing this, as opposed

Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Mark Kerzner
the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way

Book: Hadoop-The Definitive Guide

2009-02-02 Thread Mark Kerzner
Hi, I am going through examples in this book (which I have obtained as early draft from Safari), and they all work, with occasional fixes. However, the SequenceFileWriteDemo, even though it works without an error, does not show the create file when I use this command hadoop fs -ls / I remember

copying binary files to a SequenceFile

2009-02-04 Thread Mark Kerzner
Hi all, I am copying regular binary files to a SequenceFile, and I am using BytesWritable, to which I am giving all the byte[] content of the file. However, once it hits a file larger than my computer memory, it may have problems. Is there a better way? Thank you, Mark

slow writes to HDFS

2009-02-05 Thread Mark Kerzner
Hi all, I am writing to HDFS with this simple code File[] files = new File(fileDir).listFiles(); for (File file : files) { key.set(file.getPath()); byte[] bytes = new FileUtil().readCompleteFile(file);

can't read the SequenceFile correctly

2009-02-05 Thread Mark Kerzner
Hi, I have written binary files to a SequenceFile, seemeingly successfully, but when I read them back with the code below, after a first few reads I get the same number of bytes for the different files. What could go wrong? Thank you, Mark reader = new SequenceFile.Reader(fs, path,

Re: can't read the SequenceFile correctly

2009-02-06 Thread Mark Kerzner
#getBytes() to use. Tom On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I have written binary files to a SequenceFile, seemeingly successfully, but when I read them back with the code below, after a first few reads I get the same number of bytes

Re: using HDFS for a distributed storage system

2009-02-09 Thread Mark Kerzner
It is a good and useful overview,thank you. It also mentions Stuart Sierra's post, where Stuart mentions that the process is slow. Does anybody know why? I have written code to write from the PC file system to HDFS, and I also noticed that it is very slow. Instead of 40M/sec, as promised by the

what's going on :( ?

2009-02-09 Thread Mark Kerzner
Hi, Hi, why is hadoop suddenly telling me Retrying connect to server: localhost/127.0.0.1:8020 with this configuration configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property property namemapred.job.tracker/name

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
Brian, I have a similar question: why does transfer from a local filesystem to SequenceFile takes so long (about 1 second per Meg)? Thank you, Mark On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman bbock...@cse.unl.eduwrote: On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote: Hi, Could someone

could this be an error in hadoop documentation of a bug

2009-02-10 Thread Mark Kerzner
Hi, the Quick Starthttp://hadoop.apache.org/core/docs/current/quickstart.htmlhas this sample configuration namefs.default.name/name valuehdfs://localhost:9000/value but it does not seem to work: even though the daemons do listen to 9000, the following command always uses 8020 hadoop fs

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
, at 4:53 PM, Mark Kerzner wrote: Brian, I have a similar question: why does transfer from a local filesystem to SequenceFile takes so long (about 1 second per Meg)? Hey Mark, I saw your question about speed the other day ... unfortunately, I didn't have any specific advice so I stayed quiet

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
, at 11:09 PM, Mark Kerzner wrote: Brian, large files using command-line hadoop go fast, so it is something about my computer or network. I won't worry about this now, especially in light of Amit reporting fast writes and reads. You're creating files using SequenceFile, right? It might

Re: File Transfer Rates

2009-02-10 Thread Mark Kerzner
I say, that's very interesting and useful. On Tue, Feb 10, 2009 at 11:37 PM, Brian Bockelman bbock...@cse.unl.eduwrote: Just to toss out some numbers (and because our users are making interesting numbers right now) Here's our external network router:

Re: what's going on :( ?

2009-02-12 Thread Mark Kerzner
running and terminate them (bin/stop-all.sh should help) and then restart your cluster with the new configuration to see if that helps. Later, Jeff On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote: Mark Kerzner wrote: Hi, Hi, why is hadoop suddenly telling

Re: Too many open files in 0.18.3

2009-02-12 Thread Mark Kerzner
I once had too many open files when I was opening too many sockets and not closing them... On Thu, Feb 12, 2009 at 1:56 PM, Sean Knapp s...@ooyala.com wrote: Hi all, I'm continually running into the Too many open files error on 18.3: DataXceiveServer: java.io.IOException: Too many open files

Re: Namenode not listening for remote connections to port 9000

2009-02-13 Thread Mark Kerzner
I had a problem that it listened only on 8020, even though I told it to use 9000 On Fri, Feb 13, 2009 at 7:50 AM, Norbert Burger norbert.bur...@gmail.comwrote: On Fri, Feb 13, 2009 at 8:37 AM, Steve Loughran ste...@apache.org wrote: Michael Lynch wrote: Hi, As far as I can tell I've

Can never restart HDFS after a day or two

2009-02-16 Thread Mark Kerzner
Hi all, I consistently have this problem that I can run HDFS and restart it after short breaks of a few hours, but the next day I always have to reformat HDFS before the daemons begin to work. Is that normal? Maybe this is treated as temporary data, and the results need to be copied out of HDFS

Re: Can never restart HDFS after a day or two

2009-02-17 Thread Mark Kerzner
16, 2009 at 8:11 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi all, I consistently have this problem that I can run HDFS and restart it after short breaks of a few hours, but the next day I always have to reformat HDFS before the daemons begin to work. Is that normal

Re: hdfs disappears

2009-02-23 Thread Mark Kerzner
Exactly the same thing happened to me, and Brian gave the same answer. What if the default is changed to the user's home directory somewhere? On Mon, Feb 23, 2009 at 10:05 PM, Brian Bockelman bbock...@cse.unl.eduwrote: Hello, Where are you saving your data? If it's being written into /tmp,

Re: How does NVidia GPU compare to Hadoop/MapReduce

2009-02-27 Thread Mark Kerzner
of multiple computers, e.g., the flow of data in and out of a distributed filesystem, distributed reliability, global computations, etc. So you might use CUDA within mapreduce to more efficiently run compute-intensive tasks over petabytes of data. Doug Mark Kerzner wrote: Hi, this from

Re: Thanks to Christophe for Hadoop Featured Pod Cast

2009-03-01 Thread Mark Kerzner
Thank you for pointing this out! Mark On Sun, Mar 1, 2009 at 9:40 PM, Brock Palen bro...@umich.edu wrote: Just want to thank Christophe Bisciglia for taking some time out to speak with us about Hadoop on our podcast Research Computing and Engineering ( www.rce-cast.com) You can find the

Re: OT: How to search mailing list archives?

2009-03-08 Thread Mark Kerzner
Yes, that is definitely the coolest of them all On Sun, Mar 8, 2009 at 5:11 PM, Jeff Hammerbacher ham...@cloudera.comwrote: I like MarkMail's excellent service: http://hadoop.markmail.org. On Sun, Mar 8, 2009 at 2:54 PM, Iman ielgh...@cs.uwaterloo.ca wrote: You might also want to try the

Temporary files for mapppers and reducers

2009-03-15 Thread Mark Kerzner
Hi, what would be the best place to put temporary files for a reducer? I believe that since reducers each work on its own machine, at its own time, one can do anything, but I would like a confirmation from the experts. Thanks, Mark

Broder or other near-duplicate algorithms?

2009-03-23 Thread Mark Kerzner
Hi, does anybody know of an open-source implementation of the Broder algorithmhttp://www.std.org/%7Emsm/common/clustering.htmlin Hadoop? Monika Henzinger reports having done http://ltaa.epfl.ch/monika/mpapers/nearduplicates2006.pdf so in MapReduce, and I wonder if somebody has repeated her work

Re: Broder or other near-duplicate algorithms?

2009-03-24 Thread Mark Kerzner
Yi-Kai, that's good to know - and I have read this article - but is your code available? Thank you, Mark On Tue, Mar 24, 2009 at 9:51 AM, Yi-Kai Tsai yi...@yahoo-inc.com wrote: hi Mark we had done something on top of hadoop/hbase (mapreduce for evaluation , hbase for online serving ) by

Performance question

2009-04-19 Thread Mark Kerzner
Hi, I ran a Hadoop MapReduce task in the local mode, reading and writing from HDFS, and it took 2.5 minutes. Essentially the same operations on the local file system without MapReduce took 1/2 minute. Is this to be expected? It seemed that the system lost most of the time in the MapReduce

Re: Performance question

2009-04-20 Thread Mark Kerzner
you, Mark On Mon, Apr 20, 2009 at 7:42 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Mark, There is a setup price when using Hadoop, for each task a new JVM must be spawned. On such a small scale, you won't see any good using MR. J-D On Mon, Apr 20, 2009 at 12:26 AM, Mark Kerzner markkerz

Re: Performance question

2009-04-20 Thread Mark Kerzner
for the link - I wish I were at the conference! Anyway, at this level I have to make my hands dirty, re-read both Hadoop books, and other article. Cheers, Mark On Mon, Apr 20, 2009 at 10:24 AM, Arun C Murthy a...@yahoo-inc.com wrote: On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote: Hi, I

Put computation in Map or in Reduce

2009-04-20 Thread Mark Kerzner
Hi, in an MR step, I need to extract text from various files (using Tika). I have put text extraction into reduce(), because I am writing the extracted text to the output on HDFS. But now it occurs to me that I might as well have put it into map() and have default reduce() which will write every

Pregel

2009-06-25 Thread Mark Kerzner
Hi all, my guess, as good as anybody's, is that Pregel is to large graphs is what Hadoop is to large datasets. In other words, Pregel is the next natural step for massively scalable computations after Hadoop. And, as with MapReduce, Google will talk about the technology but not give out the code

Re: grahical tool for hadoop mapreduce

2009-06-26 Thread Mark Kerzner
Tom, this is so much right on time! Bravo, Karmasphere. I installed the plugins, and nothing crashed - in fact, I get the same screens as the manual promises. It is worth reading this group - they released the plugin two days ago. Mark On Fri, Jun 26, 2009 at 10:13 AM, Tom Wheeler

Re: hadoop consulting?

2009-01-23 Thread Mark Kerzner - SHMSoft
Christophe, I am writing my first Hadoop project now, and I have 20 years of consulting, and I am in Houston. Here is my resume, http://markkerzner.googlepages.com. I have used EC2. Sincerely, Mark On Fri, Jan 23, 2009 at 4:04 PM, Christophe Bisciglia christo...@cloudera.com wrote: Hey all,