Re: Direct HDFS access from a streaming job

2011-03-24 Thread Harsh J
There is a C-HDFS API + library (called libhdfs) available @ http://hadoop.apache.org/common/docs/r0.20.2/libhdfs.html. Perhaps you can make your C++ mapper program use that? On Thu, Mar 24, 2011 at 10:56 AM, Keith Wiley kwi...@keithwiley.com wrote: This webpage:

Re: Hadoop Distributed System Problems: Does not recognise any slave nodes

2011-03-24 Thread Harsh J
Also, is your Hadoop really under nutch/search or is it nutch/search/hadoop-0.x.x? Set the HADOOP_HOME appropriately to the exact directory Hadoop's files exist immediately under. On Thu, Mar 24, 2011 at 1:13 PM, Andy XUE andyxuey...@gmail.com wrote: Hi there: I'm a new user to Hadoop and

Re: CDH and Hadoop

2011-03-24 Thread Steve Loughran
On 23/03/11 15:32, Michael Segel wrote: Rita, It sounds like you're only using Hadoop and have no intentions to really get into the internals. I'm like most admins/developers/IT guys and I'm pretty lazy. I find it easier to set up the yum repository and then issue the yum install hadoop

Hadoop Distributed System Problems: Does not recognise any slave nodes

2011-03-24 Thread Andy XUE
Hi there: I'm a new user to Hadoop and Nutch, and I am trying to run the crawler * Nutch* on a distributed system powered by *Hadoop*. However as it turns out, the distributed system does not recognise any slave nodes in the cluster. I've stucked at this point for months and am desperate to look

is there a way to write rows sequentially against 60 reduce tasks?

2011-03-24 Thread JunYoung Kim
hi, I run almost 60 ruduce tasks for a single job. if the outputs of a job are from part00 to part 59. is there a way to write rows sequentially by sorted keys? curretly my outputs are like this. part00) 1 10 12 14 part 01) 2 4 6 11 13 part 02) 3 5 7 8 9 but, my aim is to get the

Re: Hadoop Distributed System Problems: Does not recognise any slave nodes

2011-03-24 Thread modemide
I'm also new to hadoop, but I was able to get my cluster up and running. I'm not familiar with Nutch though. In any case, my assumption is that Nutch relies on a working hadoop cluster as the base and adds on a few configurations to integrate the two. Here are some things that might help you: *

Re: Direct HDFS access from a streaming job

2011-03-24 Thread Keith Wiley
On Mar 23, 2011, at 11:10 PM, Harsh J wrote: There is a C-HDFS API + library (called libhdfs) available @ http://hadoop.apache.org/common/docs/r0.20.2/libhdfs.html. Perhaps you can make your C++ mapper program use that? Thanks. Actually, I think that with reference to the passage I quoted

Re: Direct HDFS access from a streaming job

2011-03-24 Thread Harsh J
Hello, On Thu, Mar 24, 2011 at 8:45 PM, Keith Wiley kwi...@keithwiley.com wrote: Thanks.  Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke hadoop fs -get or hadoop fs -copyToLocal. Some would

Re: Direct HDFS access from a streaming job

2011-03-24 Thread Keith Wiley
On Mar 24, 2011, at 8:31 AM, Harsh J wrote: Hello, On Thu, Mar 24, 2011 at 8:45 PM, Keith Wiley kwi...@keithwiley.com wrote: Thanks. Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke hadoop fs

Re: CDH and Hadoop

2011-03-24 Thread Allen Wittenauer
On Mar 23, 2011, at 7:29 AM, Rita wrote: I have been wondering if I should use CDH (http://www.cloudera.com/hadoop/) instead of the standard Hadoop distribution. What do most people use? Is CDH free? do they provide the tars or does it provide source code and I simply compile? Can I have

Re-generate datanode storageID?

2011-03-24 Thread Marc Leavitt
I am setting up a (very) small Hadoop/CDH3 beta 4 cluster in virtual machines to do some initial feasibility work. I proceeded by progressing through the Cloudera documentation standalone - pseudo-cluster - cluster with a single VM and then, when I had it stable(-ish) I copied the VM to a

How do I split input on fixed length keys

2011-03-24 Thread Kevin.Leach
I'm using hadoop streaming and currently have these properties in my command line: -Dstream.map.output.field.separator=' ' \ -Dstream.num.map.output.key.fields=1 \ This works for me as my test data happens to have a space at column 14. If I want to use a fixed length split, is there a simple

RE: Program freezes at Map 99% Reduce 33%

2011-03-24 Thread Kevin.Leach
Shi, The key here is the 99% done mapper. Nothing can move on until all mappers complete. Is it possible your data in the larger set has an incomplete record or some such at the end? Kevin -Original Message- From: Shi Yu [mailto:sh...@uchicago.edu] Sent: Thursday, March 24, 2011 3:02

Re: Re-generate datanode storageID?

2011-03-24 Thread Niels Basjes
Hi, To solve that simply do the following on the problematic nodes: 1) Stop the datanode (probably not running) 2) Remove everything inside the .../cache/hdfs/ 3) Start the datanode again. Note: With cloudera always use service way to stop/start hadoop software! service hadoop-0.20-datanode stop

Re: Program freezes at Map 99% Reduce 33%

2011-03-24 Thread Shi Yu
Hi Kevin, thanks for reply. I could hardly imagine an example of incomplete record. The mapper is very simple, just reading line by line as Strings, splitting the line by tab, and outputting a Text Pair for sort and secondary sort. If there were incomplete record, there should be an error

build script?

2011-03-24 Thread Daniel McEnnis
Dear, I have checked out via SVN the Hadoop core code. I am trying to compile it. Is there a build script to work from? Daniel McEnnis.

RE: Program freezes at Map 99% Reduce 33%

2011-03-24 Thread Kevin.Leach
Shi, This states Of course, the framework discards the sub-directory of unsuccessful task-attempts. http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht ml So yes, the missing directory is likely a failure. If you can, narrow the problem down by looking at sections of your

Re: CDH and Hadoop

2011-03-24 Thread Eli Collins
Hey Rita, All software developed by Cloudera for CDH is Apache (v2) licensed and freely available. See these docs [1,2] for more info. We publish source packages (which includes the packaging source) and source tarballs, you can find these at http://archive.cloudera.com/cdh/3/. See the

Re: Re-generate datanode storageID?

2011-03-24 Thread Marc Leavitt
Worked perfectly. Thanks Niels! -mgl On Mar 24, 2011, at 12:48 PM, Niels Basjes wrote: Hi, To solve that simply do the following on the problematic nodes: 1) Stop the datanode (probably not running) 2) Remove everything inside the .../cache/hdfs/ 3) Start the datanode again. Note:

Re: Program freezes at Map 99% Reduce 33%

2011-03-24 Thread Shi Yu
Hi Kevin, thanks for the suggestion. I think I found the problem, because my code is a chained map / reduce. In the previous iteration there is a .lzo_deflate output which is 40 times larger than other files. That was because of a special key value, which has significant larger occurrences

Re: CDH and Hadoop

2011-03-24 Thread Rita
Thanks everyone for your replies. I knew Cloudera had their release but never knew Y! had one too... On Thu, Mar 24, 2011 at 5:04 PM, Eli Collins e...@cloudera.com wrote: Hey Rita, All software developed by Cloudera for CDH is Apache (v2) licensed and freely available. See these docs

Re: CDH and Hadoop

2011-03-24 Thread David Rosenstrauch
They do, but IIRC, they recently announced that they're going to be discontinuing it. DR On Thu, March 24, 2011 8:10 pm, Rita wrote: Thanks everyone for your replies. I knew Cloudera had their release but never knew Y! had one too... On Thu, Mar 24, 2011 at 5:04 PM, Eli Collins

RE: Program freezes at Map 99% Reduce 33%

2011-03-24 Thread Kevin.Leach
Good. Data skew should not look stuck. Try sending status updates so at least you can tell one mapper is still busy. Yes, adding data or including another field into the key can help reduce data skew. Kevin -Original Message- From: Shi Yu [mailto:sh...@uchicago.edu] Sent: Thursday,

Re: Test, please respond

2011-03-24 Thread Jon Lederman
yeah i got it. On Mar 22, 2011, at 1:18 PM, Aaron Baff wrote: Does anyone see this? Can someone at least respond to this to indicate that it's getting to the mailing list fine? I've just gotten 0 replies to a few previous emails so I'm wondering if it's nobody is seeing these, or if people

Re: CDH and Hadoop

2011-03-24 Thread suresh srinivas
On Thu, Mar 24, 2011 at 7:04 PM, Rita rmorgan...@gmail.com wrote: Oh! Thats for the heads up on that... I guess I will go with the cloudera source then On Thu, Mar 24, 2011 at 8:41 PM, David Rosenstrauch dar...@darose.net wrote: They do, but IIRC, they recently announced that they're

Re: build script?

2011-03-24 Thread Harsh J
Hello, On Fri, Mar 25, 2011 at 2:27 AM, Daniel McEnnis dmcen...@gmail.com wrote: Dear, I have checked out via SVN the Hadoop core code.  I am trying to compile it.  Is there a build script to work from? There is an Apache Ant build.xml file bundled along (in the root directory of the