Java inputformat for pipes job

2008-04-08 Thread Rahul Sood
Hi, I implemented a customized input format in Java for a Map Reduce job. The mapper and reducer classes are implemented in C++, using the Hadoop Pipes API. The package documentation for org.apache.hadoop.mapred.pipes states that The job may consist of any combination of Java and C++

Re: Java inputformat for pipes job

2008-04-08 Thread 11 Nov.
You should use the -pipes option in the command. For the input format, you can pack it into the hadoop core class jar file, or put it into the cache file. 2008/4/8, Rahul Sood [EMAIL PROTECTED]: Hi, I implemented a customized input format in Java for a Map Reduce job. The mapper and reducer

Re: Can't get DFS file size

2008-04-08 Thread Michaela Buergle
Maybe because you pass Strings to the LongWritables? micha 11 Nov. wrote: Hi folks, I'm writing a little test programm to check the writing speed of DFS file system, but can't get the file size using fs.getFileStatus(file).getLen() or fs.getContentLength(file). Here is my code:

Re: Can't get DFS file size

2008-04-08 Thread 11 Nov.
I tried to play with the little test by attaching eclipse on when it started, what surprised me is that the size could be got in eclipse, and the result file is witten as expected. Can anybody explain this? 2008/4/8, 11 Nov. [EMAIL PROTECTED]: Hi folks, I'm writing a little test programm

Sorting the OutputCollector

2008-04-08 Thread Aayush Garg
Hi, I have implemented Key and value pairs in the following way: Key (Text class) Value(Custom class) word1 word2 class Custom{ int freq; TreeMapString, ArrayListString } I construct this type of key, value pairs in the outputcollector of reduce phase. Now I want to SORT this

Newbie asking: ordinary filesystem above Hadoop

2008-04-08 Thread Mika Joukainen
Hi! Yes, I'm aware that it's not good idea build ordinary filesystem above Hadoop. Let's say that I try to build system for my users where is 500 GB space for every user. It seems that Hadoop can write/store 500 GB fine, but reading and altering data later isn't easy (at least not altering). How

Re: Newbie asking: ordinary filesystem above Hadoop

2008-04-08 Thread Andreas Kostyrka
HDFS has slightly different design goals. It's not meant as a general purpose filesystem, it's meant as the fast sequential input/output storage thing meant for hadoops map/reduce. Andreas Am Dienstag, den 08.04.2008, 16:24 +0300 schrieb Mika Joukainen: Hi! Yes, I'm aware that it's not good

DFS behavior when the disk goes bad

2008-04-08 Thread Murali Krishna
Hi, We had a bad disk issue in one of the box and I am seeing some strange behaviour. Just wanted to confirm whether this is expected.. * We are running a small cluster with 10 data nodes and a name node * Each data node has 6 disks * While a job was running,

Re: Java inputformat for pipes job

2008-04-08 Thread Rahul Sood
I'm invoking hadoop with pipes command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr I tried the -file and -cacheFile options but when either of these is passed to hadoop pipes, the command just

incorrect data check

2008-04-08 Thread Colin Freas
running a job on my 5 node cluster, i get these intermittent exceptions in my logs: java.io.IOException: incorrect data check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at

Re: secondary namenode web interface

2008-04-08 Thread Yuri Pradkin
I'd be happy to file a JIRA for the bug, I just want to make sure I understand what the bug is: is it the misleading null pointer message or is it that someone is listening on this port and not doing anything useful? I mean, what is the configuration parameter dfs.secondary.http.address for?

RE: secondary namenode web interface

2008-04-08 Thread dhruba Borthakur
The secondary Namenode uses the HTTP interface to pull the fsimage from the primary. Similarly, the primary Namenode uses the dfs.secondary.http.address to pull the checkpointed-fsimage back from the secondary to the primary. So, the definition of dfs.secondary.http.address is needed. However,

Re: Reduce Sort

2008-04-08 Thread Ted Dunning
On 4/8/08 10:43 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote: I would like to try using Hadoop. That is good for education, probably bad for run time. It could take SECONDS longer to run (oh my). Do you mean to write another MapReduce program which takes the output of the first

RE: Reduce Sort

2008-04-08 Thread Natarajan, Senthil
Thanks Ted. I would like to try using Hadoop. Do you mean to write another MapReduce program which takes the output of the first MapReduce (the already existing file of this format) IP Add Count 1.2. 5. 42 27 2.8. 6. 6 24 7.9.24.13 8 7.9. 6. 9201 And use count as the key and IP

Re: secondary namenode web interface

2008-04-08 Thread Konstantin Shvachko
Yuri, The NullPointerException should be fixed as Dhruba proposed. We do not have any secondary nn web interface as of today. The http server is used for transferring data between the primary and the secondary. I don't see we can display anything useful on the secondary web UI except for the

New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Ian Tegebo
The wiki has been down for more than a day, any ETA? I was going to search the archives for the status, but I'm getting 403's for each of the Archive links on the mailing list page: http://hadoop.apache.org/core/mailing_lists.html My original question was about specifying

Headers and footers on Hadoop output results

2008-04-08 Thread ncardoso
Hello. I'm using Hadoop to process several XML files, each with several XML records, through a group of Linux servers. I am using an XMLInputFormat that I found http://www.nabble.com/map-reduce-function-on-xml-string-td15816818.html here in Nabble , and I'm using the TextOutputFormat with an

Re: secondary namenode web interface

2008-04-08 Thread Yuri Pradkin
On Tuesday 08 April 2008 11:54:35 am Konstantin Shvachko wrote: If you have anything in mind that can be displayed on the UI please let us know. You can also find a jira for the issue, it would be good if this discussion is reflected in it. Well, I guess we could have interface to browse the

Sorting the OutputCollector

2008-04-08 Thread Aayush Garg
Hi, I have implemented Key and value pairs in the following way: Key (Text class) Value(Custom class) word1 word2 class Custom{ int freq; TreeMapString, ArrayListString } I construct this type of key, value pairs in the outputcollector of reduce phase. Now I want to SORT this

Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Ted Dunning
Looks like it is up to me. On 4/8/08 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote: The wiki has been down for more than a day, any ETA? I was going to search the archives for the status, but I'm getting 403's for each of the Archive links on the mailing list page:

Re: DFS behavior when the disk goes bad

2008-04-08 Thread Raghu Angadi
The behavior seems correct. Assuming blacklisted to mean NameNode marked this node 'dead' : Murali Krishna wrote: * We are running a small cluster with 10 data nodes and a name node * Each data node has 6 disks * While a job was running, one of the disk in one data node got corrupted

Re: incorrect data check

2008-04-08 Thread Colin Freas
so, in an attempt to track down this problem, i've stripped out most of the files for input, trying to identify which ones are causing the problem. i've narrowed it down, but i can't pinpoint it. i keep getting these incorrect data check errors below, but the .gz files test fine with gzip. is

Re: incorrect data check

2008-04-08 Thread Norbert Burger
Colin, how about writing a streaming mapper which simply runs md5sum on each file it gets as input? Run this task along with the identity reducer, and you should be able to identify pretty quickly if there's HDFS corruption issue. Norbert On Tue, Apr 8, 2008 at 5:50 PM, Colin Freas [EMAIL

Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Rick Cox
On Tue, Apr 8, 2008 at 12:36 PM, Ian Tegebo [EMAIL PROTECTED] wrote: My original question was about specifying MaxMapTaskFailuresPercent as a job conf parameter on the command line for streaming jobs. Is there a conf setting like the following? mapred.taskfailure.percent The job

Re: secondary namenode web interface

2008-04-08 Thread Konstantin Shvachko
Unfortunately we do not have an api for the secondary nn that would allow browsing the checkpoint. I agree it would be nice to have one. Thanks for filing the issue. --Konstantin Yuri Pradkin wrote: On Tuesday 08 April 2008 11:54:35 am Konstantin Shvachko wrote: If you have anything in mind

Fuse-j-hadoopfs

2008-04-08 Thread xavier.quintuna
Hi everybody, I have a question about fuse-j-hadoopfs. Do it handles the hadoop permissions ? I'm using hadoop.0.16.3 Thanks X