Hadoop 0.19.2 Contrib DataJoin

2010-01-28 Thread Alex Parvulescu
Hello, I am using Hadoop 0.19.2 and DataJoin (contrib/datajoin), and I'd like to know if this is still maintained by anyone, of if there is a wiki page or something where I could get more info. I was looking at Hadoop 0.21 release and it seems that this part of the code did not change. I'd like

Input file format doubt

2010-01-28 Thread Udaya Lakshmi
Hi all.. I have searched the documentation but could not find a input file format which will give line number as the key and line as the value. Did I miss something? Can someone give me a clue of how to implement one such input file format. Thanks, Udaya.

Re: Input file format doubt

2010-01-28 Thread Amogh Vasekar
Hi, For global line numbers, you would need to know the ordering within each split generated from the input file. The standard input formats provide offsets in splits, so if the records are of equal length you can compute some kind of numbering. I remember someone had implemented sequential

Re: Cleanup Attempt in Map Task

2010-01-28 Thread Jeff Zhang
One easy way is to increase the timeout by setting mapred.task.timeout in mapred-site.xml On Thu, Jan 28, 2010 at 5:59 PM, #YONG YONG CHENG# aarnc...@pmail.ntu.edu.sg wrote: Good Day, Is there any way to control the cleanup attempt of a failed map task without changing the Hadoop

Re: Input file format doubt

2010-01-28 Thread Ravi
Thank you Amogh. On Thu, Jan 28, 2010 at 3:44 PM, Amogh Vasekar am...@yahoo-inc.com wrote: Hi, For global line numbers, you would need to know the ordering within each split generated from the input file. The standard input formats provide offsets in splits, so if the records are of equal

Re: Input file format doubt

2010-01-28 Thread Ravi
I too had the doubt but could not find the clue. However Please post the code if u can find it. On Thu, Jan 28, 2010 at 4:03 PM, Ravi ravindra.babu.rav...@gmail.comwrote: Thank you Amogh. On Thu, Jan 28, 2010 at 3:44 PM, Amogh Vasekar am...@yahoo-inc.comwrote: Hi, For global line numbers,

Re: Input file format doubt

2010-01-28 Thread Amogh Vasekar
Hi, Here's the relevant thread with Gordon, the author of the solution: I am in the process of learning Hadoop (and I think I've made a lot of progress). I have described the specific problem and solution on my blog

Re: Input file format doubt

2010-01-28 Thread Ravi
Thank you Amogh Ravi. On 1/28/10, Amogh Vasekar am...@yahoo-inc.com wrote: Hi, Here's the relevant thread with Gordon, the author of the solution: I am in the process of learning Hadoop (and I think I've made a lot of progress). I have described the specific problem and solution on my blog

Re: Input file format doubt

2010-01-28 Thread Udaya Lakshmi
Thank you Amogh. I will go through the link. Udaya. On 1/28/10, Ravi ravindra.babu.rav...@gmail.com wrote: Thank you Amogh Ravi. On 1/28/10, Amogh Vasekar am...@yahoo-inc.com wrote: Hi, Here's the relevant thread with Gordon, the author of the solution: I am in the process of learning

Re: Too many fetch-failures - reduce task problem

2010-01-28 Thread Nachiket Vaidya
with hostnames of master and slaves added to /etc/hosts and removing entry for 127.0.1.1 it worked I was always specifying IP address instead of hostname in conf file. But Hadoop uses IP address only at start up and for all other operations, it uses hostname only. so added IP address in

map(K1 key, V1 value, OutputCollectorK2, V2 output, Reporter reporter) deprecated in 0.20.2?

2010-01-28 Thread steven zhuang
hello, all, As a newbie, I have been used to the (k1,v1,k2,v2) format parameter list for map and reduce methods in mapper and reducer(as is written in many books), but after several failures, I found in 0.20+, if we extends from base class org.apache.hadoop.mapreduce.Mapper, the

Re: Failed to install Hadoop on WinXP

2010-01-28 Thread Yang Li
I met the same problem on WinXP+Cygwin and fixed it by either: - moving to a linux box (VMWare works very well) or: - configuring a mapred.child.tmp parameter in core-site.xml I cannot explain why and how mapred.child.tmp is related to the problem. From source code, it seems to be a JVM issue on

Re: fine granularity operation on HDFS

2010-01-28 Thread Gang Luo
Thanks Amogh. For the second part of my question, I actually mean loading block separately from HDFS. I don't know whether it is realistic. Anyway, for my goal is to process different division of a file separately, to do that at split level is OK. But even I can get the splits from

Re: When exactly is combiner invoked?

2010-01-28 Thread Gang Luo
Hi Le, I don't think mapreduce can completely combine all the records with the same key into one record. one situation is when min.num.spills.for.combine is too high, while you get less records than that which share the same key, the combiner will not be invoked on these records. Actually, I

Fileformat query

2010-01-28 Thread Udaya Lakshmi
Hi all.. I have searched the documentation but could not find a input file format which will give line number as the key and line as the value. Did I miss something? Can someone give me a clue of how to implement one such input file format. Thanks, Udaya.

Re: fine granularity operation on HDFS

2010-01-28 Thread Amogh Vasekar
Hi Gang, Yes PathFilters work only on file paths. I meant you can include such type of logic at split level. The input format's getSplits() method is responsible for computing and adding splits to a list container, for which JT initializes mapper tasks. You can override the getSplits() method

Re: Installing in local Maven repository

2010-01-28 Thread Stuart Sierra
On Wed, Jan 27, 2010 at 3:08 PM, Ryan Smith ryan.justin.sm...@gmail.com wrote: If you just want to use hadoop jars in your maven projects, run your own caching archive repository manager like Nexus. What I really want is to publish my own projects with the correct dependencies, using artifacts

Re: Installing in local Maven repository

2010-01-28 Thread Ryan Smith
SS, Unless Im grossly mistaken, Nexus does exactly this. I have my own projects that use hadoop jars. I can easily add custom patched versions of hadoop too. These hadoop jars arent in maven central though, theyre in my own instance of Nexus. When i go into my custom hadoop project and type:

Re: Fileformat query

2010-01-28 Thread Edward Capriolo
On Thu, Jan 28, 2010 at 4:01 AM, Udaya Lakshmi udaya...@gmail.com wrote: Hi all..   I have searched the documentation but could not find a input file format which will give line number as the key and line as the value. Did I miss something? Can someone give me a clue of how to implement one

Re: map(K1 key, V1 value, OutputCollectorK2, V2 output, Reporter reporter) deprecated in 0.20.2?

2010-01-28 Thread Edward Capriolo
On Thu, Jan 28, 2010 at 8:14 AM, steven zhuang zhuangxin8...@gmail.com wrote: hello, all,                As a newbie, I have been used to the (k1,v1,k2,v2) format parameter list for map and reduce methods in mapper and reducer(as is written in many books), but after several failures, I found

Re: Failed to install Hadoop on WinXP

2010-01-28 Thread Yura Taras
Unfortunately, setting mapred.child.tmp doesn't help. Could you share your sample config files? What about VMWare - I am thinking about this as a last resort :) On Thu, Jan 28, 2010 at 3:41 PM, Yang Li liy...@cn.ibm.com wrote: I met the same problem on WinXP+Cygwin and fixed it by either: -

Re: Slowdown with Hadoop Sort benchmark when using Jumbo frames?

2010-01-28 Thread stephen mulcahy
Jay Booth wrote: Did you set io.file.buffer.size (or whatever the property is) to a large value? Just re-ran the benchmark with that bumped to 65536 (as proposed in http://www.cloudera.com/blog/tag/configuration/). The benchmark is still slower with jumbo frames than without (but difference

Re: Scheduling, prioritizing some jobs

2010-01-28 Thread Matei Zaharia
Hi Erik, With four priority levels like this, you should just be able to use Hadoop's priorities, because it has five of them (very high, high, normal, low and very low). You can just use the default scheduler for this (i.e. don't enable either the fair or the capacity scheduler). Or am I

Re: Hadoop 0.19.2 Contrib DataJoin

2010-01-28 Thread Allen Wittenauer
On 1/28/10 12:59 AM, Alex Parvulescu alex.parvule...@gmail.com wrote: I am using Hadoop 0.19.2 and DataJoin (contrib/datajoin), ... I'd like to know if I can submit a for this small project. It's nothing much, I just added some generics. It's not perfect, but I think it's a good start. You

Re: Slowdown with Hadoop Sort benchmark when using Jumbo frames?

2010-01-28 Thread Allen Wittenauer
We're working on a patch that monkeys with the TCP buffers because we're seeing slow downs with big transfers as well. It might be related... On 1/28/10 9:25 AM, stephen mulcahy stephen.mulc...@deri.org wrote: Jay Booth wrote: Did you set io.file.buffer.size (or whatever the property is) to

DBOutputFormat Speed Issues

2010-01-28 Thread Nick Jones
Hi all, I have a use case for collecting several rows from MySQL of compressed/unstructured data (n rows), expanding the data set, and storing the expanded results back into a MySQL DB (100,000n rows). DBInputFormat seems to perform reasonably well but DBOutputFormat is inserting rows

Re: Data currently stored in Solr index. Should it be moved to HDFS?

2010-01-28 Thread Otis Gospodnetic
Hm, yes. See how few hits this shows: http://search-hadoop.com/?q=non-distributedfc_project=Hadoop You can set it up on 1 box, but that's really useful only for development. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Ranganathan,

NPE in Datanode

2010-01-28 Thread sagar naik
Hi All, I got an NPE on hadoop-18.1 datanode Exception in thread org.apache.hadoop.dfs.datanod...@107f7f7 java.lang.NullPointerException at org.apache.hadoop.dfs.FSDataset.getMetaFile(FSDataset.java:571) at org.apache.hadoop.dfs.FSDataset.updateBlock(FSDataset.java:801)

RE: Failed to install Hadoop on WinXP

2010-01-28 Thread #YONG YONG CHENG#
Good Day, My Hadoop version is 0.19.1. I have successfully configured to run it on a Windows machine. Here is the configuration that I performed: 1. I put the Hadoop files under this folder C:\cygwin\usr\local\hadoop 2. Below is the hadoop-site.xml that I use. ?xml version=1.0?

Re: Failed to install Hadoop on WinXP

2010-01-28 Thread brian
Yura Taras wrote: Unfortunately, setting mapred.child.tmp doesn't help. Could you share your sample config files? What about VMWare - I am thinking about this as a last resort :) On Thu, Jan 28, 2010 at 3:41 PM, Yang Li liy...@cn.ibm.com wrote: I met the same problem on WinXP+Cygwin and

Re: Fileformat query

2010-01-28 Thread Udaya Lakshmi
Thank you Jeff. On 1/29/10, Jeff Zhang zjf...@gmail.com wrote: Sorry for my mistake, the idea of writing your own InputFormat seems not a good idea. The cost of getting the line number of each split is a little high. On Fri, Jan 29, 2010 at 8:40 AM, Jeff Zhang zjf...@gmail.com wrote: I'm

File split query

2010-01-28 Thread Udaya Lakshmi
Hi, When framework splits a file, will it happen that some part of a line falls in one split and the other part in some other split? Or is the framework going to take care that it always splits at the end of the line? Thanks, Udaya.

Re: File split query

2010-01-28 Thread .ke. sivakumar
Hadoop will take care of it. If the split is supposed to be at the middle of the line, then it will be extended till the end. Though the split limit will be exceeded by few bytes. On Thu, Jan 28, 2010 at 7:34 PM, Udaya Lakshmi udaya...@gmail.com wrote: Hi, When framework splits a file,

always have killed or failed task in job when running multi jobs concurrently

2010-01-28 Thread john li
when hadoop running multi jobs concurrently, that is when hadoop is busy, always have killed tasks in some jobs, although the jobs success finally. anybody tell me why? -- Regards Junyong

Re: File split query

2010-01-28 Thread Prabhu Hari Dhanapal
The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are split on arbitrary byte boundaries. On Fri, Jan 29, 2010 at 1:49 AM, .ke. sivakumar kesivaku...@gmail.comwrote: Hadoop will take care of it. If the split is supposed

Re: always have killed or failed task in job when running multi jobs concurrently

2010-01-28 Thread Wang Xu
On Fri, Jan 29, 2010 at 2:52 PM, john li lij...@gmail.com wrote: when hadoop running multi jobs concurrently, that is when hadoop is busy, always have killed tasks in some jobs, although the jobs success finally. anybody tell me why? if only killed, don't mind it. JobTracker schedules idle

Re: File split query

2010-01-28 Thread Prabhu Hari Dhanapal
I guess this would be a better answer A FileSplit is merely a description of the boundaries. e.g., bytes 0 to and bytes 1 to 1. The Mapper then interprets the boundaries described by a FileSplit in a way that makes sense at the data level. The FileSplit does not actually physically

Re: File split query

2010-01-28 Thread Amogh Vasekar
Hi, In general, the file split may break the records, its the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in key,value. Amogh On 1/29/10 9:04 AM, Udaya Lakshmi

Re: always have killed or failed task in job when running multi jobs concurrently

2010-01-28 Thread Rekha Joshi
You can find out the reason from the JT logs (eg: memory/timeout restrictions) and adjust the timeout - mapred.task.timeout or the memory parameters accordingly.Refer http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html Cheers, /R On 1/29/10 12:22 PM, john li lij...@gmail.com wrote: