re : How to use MapFile in C++ program

2009-02-06 Thread Anh Vũ Nguyễn
Hi, everybody. I am writing a project in C++ and want to use the power of MapFile class(which belongs to org.apache.hadoop.io) of hadoop. Can you please tell me how can I write code in C++ using MapFile or there is no way to use API org.apache.hadoop.io in c++ (libhdfs only helps with

Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Is it possible to write a map reduce job using multiple input files? For example: File 1 has data like - Name, Number File 2 has data like - Number, Address Using these, I want to create a third file which has something like - Name, Address How can a map reduce job be written to do this?

Re: Hadoop job using multiple input files

2009-02-06 Thread Jeff Hammerbacher
Hey Amandeep, You can get the file name for a task via the map.input.file property. For the join you're doing, you could inspect this property and ouput (number, name) and (number, address) as your (key, value) pairs, depending on the file you're working with. Then you can do the combination in

Re: can't read the SequenceFile correctly

2009-02-06 Thread Tom White
Hi Mark, Not all the bytes stored in a BytesWritable object are necessarily valid. Use BytesWritable#getLength() to determine how much of the buffer returned by BytesWritable#getBytes() to use. Tom On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I have written

Re: Hadoop job using multiple input files

2009-02-06 Thread Ian Soboroff
Amandeep Khurana ama...@gmail.com writes: Is it possible to write a map reduce job using multiple input files? For example: File 1 has data like - Name, Number File 2 has data like - Number, Address Using these, I want to create a third file which has something like - Name, Address How

Re: How to use DBInputFormat?

2009-02-06 Thread Fredrik Hedberg
Well, that obviously depend on the RDBMS' implementation. And although the case is not as bad as you describe (otherwise you better ask your RDBMS vendor for your money back), your point is valid. But then again, a RDBMS is not designed for that kind of work. What do you mean by creating

Re: can't read the SequenceFile correctly

2009-02-06 Thread Mark Kerzner
Indeed, this was the answer! Thank you, Mark On Fri, Feb 6, 2009 at 4:25 AM, Tom White t...@cloudera.com wrote: Hi Mark, Not all the bytes stored in a BytesWritable object are necessarily valid. Use BytesWritable#getLength() to determine how much of the buffer returned by

Re: Hadoop job using multiple input files

2009-02-06 Thread Jeff Hammerbacher
You put the files into a common directory, and use that as your input to the MapReduce job. You write a single Mapper class that has an if statement examining the map.input.file property, outputting number as the key for both files, but address for one and name for the other. By using a commone

Re: How to use DBInputFormat?

2009-02-06 Thread Stefan Podkowinski
On Fri, Feb 6, 2009 at 2:40 PM, Fredrik Hedberg fred...@avafan.com wrote: Well, that obviously depend on the RDBMS' implementation. And although the case is not as bad as you describe (otherwise you better ask your RDBMS vendor for your money back), your point is valid. But then again, a RDBMS

Re: re : How to use MapFile in C++ program

2009-02-06 Thread Enis Soztutar
There is currently no way to read MapFiles in any language other than Java. You can write a JNI wrapper similar to libhdfs. Alternatively, you can also write the complete stack from scratch, however this might prove very difficult or impossible. You might want to check the ObjectFile/TFile

Tests stalling in my config

2009-02-06 Thread Michael Tolan
Hello, I recently checked out revision 741606, and am attempting to run the 'test' ant task. I'm new to building hadoop from source, so my problem is most likely somewhere in my own configuration, but I'm at a bit of a loss as to how to trace it. The only environment variable that I've set for

Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Thanks Jeff... I am not 100% clear about the first solution you have given. How do I get the multiple files to be read and then feed into a single reducer? I should have multiple mappers in the same class and have different job configs for them, run two separate jobs with one outputing the key as

RE: can't read the SequenceFile correctly

2009-02-06 Thread Bhupesh Bansal
Hey Tom, I got also burned by this ?? Why does BytesWritable.getBytes() returns non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes() kind of function. Best Bhupesh -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Fri 2/6/2009 2:25 AM To:

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-02-06 Thread TCK
How well does the read throughput from HDFS scale with the number of data nodes ? For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file

Re: How to use DBInputFormat?

2009-02-06 Thread Mike Olson
On Feb 6, 2009, at 7:06 AM, Stefan Podkowinski wrote: Another scenario I just recognized: what about current/realtime data? E.g. 'select * from logs where date = today()'. Working with 'offset' may turn out to return different results after the table has been updated and tasks are still

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-02-06 Thread Brian Bockelman
On Feb 6, 2009, at 11:00 AM, TCK wrote: How well does the read throughput from HDFS scale with the number of data nodes ? For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client

Cannot copy from local file system to DFS

2009-02-06 Thread Mithila Nagendra
Hey all I was trying to run the word count example on one of the hadoop systems I installed, but when i try to copy the text files from the local file system to the DFS, it throws up the following exception: [mith...@node02 hadoop]$ jps 8711 JobTracker 8805 TaskTracker 8901 Jps 8419 NameNode 8642

Re: Re: Re: Regarding Hadoop multi cluster set-up

2009-02-06 Thread Amandeep Khurana
I had to change the master on my running cluster and ended up with the same problem. Were you able to fix it at your end? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Thu, Feb 5, 2009 at 8:46 AM, shefali pawar

Re: How to use DBInputFormat?

2009-02-06 Thread Fredrik Hedberg
Well, that's also implicit by design, and cannot really be solved in a generic way. As with any system, it's not foolproof; unless you fully understand what you're doing, you won't reliably get the result you're seeking. As I said before, the JDBC interface for Hadoop solves a specific

Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Ok. Got it. Now, how would my reducer know whether the name is coming first or the address? Is it going to be in the same order in the iterator as the files are read (alphabetically) in the mapper? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri,

Heap size error

2009-02-06 Thread Amandeep Khurana
I'm getting the following error while running my hadoop job: 09/02/06 15:33:03 INFO mapred.JobClient: Task Id : attempt_200902061333_0004_r_00_1, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Unknown Source) at

Re: Hadoop job using multiple input files

2009-02-06 Thread Amandeep Khurana
Ok. I was able to get this to run but have a slight problem. *File 1* 1 10 2 20 3 30 3 35 4 40 4 45 4 49 5 50 *File 2* a 10 123 b 20 21321 c 45 2131 d 40 213 I want to join the above two based on the second column of file 1. Here's what I am getting as the

Re: Hadoop job using multiple input files

2009-02-06 Thread Billy Pearson
If it was me I would prefix the map values outputs with a: and n:. a: for address and n: for number then on the reduce you could test the value to see if its the address or the name with if statements no need to worry about which one comes first just make sure they both have been set before