Re: Questions about Hadoop

2008-09-24 Thread Enis Soztutar
Hi, Arijit Mukherjee wrote: Hi We've been thinking of using Hadoop for a decision making system which will analyze telecom-related data from various sources to take certain decisions. The data can be huge, of the order of terabytes, and can be stored as CSV files, which I understand will fit

Re: 1 file per record

2008-09-24 Thread Enis Soztutar
Yes, you can use MultiFileInputFormat. You can extend the MultiFileInputFormat to return a RecordReader, which reads a record for each file in the MultiFileSplit. Enis chandra wrote: hi.. By setting isSplitable false, we can set 1 file with n records 1 mapper. Is there any way to set 1

RE: Questions about Hadoop

2008-09-24 Thread Arijit Mukherjee
Thanx Enis. By workflow, I was trying to mean something like a chain of MapReduce jobs - the first one will extract a certain amount of data from the original set and do some computation resulting in a smaller summary, which will then be the input to a further MR job, and so on...somewhat similar

Re: 1 file per record

2008-09-24 Thread chandravadana
thanks is there any built in record reader which performs this function.. Enis Soztutar wrote: Yes, you can use MultiFileInputFormat. You can extend the MultiFileInputFormat to return a RecordReader, which reads a record for each file in the MultiFileSplit. Enis chandra wrote:

RE: Questions about Hadoop

2008-09-24 Thread Arijit Mukherjee
Thanx again Enis. I'll have a look at Pig and Hive. Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com

Re: 1 file per record

2008-09-24 Thread Enis Soztutar
Nope, not right now. But this has came up before. Perhaps you will contribute one? chandravadana wrote: thanks is there any built in record reader which performs this function.. Enis Soztutar wrote: Yes, you can use MultiFileInputFormat. You can extend the MultiFileInputFormat to

Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
Arijit, For workflow, check out http://cascading.org -- that works quite well and fits what you described. Greenplum and Aster Data have announced support for running MR within the context of their relational databases, e.g., http://www.greenplum.com/resources/mapreduce/ In terms of PIG, Hive,

RE: Questions about Hadoop

2008-09-24 Thread Arijit Mukherjee
That's a very good overview Paco - thanx for that. I might get back to you with more queries about cascade etc. at some time - hope you wouldn't mind. Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt

HDFS, FSDataOutputStream, flush(), sync(), close()

2008-09-24 Thread Christoph Graf
Hi everybody, I have a simple test case which creates a file, writes two lines into the FSDataOutputStream and then flushes, syncs and closes the stream. I am using hadoop 0.18.0 with cygwin. What I observe (in contrast of using java.io.DataOutputStream) is that the lines get written to

Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
Certainly. It'd be great to talk with others working in analytics and statistical computing, who have been evaluating MapReduce as well. Paco On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee [EMAIL PROTECTED] wrote: That's a very good overview Paco - thanx for that. I might get back to you

How can I get the record number of a SequenceFile?

2008-09-24 Thread Jeremy Chow
Hi list, I've generated a sequence file by a reducer, then I will use it to start the second map step, which need the record number of that sequence file. How can it fast ? thanks a lot. -- My research interests are distributed systems, parallel computing and bytecode based virtual machine.

Re: How can I get the record number of a SequenceFile?

2008-09-24 Thread Deyaa Adranale
Hi Jeremy, I think the key of the map function is the number of the record in a sequence file. I was trying to retrieve the record number in a normal TextInputFormat but could not find it. The best what I could use is the byte offset of the record, which does not reflex the record number

Re: HDFS, FSDataOutputStream, flush(), sync(), close()

2008-09-24 Thread Raghu Angadi
Hmm.. neither of these filesystems seems to implement flush(). Can you file a jira on it? HDFS implements sync() and data should be on the disk after that, but might not be able available to a reader yet. How did you test if it data is on the disk? Raghu. Christoph Graf wrote: Hi

Re: Question about Hadoop 's Feature(s)

2008-09-24 Thread Owen O'Malley
On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the

what does this error mean

2008-09-24 Thread Elia Mazzawi
I got these errors I don't know what they mean, any help is appreciated. I suspect that either its a H/W error or the cluster is out of space to store intermediate results? there is still lots of free space left on the cluster. 08/09/24 00:23:31 INFO mapred.JobClient: map 79% reduce 24%

HDFS ingest rate

2008-09-24 Thread steph
Hi, Are there any performance numbers related to how HDFS can ingest data? I am assuming a case where multiple processes outside hadoop write into hadoop in parallel. I understand that it is probably related to various hardware constraints but any existing numbers would be interesting. In

Hadoop job scheduling issue

2008-09-24 Thread Bryan Duxbury
I encountered an interesting situation today. I'm running Hadoop 0.17.1. What happened was that 3 jobs started simultaneously, which is expected in my workflow, but then resources got very mixed up. One of the jobs grabbed all the available reducers (5) and got one map task in before the

counter for number of mapper records

2008-09-24 Thread Sandy
If I understand correctly, each mapper is sent a set number of records. Is there a counter or variable that tells you how many records is sent to a particular mapper? Likewise, is there a similar thing for reducers? Thanks in advance. -SM

Re: counter for number of mapper records

2008-09-24 Thread lohit
Yes, take a look at src/mapred/org/apache/hadoop/mapred/Task_Counter.properties Those are all the counters available for a task. -Lohit - Original Message From: Sandy [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, September 24, 2008 5:09:39 PM Subject: counter for

Re: what does this error mean

2008-09-24 Thread Mikhail Yakshin
On Wed, Sep 24, 2008 at 9:24 PM, Elia Mazzawi wrote: I got these errors I don't know what they mean, any help is appreciated. I suspect that either its a H/W error or the cluster is out of space to store intermediate results? there is still lots of free space left on the cluster. 08/09/24

RE: Question about Hadoop 's Feature(s)

2008-09-24 Thread Trinh Tuan Cuong
Dear Mr/Mrs Owen O'Malley, First I would like to thank you much for your reply, it was somehow the exact answer which I expected. As I read about the Query Language of Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and more... and I could see that Hadoop have an advantage SQL-like

Re: Hadoop job scheduling issue

2008-09-24 Thread omalley
On 9/24/08, Bryan Duxbury [EMAIL PROTECTED] wrote: Does Hadoop not schedule jobs first-come-first served? Yes, Hadoop 0.17 schedules jobs fifo. If it isn't, that is a bug. -- Owen

Can hadoop sort by values rather than keys?

2008-09-24 Thread Jeremy Chow
Hi list, The default way hadoop doing its sorting is by keys , can it sort by values rather than keys? Regards, Jeremy -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com

Re: Hadoop for real time

2008-09-24 Thread Edward J. Yoon
What kind of the real-time app? On Wed, Sep 24, 2008 at 4:50 AM, Stas Oskin [EMAIL PROTECTED] wrote: Hi. Is it possible to use Hadoop for real-time app, in video processing field? Regards. -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org

Re: Tips on sorting using Hadoop

2008-09-24 Thread bz
Hi, Is there a way to do this with streaming? I've noticed there is a -partitioner option for streaming, does that mean I have to write a java partitioner class to perform total order sorting? Thanks, Joseph On Sun, Sep 21, 2008 at 2:12 AM, lohit [EMAIL PROTECTED] wrote: Since this is

Re: Can hadoop sort by values rather than keys?

2008-09-24 Thread Jim Twensky
Sorting according to keys is a requirement for the map/reduce algorithm. I'd suggest running a second map/reduce phase on the output files of your application and use the values as keys in that second phase. I know that will increase the running time, but this is how I do it when I need to get my

Re: debugging hadoop application!

2008-09-24 Thread Jim Twensky
As far as I know, there is a Hadoop plug-in for Eclipse but it is not possible to debug when running on a real cluster. If you want to add watches and expressions to trace your programs or profile your code, I'd suggest looking at the log files or use other tracing tools such as xtrace (