Text search on a PDF file using hadoop

2008-07-23 Thread GaneshG
while i search a text in a pdf file using hadoop, the results are not coming properly. i tried to debug my program, i could see the lines red from pdf file is not formatted. please help me to resolve this. -- View this message in context:

Re: Text search on a PDF file using hadoop

2008-07-23 Thread GaneshG
Thanks Lohit, i am using only defalult reader and i am very new to hadoop. This is my map method public void map(LongWritable key, Text value, OutputCollectorText, Text output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer

Re: Text search on a PDF file using hadoop

2008-07-23 Thread Dhruba Borthakur
One option for you is to use a pdf-to-text converter (many of them are available online) and then run map-reduce on the txt file. -dhruba On Wed, Jul 23, 2008 at 1:07 AM, GaneshG [EMAIL PROTECTED] wrote: Thanks Lohit, i am using only defalult reader and i am very new to hadoop. This is my map

Re: newbie install

2008-07-23 Thread Jose Vidal
Thanks! that worked. I was able to run dfs and put some files in it. However, when I go to my namenode at http://namenode:50070 I see that all the datanodes have a name of localhost. Will this cause bigger problems later on? or should I just ignore it. Jose On Tue, Jul 22, 2008 at 6:48 PM,

DFSClient java.io.IOException: Too many open files

2008-07-23 Thread Keith Fisher
I'm running hadoop version 0.17.0 on a Red Hat Enterprise Linux 4.4 box. I'm using an IBM provided JDK 1.5. I've configured Hadoop for a localhost. I've written a simple test to open and write to files in HDFS. I close the output stream after I write 10 bytes to the file. After 471 files, I see

Re: newbie install

2008-07-23 Thread Edward J. Yoon
That's good. :) Will this cause bigger problems later on? or should I just ignore it. I'm not sure, But I guess there is no problem. Does anyone have some experience with that? Regards, Edward J. Yoon On Wed, Jul 23, 2008 at 11:05 PM, Jose Vidal [EMAIL PROTECTED] wrote: Thanks! that worked.

Using MapReduce to do table comparing.

2008-07-23 Thread Amber
We have a 10 million row table exported from AS400 mainframe every day, the table is exported as a csv text file, which is about 30GB in size, then the csv file is imported into a RDBMS table which is dropped and recreated every day. Now we want to find how many rows are updated during each

Re: How to write one file per key as mapreduce output

2008-07-23 Thread James Moore
On Tue, Jul 22, 2008 at 5:04 PM, Lincoln Ritter [EMAIL PROTECTED] wrote: Greetings, I would like to write one file per key in the reduce (or map) phase of a mapreduce job. I have looked at the documentation for FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on how to use

Re: Using MapReduce to do table comparing.

2008-07-23 Thread Jason Venner
If you write a SequenceFile with the results from the RDBM you can use the join primitives to handle this rapidly. The key is that you have to write the data in the native key sort order. Since you have a primary key, you should be able to dump the table in primary key order, and you can define

Re: Using MapReduce to do table comparing.

2008-07-23 Thread James Moore
On Wed, Jul 23, 2008 at 7:33 AM, Amber [EMAIL PROTECTED] wrote: We have a 10 million row table exported from AS400 mainframe every day, the table is exported as a csv text file, which is about 30GB in size, then the csv file is imported into a RDBMS table which is dropped and recreated every

Re: DFSClient java.io.IOException: Too many open files

2008-07-23 Thread Raghu Angadi
Keith Fisher wrote: I realized that this could be one alternative. But what if the process writing to HDFS is a daemon that's designed to run 24X7? In that scenario, will it eventually, over time, use all the open files? Or will the DFSClient periodically close resources that it no longer

Re: DFSClient java.io.IOException: Too many open files

2008-07-23 Thread Raghu Angadi
Raghu Angadi wrote: Keith Fisher wrote: I realized that this could be one alternative. But what if the process writing to HDFS is a daemon that's designed to run 24X7? In that scenario, will it eventually, over time, use all the open files? Or will the DFSClient periodically close

Re: Using MapReduce to do table comparing.

2008-07-23 Thread Paco NATHAN
This is merely an in the ballpark calculation, regarding that 10 minute / 4-node requirement... We have a reasonably similar Hadoop job (slightly more complex in the reduce phase) running on AWS with: * 100+2 nodes (m1.xl config) * approx 3x the number of rows and data size * completes

Re: distcp skipping the file

2008-07-23 Thread Chris Douglas
The -update behavior is by design. Could you provide the command line, and the directory structure before and after issuing the copy? -C On Jul 22, 2008, at 9:46 PM, Murali Krishna wrote: Hi, I am using 0.15.3 and the destination is empty. One more behavior that I am seeing is that

Fw: question on HDFS

2008-07-23 Thread Gopal Gandhi
Hi folks, Does anybody has a comment on that? Why we let reducer fetch local data through HTTP not SSH? - Forwarded Message From: Gopal Gandhi [EMAIL PROTECTED] To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 6:30:49 PM Subject: Re: question on

Hadoop and Ganglia Meterics

2008-07-23 Thread Joe Williams
I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL

Re: Fuse-j-hadoopfs

2008-07-23 Thread Pete Wyckoff
Hi Xavier, RE: fuse dfs having facebook specific things in it, I think that the trunk version should be pretty clean. As far as permissions in fuse dfs, the following 2 jiras relate to that, and Craig Macdonald is working on it. https://issues.apache.org/jira/browse/HADOOP-3765

Re: Text search on a PDF file using hadoop

2008-07-23 Thread Joman Chu
I've been investigating this recently, and I came across Apache PDFBox (http://incubator.apache.org/projects/pdfbox.html), which may accomplish this in native Java. Try it out and get back to us on how well it works, I'd be curious to know. Joman Chu AIM: ARcanUSNUMquam IRC: irc.liquid-silver.net

Confused about Reduce functions

2008-07-23 Thread Kylie McCormick
Hello! I have been getting NullPointerExceptions in my reduce() function, with the code below. (If have removed all the check for null pointer if-statements, but they are there for every object.) I based my code off of the Word Count example. Essentially, the reduce function is to rescore the