while i search a text in a pdf file using hadoop, the results are not coming
properly. i tried to debug my program, i could see the lines red from pdf
file is not formatted. please help me to resolve this.
--
View this message in context:
Thanks Lohit, i am using only defalult reader and i am very new to hadoop.
This is my map method
public void map(LongWritable key, Text value, OutputCollectorText,
Text output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer
One option for you is to use a pdf-to-text converter (many of them are
available online) and then run map-reduce on the txt file.
-dhruba
On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
[EMAIL PROTECTED] wrote:
Thanks Lohit, i am using only defalult reader and i am very new to hadoop.
This is my map
Thanks! that worked. I was able to run dfs and put some files in it.
However, when I go to my namenode at http://namenode:50070 I see that
all the datanodes have a name of localhost.
Will this cause bigger problems later on? or should I just ignore it.
Jose
On Tue, Jul 22, 2008 at 6:48 PM,
I'm running hadoop version 0.17.0 on a Red Hat Enterprise Linux 4.4
box. I'm using an IBM provided JDK 1.5. I've configured Hadoop for a
localhost.
I've written a simple test to open and write to files in HDFS. I close
the output stream after I write 10 bytes to the file. After 471 files,
I see
That's good. :)
Will this cause bigger problems later on? or should I just ignore it.
I'm not sure, But I guess there is no problem.
Does anyone have some experience with that?
Regards, Edward J. Yoon
On Wed, Jul 23, 2008 at 11:05 PM, Jose Vidal [EMAIL PROTECTED] wrote:
Thanks! that worked.
We have a 10 million row table exported from AS400 mainframe every day, the
table is exported as a csv text file, which is about 30GB in size, then the csv
file is imported into a RDBMS table which is dropped and recreated every day.
Now we want to find how many rows are updated during each
On Tue, Jul 22, 2008 at 5:04 PM, Lincoln Ritter
[EMAIL PROTECTED] wrote:
Greetings,
I would like to write one file per key in the reduce (or map) phase of a
mapreduce job. I have looked at the documentation for
FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on
how to use
If you write a SequenceFile with the results from the RDBM you can use
the join primitives to handle this rapidly.
The key is that you have to write the data in the native key sort order.
Since you have a primary key, you should be able to dump the table in
primary key order, and you can define
On Wed, Jul 23, 2008 at 7:33 AM, Amber [EMAIL PROTECTED] wrote:
We have a 10 million row table exported from AS400 mainframe every day, the
table is exported as a csv text file, which is about 30GB in size, then the
csv file is imported into a RDBMS table which is dropped and recreated every
Keith Fisher wrote:
I realized that this could be one alternative. But what if the process
writing to HDFS is a daemon that's designed to run 24X7? In that scenario,
will it eventually, over time, use all the open files?
Or will the DFSClient periodically close resources that it no longer
Raghu Angadi wrote:
Keith Fisher wrote:
I realized that this could be one alternative. But what if the process
writing to HDFS is a daemon that's designed to run 24X7? In that
scenario,
will it eventually, over time, use all the open files?
Or will the DFSClient periodically close
This is merely an in the ballpark calculation, regarding that 10
minute / 4-node requirement...
We have a reasonably similar Hadoop job (slightly more complex in the
reduce phase) running on AWS with:
* 100+2 nodes (m1.xl config)
* approx 3x the number of rows and data size
* completes
The -update behavior is by design.
Could you provide the command line, and the directory structure before
and after issuing the copy? -C
On Jul 22, 2008, at 9:46 PM, Murali Krishna wrote:
Hi,
I am using 0.15.3 and the destination is empty. One more
behavior that I am seeing is that
Hi folks,
Does anybody has a comment on that? Why we let reducer fetch local data
through HTTP not SSH?
- Forwarded Message
From: Gopal Gandhi [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Sent: Tuesday, July 22, 2008 6:30:49 PM
Subject: Re: question on
I have been attempting to get Hadoop metrics in Ganliga and have been
unsuccessful thus far. I have see this thread
(http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/)
but it didn't help much.
I have setup my properties file like so:
[EMAIL
Hi Xavier,
RE: fuse dfs having facebook specific things in it, I think that the trunk
version should be pretty clean. As far as permissions in fuse dfs, the
following 2 jiras relate to that, and Craig Macdonald is working on it.
https://issues.apache.org/jira/browse/HADOOP-3765
I've been investigating this recently, and I came across Apache PDFBox
(http://incubator.apache.org/projects/pdfbox.html), which may
accomplish this in native Java. Try it out and get back to us on how
well it works, I'd be curious to know.
Joman Chu
AIM: ARcanUSNUMquam
IRC: irc.liquid-silver.net
Hello!
I have been getting NullPointerExceptions in my reduce() function, with the
code below. (If have removed all the check for null pointer if-statements,
but they are there for every object.)
I based my code off of the Word Count example. Essentially, the reduce
function is to rescore the
19 matches
Mail list logo