is there a separate mailing list for hadoop now

2006-03-22 Thread Raghavendra Prabhu
Hi Is there a separate mailing list for hadoop right now since it has been separated from nutch? Even then i feel that if there are any critical bugs fixed in hadoop, it would help even if the nutch group is aware of this. Interested people can be abreast of developments in hadoop also. Rgds Prab

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Andrzej Bialecki
Richard Braman wrote: I copied in hadoop-default.xml and mapred-default.xml from hadoop trunk into conf and still same error. What input directory is it looking for? You didn't say at all what are you trying to do, and what is your environment. What was the cmd-line? where is the data l

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
I copied in hadoop-default.xml and mapred-default.xml from hadoop trunk into conf and still same error. What input directory is it looking for? -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Thursday, March 23, 2006 1:11 AM To: nutch-user@lucene.apache.org Sub

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
But that didn't make it work. Borrowing from nutch haddop tutorial this is my hadoop-site.xml fs.default.name local The name of the default file system. Either the literal string "local" or a host:port for NDFS. mapred.job.trac

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
Hadoop-site.xml from trunk is missing fs.default.name local The name of the default file system. Either the literal string "local" or a host:port for NDFS. This should be added or posted ot the Nutch 0.8 tutorial. -Original Message- Fr

RE: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
I am not trying to use hadoop dfs, this is just a single nutcher and a single searcher on a single server configuration. -Original Message- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 22, 2006 11:41 PM To: nutch-user@lucene.apache.org Subject: .08 java.io.IOExcep

.08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Richard Braman
GO the following error running inject on Nutch 0.8 trunk. What am i doing wrong? 060322 235123 parsing jar:file:/T:/nutch-trunk/lib/hadoop-0.1-dev.jar!/mapred-de fault.xml 060322 235123 parsing \tmp\hadoop\mapred\local\job_vdg9ku.xml\localRunner 060322 235123 parsing file:/T:/nutch-trunk/conf/ha

.job file

2006-03-22 Thread Richard Braman
Getting back to nutch after doing some more legwork on pdf parsing, I got nutch from HEAD and built it. I noticed that there is a .job file created by the build. Is this something new in .08. Can you run nutch as a scheduled task now? Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voic

RE: Can't index Japanese PDF

2006-03-22 Thread Richard Braman
I would forward this to [EMAIL PROTECTED] -Original Message- From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 21, 2006 12:23 PM To: nutch-user@lucene.apache.org Subject: Can't index Japanese PDF In my quick experiments, Nutch 0.7.1 (with bundled PDFBox which I thou

Re: crawling pdf and word file

2006-03-22 Thread sudhendra seshachala
In Nutch-default.xml, Include plugin for word and PDF as below. plugin.includes protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs) Regular expression naming plugin directory names to include. Any plugin not matching this expression is exc

crawling pdf and word file

2006-03-22 Thread Michael Ji
hi there, Is there any specific setting need to be added in configuration file in order to crawl and index pdf and word file? thanks, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.co

Re: Removing urls from webdb

2006-03-22 Thread keren nutch
Hi sudhendra, Thans for reply. It's src/java/org/apache/nutch/tools.PruneDB, not src/java/org/apache/nutch/toos.PruneDB Best regards, Keren sudhendra seshachala <[EMAIL PROTECTED]> wrote: I guess the problem is with the package name src/java/org/apache/nutch/tools.PruneDB and

Re: Removing urls from webdb

2006-03-22 Thread sudhendra seshachala
I guess the problem is with the package name src/java/org/apache/nutch/tools.PruneDB and src/java/org/apache/nutch/toos.PruneDB... Can you please verify again. It seems to be a typo mistake Thanks keren nutch <[EMAIL PROTECTED]> wrote: Hi Matt, Thanks for reply. I put Prun

Re: Removing urls from webdb

2006-03-22 Thread keren nutch
Hi Matt, Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/tools/PruneDB Please let me know where I'm

Re: Removing urls from webdb

2006-03-22 Thread Matt Kangas
I'm puzzled by the claim that "It takes ~4 hours to remove a url from the webdb.". If you're removing them one at a time, yes, because you have to rewrite the entire webdb for any change. But you want to process them in bulk. So it should only take: = (time to rewrite webdb) + (time to proce

Re: Removing urls from webdb

2006-03-22 Thread keren nutch
Actually, we have 11,000,000 urls in the webdb. Keren "Insurance Squared Inc." <[EMAIL PROTECTED]> wrote: We've got a website that is causing our crawler to slow down (from 20mbits down to 3-5) - 400K pages that are basically not available, we're just getting 404's. I'd like to remove them

Removing urls from webdb

2006-03-22 Thread Insurance Squared Inc.
We've got a website that is causing our crawler to slow down (from 20mbits down to 3-5) - 400K pages that are basically not available, we're just getting 404's. I'd like to remove them from the DB to get our crawl speed back up again. Here's what our developer told me - I'm stumped, that seem

Re: Adaptive fetch schedule

2006-03-22 Thread Andrzej Bialecki
(Moved to the proper list) Raghavendra Prabhu wrote: Hi Does the inlink value problem solve the OPIC problem which was there. That is on a recrawl, the page would have a higher score. Does this fix that problem? No, it doesn't. But it prevents your linkDB from growing indefinitely, which

Tuning nutch-0.8-dev (rev-374745 of 2006-02-03)

2006-03-22 Thread monu . ogbe
Hello Team, Thanks to Andrzej for his support and a number of high level pointers in the matter of performance tuning. I am running the above version of nutch with mapred/ndfs across a cluster of five servers. One acting as namenode and jobtracker and all acting as datanodes and tasktrackers. E