Delay after 404 error?

2005-11-21 Thread Felix Joachim
Hi, is there some way of reducing the delay after a 404 error? 051121 135103 fetch of http://localhost:1234/manual/misc/FAQ.html failed with: java.lang.Exception: org.apache.nutch.protocol.http.HttpError: HTTP Error: 404 After each of those the fetching process seems to wait for a few

How to detect/manage duplicates across multiple tld's

2005-11-21 Thread Byron Miller
I'm noticing searches returning results that have every tld for the same site listed. For example .org, .com and .net of the same site. is there anyway to do duplicate detection based upon X% of duplicate content and either flag/descore or delete based upon that?

sorting on multiple fields

2005-11-21 Thread James Nelson
Hello, I need to sort the search results on two fields for a project I'm working on, but nutch only seems to support sorting on one. I'm wondering if I missed something and there is actually a way or if there is a reason for restricting sort to one field that I'm not aware of. thanks, James

Does Nutch use GCJ?

2005-11-21 Thread Victor Lee
Hi, I am using php-java-bridge to run nutch code in php file. I keeps getting weird exception. And it refers to GCJ, does nutch use gcj? Thanks. The following are the errors: 051121 134507 parsing jar:file:/usr/share/java/nutch-0.7.1.jar!/nutch-site.xml 051121 134507 Plugins:

Re: Does Nutch use GCJ?

2005-11-21 Thread Stefan Groschupf
I suggest to run nutch standalone in a tomcat and use the rss - xml feed servlet. You can than parse and show the rss via php. Am 21.11.2005 um 23:19 schrieb Victor Lee: Hi, I am using php-java-bridge to run nutch code in php file. I keeps getting weird exception. And it refers to

Re: Does Nutch use GCJ?

2005-11-21 Thread Victor Lee
No, I will have to face it sooner or later because I need to use Lucene(java) to index something in my php code. Stefan Groschupf [EMAIL PROTECTED] wrote: I suggest to run nutch standalone in a tomcat and use the rss - xml feed servlet. You can than parse and show the rss via php. Am

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting
Ben Halsted wrote: I've modified the auto-crawl to always use a pre-existing crawldb. If I run it multiple times I get multiple linkdb, segments, indexes, and index directories. Is it possible to merge the results using the bin/nutch comamnds? You should also have it use a single linkdb.

Re: merging auto-crawls

2005-11-21 Thread Ben Halsted
Thanks! (sorry about the double post, even a day apart). One other quick question for you. (Using the mapred branch): When I merge this stuff, do I need to merge the segments/* for each crawl into a single segments directory? Or is there data in the merged index file that will direct the web

Re: Filesystem structure for the web front-end.

2005-11-21 Thread Doug Cutting
Ben Halsted wrote: I was wondering what the required file structure is for the web gui to work properly. Are all of these required? /db/crawldb /db/index /db/indexes /db/segments /db/linkdb The indexes directory is not used when a merged index is present. The crawldb and

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting
Ben Halsted wrote: When I merge this stuff, do I need to merge the segments/* for each crawl into a single segments directory? Or is there data in the merged index file that will direct the web component to the correct segment? Put the segments in a single directory. The index only has the

Re: sorting on multiple fields

2005-11-21 Thread Doug Cutting
James Nelson wrote: I need to sort the search results on two fields for a project I'm working on, but nutch only seems to support sorting on one. I'm wondering if I missed something and there is actually a way or if there is a reason for restricting sort to one field that I'm not aware of.

Re: merging auto-crawls

2005-11-21 Thread Ben Halsted
Doug, Thank you so much, I'll modify my code and see what happens. So far when I try to take a single segment/xyz directory to another location and work with it, the web gui works fine. As soon as I include segments from more than 1 run, the results page returns blank unless there are 0 results,

Has anyone gotten the date query to function properly?

2005-11-21 Thread Bryan Woliner
If people have gotten the date query to work properly, it would be great to know the steps they used in get it working. I added the following property entry to my nutch-site.xml file and used the search phrase: url:http date:19000101-20051231 (which returned zero results). property

merging auto-crawls

2005-11-21 Thread Ben Halsted
Hi Guys, Using the mapred branch I've been able to merge fetched auto-crawl content while using NDFS. On the mapred branch, it seems the bin/nutch admin tools are being removed. Maybe because they use the webdb stuff? Anyhow, how do I get a list or count of URLs, in the system? Cheers, Ben

Help with Nutch and Web Data Toolkit

2005-11-21 Thread Kumar Limbu
Hi Everyone, I am new to the mailing list and I would highly appreciate your co-operation. I am new to nutch but I really like it's functionality. I am trying to develop a crawling application which will crawl only the specific links from a web page. I am thinking of using any wdbc tools like