date:20051121

Delay after 404 error?

2005-11-21 Thread Felix Joachim

Hi, is there some way of reducing the delay after a 404 error? 051121 135103 fetch of http://localhost:1234/manual/misc/FAQ.html failed with: java.lang.Exception: org.apache.nutch.protocol.http.HttpError: HTTP Error: 404 After each of those the fetching process seems to wait for a few

How to detect/manage duplicates across multiple tld's

2005-11-21 Thread Byron Miller

I'm noticing searches returning results that have every tld for the same site listed. For example .org, .com and .net of the same site. is there anyway to do duplicate detection based upon X% of duplicate content and either flag/descore or delete based upon that?

sorting on multiple fields

2005-11-21 Thread James Nelson

Hello, I need to sort the search results on two fields for a project I'm working on, but nutch only seems to support sorting on one. I'm wondering if I missed something and there is actually a way or if there is a reason for restricting sort to one field that I'm not aware of. thanks, James

Does Nutch use GCJ?

2005-11-21 Thread Victor Lee

Hi, I am using php-java-bridge to run nutch code in php file. I keeps getting weird exception. And it refers to GCJ, does nutch use gcj? Thanks. The following are the errors: 051121 134507 parsing jar:file:/usr/share/java/nutch-0.7.1.jar!/nutch-site.xml 051121 134507 Plugins:

Re: Does Nutch use GCJ?

2005-11-21 Thread Stefan Groschupf

I suggest to run nutch standalone in a tomcat and use the rss - xml feed servlet. You can than parse and show the rss via php. Am 21.11.2005 um 23:19 schrieb Victor Lee: Hi, I am using php-java-bridge to run nutch code in php file. I keeps getting weird exception. And it refers to

Re: Does Nutch use GCJ?

2005-11-21 Thread Victor Lee

No, I will have to face it sooner or later because I need to use Lucene(java) to index something in my php code. Stefan Groschupf [EMAIL PROTECTED] wrote: I suggest to run nutch standalone in a tomcat and use the rss - xml feed servlet. You can than parse and show the rss via php. Am

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting

Ben Halsted wrote: I've modified the auto-crawl to always use a pre-existing crawldb. If I run it multiple times I get multiple linkdb, segments, indexes, and index directories. Is it possible to merge the results using the bin/nutch comamnds? You should also have it use a single linkdb.

Re: merging auto-crawls

2005-11-21 Thread Ben Halsted

Thanks! (sorry about the double post, even a day apart). One other quick question for you. (Using the mapred branch): When I merge this stuff, do I need to merge the segments/* for each crawl into a single segments directory? Or is there data in the merged index file that will direct the web

Re: Filesystem structure for the web front-end.

2005-11-21 Thread Doug Cutting

Ben Halsted wrote: I was wondering what the required file structure is for the web gui to work properly. Are all of these required? /db/crawldb /db/index /db/indexes /db/segments /db/linkdb The indexes directory is not used when a merged index is present. The crawldb and

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting

Ben Halsted wrote: When I merge this stuff, do I need to merge the segments/* for each crawl into a single segments directory? Or is there data in the merged index file that will direct the web component to the correct segment? Put the segments in a single directory. The index only has the

Re: sorting on multiple fields

2005-11-21 Thread Doug Cutting

James Nelson wrote: I need to sort the search results on two fields for a project I'm working on, but nutch only seems to support sorting on one. I'm wondering if I missed something and there is actually a way or if there is a reason for restricting sort to one field that I'm not aware of.

Re: merging auto-crawls

2005-11-21 Thread Ben Halsted

Doug, Thank you so much, I'll modify my code and see what happens. So far when I try to take a single segment/xyz directory to another location and work with it, the web gui works fine. As soon as I include segments from more than 1 run, the results page returns blank unless there are 0 results,

Has anyone gotten the date query to function properly?

2005-11-21 Thread Bryan Woliner

If people have gotten the date query to work properly, it would be great to know the steps they used in get it working. I added the following property entry to my nutch-site.xml file and used the search phrase: url:http date:19000101-20051231 (which returned zero results). property

merging auto-crawls

2005-11-21 Thread Ben Halsted

Hi Guys, Using the mapred branch I've been able to merge fetched auto-crawl content while using NDFS. On the mapred branch, it seems the bin/nutch admin tools are being removed. Maybe because they use the webdb stuff? Anyhow, how do I get a list or count of URLs, in the system? Cheers, Ben

Help with Nutch and Web Data Toolkit

2005-11-21 Thread Kumar Limbu

Hi Everyone, I am new to the mailing list and I would highly appreciate your co-operation. I am new to nutch but I really like it's functionality. I am trying to develop a crawling application which will crawl only the specific links from a web page. I am thinking of using any wdbc tools like

Delay after 404 error?

How to detect/manage duplicates across multiple tld's

sorting on multiple fields

Does Nutch use GCJ?

Re: Does Nutch use GCJ?

Re: Does Nutch use GCJ?

Re: merging auto-crawls

Re: merging auto-crawls

Re: Filesystem structure for the web front-end.

Re: merging auto-crawls

Re: sorting on multiple fields

Re: merging auto-crawls

Has anyone gotten the date query to function properly?

merging auto-crawls

Help with Nutch and Web Data Toolkit

15 matches

Site Navigation

Mail list logo

Footer information