Hi,
is there some way of reducing the delay after a 404 error?
051121 135103 fetch of http://localhost:1234/manual/misc/FAQ.html failed
with: java.lang.Exception: org.apache.nutch.protocol.http.HttpError:
HTTP Error: 404
After each of those the fetching process seems to wait for a few
I'm noticing searches returning results that have
every tld for the same site listed.
For example .org, .com and .net of the same site.
is there anyway to do duplicate detection based upon
X% of duplicate content and either flag/descore or
delete based upon that?
Hello,
I need to sort the search results on two fields for a project I'm
working on, but nutch only seems to support sorting on one. I'm
wondering if I missed something and there is actually a way or if
there is a reason for restricting sort to one field that I'm not aware
of.
thanks,
James
Hi,
I am using php-java-bridge to run nutch code in php file. I keeps getting
weird exception. And it refers to GCJ, does nutch use gcj? Thanks. The
following are the errors:
051121 134507 parsing jar:file:/usr/share/java/nutch-0.7.1.jar!/nutch-site.xml
051121 134507 Plugins:
I suggest to run nutch standalone in a tomcat and use the rss - xml
feed servlet. You can than parse and show the rss via php.
Am 21.11.2005 um 23:19 schrieb Victor Lee:
Hi,
I am using php-java-bridge to run nutch code in php file. I
keeps getting weird exception. And it refers to
No, I will have to face it sooner or later because I need to use Lucene(java)
to index something in my php code.
Stefan Groschupf [EMAIL PROTECTED] wrote: I suggest to run nutch standalone
in a tomcat and use the rss - xml
feed servlet. You can than parse and show the rss via php.
Am
Ben Halsted wrote:
I've modified the auto-crawl to always use a pre-existing crawldb. If I run
it multiple times I get multiple linkdb, segments, indexes, and index
directories.
Is it possible to merge the results using the bin/nutch comamnds?
You should also have it use a single linkdb.
Thanks! (sorry about the double post, even a day apart).
One other quick question for you. (Using the mapred branch):
When I merge this stuff, do I need to merge the segments/* for each crawl
into a single segments directory? Or is there data in the merged index file
that will direct the web
Ben Halsted wrote:
I was wondering what the required file structure is for the web gui to work
properly.
Are all of these required?
/db/crawldb
/db/index
/db/indexes
/db/segments
/db/linkdb
The indexes directory is not used when a merged index is present.
The crawldb and
Ben Halsted wrote:
When I merge this stuff, do I need to merge the segments/* for each crawl
into a single segments directory? Or is there data in the merged index file
that will direct the web component to the correct segment?
Put the segments in a single directory. The index only has the
James Nelson wrote:
I need to sort the search results on two fields for a project I'm
working on, but nutch only seems to support sorting on one. I'm
wondering if I missed something and there is actually a way or if
there is a reason for restricting sort to one field that I'm not aware
of.
Doug,
Thank you so much, I'll modify my code and see what happens. So far when I
try to take a single segment/xyz directory to another location and work with
it, the web gui works fine. As soon as I include segments from more than 1
run, the results page returns blank unless there are 0 results,
If people have gotten the date query to work properly, it would be
great to know the steps they used in get it working.
I added the following property entry to my nutch-site.xml file and
used the search phrase:
url:http date:19000101-20051231 (which returned zero results).
property
Hi Guys,
Using the mapred branch I've been able to merge fetched auto-crawl content
while using NDFS.
On the mapred branch, it seems the bin/nutch admin tools are being
removed. Maybe because they use the webdb stuff?
Anyhow, how do I get a list or count of URLs, in the system?
Cheers,
Ben
Hi Everyone,
I am new to the mailing list and I would highly appreciate your
co-operation. I am new to
nutch but I really like it's functionality.
I am trying to develop a crawling application which will crawl only the
specific
links from a web page. I am thinking of using any wdbc tools like
15 matches
Mail list logo