Benjamin,
you could add this to the HtmlParser and BasicIndexingFilter, but maybe
it is best to create your own plugin. Along the lines of the
WritingPluginExample:
1) add a metatag to your seed Pages
meta name=indexed content=no /
2) create a ParseFilter that extends HtmlParseFilter and
Hello,
One week ago, I launched a crawl on a list of domains with a depth of
10. My crawler is still running now. How can I have the status of the
crawl process ? (number of fetched/indexed pages, current depth of
crawl, percentage of tasks realised...and any other useful information).
How can I have the status of the crawl process ?
In general this should be apparent from the crawl log.
- number of fetched pages is printed to the logs at certain
intervals (also number of pages/sec etc.)
- number of indexed pages if you use the crawl too, indexing is
done after all pages
Fabrice,
Personally I am tailing the crawl log to find that out. About every 100
pages it gives out the amount of pages in total and pages per second and
line speed.
Hope that helps.
r/d
-Original Message-
From: Fabrice Estiévenart [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 05,
Got the following dump at 100% of generate cycle
(.8 svn release)
060405 080019 parsing
file:/home/mozdex/trunk/conf/nutch-site.xml
060405 080019 parsing
file:/home/mozdex/trunk/conf/hadoop-site.xml
Exception in thread main java.lang.RuntimeException:
class
Andrzej fixed it 2 hours ago.
http://svn.apache.org/viewcvs.cgi?rev=391577view=rev
Thanks
Jérôme
On 4/5/06, Byron Miller [EMAIL PROTECTED] wrote:
Got the following dump at 100% of generate cycle
(.8 svn release)
060405 080019 parsing
file:/home/mozdex/trunk/conf/nutch-site.xml
060405
hehe, just pulled it down and trying again :)
thanks!
--- J�r�me Charron [EMAIL PROTECTED]
wrote:
Andrzej fixed it 2 hours ago.
http://svn.apache.org/viewcvs.cgi?rev=391577view=rev
Thanks
J�r�me
On 4/5/06, Byron Miller [EMAIL PROTECTED]
wrote:
Got the following dump at
Byron Miller wrote:
Got the following dump at 100% of generate cycle
(.8 svn release)
Just fixed this. Sorry.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || |
Andrzej,
Thanks for your response and patch. But I have a few more questions about
adaptive refetch. As far as I understood the solution below is 'not to
overwrite
some fields of the entries' in the db. Assume we applied the adaptive
refetch idea in your patch to the 0.7 version. We have the
Mehmet Tan wrote:
Andrzej,
Thanks for your response and patch. But I have a few more questions about
adaptive refetch. As far as I understood the solution below is 'not to
overwrite
some fields of the entries' in the db. Assume we applied the adaptive
refetch idea in your patch to the 0.7
Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write
I had earlier posted this message to the list but havent got any response.
Here are more details.
Nutch versionI: nutch.0.7.2
URL File: contains a single URL. File name: urls
Crawl-url-filter: is set to grab all URLs
Command: bin/nutch crawl urls -dir crawl.test -depth 3
Error:
Mehmet Tan wrote:
Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
Hi All,
Two general questions:
- I'm wondering if there are any good sources of written information on
actually writing a search engine script. Things like scoring, indexing,
that kind of stuff. I bought the lucene book, but that's lucene
specific technical info. Looking for something at
Hi there...
I was having a number of problems with my install, mainly because I'm
not used to Tomcat and/or Nutch etc...
Anyways, I am running Fedora 4 and was told that the packages are bad
idea to use so uninstalled all of my java/tomcat rpm's and installed new
binaries today from the source
15 matches
Mail list logo