Ah Ok. Thank you.
This sounds like my intention :)
Regards
Stefan
Am 02.10.2012 10:34, schrieb Julien Nioche:
you should be able to do that with a custom scoring filter and give a score
based on the mime type
On 2 October 2012 08:28, Markus Jelsma <[email protected]> wrote:
Hi - There's nothing like that yet. What you can do is run a custom URL
filter for the generate step, allowing only HTML files and use your
standard URL filter for the other steps.
-----Original message-----
From:Stefan Scheffler <[email protected]>
Sent: Tue 02-Oct-2012 09:24
To: [email protected]
Subject: priorised/scored fetching
Hi.
I crawl a webdatabase for *.html, *.pdf and *.doc documents, with a
given topN. I want nutch to fetch first all of the html documents, then
pdf and at last doc, because html is more important than pdf and so on.
Is there a way to make nutch follow such rules (maybe with a scoring
algorithm)?
Regards
Stefan
--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: [email protected]
--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: [email protected]