Hi - There's nothing like that yet. What you can do is run a custom URL filter 
for the generate step, allowing only HTML files and use your standard URL 
filter for the other steps.

 
 
-----Original message-----
> From:Stefan Scheffler <[email protected]>
> Sent: Tue 02-Oct-2012 09:24
> To: [email protected]
> Subject: priorised/scored fetching
> 
> Hi.
> I crawl a webdatabase for *.html, *.pdf and *.doc documents, with a 
> given topN. I want nutch to fetch first all of the html documents, then 
> pdf and at last doc, because html is more important than pdf and so on.
> Is there a way to make nutch follow such rules (maybe with a scoring 
> algorithm)?
> 
> Regards
> Stefan
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GbR
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: [email protected]
> 
> 

Reply via email to