Hi Shane, The regex-urlfilter.txt will exclude "someurl.com" when you do a/multiple cycle of "inject > generate > fetch > parse > update > solrupdate" process. The regex-urlfilter.txt will also affects on "updatedb" and "solrindex" steps with "-filter" as parameter applied.
Regards, On Thu, Apr 3, 2014 at 10:44 AM, Shane Wood <[email protected]> wrote: > Can you choose a custom regex-urlfilter.txt too save editing it each time > you wish too index a different site ?. > > I am surprised you can't enter a url when generating a fetch list. ie > > /bin/nutch generate --only someurl.com --job 192833-292837 > > The you fetch job 192833-292837 parse job 192833-292837 and finally > update dbase job 192833-292837 > > Now that would be great.. > > Thanks will be doing it your way for now. :) > > Shane. > > > > On 03/04/14 13:24, remi tassing wrote: > >> Hi Shane, >> >> You could use the same scripts as before but just modify the >> regex-urlfilter.txt to restrict the crawling scope. >> >> BR, Remi >> >> >> On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood<[email protected]> wrote: >> >> >> >>> I have indexed several site successfully. >>> Now i wish too index a new site and not update any other sites already >>> indexed. >>> >>> I use Nutch 2.21 MYSQL 5.3 and Solr 4.7.0 how would you recommend i go >>> about indexing a new site only >>> if someone can give examples of command lines that would be amazingly >>> helpful. >>> >>> Cheers >>> Shane. >>> >>> >>> >> >> > > -- wassalam, [bayu]

