Hi Shane,

The regex-urlfilter.txt will exclude "someurl.com" when you do a/multiple
cycle of "inject > generate > fetch > parse > update > solrupdate" process.
The regex-urlfilter.txt will also affects on "updatedb" and "solrindex"
steps with "-filter" as parameter applied.

Regards,


On Thu, Apr 3, 2014 at 10:44 AM, Shane Wood <[email protected]> wrote:

> Can you choose a custom regex-urlfilter.txt too save editing it each time
> you wish too index a different site ?.
>
> I am surprised you can't enter a url when generating a fetch list. ie
>
> /bin/nutch generate --only  someurl.com --job 192833-292837
>
> The you fetch job 192833-292837  parse job 192833-292837 and finally
> update dbase  job 192833-292837
>
> Now that would be great..
>
> Thanks will be doing it your way for now. :)
>
> Shane.
>
>
>
> On 03/04/14 13:24, remi tassing wrote:
>
>> Hi Shane,
>>
>> You could use the same scripts as before but just modify the
>> regex-urlfilter.txt to restrict the crawling scope.
>>
>> BR, Remi
>>
>>
>> On Thu, Apr 3, 2014 at 10:52 AM, Shane Wood<[email protected]>  wrote:
>>
>>
>>
>>> I have indexed several site successfully.
>>> Now i wish too index a new site and not update any other sites already
>>> indexed.
>>>
>>> I use Nutch 2.21 MYSQL 5.3  and Solr 4.7.0 how would you recommend i go
>>> about indexing a new site only
>>> if someone can give examples of command lines that would be amazingly
>>> helpful.
>>>
>>> Cheers
>>> Shane.
>>>
>>>
>>>
>>
>>
>
>


-- 
wassalam,
[bayu]

Reply via email to