Maybe you can use bin/nutch plugin to test the httplcient plugin first. It
use HttpClient handles authenticating with servers almost transparently,
the only thing a developer must do is actually provide the login
credentials. you can see this [0]
[0]
Hi,
Can some one please explain how the following scenario works?
I need to crawl a site with 50K urls. This site is a dynamic site and will
have frequent updates on the site. Assuming it takes 2 days to completely
crawl this site, can we have some configuration(fetch schedule or something
Hi Senthil,
I think you should take a look at this website. You can find detailed
information there.
http://wiki.apache.org/nutch/FrontPage
I will presume you are using Nutch 1.xx without Hadoop. You can then check this
site first: http://wiki.apache.org/nutch/NutchTutorial
You should
Hi Senthilkumar,
In short, search recrawl from the Nutch wiki to find an external blog post
on recrawling with Nutch. If you have anything to add to the post contact
the author. If on the other hand you need clarification on anything then
ping us here
Hth
Lewis
On Thursday, April 18, 2013,
Thank you Kiran.
Now, I am following the tutorial.
Regards,
Atte,
Maximiliano Marin Bustos
MCTS: Windows Server 2008 R2, Virtualization
MCTS: SQL Server 2008, Implementation and Maintenance
Web: http://maximilianomarin.com
Celular: (+56 9) 780 688 91
2013/4/17 kiran chitturi
Curious to know whether Nutch AdaptiveFetchSchedule can do recrawling
automatically?
I observed Hadoop automatically reinitiates the interrupted Jobs. Otherwise
Hadoop is always up and running with Nutch jobs configured to it. In this
scenario if a page is ready to be crawled based on adaptive
I'm using nutch 1.6 to crawl a variety of web pages/sites and I'm finding
that my solr database contains pairs of near-duplicate entries where the
main difference is that one contains a period after the hostname in the id.
For example:
entry 1: id: http://example.com/
entry 2: id:
Rodney,
Those are valid URL's but you clearly don't need them. You can either use
filters to get rid of them or normalize them away. Use the
org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test
your config.
Markus
-Original message-
From:Rodney Barnett
Hi Raja,
The FetchSchedule [0] defines the contract for implementations that
manipulate fetch times and re-fetch intervals. FetchScheduleFactory [1]
caches the instance in the ObjectCache.
The Interface and classes (respectively) do not automate or semi-automate
actual scheduling e.g. execute the
Thanks Lewis for your help and useful information.
-Raja
--
View this message in context:
http://lucene.472066.n3.nabble.com/Whether-Nutch-AdaptiveFetchSchedule-can-do-recrawling-automatically-tp4056979p4057179.html
Sent from the Nutch - User mailing list archive at Nabble.com.
10 matches
Mail list logo