Re: Send parameters to a url

2013-04-18 Thread feng lu
Maybe you can use bin/nutch plugin to test the httplcient plugin first. It use HttpClient handles authenticating with servers almost transparently, the only thing a developer must do is actually provide the login credentials. you can see this [0] [0]

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread mesenthil1
Hi, Can some one please explain how the following scenario works? I need to crawl a site with 50K urls. This site is a dynamic site and will have frequent updates on the site. Assuming it takes 2 days to completely crawl this site, can we have some configuration(fetch schedule or something

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread Walter Tietze
Hi Senthil, I think you should take a look at this website. You can find detailed information there. http://wiki.apache.org/nutch/FrontPage I will presume you are using Nutch 1.xx without Hadoop. You can then check this site first: http://wiki.apache.org/nutch/NutchTutorial You should

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread Lewis John Mcgibbney
Hi Senthilkumar, In short, search recrawl from the Nutch wiki to find an external blog post on recrawling with Nutch. If you have anything to add to the post contact the author. If on the other hand you need clarification on anything then ping us here Hth Lewis On Thursday, April 18, 2013,

Re: Question about Nutch and Hadoop

2013-04-18 Thread Maximiliano Marin
Thank you Kiran. Now, I am following the tutorial. Regards, Atte, Maximiliano Marin Bustos MCTS: Windows Server 2008 R2, Virtualization MCTS: SQL Server 2008, Implementation and Maintenance Web: http://maximilianomarin.com Celular: (+56 9) 780 688 91 2013/4/17 kiran chitturi

Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread vivekvl
Curious to know whether Nutch AdaptiveFetchSchedule can do recrawling automatically? I observed Hadoop automatically reinitiates the interrupted Jobs. Otherwise Hadoop is always up and running with Nutch jobs configured to it. In this scenario if a page is ready to be crawled based on adaptive

Period-terminated hostnames

2013-04-18 Thread Rodney Barnett
I'm using nutch 1.6 to crawl a variety of web pages/sites and I'm finding that my solr database contains pairs of near-duplicate entries where the main difference is that one contains a period after the hostname in the id. For example: entry 1: id: http://example.com/ entry 2: id:

RE: Period-terminated hostnames

2013-04-18 Thread Markus Jelsma
Rodney, Those are valid URL's but you clearly don't need them. You can either use filters to get rid of them or normalize them away. Use the org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test your config. Markus -Original message- From:Rodney Barnett

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread Lewis John Mcgibbney
Hi Raja, The FetchSchedule [0] defines the contract for implementations that manipulate fetch times and re-fetch intervals. FetchScheduleFactory [1] caches the instance in the ObjectCache. The Interface and classes (respectively) do not automate or semi-automate actual scheduling e.g. execute the

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

2013-04-18 Thread vivekvl
Thanks Lewis for your help and useful information. -Raja -- View this message in context: http://lucene.472066.n3.nabble.com/Whether-Nutch-AdaptiveFetchSchedule-can-do-recrawling-automatically-tp4056979p4057179.html Sent from the Nutch - User mailing list archive at Nabble.com.