Don’t think you’ll find all your answers on the out-of-the-box nutch, but you 
should study some of the extension points Nutch has, as far as I can see you 
should be able of writing custom plugins that will allow you to achieve your 
goals, but some programming is required. 

Greetings,

On Aug 6, 2014, at 4:24 AM, Ali Nazemian <[email protected]> wrote:

> Dear all,
> Hi,
> I used nutch for crawling some news website and solr for indexing the
> crawled pages. I was wondering how can I use nutch for crawling web forums?
> In crawling web forums we have some problems that need to be considered.
> (The ones that are not our concern in the case of news websites) Here is
> some of them:
> - There should be some techniques to find out each thread/post has how many
> pages and how can be reached.
> - Some of forums use java script for identifying paging and java script is
> a client side programming language. Somehow it should be parsed with nutch.
> - The depth method of nutch for crawling becomes useless since each page
> consider in new depth. But also infinite depth is off the choice cause it
> can be face us with infite crawling!
> - More...
> I really appreciate if somebody guide me through this subject.
> Best regards.
> 
> -- 
> A.Nazemian

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Reply via email to