Don’t think you’ll find all your answers on the out-of-the-box nutch, but you should study some of the extension points Nutch has, as far as I can see you should be able of writing custom plugins that will allow you to achieve your goals, but some programming is required.
Greetings, On Aug 6, 2014, at 4:24 AM, Ali Nazemian <[email protected]> wrote: > Dear all, > Hi, > I used nutch for crawling some news website and solr for indexing the > crawled pages. I was wondering how can I use nutch for crawling web forums? > In crawling web forums we have some problems that need to be considered. > (The ones that are not our concern in the case of news websites) Here is > some of them: > - There should be some techniques to find out each thread/post has how many > pages and how can be reached. > - Some of forums use java script for identifying paging and java script is > a client side programming language. Somehow it should be parsed with nutch. > - The depth method of nutch for crawling becomes useless since each page > consider in new depth. But also infinite depth is off the choice cause it > can be face us with infite crawling! > - More... > I really appreciate if somebody guide me through this subject. > Best regards. > > -- > A.Nazemian VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

