Dear all,
Hi,
I used nutch for crawling some news website and solr for indexing the
crawled pages. I was wondering how can I use nutch for crawling web forums?
In crawling web forums we have some problems that need to be considered.
(The ones that are not our concern in the case of news websites) Here is
some of them:
- There should be some techniques to find out each thread/post has how many
pages and how can be reached.
- Some of forums use java script for identifying paging and java script is
a client side programming language. Somehow it should be parsed with nutch.
- The depth method of nutch for crawling becomes useless since each page
consider in new depth. But also infinite depth is off the choice cause it
can be face us with infite crawling!
- More...
I really appreciate if somebody guide me through this subject.
Best regards.

-- 
A.Nazemian

Reply via email to