Hi All,

I am new to nutch, and not able to figure out simple configurations for me.
Also I am not finding much help from the web.

Here is my simple requirement for which I am thinking to use nutch:

I have several topics in a forum like
http://www.coderanch.com/forums/f-15/Performance

then inside every topic there are users who participated in the discussion
http://www.coderanch.com/t/615478/Performance/java/Code-quality-plugins-eclipse

I want to crawl all users name and their page, in above example, it would
be something like

name : Navneet Sharma
profile : http://www.coderanch.com/forums/user/profile/277769

name: soundar rajan
profile:http://www.coderanch.com/forums/user/profile/283096

and say, I want to crawl only their
Ranking, Number of messages and Registration date.

So, I am only interested in this much data

name
ranking
number_of_message
registration_date

how can I achieve it in nutch? And also, If I can tell nutch by any means
not to crawl unnecessary links other than these?

Please mention which version works best with Windows also, because I had
issue with 1.7, some file permission with hadoop, but working well with
nutch 1.2.

Any help would be highly appreciated.

Regards
Harshvardhan Ojha

Reply via email to