Hi..

You can specify URL regular expression in the conf/regex-urlfilter file to
just accept only the links that matches regular expression of profiles
URLs.

I recommend not to use Windows for this (unless someone have a different
opinion). May be you can try to install a virtual manager like Hyper-v and
install Ubuntu OS on it and go from there.






On Wed, Oct 23, 2013 at 1:57 PM, Harshvardhan Ojha <
[email protected]> wrote:

> Hi All,
>
> I am new to nutch, and not able to figure out simple configurations for me.
> Also I am not finding much help from the web.
>
> Here is my simple requirement for which I am thinking to use nutch:
>
> I have several topics in a forum like
> http://www.coderanch.com/forums/f-15/Performance
>
> then inside every topic there are users who participated in the discussion
>
> http://www.coderanch.com/t/615478/Performance/java/Code-quality-plugins-eclipse
>
> I want to crawl all users name and their page, in above example, it would
> be something like
>
> name : Navneet Sharma
> profile : http://www.coderanch.com/forums/user/profile/277769
>
> name: soundar rajan
> profile:http://www.coderanch.com/forums/user/profile/283096
>
> and say, I want to crawl only their
> Ranking, Number of messages and Registration date.
>
> So, I am only interested in this much data
>
> name
> ranking
> number_of_message
> registration_date
>
> how can I achieve it in nutch? And also, If I can tell nutch by any means
> not to crawl unnecessary links other than these?
>
> Please mention which version works best with Windows also, because I had
> issue with 1.7, some file permission with hadoop, but working well with
> nutch 1.2.
>
> Any help would be highly appreciated.
>
> Regards
> Harshvardhan Ojha
>

Reply via email to