Thanks for reply, I will configure this in regex-urlfilter.xml file. Regards Harshvardhan Ojha
On Wed, Oct 23, 2013 at 11:34 PM, A Laxmi <[email protected]> wrote: > Hi.. > > You can specify URL regular expression in the conf/regex-urlfilter file to > just accept only the links that matches regular expression of profiles > URLs. > > I recommend not to use Windows for this (unless someone have a different > opinion). May be you can try to install a virtual manager like Hyper-v and > install Ubuntu OS on it and go from there. > > > > > > > On Wed, Oct 23, 2013 at 1:57 PM, Harshvardhan Ojha < > [email protected]> wrote: > > > Hi All, > > > > I am new to nutch, and not able to figure out simple configurations for > me. > > Also I am not finding much help from the web. > > > > Here is my simple requirement for which I am thinking to use nutch: > > > > I have several topics in a forum like > > http://www.coderanch.com/forums/f-15/Performance > > > > then inside every topic there are users who participated in the > discussion > > > > > http://www.coderanch.com/t/615478/Performance/java/Code-quality-plugins-eclipse > > > > I want to crawl all users name and their page, in above example, it would > > be something like > > > > name : Navneet Sharma > > profile : http://www.coderanch.com/forums/user/profile/277769 > > > > name: soundar rajan > > profile:http://www.coderanch.com/forums/user/profile/283096 > > > > and say, I want to crawl only their > > Ranking, Number of messages and Registration date. > > > > So, I am only interested in this much data > > > > name > > ranking > > number_of_message > > registration_date > > > > how can I achieve it in nutch? And also, If I can tell nutch by any means > > not to crawl unnecessary links other than these? > > > > Please mention which version works best with Windows also, because I had > > issue with 1.7, some file permission with hadoop, but working well with > > nutch 1.2. > > > > Any help would be highly appreciated. > > > > Regards > > Harshvardhan Ojha > > >

