Re: how to parse html files while crawling

2010-04-13 Thread NareshG

Hi ,

I am also a newbie with nutch. 
Actually our requirement is to do opinion crawling. 
i.e we are looking to crawl only certail html pages which contain user
opinions about products, items or movies, mail the page should contains
opinions. And our seed list contains some review web sites like amazon,
mouthsut and some other sites. 
so during fetching of html pages, how should i verify weather the html page
is a opinion or not...
Nutch experts(Nutch reasearches) put your comments and suggestions.
Hope the requirement is clear..

Thanks and regards,
Naresh
-- 
View this message in context: 
http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p712846.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Opinion crawling

2010-04-12 Thread NareshG

Hi ,

I am newbie in nutch. As part of learning I have done some basic things in
nutch like intranet crawling, internet crawling and tried plugin example
etc. Actually our main objective is to do opinion crawling. 
Its like we need to crawl only html pages which contain opinions,i.e user
reviews about products, items, movies etc. So My question is during fetching
itself whether i can find this html page contains user opinions or not ?If
the page contains opinions, parse it . If not discard it. 

This is our approach as of now. Please put your comments and suggestions. 

Thanks in advance.

Best regards,
Naresh
-- 
View this message in context: 
http://n3.nabble.com/Opinion-crawling-tp713521p713521.html
Sent from the Nutch - User mailing list archive at Nabble.com.