Re: how to parse html files while crawling
well main question is that i need html elements removed files this is important not other things is this possible ? On 21 April 2010 16:38, nachonieto3 wrote: > > Thank you a lot! Now I'm working on that, I have some doubts more...I'm not > able to run the command readseg...I've been consulting some help forum and > the basic synthesis is > readseg > I have the segments in this path: > D:\nutch-0.9\crawl-20100420112025\segments > The file named crawl-20100420112025 is the one where are stored the > segments. So I'm trying to execute the command using these but none is > working: > readseg d/nutch-0.9/crawl-20100420112025/segments > readseg crawl-20100420112025/segments > readseg crawl-20100420112025 > > What I'm doing wrong??When I try to execute I get bash: readseg:command not > found. > Any idea??Thank you in advance. > -- > View this message in context: > http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p739953.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: how to parse html files while crawling
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments The file named crawl-20100420112025 is the one where are stored the segments. So I'm trying to execute the command using these but none is working: readseg d/nutch-0.9/crawl-20100420112025/segments readseg crawl-20100420112025/segments readseg crawl-20100420112025 What I'm doing wrong??When I try to execute I get bash: readseg:command not found. Any idea??Thank you in advance. -- View this message in context: http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p739953.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to parse html files while crawling
To convert the Nutch's crawled data which is stored in segments to human readable and interpretable forms, you will have to look at the 'segread' command (which was earlier 'readseg'). It reads and exports the segment data. Details at Nutch Wiki: http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_segread - Ankit Dangi On Mon, Apr 19, 2010 at 9:15 PM, nachonieto3 wrote: > > I have a doubt related with this topic (I guess)...How are the final > results > of Nutch stored?I mean, in which format is stored the information contained > in the links analyzed? > > I understood that Nutch need the information in plan text to parse it...but > in which format is stored finally?I know is stored in "segments" but how > can > I access to this information in order to convert it to plan text?Is it > possible? > > Thank you in advance > -- > View this message in context: > http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p729943.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Ankit Dangi
Re: how to parse html files while crawling
I have a doubt related with this topic (I guess)...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is stored in "segments" but how can I access to this information in order to convert it to plan text?Is it possible? Thank you in advance -- View this message in context: http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p729943.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: how to parse html files while crawling
Hi your task is clear but solution is not simple. That is why there are so many companies which are competing for users and try to show them relevant results. Nutch is not cleaver enough to sort out if a page an opinion or just an advertisement. So you MUST yourself to teach it. First of all you need to find out what distinguishes opinions from other pages. It might be special structure, special tag or key word Next you must rewrite HTML parser (indexer) and add necessary logic to filter opinions form other pages. so this's basically what you should do. Best Regards Alexander Aristov On 12 April 2010 11:23, NareshG wrote: > > Hi , > > I am also a newbie with nutch. > Actually our requirement is to do opinion crawling. > i.e we are looking to crawl only certail html pages which contain user > opinions about products, items or movies, mail the page should contains > opinions. And our seed list contains some review web sites like amazon, > mouthsut and some other sites. > so during fetching of html pages, how should i verify weather the html page > is a opinion or not... > Nutch experts(Nutch reasearches) put your comments and suggestions. > Hope the requirement is clear.. > > Thanks and regards, > Naresh > -- > View this message in context: > http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p712846.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: how to parse html files while crawling
The parsed html files are saved in "segments" On Fri, Apr 9, 2010 at 3:40 AM, cefurkan0 cefurkan0 wrote: > i can successfully crawl web sites with > > bin/nutch crawl command > > but i also want to save parsed html files > > how can i do that > > ty >
Re: how to parse html files while crawling
Hi , I am also a newbie with nutch. Actually our requirement is to do opinion crawling. i.e we are looking to crawl only certail html pages which contain user opinions about products, items or movies, mail the page should contains opinions. And our seed list contains some review web sites like amazon, mouthsut and some other sites. so during fetching of html pages, how should i verify weather the html page is a opinion or not... Nutch experts(Nutch reasearches) put your comments and suggestions. Hope the requirement is clear.. Thanks and regards, Naresh -- View this message in context: http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p712846.html Sent from the Nutch - User mailing list archive at Nabble.com.