Yes, i would also implement a HtmlParserFilter plugin but execute the regex on the parseText, because that is where you are going to find phone numbers etc. Markus
-----Original message----- > From:Jorge Luis Betancourt González <jlbetanco...@uci.cu> > Sent: Tuesday 9th February 2016 19:59 > To: user@nutch.apache.org > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser > > Any particular requiremente that prevent you from implementing your logic as > a HtmlParser plugin? essentially the parsing will be done for you (by > parse-html or parse-tika) and all you need to do is find the right nodes and > extract the desired information (see [1]). > > Regards, > > [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ > > ----- Mensaje original ----- > De: "Bin Wang" <binwang...@gmail.com> > Para: "Apache.Nutch.User" <user@nutch.apache.org> > Enviados: Martes, 9 de Febrero 2016 13:19:35 > Asunto: [MASSMAIL]Extract Contact Information - Custom Parser > > Hi there, > > I am working on a project that need to identify contact points on company's > website and used for the purpose of enhancing security. > > Right now, I managed to crawl several rounds of sites. The next step will > be to parse the HTML pages and locate where the contact information is. In > this case, I am only interested in email addresses and phone numbers.... > > Here is what I am planning to do, we can write a map reduce jobs to parse > HTML file and use things like regular expression in combo with > Jsoup/Beautifulsoup HTML parsers to find the regular expression. > > However, I am wondering is there any parser plugin that has already been > implemented and maybe tested used for this purpose? > > Also, any feedback how to achieve this is much appreciated! > > Best regards, > > Bin >