I have a student working on this right now. One thing - Tika has a PhoneNumber Content Handler and it would be leveraged here in such a plugin type in Nutch. Tyler Palsulich worked on it from our DARPA work.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Markus Jelsma <markus.jel...@openindex.io> Reply-To: "user@nutch.apache.org" <user@nutch.apache.org> Date: Wednesday, February 10, 2016 at 6:43 AM To: "user@nutch.apache.org" <user@nutch.apache.org> Subject: RE: [MASSMAIL]Extract Contact Information - Custom Parser >Yes, i would also implement a HtmlParserFilter plugin but execute the >regex on the parseText, because that is where you are going to find phone >numbers etc. >Markus > > > >-----Original message----- >> From:Jorge Luis Betancourt González <jlbetanco...@uci.cu> >> Sent: Tuesday 9th February 2016 19:59 >> To: user@nutch.apache.org >> Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser >> >> Any particular requiremente that prevent you from implementing your >>logic as a HtmlParser plugin? essentially the parsing will be done for >>you (by parse-html or parse-tika) and all you need to do is find the >>right nodes and extract the desired information (see [1]). >> >> Regards, >> >> [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ >> >> ----- Mensaje original ----- >> De: "Bin Wang" <binwang...@gmail.com> >> Para: "Apache.Nutch.User" <user@nutch.apache.org> >> Enviados: Martes, 9 de Febrero 2016 13:19:35 >> Asunto: [MASSMAIL]Extract Contact Information - Custom Parser >> >> Hi there, >> >> I am working on a project that need to identify contact points on >>company's >> website and used for the purpose of enhancing security. >> >> Right now, I managed to crawl several rounds of sites. The next step >>will >> be to parse the HTML pages and locate where the contact information is. >>In >> this case, I am only interested in email addresses and phone numbers.... >> >> Here is what I am planning to do, we can write a map reduce jobs to >>parse >> HTML file and use things like regular expression in combo with >> Jsoup/Beautifulsoup HTML parsers to find the regular expression. >> >> However, I am wondering is there any parser plugin that has already been >> implemented and maybe tested used for this purpose? >> >> Also, any feedback how to achieve this is much appreciated! >> >> Best regards, >> >> Bin >>