I have a student working on this right now. One thing - Tika has a PhoneNumber Content Handler and it would be leveraged here in such a plugin type in Nutch. Tyler Palsulich worked on it from our DARPA work.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Markus Jelsma <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, February 10, 2016 at 6:43 AM To: "[email protected]" <[email protected]> Subject: RE: [MASSMAIL]Extract Contact Information - Custom Parser >Yes, i would also implement a HtmlParserFilter plugin but execute the >regex on the parseText, because that is where you are going to find phone >numbers etc. >Markus > > > >-----Original message----- >> From:Jorge Luis Betancourt González <[email protected]> >> Sent: Tuesday 9th February 2016 19:59 >> To: [email protected] >> Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser >> >> Any particular requiremente that prevent you from implementing your >>logic as a HtmlParser plugin? essentially the parsing will be done for >>you (by parse-html or parse-tika) and all you need to do is find the >>right nodes and extract the desired information (see [1]). >> >> Regards, >> >> [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ >> >> ----- Mensaje original ----- >> De: "Bin Wang" <[email protected]> >> Para: "Apache.Nutch.User" <[email protected]> >> Enviados: Martes, 9 de Febrero 2016 13:19:35 >> Asunto: [MASSMAIL]Extract Contact Information - Custom Parser >> >> Hi there, >> >> I am working on a project that need to identify contact points on >>company's >> website and used for the purpose of enhancing security. >> >> Right now, I managed to crawl several rounds of sites. The next step >>will >> be to parse the HTML pages and locate where the contact information is. >>In >> this case, I am only interested in email addresses and phone numbers.... >> >> Here is what I am planning to do, we can write a map reduce jobs to >>parse >> HTML file and use things like regular expression in combo with >> Jsoup/Beautifulsoup HTML parsers to find the regular expression. >> >> However, I am wondering is there any parser plugin that has already been >> implemented and maybe tested used for this purpose? >> >> Also, any feedback how to achieve this is much appreciated! >> >> Best regards, >> >> Bin >>

