See SO => http://stackoverflow.com/questions/35299744/nutch-parser-plugin-collect-contact-information
There seems to be more and more people sending the questions to both the ML and SO. Am wondering whether we should set up a redirect so that any question asked there lands automatically on the user list. Any thoughts? On 10 February 2016 at 14:43, Markus Jelsma <[email protected]> wrote: > Yes, i would also implement a HtmlParserFilter plugin but execute the > regex on the parseText, because that is where you are going to find phone > numbers etc. > Markus > > > > -----Original message----- > > From:Jorge Luis Betancourt González <[email protected]> > > Sent: Tuesday 9th February 2016 19:59 > > To: [email protected] > > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser > > > > Any particular requiremente that prevent you from implementing your > logic as a HtmlParser plugin? essentially the parsing will be done for you > (by parse-html or parse-tika) and all you need to do is find the right > nodes and extract the desired information (see [1]). > > > > Regards, > > > > [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ > > > > ----- Mensaje original ----- > > De: "Bin Wang" <[email protected]> > > Para: "Apache.Nutch.User" <[email protected]> > > Enviados: Martes, 9 de Febrero 2016 13:19:35 > > Asunto: [MASSMAIL]Extract Contact Information - Custom Parser > > > > Hi there, > > > > I am working on a project that need to identify contact points on > company's > > website and used for the purpose of enhancing security. > > > > Right now, I managed to crawl several rounds of sites. The next step will > > be to parse the HTML pages and locate where the contact information is. > In > > this case, I am only interested in email addresses and phone numbers.... > > > > Here is what I am planning to do, we can write a map reduce jobs to parse > > HTML file and use things like regular expression in combo with > > Jsoup/Beautifulsoup HTML parsers to find the regular expression. > > > > However, I am wondering is there any parser plugin that has already been > > implemented and maybe tested used for this purpose? > > > > Also, any feedback how to achieve this is much appreciated! > > > > Best regards, > > > > Bin > > > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>

