That’s a cool idea but how would we set up the redirect since wouldn’t that have to occur at SO?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Julien Nioche <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, February 10, 2016 at 6:48 AM To: "[email protected]" <[email protected]> Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser >See SO => >http://stackoverflow.com/questions/35299744/nutch-parser-plugin-collect-co >ntact-information > >There seems to be more and more people sending the questions to both the >ML >and SO. Am wondering whether we should set up a redirect so that any >question asked there lands automatically on the user list. Any thoughts? > >On 10 February 2016 at 14:43, Markus Jelsma <[email protected]> >wrote: > >> Yes, i would also implement a HtmlParserFilter plugin but execute the >> regex on the parseText, because that is where you are going to find >>phone >> numbers etc. >> Markus >> >> >> >> -----Original message----- >> > From:Jorge Luis Betancourt González <[email protected]> >> > Sent: Tuesday 9th February 2016 19:59 >> > To: [email protected] >> > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser >> > >> > Any particular requiremente that prevent you from implementing your >> logic as a HtmlParser plugin? essentially the parsing will be done for >>you >> (by parse-html or parse-tika) and all you need to do is find the right >> nodes and extract the desired information (see [1]). >> > >> > Regards, >> > >> > [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ >> > >> > ----- Mensaje original ----- >> > De: "Bin Wang" <[email protected]> >> > Para: "Apache.Nutch.User" <[email protected]> >> > Enviados: Martes, 9 de Febrero 2016 13:19:35 >> > Asunto: [MASSMAIL]Extract Contact Information - Custom Parser >> > >> > Hi there, >> > >> > I am working on a project that need to identify contact points on >> company's >> > website and used for the purpose of enhancing security. >> > >> > Right now, I managed to crawl several rounds of sites. The next step >>will >> > be to parse the HTML pages and locate where the contact information >>is. >> In >> > this case, I am only interested in email addresses and phone >>numbers.... >> > >> > Here is what I am planning to do, we can write a map reduce jobs to >>parse >> > HTML file and use things like regular expression in combo with >> > Jsoup/Beautifulsoup HTML parsers to find the regular expression. >> > >> > However, I am wondering is there any parser plugin that has already >>been >> > implemented and maybe tested used for this purpose? >> > >> > Also, any feedback how to achieve this is much appreciated! >> > >> > Best regards, >> > >> > Bin >> > >> > > > >-- > >*Open Source Solutions for Text Engineering* > >http://www.digitalpebble.com >http://digitalpebble.blogspot.com/ >#digitalpebble <http://twitter.com/digitalpebble>

