Great idea, +1. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Julien Nioche <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, February 12, 2016 at 7:46 AM To: "[email protected]" <[email protected]> Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser >we could create an account for the project at SO, give the user list as an >email address and set up an alert so that any question tagged as [nutch] >gets sent to [email protected] >That should work shouldn't it? > >On 12 February 2016 at 15:11, Mattmann, Chris A (3980) < >[email protected]> wrote: > >> That’s a cool idea but how would we set up the redirect since >> wouldn’t that have to occur at SO? >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Julien Nioche <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Wednesday, February 10, 2016 at 6:48 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser >> >> >See SO => >> > >> >>http://stackoverflow.com/questions/35299744/nutch-parser-plugin-collect-c >>o >> >ntact-information >> > >> >There seems to be more and more people sending the questions to both >>the >> >ML >> >and SO. Am wondering whether we should set up a redirect so that any >> >question asked there lands automatically on the user list. Any >>thoughts? >> > >> >On 10 February 2016 at 14:43, Markus Jelsma >><[email protected]> >> >wrote: >> > >> >> Yes, i would also implement a HtmlParserFilter plugin but execute the >> >> regex on the parseText, because that is where you are going to find >> >>phone >> >> numbers etc. >> >> Markus >> >> >> >> >> >> >> >> -----Original message----- >> >> > From:Jorge Luis Betancourt González <[email protected]> >> >> > Sent: Tuesday 9th February 2016 19:59 >> >> > To: [email protected] >> >> > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser >> >> > >> >> > Any particular requiremente that prevent you from implementing your >> >> logic as a HtmlParser plugin? essentially the parsing will be done >>for >> >>you >> >> (by parse-html or parse-tika) and all you need to do is find the >>right >> >> nodes and extract the desired information (see [1]). >> >> > >> >> > Regards, >> >> > >> >> > [1] >>http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ >> >> > >> >> > ----- Mensaje original ----- >> >> > De: "Bin Wang" <[email protected]> >> >> > Para: "Apache.Nutch.User" <[email protected]> >> >> > Enviados: Martes, 9 de Febrero 2016 13:19:35 >> >> > Asunto: [MASSMAIL]Extract Contact Information - Custom Parser >> >> > >> >> > Hi there, >> >> > >> >> > I am working on a project that need to identify contact points on >> >> company's >> >> > website and used for the purpose of enhancing security. >> >> > >> >> > Right now, I managed to crawl several rounds of sites. The next >>step >> >>will >> >> > be to parse the HTML pages and locate where the contact information >> >>is. >> >> In >> >> > this case, I am only interested in email addresses and phone >> >>numbers.... >> >> > >> >> > Here is what I am planning to do, we can write a map reduce jobs to >> >>parse >> >> > HTML file and use things like regular expression in combo with >> >> > Jsoup/Beautifulsoup HTML parsers to find the regular expression. >> >> > >> >> > However, I am wondering is there any parser plugin that has already >> >>been >> >> > implemented and maybe tested used for this purpose? >> >> > >> >> > Also, any feedback how to achieve this is much appreciated! >> >> > >> >> > Best regards, >> >> > >> >> > Bin >> >> > >> >> >> > >> > >> > >> >-- >> > >> >*Open Source Solutions for Text Engineering* >> > >> >http://www.digitalpebble.com >> >http://digitalpebble.blogspot.com/ >> >#digitalpebble <http://twitter.com/digitalpebble> >> >> > > >-- > >*Open Source Solutions for Text Engineering* > >http://www.digitalpebble.com >http://digitalpebble.blogspot.com/ >#digitalpebble <http://twitter.com/digitalpebble>

