we could create an account for the project at SO, give the user list as an email address and set up an alert so that any question tagged as [nutch] gets sent to [email protected] That should work shouldn't it?
On 12 February 2016 at 15:11, Mattmann, Chris A (3980) < [email protected]> wrote: > That’s a cool idea but how would we set up the redirect since > wouldn’t that have to occur at SO? > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: Julien Nioche <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, February 10, 2016 at 6:48 AM > To: "[email protected]" <[email protected]> > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser > > >See SO => > > > http://stackoverflow.com/questions/35299744/nutch-parser-plugin-collect-co > >ntact-information > > > >There seems to be more and more people sending the questions to both the > >ML > >and SO. Am wondering whether we should set up a redirect so that any > >question asked there lands automatically on the user list. Any thoughts? > > > >On 10 February 2016 at 14:43, Markus Jelsma <[email protected]> > >wrote: > > > >> Yes, i would also implement a HtmlParserFilter plugin but execute the > >> regex on the parseText, because that is where you are going to find > >>phone > >> numbers etc. > >> Markus > >> > >> > >> > >> -----Original message----- > >> > From:Jorge Luis Betancourt González <[email protected]> > >> > Sent: Tuesday 9th February 2016 19:59 > >> > To: [email protected] > >> > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser > >> > > >> > Any particular requiremente that prevent you from implementing your > >> logic as a HtmlParser plugin? essentially the parsing will be done for > >>you > >> (by parse-html or parse-tika) and all you need to do is find the right > >> nodes and extract the desired information (see [1]). > >> > > >> > Regards, > >> > > >> > [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ > >> > > >> > ----- Mensaje original ----- > >> > De: "Bin Wang" <[email protected]> > >> > Para: "Apache.Nutch.User" <[email protected]> > >> > Enviados: Martes, 9 de Febrero 2016 13:19:35 > >> > Asunto: [MASSMAIL]Extract Contact Information - Custom Parser > >> > > >> > Hi there, > >> > > >> > I am working on a project that need to identify contact points on > >> company's > >> > website and used for the purpose of enhancing security. > >> > > >> > Right now, I managed to crawl several rounds of sites. The next step > >>will > >> > be to parse the HTML pages and locate where the contact information > >>is. > >> In > >> > this case, I am only interested in email addresses and phone > >>numbers.... > >> > > >> > Here is what I am planning to do, we can write a map reduce jobs to > >>parse > >> > HTML file and use things like regular expression in combo with > >> > Jsoup/Beautifulsoup HTML parsers to find the regular expression. > >> > > >> > However, I am wondering is there any parser plugin that has already > >>been > >> > implemented and maybe tested used for this purpose? > >> > > >> > Also, any feedback how to achieve this is much appreciated! > >> > > >> > Best regards, > >> > > >> > Bin > >> > > >> > > > > > > > >-- > > > >*Open Source Solutions for Text Engineering* > > > >http://www.digitalpebble.com > >http://digitalpebble.blogspot.com/ > >#digitalpebble <http://twitter.com/digitalpebble> > > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>

