See SO =>
http://stackoverflow.com/questions/35299744/nutch-parser-plugin-collect-contact-information

There seems to be more and more people sending the questions to both the ML
and SO. Am wondering whether we should set up a redirect so that any
question asked there lands automatically on the user list. Any thoughts?

On 10 February 2016 at 14:43, Markus Jelsma <[email protected]>
wrote:

> Yes, i would also implement a HtmlParserFilter plugin but execute the
> regex on the parseText, because that is where you are going to find phone
> numbers etc.
> Markus
>
>
>
> -----Original message-----
> > From:Jorge Luis Betancourt González <[email protected]>
> > Sent: Tuesday 9th February 2016 19:59
> > To: [email protected]
> > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser
> >
> > Any particular requiremente that prevent you from implementing your
> logic as a HtmlParser plugin? essentially the parsing will be done for you
> (by parse-html or parse-tika) and all you need to do is find the right
> nodes and extract the desired information (see [1]).
> >
> > Regards,
> >
> > [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/
> >
> > ----- Mensaje original -----
> > De: "Bin Wang" <[email protected]>
> > Para: "Apache.Nutch.User" <[email protected]>
> > Enviados: Martes, 9 de Febrero 2016 13:19:35
> > Asunto: [MASSMAIL]Extract Contact Information - Custom Parser
> >
> > Hi there,
> >
> > I am working on a project that need to identify contact points on
> company's
> > website and used for the purpose of enhancing security.
> >
> > Right now, I managed to crawl several rounds of sites. The next step will
> > be to parse the HTML pages and locate where the contact information is.
> In
> > this case, I am only interested in email addresses and phone numbers....
> >
> > Here is what I am planning to do, we can write a map reduce jobs to parse
> > HTML file and use things like regular expression in combo with
> > Jsoup/Beautifulsoup HTML parsers to find the regular expression.
> >
> > However, I am wondering is there any parser plugin that has already been
> > implemented and maybe tested used for this purpose?
> >
> > Also, any feedback how to achieve this is much appreciated!
> >
> > Best regards,
> >
> > Bin
> >
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Reply via email to