Re: [MASSMAIL]Extract Contact Information - Custom Parser

Jorge Luis Betancourt González Tue, 09 Feb 2016 10:59:59 -0800

Any particular requiremente that prevent you from implementing your logic as a 
HtmlParser plugin? essentially the parsing will be done for you (by parse-html 
or parse-tika) and all you need to do is find the right nodes and extract the 
desired information (see [1]).


Regards,

[1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/

----- Mensaje original -----
De: "Bin Wang" <[email protected]>
Para: "Apache.Nutch.User" <[email protected]>
Enviados: Martes, 9 de Febrero 2016 13:19:35
Asunto: [MASSMAIL]Extract Contact Information - Custom Parser

Hi there,

I am working on a project that need to identify contact points on company's
website and used for the purpose of enhancing security.

Right now, I managed to crawl several rounds of sites. The next step will
be to parse the HTML pages and locate where the contact information is. In
this case, I am only interested in email addresses and phone numbers....

Here is what I am planning to do, we can write a map reduce jobs to parse
HTML file and use things like regular expression in combo with
Jsoup/Beautifulsoup HTML parsers to find the regular expression.

However, I am wondering is there any parser plugin that has already been
implemented and maybe tested used for this purpose?

Also, any feedback how to achieve this is much appreciated!

Best regards,

Bin

Re: [MASSMAIL]Extract Contact Information - Custom Parser

Reply via email to