Yes, i would also implement a HtmlParserFilter plugin but execute the regex on 
the parseText, because that is where you are going to find phone numbers etc.
Markus

 
 
-----Original message-----
> From:Jorge Luis Betancourt González <jlbetanco...@uci.cu>
> Sent: Tuesday 9th February 2016 19:59
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser
> 
> Any particular requiremente that prevent you from implementing your logic as 
> a HtmlParser plugin? essentially the parsing will be done for you (by 
> parse-html or parse-tika) and all you need to do is find the right nodes and 
> extract the desired information (see [1]). 
> 
> Regards,
> 
> [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/
> 
> ----- Mensaje original -----
> De: "Bin Wang" <binwang...@gmail.com>
> Para: "Apache.Nutch.User" <user@nutch.apache.org>
> Enviados: Martes, 9 de Febrero 2016 13:19:35
> Asunto: [MASSMAIL]Extract Contact Information - Custom Parser
> 
> Hi there,
> 
> I am working on a project that need to identify contact points on company's
> website and used for the purpose of enhancing security.
> 
> Right now, I managed to crawl several rounds of sites. The next step will
> be to parse the HTML pages and locate where the contact information is. In
> this case, I am only interested in email addresses and phone numbers....
> 
> Here is what I am planning to do, we can write a map reduce jobs to parse
> HTML file and use things like regular expression in combo with
> Jsoup/Beautifulsoup HTML parsers to find the regular expression.
> 
> However, I am wondering is there any parser plugin that has already been
> implemented and maybe tested used for this purpose?
> 
> Also, any feedback how to achieve this is much appreciated!
> 
> Best regards,
> 
> Bin
> 

Reply via email to