Any particular requiremente that prevent you from implementing your logic as a HtmlParser plugin? essentially the parsing will be done for you (by parse-html or parse-tika) and all you need to do is find the right nodes and extract the desired information (see [1]).
Regards, [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/ ----- Mensaje original ----- De: "Bin Wang" <[email protected]> Para: "Apache.Nutch.User" <[email protected]> Enviados: Martes, 9 de Febrero 2016 13:19:35 Asunto: [MASSMAIL]Extract Contact Information - Custom Parser Hi there, I am working on a project that need to identify contact points on company's website and used for the purpose of enhancing security. Right now, I managed to crawl several rounds of sites. The next step will be to parse the HTML pages and locate where the contact information is. In this case, I am only interested in email addresses and phone numbers.... Here is what I am planning to do, we can write a map reduce jobs to parse HTML file and use things like regular expression in combo with Jsoup/Beautifulsoup HTML parsers to find the regular expression. However, I am wondering is there any parser plugin that has already been implemented and maybe tested used for this purpose? Also, any feedback how to achieve this is much appreciated! Best regards, Bin

