I have a student working on this right now.

One thing - Tika has a PhoneNumber Content Handler and it would
be leveraged here in such a plugin type in Nutch. Tyler Palsulich
worked on it from our DARPA work.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Markus Jelsma <markus.jel...@openindex.io>
Reply-To: "user@nutch.apache.org" <user@nutch.apache.org>
Date: Wednesday, February 10, 2016 at 6:43 AM
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: RE: [MASSMAIL]Extract Contact Information - Custom Parser

>Yes, i would also implement a HtmlParserFilter plugin but execute the
>regex on the parseText, because that is where you are going to find phone
>numbers etc.
>Markus
>
> 
> 
>-----Original message-----
>> From:Jorge Luis Betancourt González <jlbetanco...@uci.cu>
>> Sent: Tuesday 9th February 2016 19:59
>> To: user@nutch.apache.org
>> Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser
>> 
>> Any particular requiremente that prevent you from implementing your
>>logic as a HtmlParser plugin? essentially the parsing will be done for
>>you (by parse-html or parse-tika) and all you need to do is find the
>>right nodes and extract the desired information (see [1]).
>> 
>> Regards,
>> 
>> [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/
>> 
>> ----- Mensaje original -----
>> De: "Bin Wang" <binwang...@gmail.com>
>> Para: "Apache.Nutch.User" <user@nutch.apache.org>
>> Enviados: Martes, 9 de Febrero 2016 13:19:35
>> Asunto: [MASSMAIL]Extract Contact Information - Custom Parser
>> 
>> Hi there,
>> 
>> I am working on a project that need to identify contact points on
>>company's
>> website and used for the purpose of enhancing security.
>> 
>> Right now, I managed to crawl several rounds of sites. The next step
>>will
>> be to parse the HTML pages and locate where the contact information is.
>>In
>> this case, I am only interested in email addresses and phone numbers....
>> 
>> Here is what I am planning to do, we can write a map reduce jobs to
>>parse
>> HTML file and use things like regular expression in combo with
>> Jsoup/Beautifulsoup HTML parsers to find the regular expression.
>> 
>> However, I am wondering is there any parser plugin that has already been
>> implemented and maybe tested used for this purpose?
>> 
>> Also, any feedback how to achieve this is much appreciated!
>> 
>> Best regards,
>> 
>> Bin
>> 

Reply via email to