That’s a cool idea but how would we set up the redirect since
wouldn’t that have to occur at SO?

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Julien Nioche <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, February 10, 2016 at 6:48 AM
To: "[email protected]" <[email protected]>
Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser

>See SO =>
>http://stackoverflow.com/questions/35299744/nutch-parser-plugin-collect-co
>ntact-information
>
>There seems to be more and more people sending the questions to both the
>ML
>and SO. Am wondering whether we should set up a redirect so that any
>question asked there lands automatically on the user list. Any thoughts?
>
>On 10 February 2016 at 14:43, Markus Jelsma <[email protected]>
>wrote:
>
>> Yes, i would also implement a HtmlParserFilter plugin but execute the
>> regex on the parseText, because that is where you are going to find
>>phone
>> numbers etc.
>> Markus
>>
>>
>>
>> -----Original message-----
>> > From:Jorge Luis Betancourt González <[email protected]>
>> > Sent: Tuesday 9th February 2016 19:59
>> > To: [email protected]
>> > Subject: Re: [MASSMAIL]Extract Contact Information - Custom Parser
>> >
>> > Any particular requiremente that prevent you from implementing your
>> logic as a HtmlParser plugin? essentially the parsing will be done for
>>you
>> (by parse-html or parse-tika) and all you need to do is find the right
>> nodes and extract the desired information (see [1]).
>> >
>> > Regards,
>> >
>> > [1] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/headings/
>> >
>> > ----- Mensaje original -----
>> > De: "Bin Wang" <[email protected]>
>> > Para: "Apache.Nutch.User" <[email protected]>
>> > Enviados: Martes, 9 de Febrero 2016 13:19:35
>> > Asunto: [MASSMAIL]Extract Contact Information - Custom Parser
>> >
>> > Hi there,
>> >
>> > I am working on a project that need to identify contact points on
>> company's
>> > website and used for the purpose of enhancing security.
>> >
>> > Right now, I managed to crawl several rounds of sites. The next step
>>will
>> > be to parse the HTML pages and locate where the contact information
>>is.
>> In
>> > this case, I am only interested in email addresses and phone
>>numbers....
>> >
>> > Here is what I am planning to do, we can write a map reduce jobs to
>>parse
>> > HTML file and use things like regular expression in combo with
>> > Jsoup/Beautifulsoup HTML parsers to find the regular expression.
>> >
>> > However, I am wondering is there any parser plugin that has already
>>been
>> > implemented and maybe tested used for this purpose?
>> >
>> > Also, any feedback how to achieve this is much appreciated!
>> >
>> > Best regards,
>> >
>> > Bin
>> >
>>
>
>
>
>-- 
>
>*Open Source Solutions for Text Engineering*
>
>http://www.digitalpebble.com
>http://digitalpebble.blogspot.com/
>#digitalpebble <http://twitter.com/digitalpebble>

Reply via email to