You simply need to write a HTMLParser, they receive the DOM representation
of the page from parse-tika (or parse-html). See JIRA for the entry on the
metatag parser for an example and discussion. There is usually no need to
modify parse-html or tika at all

Julien

On 17 July 2011 16:23, lewis john mcgibbney <[email protected]>wrote:

> Hi,
>
> Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have
> thought that this would have been dealt with in Tika, however I have seen
> no
> mention of anyone having problems extracting this from web documents when
> fetching with Nutch or even discussing it.
>
> For example say I had some geographical location in a meta tag such
> as"geo:long=55.1244", is is possible to extract with parse-tika or would I
> need to extend parse-html?
>
> Or the other part, is it possible to extract hash tags from twitter via the
> above?
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to