You simply need to write a HTMLParser, they receive the DOM representation of the page from parse-tika (or parse-html). See JIRA for the entry on the metatag parser for an example and discussion. There is usually no need to modify parse-html or tika at all
Julien On 17 July 2011 16:23, lewis john mcgibbney <[email protected]>wrote: > Hi, > > Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have > thought that this would have been dealt with in Tika, however I have seen > no > mention of anyone having problems extracting this from web documents when > fetching with Nutch or even discussing it. > > For example say I had some geographical location in a meta tag such > as"geo:long=55.1244", is is possible to extract with parse-tika or would I > need to extend parse-html? > > Or the other part, is it possible to extract hash tags from twitter via the > above? > > -- > *Lewis* > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

