Hi Nick,
On Jan 19, 2010, at 5:41am, Nick Burch wrote:
Hi All
I've been taking a look at the HtmlParser, and I can't spot anything
in there that extracts any of the dublin core metadata that could be
there. It seems that it's only things like location and encoding
that get set onto the metadata object. Nothing like description,
author etc seems to get set.
Only location & encoding are explicitly looked for, but all meta tag
values get put into the metadata map.
See HtmlHandler.startElement(), where it has:
if (bodyLevel == 0 && discardLevel == 0) {
if ("META".equals(name) && atts.getValue("content") !=
null) {
if (atts.getValue("http-equiv") != null) {
metadata.set(
atts.getValue("http-equiv"),
atts.getValue("content"));
}
if (atts.getValue("name") != null) {
metadata.set(
atts.getValue("name"),
atts.getValue("content"));
}
Though the names defined in Tika's DublinCore enum seem to be missing
the "dc." prefix.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g