Hi Nick,

On Jan 19, 2010, at 5:41am, Nick Burch wrote:

Hi All

I've been taking a look at the HtmlParser, and I can't spot anything in there that extracts any of the dublin core metadata that could be there. It seems that it's only things like location and encoding that get set onto the metadata object. Nothing like description, author etc seems to get set.

Only location & encoding are explicitly looked for, but all meta tag values get put into the metadata map.

See HtmlHandler.startElement(), where it has:

        if (bodyLevel == 0 && discardLevel == 0) {
if ("META".equals(name) && atts.getValue("content") != null) {
                if (atts.getValue("http-equiv") != null) {
                    metadata.set(
                            atts.getValue("http-equiv"),
                            atts.getValue("content"));
                }
                if (atts.getValue("name") != null) {
                    metadata.set(
                            atts.getValue("name"),
                            atts.getValue("content"));
                }


Though the names defined in Tika's DublinCore enum seem to be missing the "dc." prefix.

-- Ken



--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to