You can write a simple parse filter plugin. With the NodeWalker you can walk
all nodes of the DOM and get the alt attribute for img tags.
NodeWalker walker = new NodeWalker(doc);
Node currentNode = walker.nextNode();
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
if ("img".equalsIgnoreCase(currentNode.getNodeName())) {
HashMap<String,String> atts = getAttributes(currentNode);
}
}
}
protected HashMap<String,String> getAttributes(Node node) {
HashMap<String,String> attribMap = new HashMap<String,String>();
NamedNodeMap attributes = node.getAttributes();
for(int i = 0 ; i < attributes.getLength(); i++) {
Attr attribute = (Attr)attributes.item(i);
attribMap.put(attribute.getName().toLowerCase(), attribute.getValue());
}
return attribMap;
}
-----Original message-----
> From:Alexandre <[email protected]>
> Sent: Mon 01-Oct-2012 15:05
> To: [email protected]
> Subject: Re: Parsing/Indexing alt tag
>
> Hi Patrick,
>
> I have the same Problem.
> Did you find a way to parse the alt attributes without rewrite a complet
> parse plugin?
>
> Alex.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540p4011181.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>