I'm a little suprised that Tika doesn't have something that reads extracts
element attribute values. The only thing I've found that does something
close to what I want is by extending the XMLParser's getcontentHandler
method:
public class RdfParser extends XMLParser
{
private static String RDF_NS = "
http://www.w3.org/1999/02/22-rdf-syntax-ns#";
private static ContentHandler getRdfHandler(Metadata metadata, String
name,
String localName)
{
return new AttributeMetadataHandler(RDF_NS, localName, metadata, name);
}
@Override
protected ContentHandler getContentHandler(ContentHandler handler,
Metadata metadata, ParseContext context)
{
return new TeeContentHandler(super.getContentHandler(handler, metadata,
context), getRdfHandler(metadata, "about", "about"), getRdfHandler(
metadata, "resource", "resource"));
}
}
But that doesn't even give me what I want, as it sticks everything in the
metaData's "resource" property. I'd like a way to extract the resource for
the <Title> element, as well as the resource attribute value for the
<originalurl> element as well.
=============================output================================
handler:
metadata: resource=http://news.google.com/, Google News, Thu, 17 Nov 2011
09:12:39 -0500, index.html, UTF-8 about=urn:root
Content-Type=application/xml
======================================================================
On Mon, Nov 21, 2011 at 11:57 AM, Nick Burch <[email protected]>wrote:
> On Sun, 20 Nov 2011, b m wrote:
>
>> <?xml version="1.0"?>
>> <RDF:RDF
>> xmlns:MAF="http://maf.mozdev.**org/metadata/rdf#<http://maf.mozdev.org/metadata/rdf#>
>> "
>>
>> xmlns:NC="http://home.**netscape.com/NC-rdf#<http://home.netscape.com/NC-rdf#>
>> "
>>
>> xmlns:RDF="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> ">
>> <RDF:Description RDF:about="urn:root">
>> <MAF:originalurl
>> RDF:resource="http://news.**google.com/<http://news.google.com/>
>> "/>
>> <MAF:title RDF:resource="Google News"/>
>> <MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
>> <MAF:indexfilename RDF:resource="index.html"/>
>> <MAF:charset RDF:resource="UTF-8"/>
>> </RDF:Description>
>> </RDF:RDF>
>>
>
> This xml doesn't have any text nodes, it only has attributes. (i.e.
> nothing like <foo>this is text</foo>)
>
>
> Here's my source code...
>>
>> File file = new File("/tmp/test.rdf");
>> InputStream is = new FileInputStream(file);
>> Metadata metaData = new Metadata();
>> AbstractParser parser = new RdfParser();
>> DefaultHandler handler = new ToTextContentHandler();
>>
>
> This handler will only give you the contents of text nodes, but you don't
> have any!
>
> Nick
>