I'm a little suprised that Tika doesn't have something that reads extracts
element attribute values.  The only thing I've found that does something
close to what I want is by extending the XMLParser's getcontentHandler
method:


public class RdfParser extends XMLParser
{
  private static String RDF_NS = "
http://www.w3.org/1999/02/22-rdf-syntax-ns#";;

  private static ContentHandler getRdfHandler(Metadata metadata, String
name,
      String localName)
  {
    return new AttributeMetadataHandler(RDF_NS, localName, metadata, name);
  }

  @Override
  protected ContentHandler getContentHandler(ContentHandler handler,
      Metadata metadata, ParseContext context)
  {
    return new TeeContentHandler(super.getContentHandler(handler, metadata,
        context), getRdfHandler(metadata, "about", "about"), getRdfHandler(
        metadata, "resource", "resource"));
  }
}


But that doesn't even give me what I want, as it sticks everything in the
metaData's "resource" property.  I'd like a way to extract the resource for
the <Title> element, as well as the resource attribute value for the
<originalurl> element as well.


=============================output================================

handler:





metadata: resource=http://news.google.com/, Google News, Thu, 17 Nov 2011
09:12:39 -0500, index.html, UTF-8 about=urn:root
Content-Type=application/xml
======================================================================

On Mon, Nov 21, 2011 at 11:57 AM, Nick Burch <[email protected]>wrote:

> On Sun, 20 Nov 2011, b m wrote:
>
>> <?xml version="1.0"?>
>> <RDF:RDF 
>> xmlns:MAF="http://maf.mozdev.**org/metadata/rdf#<http://maf.mozdev.org/metadata/rdf#>
>> "
>>        
>> xmlns:NC="http://home.**netscape.com/NC-rdf#<http://home.netscape.com/NC-rdf#>
>> "
>>        
>> xmlns:RDF="http://www.w3.org/**1999/02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> ">
>>  <RDF:Description RDF:about="urn:root">
>>   <MAF:originalurl 
>> RDF:resource="http://news.**google.com/<http://news.google.com/>
>> "/>
>>   <MAF:title RDF:resource="Google News"/>
>>   <MAF:archivetime RDF:resource="Thu, 17 Nov 2011 09:12:39 -0500"/>
>>   <MAF:indexfilename RDF:resource="index.html"/>
>>   <MAF:charset RDF:resource="UTF-8"/>
>>  </RDF:Description>
>> </RDF:RDF>
>>
>
> This xml doesn't have any text nodes, it only has attributes. (i.e.
> nothing like <foo>this is text</foo>)
>
>
>  Here's my source code...
>>
>> File file = new File("/tmp/test.rdf");
>> InputStream is = new FileInputStream(file);
>> Metadata metaData = new Metadata();
>> AbstractParser parser = new RdfParser();
>> DefaultHandler handler = new ToTextContentHandler();
>>
>
> This handler will only give you the contents of text nodes, but you don't
> have any!
>
> Nick
>

Reply via email to