RE: Questions about using AutoDetect and DigestParser

Allison, Timothy B. Tue, 05 Jan 2016 05:37:55 -0800

>>Question1) Shouldn't this be more specific? Like PdfParser, 
>>OpenDocumentParser and so on.


Y, make sure to call metadata.getValues(X-Parsed-By) which returns an array of 
values and then iterate through that array to see the parsers that actually 
processed your doc.  If you call metadata.get(Property p), you only get the 
first value in the array.

>> Question2) I understand that there is the DigestingParser to add Md5 and 
>> Sha1 hashes to the metadata. But how can I "combine" the AutoDetectParser 
>> and the DigestingParser?

See DigestingParserTest [0] for exact code, but basically something like this:

Metadata m = new Metadata();
CommonsDigester.DigestAlgorithm[] algos = CommonsDigester.parse("md5,sha512");
Parser d = new DigestingParser(new AutoDetectParser(), new 
CommonsDigester(1000000, algos, m)

d.parse(InputStream....)



[0] 
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java?view=markup
-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, January 05, 2016 3:33 AM
To: [email protected]
Subject: Questions about using AutoDetect and DigestParser

Happy New Year everyone,
I have a small program for simple text and metadata extraction. It is really 
not more than this (in Scala):

        val fileParser : AutoDetectParser = new AutoDetectParser()
        val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
        val metadata : Metadata = new Metadata()
        val context : ParseContext = new ParseContext()

        try {
            fileParser.parse(stream, handler, metadata, context)
        } catch ...

When I look at the metadata I always have this line: X-Parsed-By: 
org.apache.tika.parser.DefaultParser
Question1) Shouldn't this be more specific? Like PdfParser, OpenDocumentParser 
and so on.

Question2) I understand that there is the DigestingParser to add Md5 and Sha1 
hashes to the metadata. But how can I "combine" the AutoDetectParser and the 
DigestingParser?

Thanks so far
Kind regards

RE: Questions about using AutoDetect and DigestParser

Reply via email to