On Jul 11, 2010, at 6:09 PM, Paul Jakubik wrote:
> Hi,
> 
> I want to be able to parse zip, tar.gz, etc. files and extract metadata from 
> each file in the package. When I looked through the code, it looks like the 
> package parser creates a separate metadata object for each file in the 
> package, and I don't see a way to get to that object.
> 
> - Are there any plans for adding the ability to extract metadata from each 
> file in the package?
> 
> Here are the first two ways I thought of that this could be implemented:
> - Add metadata to the context, and add a clear method so the user can clear 
> the metadata after each file in the package is parsed (in the ContentHandler 
> when a "div" element is closed).
> - Write all metadata to the header section of the generated XHTML for each 
> document.
> 
> Will there be a way to get this metadata anytime soon?


I asked about this same thing almost exactly a  year ago.
http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/200906.mbox/%[email protected]%3e

and got unceremoniously shot down.
http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/200907.mbox/%[email protected]%3e

        That would confuse the distinction between metadata and normal
        document text, especially when just the character stream is extracted.
        If you need access to the entry metadata, it would probably be better
        to expose it as attributes of the package-entry div.

Still really really want this feature.  

I dont like the idea of writing all the metadata in the HEAD, because now you 
don't know which file had which metadata.  I guess you could always make the ID 
attribute of XHTML's META tag correspond to the filename of the contained file, 
but it doesn't feels kind of sloppy to do that.

Mostly, I think the problem comes from trying to shove everything into an HTML 
instead of just fully embracing XML isn't necessarily the best choice, 
especially for package files.  If XML was used, there's no reason why you 
couldn't have something that looked like:
        <FILE>
                <META key="" value="" />
                <CONTENT>
                        <FILE>
                                <META key="" value="" />
                                <CONTENT>
                                </CONTENT>
                        </FILE>
                        <FILE>
                                <META key="" value="" />
                                <CONTENT>
                                </CONTENT>
                        </FILE>
        </FILE>

But the XHTML-vs-XML ship has sailed, so there's no point in re-litigating 
that.  Perhaps it's something to consider for version 2.0.

An alternative way of handling this would be to create a nonrecursive version 
of AutoDetectParser.  That way, when the parser returned the metadata on the 
package, a metadata key could be set like, isPackage=TRUE, and then the user 
could get an iterator to each contained file contained package, and then 
manually call AutoDetectParserNonRecursive on each of the contained files, thus 
getting the metadata as needed.

--
Jonathan Koren
[email protected]
http://www.soe.ucsc.edu/~jonathan/


Reply via email to