Oh, ok...

As for "does for directories"...y, I've been thinking about a modification of 
-z for tar/zip files, pst and, I guess, now WARC.  Files that can be so 
enormous that you'd want to unpack them before indexing.  No one would really 
want to index the Enron pst (if it actually existed) as a single file, rather, 
they'd want to be able to unpack it and index the individual files.  And, while 
you can attach a bunch of files inside a PDF or MSOffice file, in practice, 
there seems to be a fundamental difference between how users might want to deal 
with embedded files in, say, a PDF than in a PST.  

Depending on interest, might make sense to add disk images to the list of 
zip/pst/etc..., e.g. AFF? 



-----Original Message-----
From: Nick Burch [mailto:[email protected]] 
Sent: Monday, July 10, 2017 2:45 PM
To: [email protected]
Subject: Re: Adding a WARC parser to Tika

On Mon, 10 Jul 2017, Allison, Timothy B. wrote:
> Sorry, I can't tell if this is tongue-in-cheek...

No, I do think we should add a WARC parser to Tika Parsers.

Once done, I'd suggest we figure out a way for Tika Batch to run over a 
collection of WARC files just as it does for directories, to make it easier to 
run over crawl collections without having to unpack them first!

Nick

Reply via email to