Re: fetching content from archives and images

Dave Meikle Mon, 07 Jan 2013 15:54:51 -0800

Hi Maciej,

On 7 Jan 2013, at 20:53, Maciej Liżewski <[email protected]> wrote:


> Hi,
> 
> I downloaded tika sources and noticed that tests (ZipParserTest) check if
> AutoDetectParser run with ZIP file return all file names and text content
> extracted from those files... and this test passes without errors. however
> when trying with tika-app (and in Solr) I do not get this content. Tried to
> debug and for Zip files PackageParser is used. The parser iterates through
> all archived entries, but then in tika-app the output is empty even for same
> zip file as in tests... There is also one difference between tika-app and
> Solr: Solr return at least file names while tika-app shows nothing at all.
> 
> I simply do not get it... If test confirm that extracting archived files
> content works ok, then why I do not get any content in application/Solr?
> 

By default the Tika GUI app does not extract the files as the DocumentSelector 
set on the context used to define wether an embedded entry should be parsed 
only allows Images to be processed.  Note the Tika CLI app doesn't have this 
problem as it uses its own EmbeddedDocumentExtractor.

Furthermore, I suspect you will find that the Tika App is showing the file 
names just in the structured text view as they are output as div tags.

Would need to check Solr code to see what it was doing.

Cheers,
Dave

Re: fetching content from archives and images

Reply via email to