Hi,
Thank you all for suggestions. They lead me to the solution. There was a problem with how Solr initializes parser context (described here: https://issues.apache.org/jira/browse/SOLR-2416) which caused that EmptyParser was used for archived files. Applying patch and recompiling solr-cell jar solved my problem. Btw - Tika App does not show even file names. <body> tag in structured output is empty. Maciek From: Dave Meikle [mailto:[email protected]] Sent: Tuesday, January 08, 2013 12:54 AM To: [email protected] Subject: Re: fetching content from archives and images Hi Maciej, On 7 Jan 2013, at 20:53, Maciej Liżewski <[email protected]> wrote: Hi, I downloaded tika sources and noticed that tests (ZipParserTest) check if AutoDetectParser run with ZIP file return all file names and text content extracted from those files... and this test passes without errors. however when trying with tika-app (and in Solr) I do not get this content. Tried to debug and for Zip files PackageParser is used. The parser iterates through all archived entries, but then in tika-app the output is empty even for same zip file as in tests... There is also one difference between tika-app and Solr: Solr return at least file names while tika-app shows nothing at all. I simply do not get it... If test confirm that extracting archived files content works ok, then why I do not get any content in application/Solr? By default the Tika GUI app does not extract the files as the DocumentSelector set on the context used to define wether an embedded entry should be parsed only allows Images to be processed. Note the Tika CLI app doesn't have this problem as it uses its own EmbeddedDocumentExtractor. Furthermore, I suspect you will find that the Tika App is showing the file names just in the structured text view as they are output as div tags. Would need to check Solr code to see what it was doing. Cheers, Dave
