On 03/08/2012 04:43 PM, Nick Burch wrote:
On Thu, 8 Mar 2012, Harry Simons wrote:I tried the BFF Validator, and it is indeed failing!If you're able to share the error log, that could be helpful
-------------------------------------------- <BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" result="FAILED"> <ParseStack><Type builtinType="Docfile" docName="MS-DOC" sectionTitle="File Structure" msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737"> <Info>Built-in type "Docfile": The root storage object of an OLE compound file. For more information, see http://msdn.microsoft.com/en-us/library/dd942138.aspx.</Info>
</Type><Type builtinType="Stream" docName="MS-DOC" sectionTitle="File Structure" msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"> <Info>Built-in type "Stream": Any stream object for OLE compound files. The entire file contents for other files.</Info>
</Type><Type docName="MS-DOC" sectionTitle="Fib" sectionNumber="2.5.1" msdnLink="http://msdn.microsoft.com/en-us/library/9AEAA2E7-4A45-468E-AB13-3F6193EB9394" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/> <Type docName="MS-DOC" sectionTitle="FibBase" sectionNumber="2.5.2" msdnLink="http://msdn.microsoft.com/en-us/library/26FB6C06-4E5C-4778-AB4E-EDBF26A545BB" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/> <Type builtinType="USHORT" streamName="WordDocument" bitfield="True" bitOffsetWithinStruct="84" hexBitOffsetWithinStruct="0x54" bitCount="4" streamOffsetOfStruct="0" hexStreamOffsetOfStruct="0x0" streamOffset="10" hexStreamOffset="0xa" childId="10" hexChildId="0xa">
<Info>Built-in type "USHORT": Unsigned 2-byte integer.</Info> </Type> </ParseStack> <LastData><![CDATA[ EC A5 01 01 4D 20 09 04 00 00 08 12 BF 00 00 00 ....M........... 00 00 00 30 00 00 00 00 00 08 00 00 66 EF 00 00 ...0........f... ]]></LastData> </BFFValidation> --------------------------------------------
However, the file got created by MS Word only, and I doubt if it's 'corrupt'... since both MS Word and LibreOffice can load it fine without any errors or even warnings of any kind -- everything seems to be normal with these apps. I can even use LibreOffice 3.5 to convert it to pdf or to a .zip of xml's.If you load it in word, and do a save-as, does the new .doc file show the same problem?
No, then it /is/ able to extract the works the appends the following to the extracted text:
-------------------------------------------- _-1388201556/ole-[42, 4D, 0E, 0A, 00, 00, 00, 00] _-1388203796/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388843352/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388845272/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388297360/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388297680/ole-[42, 4D, D6, 09, 00, 00, 00, 00] _-1388296720/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388203476/ole-[42, 4D, 66, 09, 00, 00, 00, 00] _-1382869532/ole-[42, 4D, 36, 0C, 00, 00, 00, 00] _-1388200596/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388200916/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1383036196/ole-[42, 4D, 12, 09, 00, 00, 00, 00] _-1382867932/ole-[42, 4D, 86, 0A, 00, 00, 00, 00] _-1382868252/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1380808936/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] --------------------------------------------I have 1000s of such documens, hoping I'll not have to repeat this process for each one for them. :-(
I don't know what version of Word the original document got created with, but I used MS Word 2007 for the 'Save as' you just suggested.
Tika 1.1 release candidate made no difference. It gave same behavior with both the files:Do you/others still feel it could be addressed by a POI upgrade?You could try with the Tika 1.1 release candidate, that has the latest POI release in it. You could also try dropping in a recent POI nightly build to see if that helps - Tika will upgrade shortly to POI 3.8 beta 6 once that's out
original file: same exceptions
re-saved : same extraneous text appended (pasted above).
Seems like the OSGi bundle may be the right packaging choice for me to allow POI upgrades independent of Tika. Never used maven or OSGi... is there a link I can download the OSGi bundle from and then follow instructions that come with it? I can't see it on the Tika site anywhere.Also, I thought Tika uses POI and would be using POI as a .jar. But looking in Tika sources, I could find only *POI*.java files but no *POI*.jar or *poi*.jar file(s).Depends how you use Tika. The Tika-App inlines all the dependencies, the Tika OSGi Bundle has them individually as jars in the bundle, or Maven will download them for you
