Re: 3.6.0 regression: non-deterministic filter selection ...
Tor - I had a feeling you were looking at some (MathML ?) detection issue that may be related to this (and/or fixed it already); Don't think so; the problem I looked at was caused by the MathML filter detecting *any* XML file as MathML, but then obviously failing to actually load it (and after a load attempt has failed there is no loop back to try other filters). In this case the user *wanted* the ASCII filter to be the one invoked. (Surely there must be better arbitrary-XML editors than Writer, that at least make sure XML start and end tags match etc automatically, but oh well). --tml ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On Wed, Aug 15, 2012 at 12:40 AM, Kohei Yoshida kyosh...@novell.com wrote: On Tue, Aug 14, 2012 at 3:15 PM, Michael Meeks michael.me...@suse.com wrote: Thoughts ? It appears that the type detection asks SwFilterDetect::detect() to detect its type, and it returns empty handed, which eventually leads to it being detected as ascii text. I've tried a simple Word document I created from Word XP, and the same code detects the file type just fine. I'm right now looking in to see why SwFilterDetect::detect() fails to detect your particular document. I'm starting to suspect that maybe it's the filter itself failing to parse this document, rather than the type detection system failing to detect it. Here is my reasoning: 1) Type detection correctly prioritize the writer_MS_Word_97 file type at the top of the list, which gets tested first. With normal .doc file, this test succeeds, the type detection ends with the correct filter type detected 'MS Word 97'. 2) But with Large-Word.doc, this test fails, and the type detection continues down the list of all types to test. They all fail, and the last one on the list is the ascii text, which always succeeds. 3) Now, if you launch the file open dialog and select this file, manually set the file type to 'Microsoft Word 97/2000/XP/2003', select Large-Word.doc and click Open, internally it bypasses the type detection and proceeds to open the file using the pre-selected filter 'MS Word 97'. Even in this scenario, you'll get 'Read-Error. Error reading file'. The last point indicates that the correct filter for this file type fails to load this file for whatever reason. Does this make sense? At this point, I'm inclined to ask the Writer folks to see if any changes in writer's word import filter has ended up not loading this file correctly... I'm CC'ing Cedric who is our writer guru, and Caolan who is generally very knowledgeable about this sort of thing. :-) I'll go a little deeper in the type detection chain (I'm at SwIoSystem::IsFileFilter at the moment) to confirm this theory. Kohei ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On 08/15/2012 11:05 AM, Kohei Yoshida wrote: I'll go a little deeper in the type detection chain (I'm at SwIoSystem::IsFileFilter at the moment) to confirm this theory. I've now deduced it to aHdr.Load( *pStream ) (aHdr being StgHeader) in Storage::IsStorageFile() failing with Large-Word.doc, whereas this succeeds with normal .doc files. Kohei -- Kohei Yoshida, LibreOffice hacker, Calc ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On 08/15/2012 12:02 PM, Kohei Yoshida wrote: On 08/15/2012 11:05 AM, Kohei Yoshida wrote: I'll go a little deeper in the type detection chain (I'm at SwIoSystem::IsFileFilter at the moment) to confirm this theory. I've now deduced it to aHdr.Load( *pStream ) (aHdr being StgHeader) in Storage::IsStorageFile() failing with Large-Word.doc, whereas this succeeds with normal .doc files. Here is a link to the method (for your convenience) http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stg.cxx#344 The culprit is on line 352, where bRet = 0 though it should be bRet = 1. -- Kohei Yoshida, LibreOffice hacker, Calc ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On 08/15/2012 12:05 PM, Kohei Yoshida wrote: On 08/15/2012 12:02 PM, Kohei Yoshida wrote: On 08/15/2012 11:05 AM, Kohei Yoshida wrote: I'll go a little deeper in the type detection chain (I'm at SwIoSystem::IsFileFilter at the moment) to confirm this theory. I've now deduced it to aHdr.Load( *pStream ) (aHdr being StgHeader) in Storage::IsStorageFile() failing with Large-Word.doc, whereas this succeeds with normal .doc files. Here is a link to the method (for your convenience) http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stg.cxx#344 The culprit is on line 352, where bRet = 0 though it should be bRet = 1. Ah. Now I know. Ultimately the failure is due to StgHeader::Check() returning false on Large-Word.doc. http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186 Changing this method to always return true leads to Writer loading the file correctly. Now, what would be the right fix for this? I'm not sure Kohei -- Kohei Yoshida, LibreOffice hacker, Calc ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On 08/15/2012 12:12 PM, Kohei Yoshida wrote: Ah. Now I know. Ultimately the failure is due to StgHeader::Check() returning false on Large-Word.doc. http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186 Changing this method to always return true leads to Writer loading the file correctly. And the check that fails is this: ( nMasterChain == -2 || ( nMasterChain =0 nMaster 109 ) ) where nMasterChain == 24269 and nMaster == 2. Since nMaster 109 is false, it returns false, and the whole thing starts to crumble... -- Kohei Yoshida, LibreOffice hacker, Calc ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On 08/15/2012 12:26 PM, Kohei Yoshida wrote: On 08/15/2012 12:12 PM, Kohei Yoshida wrote: Ah. Now I know. Ultimately the failure is due to StgHeader::Check() returning false on Large-Word.doc. http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186 Changing this method to always return true leads to Writer loading the file correctly. And the check that fails is this: ( nMasterChain == -2 || ( nMasterChain =0 nMaster 109 ) ) where nMasterChain == 24269 and nMaster == 2. Since nMaster 109 is false, it returns false, and the whole thing starts to crumble... So, we have a choice to make. This patch diff --git a/sot/source/sdstor/stgelem.cxx b/sot/source/sdstor/stgelem.cxx index 5fb3a09..dfcc28a 100644 --- a/sot/source/sdstor/stgelem.cxx +++ b/sot/source/sdstor/stgelem.cxx @@ -194,7 +194,7 @@ sal_Bool StgHeader::Check() nTOCstrm = 0 nThreshold 0 ( nDataFAT == -2 || ( nDataFAT = 0 nDataFATSize 0 ) ) - ( nMasterChain == -2 || ( nMasterChain =0 nMaster 109 ) ) + ( nMasterChain == -2 || nMasterChain =0 ) nMaster = 0; } which removes the check for nMaster being greater than 109 fixes this load issue. The documentation: http://msdn.microsoft.com/en-us/library/dd941946%28v=prot.13%29 basically says that number represents the number of DIFAT sectors that follows, and the number of the DIFAT sectors is always 109. That's probably where the check came from (though strictly the check should be nMaster == 109 instead of nMaster 109). But we always load 109 DIFAT sectors regardless of the value of nMaster, and nMaster *may* actually represents the number of *used* DIFAT sectors... Who knows? The documentation is not very clear about this. IMO, it's safe to drop this check since it doesn't do anything significant anyway, and it obviously prevents legitimate Word files from being loaded. Any opinions? Kohei -- Kohei Yoshida, LibreOffice hacker, Calc ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
Hi Kohei, On 15.08.2012 21:00, Kohei Yoshida wrote: On 08/15/2012 12:26 PM, Kohei Yoshida wrote: On 08/15/2012 12:12 PM, Kohei Yoshida wrote: Ah. Now I know. Ultimately the failure is due to StgHeader::Check() returning false on Large-Word.doc. http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186 Changing this method to always return true leads to Writer loading the file correctly. And the check that fails is this: ( nMasterChain == -2 || ( nMasterChain =0 nMaster 109 ) ) where nMasterChain == 24269 and nMaster == 2. Since nMaster 109 is false, it returns false, and the whole thing starts to crumble... So, we have a choice to make. This patch diff --git a/sot/source/sdstor/stgelem.cxx b/sot/source/sdstor/stgelem.cxx index 5fb3a09..dfcc28a 100644 --- a/sot/source/sdstor/stgelem.cxx +++ b/sot/source/sdstor/stgelem.cxx @@ -194,7 +194,7 @@ sal_Bool StgHeader::Check() nTOCstrm = 0 nThreshold 0 ( nDataFAT == -2 || ( nDataFAT = 0 nDataFATSize 0 ) ) - ( nMasterChain == -2 || ( nMasterChain =0 nMaster 109 ) ) + ( nMasterChain == -2 || nMasterChain =0 ) nMaster = 0; } which removes the check for nMaster being greater than 109 fixes this load issue. The documentation: http://msdn.microsoft.com/en-us/library/dd941946%28v=prot.13%29 basically says that number represents the number of DIFAT sectors that follows, and the number of the DIFAT sectors is always 109. That's probably where the check came from (though strictly the check should be nMaster == 109 instead of nMaster 109). But we always load 109 DIFAT sectors regardless of the value of nMaster, and nMaster *may* actually represents the number of *used* DIFAT sectors... Who knows? The documentation is not very clear about this. first 109 sectors are in the header anyway and nMaster are in FAT (if needed). nMaster should not be checked for 109. http://msdn.microsoft.com/en-us/library/dd941958%28v=prot.13%29 Best regards, Ivan ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On 08/15/2012 01:19 PM, Ivan Timofeev wrote: first 109 sectors are in the header anyway and nMaster are in FAT (if needed). nMaster should not be checked for 109. Hi Ivan, Thanks. Pushed the change to master. Kohei -- Kohei Yoshida, LibreOffice hacker, Calc ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: 3.6.0 regression: non-deterministic filter selection ...
On Tue, Aug 14, 2012 at 3:15 PM, Michael Meeks michael.me...@suse.com wrote: Thoughts ? It appears that the type detection asks SwFilterDetect::detect() to detect its type, and it returns empty handed, which eventually leads to it being detected as ascii text. I've tried a simple Word document I created from Word XP, and the same code detects the file type just fine. I'm right now looking in to see why SwFilterDetect::detect() fails to detect your particular document. Kohei ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice