Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Tor Lillqvist
 Tor - I had a feeling you were looking at some (MathML ?) detection
 issue that may be related to this (and/or fixed it already);

Don't think so; the problem I looked at was caused by the MathML
filter detecting *any* XML file as MathML, but then obviously
failing to actually load it (and after a load attempt has failed there
is no loop back to try other filters). In this case the user *wanted*
the ASCII filter to be the one invoked. (Surely there must be better
arbitrary-XML editors than Writer, that at least make sure XML start
and end tags match etc automatically, but oh well).

--tml
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Kohei Yoshida
On Wed, Aug 15, 2012 at 12:40 AM, Kohei Yoshida kyosh...@novell.com wrote:
 On Tue, Aug 14, 2012 at 3:15 PM, Michael Meeks michael.me...@suse.com wrote:

 Thoughts ?

 It appears that the type detection asks SwFilterDetect::detect() to
 detect its type, and it returns empty handed, which eventually leads
 to it being detected as ascii text.

 I've tried a simple Word document I created from Word XP, and the same
 code detects the file type just fine.  I'm right now looking in to see
 why SwFilterDetect::detect() fails to detect your particular document.

I'm starting to suspect that maybe it's the filter itself failing to
parse this document, rather than the type detection system failing to
detect it.

Here is my reasoning:

1) Type detection correctly prioritize the writer_MS_Word_97 file type
at the top of the list, which gets tested first.  With normal .doc
file, this test succeeds, the type detection ends with the correct
filter type detected 'MS Word 97'.

2) But with Large-Word.doc, this test fails, and the type detection
continues down the list of all types to test.  They all fail, and the
last one on the list is the ascii text, which always succeeds.

3) Now, if you launch the file open dialog and select this file,
manually set the file type to 'Microsoft Word 97/2000/XP/2003', select
Large-Word.doc and click Open, internally it bypasses the type
detection and proceeds to open the file using the pre-selected filter
'MS Word 97'.  Even in this scenario, you'll get 'Read-Error. Error
reading file'.

The last point indicates that the correct filter for this file type
fails to load this file for whatever reason.

Does this make sense?  At this point, I'm inclined to ask the Writer
folks to see if any changes in writer's word import filter has ended
up not loading this file correctly...  I'm CC'ing Cedric who is our
writer guru, and Caolan who is generally very knowledgeable about this
sort of thing. :-)

I'll go a little deeper in the type detection chain (I'm at
SwIoSystem::IsFileFilter at the moment) to confirm this theory.

Kohei
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Kohei Yoshida

On 08/15/2012 11:05 AM, Kohei Yoshida wrote:

I'll go a little deeper in the type detection chain (I'm at
SwIoSystem::IsFileFilter at the moment) to confirm this theory.


I've now deduced it to aHdr.Load( *pStream ) (aHdr being StgHeader) in 
Storage::IsStorageFile() failing with Large-Word.doc, whereas this 
succeeds with normal .doc files.


Kohei

--
Kohei Yoshida, LibreOffice hacker, Calc
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Kohei Yoshida

On 08/15/2012 12:02 PM, Kohei Yoshida wrote:

On 08/15/2012 11:05 AM, Kohei Yoshida wrote:

I'll go a little deeper in the type detection chain (I'm at
SwIoSystem::IsFileFilter at the moment) to confirm this theory.


I've now deduced it to aHdr.Load( *pStream ) (aHdr being StgHeader) in
Storage::IsStorageFile() failing with Large-Word.doc, whereas this
succeeds with normal .doc files.


Here is a link to the method (for your convenience)
http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stg.cxx#344

The culprit is on line 352, where bRet = 0 though it should be bRet = 1.

--
Kohei Yoshida, LibreOffice hacker, Calc
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Kohei Yoshida

On 08/15/2012 12:05 PM, Kohei Yoshida wrote:

On 08/15/2012 12:02 PM, Kohei Yoshida wrote:

On 08/15/2012 11:05 AM, Kohei Yoshida wrote:

I'll go a little deeper in the type detection chain (I'm at
SwIoSystem::IsFileFilter at the moment) to confirm this theory.


I've now deduced it to aHdr.Load( *pStream ) (aHdr being StgHeader) in
Storage::IsStorageFile() failing with Large-Word.doc, whereas this
succeeds with normal .doc files.


Here is a link to the method (for your convenience)
http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stg.cxx#344

The culprit is on line 352, where bRet = 0 though it should be bRet = 1.


Ah.  Now I know.  Ultimately the failure is due to StgHeader::Check() 
returning false on Large-Word.doc.


http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186

Changing this method to always return true leads to Writer loading the 
file correctly.


Now, what would be the right fix for this?  I'm not sure

Kohei

--
Kohei Yoshida, LibreOffice hacker, Calc
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Kohei Yoshida

On 08/15/2012 12:12 PM, Kohei Yoshida wrote:

Ah.  Now I know.  Ultimately the failure is due to StgHeader::Check()
returning false on Large-Word.doc.

http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186

Changing this method to always return true leads to Writer loading the
file correctly.


And the check that fails is this:

( nMasterChain == -2 || ( nMasterChain =0  nMaster  109 ) )

where nMasterChain == 24269 and nMaster == 2.  Since nMaster  109 is 
false, it returns false, and the whole thing starts to crumble...


--
Kohei Yoshida, LibreOffice hacker, Calc
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Kohei Yoshida

On 08/15/2012 12:26 PM, Kohei Yoshida wrote:

On 08/15/2012 12:12 PM, Kohei Yoshida wrote:

Ah.  Now I know.  Ultimately the failure is due to StgHeader::Check()
returning false on Large-Word.doc.

http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186


Changing this method to always return true leads to Writer loading the
file correctly.


And the check that fails is this:

( nMasterChain == -2 || ( nMasterChain =0  nMaster  109 ) )

where nMasterChain == 24269 and nMaster == 2.  Since nMaster  109 is
false, it returns false, and the whole thing starts to crumble...


So, we have a choice to make.  This patch

diff --git a/sot/source/sdstor/stgelem.cxx b/sot/source/sdstor/stgelem.cxx
index 5fb3a09..dfcc28a 100644
--- a/sot/source/sdstor/stgelem.cxx
+++ b/sot/source/sdstor/stgelem.cxx
@@ -194,7 +194,7 @@ sal_Bool StgHeader::Check()
  nTOCstrm = 0
  nThreshold  0
  ( nDataFAT == -2 || ( nDataFAT = 0  nDataFATSize  0 ) )
- ( nMasterChain == -2 || ( nMasterChain =0  nMaster  
109 ) )

+ ( nMasterChain == -2 || nMasterChain =0 )
  nMaster = 0;
 }

which removes the check for nMaster being greater than 109 fixes this 
load issue.


The documentation:

http://msdn.microsoft.com/en-us/library/dd941946%28v=prot.13%29

basically says that number represents the number of DIFAT sectors that 
follows, and the number of the DIFAT sectors is always 109.  That's 
probably where the check came from (though strictly the check should be 
nMaster == 109 instead of nMaster  109).


But we always load 109 DIFAT sectors regardless of the value of nMaster, 
and nMaster *may* actually represents the number of *used* DIFAT 
sectors... Who knows?  The documentation is not very clear about this.


IMO, it's safe to drop this check since it doesn't do anything 
significant anyway, and it obviously prevents legitimate Word files from 
being loaded.


Any opinions?

Kohei

--
Kohei Yoshida, LibreOffice hacker, Calc
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Ivan Timofeev

Hi Kohei,

On 15.08.2012 21:00, Kohei Yoshida wrote:

On 08/15/2012 12:26 PM, Kohei Yoshida wrote:

On 08/15/2012 12:12 PM, Kohei Yoshida wrote:

Ah.  Now I know.  Ultimately the failure is due to StgHeader::Check()
returning false on Large-Word.doc.

http://opengrok.libreoffice.org/xref/core/sot/source/sdstor/stgelem.cxx#186



Changing this method to always return true leads to Writer loading the
file correctly.


And the check that fails is this:

( nMasterChain == -2 || ( nMasterChain =0  nMaster  109 ) )

where nMasterChain == 24269 and nMaster == 2.  Since nMaster  109 is
false, it returns false, and the whole thing starts to crumble...


So, we have a choice to make.  This patch

diff --git a/sot/source/sdstor/stgelem.cxx b/sot/source/sdstor/stgelem.cxx
index 5fb3a09..dfcc28a 100644
--- a/sot/source/sdstor/stgelem.cxx
+++ b/sot/source/sdstor/stgelem.cxx
@@ -194,7 +194,7 @@ sal_Bool StgHeader::Check()
   nTOCstrm = 0
   nThreshold  0
   ( nDataFAT == -2 || ( nDataFAT = 0  nDataFATSize  0
) )
- ( nMasterChain == -2 || ( nMasterChain =0  nMaster 
109 ) )
+ ( nMasterChain == -2 || nMasterChain =0 )
   nMaster = 0;
  }

which removes the check for nMaster being greater than 109 fixes this
load issue.

The documentation:

http://msdn.microsoft.com/en-us/library/dd941946%28v=prot.13%29

basically says that number represents the number of DIFAT sectors that
follows, and the number of the DIFAT sectors is always 109.  That's
probably where the check came from (though strictly the check should be
nMaster == 109 instead of nMaster  109).

But we always load 109 DIFAT sectors regardless of the value of nMaster,
and nMaster *may* actually represents the number of *used* DIFAT
sectors... Who knows?  The documentation is not very clear about this.


first 109 sectors are in the header anyway and nMaster are in FAT (if 
needed). nMaster should not be checked for 109.


http://msdn.microsoft.com/en-us/library/dd941958%28v=prot.13%29

Best regards,
Ivan
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-15 Thread Kohei Yoshida

On 08/15/2012 01:19 PM, Ivan Timofeev wrote:


first 109 sectors are in the header anyway and nMaster are in FAT (if
needed). nMaster should not be checked for 109.


Hi Ivan,

Thanks. Pushed the change to master.

Kohei

--
Kohei Yoshida, LibreOffice hacker, Calc
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: 3.6.0 regression: non-deterministic filter selection ...

2012-08-14 Thread Kohei Yoshida
On Tue, Aug 14, 2012 at 3:15 PM, Michael Meeks michael.me...@suse.com wrote:

 Thoughts ?

It appears that the type detection asks SwFilterDetect::detect() to
detect its type, and it returns empty handed, which eventually leads
to it being detected as ascii text.

I've tried a simple Word document I created from Word XP, and the same
code detects the file type just fine.  I'm right now looking in to see
why SwFilterDetect::detect() fails to detect your particular document.

Kohei
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice