Re: [REVIEW 3-5] fdo#47644 performance regression on largish .doc

2012-05-11 Thread Michael Meeks

On Thu, 2012-05-10 at 21:56 +0100, Caolán McNamara wrote:
 with a scenario of nBgn = 11, nRel = 2, desired result is 13

Ah ! how stupid of me, I missed that - but this case is almost totally
pointless ;-) we are just victims of this optimisation of starting in
the middle - which means we have to go lookup where we are; when we
actually precisely know our offset from the beginning of the chain/array
( which is nNew/nPageSize ).

So - how about the attached diff, hopefully rather easier to review 
back-port :-) should I knock a version up for -3-5 ? the #ifdef bravery
is never triggered if enabled (FWIW) during import / export of the
slow .doc anyhow.

All the best,

Michael.

-- 
michael.me...@suse.com  , Pseudo Engineer, itinerant idiot
diff --git a/sot/source/sdstor/stgstrms.cxx b/sot/source/sdstor/stgstrms.cxx
index 88bc416..67da8ee 100644
--- a/sot/source/sdstor/stgstrms.cxx
+++ b/sot/source/sdstor/stgstrms.cxx
@@ -334,27 +334,12 @@ void StgStrm::SetEntry( StgDirEntry r )
 r.SetDirty();
 }
 
-namespace lcl
-{
-#if defined(__GXX_EXPERIMENTAL_CXX0X__) || __cplusplus = 201103L
-using std::is_sorted;
-#else
-template typename iter bool is_sorted(iter aStart, iter aEnd)
-{
-if (aStart == aEnd)
-return true;
-
-for (iter aNext = aStart + 1; aNext != aEnd; aStart = aNext, ++aNext)
-{
-if (*aNext  *aStart)
-return false;
-}
-
-return true;
-}
-#endif
-}
-
+/*
+ * The page chain, is basically a singly linked list of slots each
+ * point to the next page. Instead of traversing the file structure
+ * for this each time build a simple flat in-memory vector list
+ * of pages.
+ */
 bool StgStrm::buildPageChainCache()
 {
 if (nSize  0)
@@ -370,10 +355,6 @@ bool StgStrm::buildPageChainCache()
 return false;
 }
 
-m_bSortedPageChain = lcl::is_sorted(m_aPagesCache.begin(), m_aPagesCache.end());
-
-SAL_WARN_IF(!m_bSortedPageChain, sot, unsorted page chain, that's suspicious);
-
 return true;
 }
 
@@ -403,6 +384,35 @@ sal_Bool StgStrm::Pos2Page( sal_Int32 nBytePos )
 nPos = nBytePos;
 if( nOld == nNew )
 return sal_True;
+
+if (m_aPagesCache.empty()  nNew  ARBITRARY_LARGE_AMOUNT_OF_PAGES)
+{
+SAL_WARN(sot, kicking off large seek helper\n);
+buildPageChainCache();
+}
+
+if (!m_aPagesCache.empty())
+{
+size_t nIdx = nNew / nPageSize;
+
+#ifdef BRAVE_BUT_UN_NECESSARY
+// special case: seek to 1st byte of new, unallocated page
+// (in case the file size is a multiple of the page size)
+if( nBytePos == nSize  !nOffset )
+{
+nIdx--;
+nOffset = nPageSize;
+}
+#endif
+
+if (nIdx  m_aPagesCache.size())
+{
+nPage = m_aPagesCache[ nIdx ];
+//fprintf (stderr, got page ! %d offset %d\n, (int)nPage, (int)nOffset);
+return sal_Bool( nPage = 0 );
+}
+}
+
 if( nNew  nOld )
 {
 // the new position is after the current, so an incremental
@@ -420,62 +430,19 @@ sal_Bool StgStrm::Pos2Page( sal_Int32 nBytePos )
 // now, traverse the FAT chain.
 nRel /= nPageSize;
 
-sal_Int32 nLast = STG_EOF;
-if (m_aPagesCache.empty()  nRel  ARBITRARY_LARGE_AMOUNT_OF_PAGES)
-{
+sal_Int32 nLast = STG_EOF;
 while (nRel  nBgn = 0)
 {
 nLast = nBgn;
 nBgn = pFat-GetNextPage( nBgn );
 nRel--;
 }
-}
-else if (nBgn = 0)
-{
-//Seeking large distances is slow, so if we're starting seeking (some
-//fairly arbitrary) large distances, build a cache and re-use it for
-//subsequent seeks
-if (m_aPagesCache.empty())
-{
-SAL_WARN(sot, kicking off large seek helper\n);
-buildPageChainCache();
-}
-
-std::vectorsal_Int32::iterator aI;
-
-if (m_bSortedPageChain)
-aI = std::lower_bound(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);
-else
-aI = std::find(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);
-
-if (aI == m_aPagesCache.end())
-{
-SAL_WARN(sot, Unknown page position);
-nBgn = STG_EOF;
-}
-else
-{
-size_t nBgnDistance = std::distance(m_aPagesCache.begin(), aI);
-
-size_t nIndex = nBgnDistance + nRel;
-
-if (nIndex  m_aPagesCache.size())
-{
-nRel = m_aPagesCache.size() - nBgnDistance;
-nIndex = m_aPagesCache.size() - 1;
-}
-else
-nRel = 0;
-
-nLast = nIndex ? m_aPagesCache[nIndex - 1] : STG_EOF;
-nBgn = m_aPagesCache[nIndex];
-}
-}
 
 // special case: seek to 1st byte of new, unallocated page
 // (in case the file size is a multiple of the page size)
 

Re: [REVIEW 3-5] fdo#47644 performance regression on largish .doc

2012-05-11 Thread Caolán McNamara
On Fri, 2012-05-11 at 10:09 +0100, Michael Meeks wrote:
  we actually precisely know our offset from the beginning of the chain/array
 ( which is nNew/nPageSize ).

hmm, that escaped me entirely. Been trying to find a loophole that
invalidates that e.g. 10 20 30 40 where 20 is the first entry. But
anything like that seems to be covered in that the ::Inits pass the
first entry which ends up as nStart and we build the chain only from
there. So I can't see any reason it wouldn't work.

   So - how about the attached diff, hopefully rather easier to review 
 back-port :-)

Seems sane, though it probably makes sense to tweak the trigger for
calling buildPageChainCache as the units for nNew there are in bytes
while the define is in pages. And we might want to avoid calling it in
the case of a small relative jump in pages forward.

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


[REVIEW 3-5] fdo#47644 performance regression on largish .doc

2012-05-10 Thread Caolán McNamara
fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because
of the extra checks I added to sanity test .doc file during parsing.
Turns out that seeking backwards in the those ole2 formats is incredibly
slow, this might affect other formats like xls and ppt that use some of
the shared msfilter code.

1st attachment gets us back to where we came from in terms of
performance there.

The second one is maybe a bit more contentious for 3-5, but I include it
for a look-over. Basically there's a page chain in the ole2 format and
to seek to a location you walk through the chain until you get to the
correct page. I see that in practice most documents have their chain in
a presorted order (I have no idea if having them unsorted is actually a
sign of a broken document or if its legal) so we can do a far faster
binary_search if we walk the chain once and hold on to the results.

So second attachment keeps the chain if its one of the circumstances
where we have to parse the whole thing anyway, and if not (the usual
one) then if we're seeking an (fairly arbitrary) large number of steps
where its likely that the cost-benefit is in favour of it then generate
the chain once and hold on to it for reuse until there any event which
might invalidate the chain.

Second patch knocks about 50 seconds off load time. First patch knocks
some unknown but  10 mins off load time. 

C.

Don't even think about measuring on a gcc dbgutil build btw, the concept
checking of the stl lower_bound is a huge cost so the times aren't even
roughly uniformly slower on dbgutil vs product but spike dramatically on
some stl stuff.
From 1e89ba85d89858ae8c3263a4e6b5b522016fe445 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Caol=C3=A1n=20McNamara?= caol...@redhat.com
Date: Thu, 10 May 2012 15:41:17 +0100
Subject: [PATCH] Resolves: fdo#47644 compound storage backend is poor at
 knowing stream size

---
 sot/inc/sot/stg.hxx  |3 +++
 sot/inc/sot/storage.hxx  |1 +
 sot/source/sdstor/stg.cxx|7 +++
 sot/source/sdstor/storage.cxx|7 +++
 sot/source/sdstor/ucbstorage.cxx |5 +
 tools/inc/tools/stream.hxx   |3 ++-
 tools/source/stream/stream.cxx   |6 +++---
 7 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/sot/inc/sot/stg.hxx b/sot/inc/sot/stg.hxx
index c8d18da..3071017 100644
--- a/sot/inc/sot/stg.hxx
+++ b/sot/inc/sot/stg.hxx
@@ -90,6 +90,7 @@ public:
 virtual sal_uLong   Tell() = 0;
 virtual voidFlush() = 0;
 virtual sal_BoolSetSize( sal_uLong nNewSize ) = 0;
+virtual sal_uLong   GetSize() const = 0;
 virtual sal_BoolCopyTo( BaseStorageStream * pDestStm ) = 0;
 virtual sal_BoolCommit() = 0;
 virtual sal_BoolRevert() = 0;
@@ -171,6 +172,7 @@ public:
 virtual sal_uLong   Tell() { return nPos; }
 virtual voidFlush();
 virtual sal_BoolSetSize( sal_uLong nNewSize );
+virtual sal_uLong   GetSize() const;
 virtual sal_BoolCopyTo( BaseStorageStream * pDestStm );
 virtual sal_BoolCommit();
 virtual sal_BoolRevert();
@@ -266,6 +268,7 @@ public:
 virtual sal_uLong   Tell();
 virtual voidFlush();
 virtual sal_BoolSetSize( sal_uLong nNewSize );
+virtual sal_uLong   GetSize() const;
 virtual sal_BoolCopyTo( BaseStorageStream * pDestStm );
 virtual sal_BoolCommit();
 virtual sal_BoolRevert();
diff --git a/sot/inc/sot/storage.hxx b/sot/inc/sot/storage.hxx
index 26258bc..4f602e9 100644
--- a/sot/inc/sot/storage.hxx
+++ b/sot/inc/sot/storage.hxx
@@ -96,6 +96,7 @@ public:
 sal_BoolGetProperty( const String rName, ::com::sun::star::uno::Any rValue );
 ::com::sun::star::uno::Reference ::com::sun::star::io::XInputStream 
 GetXInputStream() const;
+virtual sal_Size remainingSize();
 };
 
 #ifndef SOT_DECL_SOTSTORAGESTREAM_DEFINED
diff --git a/sot/source/sdstor/stg.cxx b/sot/source/sdstor/stg.cxx
index 4c4be65..18e95f6 100644
--- a/sot/source/sdstor/stg.cxx
+++ b/sot/source/sdstor/stg.cxx
@@ -252,6 +252,13 @@ sal_Bool StorageStream::SetSize( sal_uLong nNewSize )
 return sal_False;
 }
 
+sal_uLong StorageStream::GetSize() const
+{
+if( Validate() )
+return pEntry-GetSize();
+return 0;
+}
+
 sal_Bool StorageStream::Commit()
 {
 if( !Validate() )
diff --git a/sot/source/sdstor/storage.cxx b/sot/source/sdstor/storage.cxx
index 897f9fe..4d24f1f 100644
--- a/sot/source/sdstor/storage.cxx
+++ b/sot/source/sdstor/storage.cxx
@@ -298,6 +298,13 @@ sal_uInt32 SotStorageStream::GetSize() const
 return nSize;
 }
 
+sal_Size SotStorageStream::remainingSize()
+{
+if (pOwnStm)
+return pOwnStm-GetSize() - Tell();
+return SvStream::remainingSize();
+}
+
 /*
 |*SotStorageStream::CopyTo()
 |*
diff --git 

Re: [REVIEW 3-5] fdo#47644 performance regression on largish .doc

2012-05-10 Thread Michael Meeks
Hi there,

On Thu, 2012-05-10 at 16:06 +0100, Caolán McNamara wrote:
 fdo#47644 big 18 meg .doc is super dooper slow to load, mostly because
 of the extra checks I added to sanity test .doc file during parsing.

:-)

 1st attachment gets us back to where we came from in terms of
 performance there.

Great - I've pushed that to libreoffice-3-5.

 The second one is maybe a bit more contentious for 3-5, but I include it
 for a look-over. Basically there's a page chain in the ole2 format and
 to seek to a location you walk through the chain until you get to the
 correct page.

Yep - it's a FAT structure; nasty one-slot linked list.

  I see that in practice most documents have their chain in
 a presorted order (I have no idea if having them unsorted is actually a
 sign of a broken document or if its legal)

It is legal, but it's highly unlikely.

 So second attachment keeps the chain if its one of the circumstances
 where we have to parse the whole thing anyway, and if not (the usual
 one) then if we're seeking an (fairly arbitrary) large number of steps
 where its likely that the cost-benefit is in favour of it then generate
 the chain once and hold on to it for reuse until there any event which
 might invalidate the chain.

So - I'm a bit lost with the second patch. The FAT is essentially a
linked list; and we walk it to find the block we want, so:

nRel = 3; - go three steps along the linked list.
A - B - C == C ...

Your pageCache looks like the right approach, and turns this into a
simple vector we can rapidly look up:

[0] = A, [1] = B, [2] = C

And so on - hopefully I'm there so far :-) With the 512 byte block size
we expect, that turns a 100Mb OLE2 stream into a 800k vector (which is
just fine).

 Second patch knocks about 50 seconds off load time. First patch knocks
 some unknown but  10 mins off load time. 

My question is thus about:

[snip]
std::vectorsal_Int32::iterator aI;

if (m_bSortedPageChain)
aI = std::lower_bound(m_aPagesCache.begin(), m_aPagesCache.end(), 
nBgn);
else
aI = std::find(m_aPagesCache.begin(), m_aPagesCache.end(), nBgn);

if (aI == m_aPagesCache.end())
{
SAL_WARN(sot, Unknown page position);
nBgn = STG_EOF;
}
else
{
size_t nBgnDistance = std::distance(m_aPagesCache.begin(), aI);

size_t nIndex = nBgnDistance + nRel;

if (nIndex  m_aPagesCache.size())
{
nRel = m_aPagesCache.size() - nBgnDistance;
nIndex = m_aPagesCache.size() - 1;
}
else
nRel = 0;

nLast = nIndex ? m_aPagesCache[nIndex - 1] : STG_EOF;
nBgn = m_aPagesCache[nIndex];
}
[/snip]

Surely that should read:

nBgn = m_aPageCache[nRel];

or did I go a bit crazy ? [ I'm trying to turn the above into that
mentally and giving up ;-].

Otherwise the clearing of the cache looks sensible; though the:

StgFATStrm::SetPage(); should prolly just update that vector as well as
doing the nasty fat walk so:

m_aPageCache[nBlocks] = nNewPage;

or somesuch (assuming it is long enough for that).

Otherwise it looks really lovely.

Thanks !

Michael.

-- 
michael.me...@suse.com  , Pseudo Engineer, itinerant idiot

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [REVIEW 3-5] fdo#47644 performance regression on largish .doc

2012-05-10 Thread Caolán McNamara
On Thu, 2012-05-10 at 17:37 +0100, Michael Meeks wrote:
   Surely that should read:
 
   nBgn = m_aPageCache[nRel];
 
   or did I go a bit crazy ? [ I'm trying to turn the above into that
 mentally and giving up ;-].

how about a 4 element chain of 10 11 12 13

with a scenario of nBgn = 11, nRel = 2, desired result is 13

so nBgn of 11 gives a nBgnDistance of 1
so desired nIndex is 3, nBgnDistance (1) + nRel (2);

so nBgn = m_aPagesCache[nIndex]; and set nRel to 0 to indicate
successfully stepped to desired page

with an alternative scenario of nBgn = 11, nRel = 8
get nBgnDistance of 1 again, so desired nIndex of 9, clip it as its out
of scale to the max of 3, reduce nRel by how valid steps we took (2) to
give 6 un-achievable steps oops my out-of-bounds code is all
wrong :-) fixed with 75e7643e22bcf674739ca890bfc06f80e872624b

   StgFATStrm::SetPage(); should prolly just update that vector as well as
 doing the nasty fat walk so:
 
   m_aPageCache[nBlocks] = nNewPage;

would have to check around the insertion point if the sort order is
still valid afterwards I guess, throwing away everything is my current
cowardly conservative approach on any modifications

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice