[Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread Mattias Johnsson
Hi all,

Here are my patches for the easy hack / programming task Count
characters without whitespace in the Writer statistics. Since it's
something translators have apparently been asking for OO.org to have
for eight years (see
http://www.openoffice.org/issues/show_bug.cgi?id=10356 and
https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for
it to appear in LibreOffice :-)

I've added an extra couple of lines to the word count dialog box which
give the number of characters excluding whitespace in both selected
text and the entire document. As far as the UI decision goes, I
checked in MS Word, and that's what it does, so I figure if it's good
enough for MS it's good enough for us.

Note: I'm still getting started with the LibreOffice code base, and
I'm not entirely certain what I'm doing. For example, I have no idea
about what is supposed to happen with regard to internationalisation,
or whether this also works under Windows, given that it affects the
UI.

Still, it builds and works under Ubuntu 10.10 x86_64. I've tested it
with a number of documents and it seems to give the right answers
(with a caveat, see below), and the answers with respect to the
standard word/character count are the same as before, so I at least
haven't broken anything. If my patches aren't up to scratch, hopefully
other people can at least use them as a starting point.

Now the caveat. There seems to be a bug, in that at least one document
(www.oasis-open.org/committees/download.php/25054/07-08-22-MetaData-Examples.odt
) gives the wrong word/character count if you open it and check the
document statistics. However if you edit it at all, such as adding a
character, and then check the statistics, they're then correct. It's
as though the load doesn't mark the word count as dirty or something.
Documents I create, save and open seem to work fine.

The reason I'm submitting these patches despite this bug is that that
bug was present before I made my changes. I just pulled the latest
git, built and checked to make sure. So my feature change works, and
moves things forward, but doesn't fix this orginal bug which I found
during testing my changes. Note that OpenOffice.org 3.2 (Ubuntu 10.10
x86_64 repository version) doesn't seem to have the bug.

It's not immediately obvious to me how to fix it, but hopefully it'll
be blindingly obvious to someone else.

Please examine, test, and tell me if I've done stuff that's horribly wrong :-P

Patches contributed under MPL 1.1 / GPLv3+ / LGPLv3+ licenses.

Cheers,
Mattias
From 5ac50b845feab1ab1901cd52593237c3676e097b Mon Sep 17 00:00:00 2001
From: Mattias Johnsson m.t.johns...@gmail.com
Date: Wed, 27 Oct 2010 18:01:43 +1100
Subject: [PATCH] Add character count exclusive of whitespace to document statistics part 1

---
 sw/inc/docstat.hxx  |1 +
 sw/inc/ndtxt.hxx|2 +
 sw/source/core/doc/docstat.cxx  |2 +
 sw/source/core/txtnode/txtedt.cxx   |  115 +-
 sw/source/ui/dialog/wordcountdialog.cxx |6 ++
 sw/source/ui/dialog/wordcountdialog.hrc |   30 +
 sw/source/ui/dialog/wordcountdialog.src |   42 +---
 sw/source/ui/inc/wordcountdialog.hxx|4 +
 8 files changed, 130 insertions(+), 72 deletions(-)

diff --git a/sw/inc/docstat.hxx b/sw/inc/docstat.hxx
index a818e2f..8b156bf 100644
--- a/sw/inc/docstat.hxx
+++ b/sw/inc/docstat.hxx
@@ -44,6 +44,7 @@ struct SW_DLLPUBLIC SwDocStat
 ULONG   nAllPara;
 ULONG			nWord;
 ULONG			nChar;
+ULONG			nCharExcludingSpaces;
 BOOL			bModified;
 
 SwDocStat();
diff --git a/sw/inc/ndtxt.hxx b/sw/inc/ndtxt.hxx
index 08410b0..713a30b 100644
--- a/sw/inc/ndtxt.hxx
+++ b/sw/inc/ndtxt.hxx
@@ -189,6 +189,8 @@ class SW_DLLPUBLIC SwTxtNode: public SwCntntNode, public ::sfx2::Metadatable
 SW_DLLPRIVATE ULONG GetParaNumberOfWords() const;
 SW_DLLPRIVATE void SetParaNumberOfChars( ULONG nTmpChars ) const;
 SW_DLLPRIVATE ULONG GetParaNumberOfChars() const;
+SW_DLLPRIVATE void SetParaNumberOfCharsExcludingSpaces( ULONG nTmpChars ) const;
+SW_DLLPRIVATE ULONG GetParaNumberOfCharsExcludingSpaces() const;
 SW_DLLPRIVATE void InitSwParaStatistics( bool bNew );
 
 /** create number for this text node, if not already existing
diff --git a/sw/source/core/doc/docstat.cxx b/sw/source/core/doc/docstat.cxx
index b75a057..e2bef7f 100644
--- a/sw/source/core/doc/docstat.cxx
+++ b/sw/source/core/doc/docstat.cxx
@@ -46,6 +46,7 @@ SwDocStat::SwDocStat() :
 nAllPara(1),
 nWord(0),
 nChar(0),
+nCharExcludingSpaces(0),
 bModified(TRUE)
 {}
 
@@ -63,6 +64,7 @@ void SwDocStat::Reset()
 nAllPara= 1;
 nWord 	= 0;
 nChar	= 0;
+nCharExcludingSpaces = 0;
 bModified = TRUE;
 }
 
diff --git a/sw/source/core/txtnode/txtedt.cxx b/sw/source/core/txtnode/txtedt.cxx
index aa8faaa..ad2eb8b 100644
--- a/sw/source/core/txtnode/txtedt.cxx
+++ b/sw/source/core/txtnode/txtedt.cxx
@@ -2,7 +2,7 @@
 

Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread Cedric Bosdonnat
Hi Mattias,

On Wed, 2010-10-27 at 18:26 +1100, Mattias Johnsson wrote:
 Here are my patches for the easy hack / programming task Count
 characters without whitespace in the Writer statistics. Since it's
 something translators have apparently been asking for OO.org to have
 for eight years (see
 http://www.openoffice.org/issues/show_bug.cgi?id=10356 and
 https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for
 it to appear in LibreOffice :-)

Thanks for your patches, I reviewed them and cleaned up a bit the
SwTxtNode::WordCount() method as you added some commented code. I also
kept the condition for the aScanner.GetLen()  1... is there any reason
to remove that?

I removed the task from the EasyTasks list. Keep providing nice patches
like those ;)

Regards,

-- 
Cédric Bosdonnat
LibreOffice hacker
http://documentfoundation.org
OOo Eclipse Integration developer
http://cedric.bosdonnat.free.fr



___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread Mattias Johnsson
On 27 October 2010 23:38, Cedric Bosdonnat cedric.bosdonnat@free.fr wrote:
 Hi Mattias,

 On Wed, 2010-10-27 at 18:26 +1100, Mattias Johnsson wrote:
 Here are my patches for the easy hack / programming task Count
 characters without whitespace in the Writer statistics. Since it's
 something translators have apparently been asking for OO.org to have
 for eight years (see
 http://www.openoffice.org/issues/show_bug.cgi?id=10356 and
 https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for
 it to appear in LibreOffice :-)

 Thanks for your patches, I reviewed them and cleaned up a bit the
 SwTxtNode::WordCount() method as you added some commented code. I also
 kept the condition for the aScanner.GetLen()  1... is there any reason
 to remove that?

 I removed the task from the EasyTasks list. Keep providing nice patches
 like those ;)

 Regards,

 --
 Cédric Bosdonnat
 LibreOffice hacker
 http://documentfoundation.org
 OOo Eclipse Integration developer
 http://cedric.bosdonnat.free.fr



 ___
 LibreOffice mailing list
 LibreOffice@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/libreoffice

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread Mattias Johnsson
Drat, I meant to send:

I removed the aScanner.GetLen()  1 check because if you leave that
in, it doesn't count words consisting of a single character as words.
So it wasn't counting words like a or i

On 27 October 2010 23:38, Cedric Bosdonnat cedric.bosdonnat@free.fr wrote:
 Hi Mattias,

 On Wed, 2010-10-27 at 18:26 +1100, Mattias Johnsson wrote:
 Here are my patches for the easy hack / programming task Count
 characters without whitespace in the Writer statistics. Since it's
 something translators have apparently been asking for OO.org to have
 for eight years (see
 http://www.openoffice.org/issues/show_bug.cgi?id=10356 and
 https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for
 it to appear in LibreOffice :-)

 Thanks for your patches, I reviewed them and cleaned up a bit the
 SwTxtNode::WordCount() method as you added some commented code. I also
 kept the condition for the aScanner.GetLen()  1... is there any reason
 to remove that?

 I removed the task from the EasyTasks list. Keep providing nice patches
 like those ;)

 Regards,

 --
 Cédric Bosdonnat
 LibreOffice hacker
 http://documentfoundation.org
 OOo Eclipse Integration developer
 http://cedric.bosdonnat.free.fr



 ___
 LibreOffice mailing list
 LibreOffice@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/libreoffice

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread LeMoyne

Using the following sample from a git patch one can see one way in which the
current counting method comes up with fewer words than other methods do.  
+1747,9
1.7.0.4
14 characters on two lines: either 2, 3 or 6 words depending on how you
count

Gedit says:  2 lines 6 words 15 chars 14 chars(no spaces)
LibOdev says: 2 words 14 chars 14 chars excl spaces  - (no stat line for
lines tho it has para counts)

Gedit takes each number as a word breaking the words on punctuation 
Gedit also counts the new line as whitespace
LibOdev counts all of any block of contiguous characters as a word 
LibOdev in node word counter never sees the newline

Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing
gedit / LibOdev
Words: 2418 / 2414 
Chars: 24241 / 24241 
Chars – 16830 / 16830  (excl. spaces)
Now a near match in words and perfect match on chars excl spaces.  

Testing with a different entire patch file, the major difference is in words
1338 to 1533 or ~200 out of 1400 words, but the total char and char excl.
spaces agree completely 13 459 and 10 157
Taking into account the different word handling (top) and the way they match
then don't match I suspect a second difference in the counting method tween
gedit and LibOdev  and differences in the line breaks in the files after cut
and paste.  

So far gedit and LibOdev agree completely ONLY on the non-space counts.  

I didn't check results on your reference odt because gedit wont open odt and
cut and paste just dumps the XML into the text... 
Words  3997  /  18
Chars 33429  /  125 
Chars –  28469  /  107 
Where the second smaller numbers are a page footer's counts.  AFAIR -
LibOdev doesn't count the footer content and that might be the difference.
there are 20+ pages so thats 360+ words ~2500 chars in the footers

I also saw how the LibOdev count is zero at load of the odt.  Perhaps the
count is made somewhere else and saved on the doc without this code or it is
stored in the doc and loaded – either way the word count is  marked clean so
it is not re-counted when the dialog box calls updateStats and the excl.
spaces count remains zero.   Just clicking in the document causes a full
recount tho and that seems too busy  somehow.. -- more than enough guessing
there  

All these tests are with the aScanner.GetLen()  1 check in place.  With
that Len =2 check, the new counting routine has no problem with single
letter words like A, a, 1, -, or just ,   
It is puzzling that Mattias removed the check to handle single char words on
his machine but a build out of master/LibOdev works (at least for me) with
that same check in … 

I will test changing back to Mattias simpler submission.  (building now).  
I must note that the block immediately after this count area word counts the
outline numbers (and counts the bullets as words!?!) - it does not have any
such length check at all... I think all the len=1 strings that the scanner
might give back are just  CH_TXTATR_BREAKWORD = 0x01.  And they are probably
Scanner's zero length string.  Scanner's GetEnd points one slot past the end
of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen()(no
-1 there)   And that end spot likely has a break marker.  

Again gedit and LibOdev agree completely ONLY on the non-space counts.  

-- 
View this message in context: 
http://nabble.documentfoundation.org/PATCH-Fix-for-bug-feature-request-30550-Character-count-without-spaces-tp1778667p1782965.html
Sent from the Dev mailing list archive at Nabble.com.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread Mattias Johnsson
On 28 October 2010 08:42, LeMoyne j...@mail2lee.com wrote:

 Using the following sample from a git patch one can see one way in which the
 current counting method comes up with fewer words than other methods do.
 +1747,9
 1.7.0.4
 14 characters on two lines: either 2, 3 or 6 words depending on how you
 count

 Gedit says:  2 lines 6 words 15 chars 14 chars(no spaces)
 LibOdev says: 2 words 14 chars 14 chars excl spaces  - (no stat line for
 lines tho it has para counts)

 Gedit takes each number as a word breaking the words on punctuation
 Gedit also counts the new line as whitespace
 LibOdev counts all of any block of contiguous characters as a word
 LibOdev in node word counter never sees the newline

 Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing
 gedit / LibOdev
 Words: 2418 / 2414
 Chars: 24241 / 24241
 Chars – 16830 / 16830  (excl. spaces)
 Now a near match in words and perfect match on chars excl spaces.

 Testing with a different entire patch file, the major difference is in words
 1338 to 1533 or ~200 out of 1400 words, but the total char and char excl.
 spaces agree completely 13 459 and 10 157
 Taking into account the different word handling (top) and the way they match
 then don't match I suspect a second difference in the counting method tween
 gedit and LibOdev  and differences in the line breaks in the files after cut
 and paste.

 So far gedit and LibOdev agree completely ONLY on the non-space counts.

 I didn't check results on your reference odt because gedit wont open odt and
 cut and paste just dumps the XML into the text...
 Words      3997  /  18
 Chars     33429  /  125
 Chars –  28469  /  107
 Where the second smaller numbers are a page footer's counts.  AFAIR -
 LibOdev doesn't count the footer content and that might be the difference.
 there are 20+ pages so thats 360+ words ~2500 chars in the footers

 I also saw how the LibOdev count is zero at load of the odt.  Perhaps the
 count is made somewhere else and saved on the doc without this code or it is
 stored in the doc and loaded – either way the word count is  marked clean so
 it is not re-counted when the dialog box calls updateStats and the excl.
 spaces count remains zero.   Just clicking in the document causes a full
 recount tho and that seems too busy  somehow.. -- more than enough guessing
 there

 All these tests are with the aScanner.GetLen()  1 check in place.  With
 that Len =2 check, the new counting routine has no problem with single
 letter words like A, a, 1, -, or just ,
 It is puzzling that Mattias removed the check to handle single char words on
 his machine but a build out of master/LibOdev works (at least for me) with
 that same check in …

Hmm, I originally left that check in because it was in Norbert's
sketch code, and I figured he knew what was going on. But I definitely
didn't get the right word count with it in place, and I did when I
removed it. I was quite puzzled as to its purpose - your explanation
about the leading spaces and the SwScanner makes sense, though, and I
guess that's the reason it was there.

 I will test changing back to Mattias simpler submission.  (building now).
 I must note that the block immediately after this count area word counts the
 outline numbers (and counts the bullets as words!?!) - it does not have any
 such length check at all... I think all the len=1 strings that the scanner
 might give back are just  CH_TXTATR_BREAKWORD = 0x01.  And they are probably
 Scanner's zero length string.  Scanner's GetEnd points one slot past the end
 of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen()    (no
 -1 there)   And that end spot likely has a break marker.

 Again gedit and LibOdev agree completely ONLY on the non-space counts.

Nice analysis! I'm at work now, but with your explanations I'll look
into things again when I get home, unless you've solved all the
problems by then.

I did notice the problem LO has with counting things like isolated
punctuation as a word (and its deliberate choice to count bullets as
words), but decided not to try and change it, since I figured step 1
was to add the feature without breaking the current behaviour :-P I
also couldn't see a way to make it robust for all languages,
especially those with non-Latin alphabets and weird punctuation
markers.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread LeMoyne

Mattias, 

No problem at all.  Recompiled with your original simpler if then statement
and I get the same counts for your reference Oasis Metadata Examples odt  
http://nabble.documentfoundation.org/file/n1783515/07-08-22-MetaData-Examples.odt
07-08-22-MetaData-Examples.odt 

  Just one quick test but absolute agreement with and without the length
 test. 

 So, it really does seem that the len=1 strings are just the break char
and one char words must come through as char+break.  
 You were correct to just slip in the minimal fix and then look at all
the other problems.  I really bogged down in the greater context and in the
scanner weirdness.  Didn't really get less confused until doing the simple
test of switching between your patch and Cedric's patch.  
On one hand, the current method will count the same in other languages as
long as their space char has a uint val of 32.  In other words, the present
counter can't tell an upside-down exclamation point from an A: it's all
not-a-space.   On the other hand there is almost certainly implicit casting
involved in the whitespace tests (' ' == unicodeCharVar ) and that could
really break it on a different code page.  On the gripping hand I don't
really know.  It does still over-count a leading double quote () as its own
word and I'm pretty clueless on that pre-existing condition except to
strongly suspicion the scanner ;-)   - the double quote isn't in the
whitespace list at the top of the file.
 I will try to look closer to see what the scanner is actually starting
with and giving back as it expands and breaks up the node text.  I may not
get to that for a while so please dont let me stop you.  For clarity and
completion you may want to pull the numbering/bullets stuff into line with
your fix on the main node text and just re-submit your simpler test.  
 The documentation folks will laugh if/when they find out we count
bullets as a word.  But only because they are in a good mood: they will be
happy with your patch. 
- LeMoyne

-- 
View this message in context: 
http://nabble.documentfoundation.org/PATCH-Fix-for-bug-feature-request-30550-Character-count-without-spaces-tp1778667p1783515.html
Sent from the Dev mailing list archive at Nabble.com.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces

2010-10-27 Thread Norbert Thiebaud
On 10/27/10, Mattias Johnsson m.t.johns...@gmail.com wrote:
 On 28 October 2010 08:42, LeMoyne j...@mail2lee.com wrote:



 All these tests are with the aScanner.GetLen()  1 check in place.  With
 that Len =2 check, the new counting routine has no problem with single
 letter words like A, a, 1, -, or just ,
 It is puzzling that Mattias removed the check to handle single char words
 on
 his machine but a build out of master/LibOdev works (at least for me) with
 that same check in …

 Hmm, I originally left that check in because it was in Norbert's
 sketch code, and I figured he knew what was going on.

Not at all. I had no clue. I just found the place where the magic was
happening so I mentioned it in the bug report.
The 'if' was already in the original code. I just had to move things a
bit to not have the if for counting char while still have it the way
it was before for words... and I may very well have botched that.

Norbert
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice