[Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
Hi all, Here are my patches for the easy hack / programming task Count characters without whitespace in the Writer statistics. Since it's something translators have apparently been asking for OO.org to have for eight years (see http://www.openoffice.org/issues/show_bug.cgi?id=10356 and https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for it to appear in LibreOffice :-) I've added an extra couple of lines to the word count dialog box which give the number of characters excluding whitespace in both selected text and the entire document. As far as the UI decision goes, I checked in MS Word, and that's what it does, so I figure if it's good enough for MS it's good enough for us. Note: I'm still getting started with the LibreOffice code base, and I'm not entirely certain what I'm doing. For example, I have no idea about what is supposed to happen with regard to internationalisation, or whether this also works under Windows, given that it affects the UI. Still, it builds and works under Ubuntu 10.10 x86_64. I've tested it with a number of documents and it seems to give the right answers (with a caveat, see below), and the answers with respect to the standard word/character count are the same as before, so I at least haven't broken anything. If my patches aren't up to scratch, hopefully other people can at least use them as a starting point. Now the caveat. There seems to be a bug, in that at least one document (www.oasis-open.org/committees/download.php/25054/07-08-22-MetaData-Examples.odt ) gives the wrong word/character count if you open it and check the document statistics. However if you edit it at all, such as adding a character, and then check the statistics, they're then correct. It's as though the load doesn't mark the word count as dirty or something. Documents I create, save and open seem to work fine. The reason I'm submitting these patches despite this bug is that that bug was present before I made my changes. I just pulled the latest git, built and checked to make sure. So my feature change works, and moves things forward, but doesn't fix this orginal bug which I found during testing my changes. Note that OpenOffice.org 3.2 (Ubuntu 10.10 x86_64 repository version) doesn't seem to have the bug. It's not immediately obvious to me how to fix it, but hopefully it'll be blindingly obvious to someone else. Please examine, test, and tell me if I've done stuff that's horribly wrong :-P Patches contributed under MPL 1.1 / GPLv3+ / LGPLv3+ licenses. Cheers, Mattias From 5ac50b845feab1ab1901cd52593237c3676e097b Mon Sep 17 00:00:00 2001 From: Mattias Johnsson m.t.johns...@gmail.com Date: Wed, 27 Oct 2010 18:01:43 +1100 Subject: [PATCH] Add character count exclusive of whitespace to document statistics part 1 --- sw/inc/docstat.hxx |1 + sw/inc/ndtxt.hxx|2 + sw/source/core/doc/docstat.cxx |2 + sw/source/core/txtnode/txtedt.cxx | 115 +- sw/source/ui/dialog/wordcountdialog.cxx |6 ++ sw/source/ui/dialog/wordcountdialog.hrc | 30 + sw/source/ui/dialog/wordcountdialog.src | 42 +--- sw/source/ui/inc/wordcountdialog.hxx|4 + 8 files changed, 130 insertions(+), 72 deletions(-) diff --git a/sw/inc/docstat.hxx b/sw/inc/docstat.hxx index a818e2f..8b156bf 100644 --- a/sw/inc/docstat.hxx +++ b/sw/inc/docstat.hxx @@ -44,6 +44,7 @@ struct SW_DLLPUBLIC SwDocStat ULONG nAllPara; ULONG nWord; ULONG nChar; +ULONG nCharExcludingSpaces; BOOL bModified; SwDocStat(); diff --git a/sw/inc/ndtxt.hxx b/sw/inc/ndtxt.hxx index 08410b0..713a30b 100644 --- a/sw/inc/ndtxt.hxx +++ b/sw/inc/ndtxt.hxx @@ -189,6 +189,8 @@ class SW_DLLPUBLIC SwTxtNode: public SwCntntNode, public ::sfx2::Metadatable SW_DLLPRIVATE ULONG GetParaNumberOfWords() const; SW_DLLPRIVATE void SetParaNumberOfChars( ULONG nTmpChars ) const; SW_DLLPRIVATE ULONG GetParaNumberOfChars() const; +SW_DLLPRIVATE void SetParaNumberOfCharsExcludingSpaces( ULONG nTmpChars ) const; +SW_DLLPRIVATE ULONG GetParaNumberOfCharsExcludingSpaces() const; SW_DLLPRIVATE void InitSwParaStatistics( bool bNew ); /** create number for this text node, if not already existing diff --git a/sw/source/core/doc/docstat.cxx b/sw/source/core/doc/docstat.cxx index b75a057..e2bef7f 100644 --- a/sw/source/core/doc/docstat.cxx +++ b/sw/source/core/doc/docstat.cxx @@ -46,6 +46,7 @@ SwDocStat::SwDocStat() : nAllPara(1), nWord(0), nChar(0), +nCharExcludingSpaces(0), bModified(TRUE) {} @@ -63,6 +64,7 @@ void SwDocStat::Reset() nAllPara= 1; nWord = 0; nChar = 0; +nCharExcludingSpaces = 0; bModified = TRUE; } diff --git a/sw/source/core/txtnode/txtedt.cxx b/sw/source/core/txtnode/txtedt.cxx index aa8faaa..ad2eb8b 100644 --- a/sw/source/core/txtnode/txtedt.cxx +++ b/sw/source/core/txtnode/txtedt.cxx @@ -2,7 +2,7 @@
Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
Hi Mattias, On Wed, 2010-10-27 at 18:26 +1100, Mattias Johnsson wrote: Here are my patches for the easy hack / programming task Count characters without whitespace in the Writer statistics. Since it's something translators have apparently been asking for OO.org to have for eight years (see http://www.openoffice.org/issues/show_bug.cgi?id=10356 and https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for it to appear in LibreOffice :-) Thanks for your patches, I reviewed them and cleaned up a bit the SwTxtNode::WordCount() method as you added some commented code. I also kept the condition for the aScanner.GetLen() 1... is there any reason to remove that? I removed the task from the EasyTasks list. Keep providing nice patches like those ;) Regards, -- Cédric Bosdonnat LibreOffice hacker http://documentfoundation.org OOo Eclipse Integration developer http://cedric.bosdonnat.free.fr ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
On 27 October 2010 23:38, Cedric Bosdonnat cedric.bosdonnat@free.fr wrote: Hi Mattias, On Wed, 2010-10-27 at 18:26 +1100, Mattias Johnsson wrote: Here are my patches for the easy hack / programming task Count characters without whitespace in the Writer statistics. Since it's something translators have apparently been asking for OO.org to have for eight years (see http://www.openoffice.org/issues/show_bug.cgi?id=10356 and https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for it to appear in LibreOffice :-) Thanks for your patches, I reviewed them and cleaned up a bit the SwTxtNode::WordCount() method as you added some commented code. I also kept the condition for the aScanner.GetLen() 1... is there any reason to remove that? I removed the task from the EasyTasks list. Keep providing nice patches like those ;) Regards, -- Cédric Bosdonnat LibreOffice hacker http://documentfoundation.org OOo Eclipse Integration developer http://cedric.bosdonnat.free.fr ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
Drat, I meant to send: I removed the aScanner.GetLen() 1 check because if you leave that in, it doesn't count words consisting of a single character as words. So it wasn't counting words like a or i On 27 October 2010 23:38, Cedric Bosdonnat cedric.bosdonnat@free.fr wrote: Hi Mattias, On Wed, 2010-10-27 at 18:26 +1100, Mattias Johnsson wrote: Here are my patches for the easy hack / programming task Count characters without whitespace in the Writer statistics. Since it's something translators have apparently been asking for OO.org to have for eight years (see http://www.openoffice.org/issues/show_bug.cgi?id=10356 and https://bugs.freedesktop.org/show_bug.cgi?id=30550) it'd be nice for it to appear in LibreOffice :-) Thanks for your patches, I reviewed them and cleaned up a bit the SwTxtNode::WordCount() method as you added some commented code. I also kept the condition for the aScanner.GetLen() 1... is there any reason to remove that? I removed the task from the EasyTasks list. Keep providing nice patches like those ;) Regards, -- Cédric Bosdonnat LibreOffice hacker http://documentfoundation.org OOo Eclipse Integration developer http://cedric.bosdonnat.free.fr ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
Using the following sample from a git patch one can see one way in which the current counting method comes up with fewer words than other methods do. +1747,9 1.7.0.4 14 characters on two lines: either 2, 3 or 6 words depending on how you count Gedit says: 2 lines 6 words 15 chars 14 chars(no spaces) LibOdev says: 2 words 14 chars 14 chars excl spaces - (no stat line for lines tho it has para counts) Gedit takes each number as a word breaking the words on punctuation Gedit also counts the new line as whitespace LibOdev counts all of any block of contiguous characters as a word LibOdev in node word counter never sees the newline Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing gedit / LibOdev Words: 2418 / 2414 Chars: 24241 / 24241 Chars – 16830 / 16830 (excl. spaces) Now a near match in words and perfect match on chars excl spaces. Testing with a different entire patch file, the major difference is in words 1338 to 1533 or ~200 out of 1400 words, but the total char and char excl. spaces agree completely 13 459 and 10 157 Taking into account the different word handling (top) and the way they match then don't match I suspect a second difference in the counting method tween gedit and LibOdev and differences in the line breaks in the files after cut and paste. So far gedit and LibOdev agree completely ONLY on the non-space counts. I didn't check results on your reference odt because gedit wont open odt and cut and paste just dumps the XML into the text... Words 3997 / 18 Chars 33429 / 125 Chars – 28469 / 107 Where the second smaller numbers are a page footer's counts. AFAIR - LibOdev doesn't count the footer content and that might be the difference. there are 20+ pages so thats 360+ words ~2500 chars in the footers I also saw how the LibOdev count is zero at load of the odt. Perhaps the count is made somewhere else and saved on the doc without this code or it is stored in the doc and loaded – either way the word count is marked clean so it is not re-counted when the dialog box calls updateStats and the excl. spaces count remains zero. Just clicking in the document causes a full recount tho and that seems too busy somehow.. -- more than enough guessing there All these tests are with the aScanner.GetLen() 1 check in place. With that Len =2 check, the new counting routine has no problem with single letter words like A, a, 1, -, or just , It is puzzling that Mattias removed the check to handle single char words on his machine but a build out of master/LibOdev works (at least for me) with that same check in … I will test changing back to Mattias simpler submission. (building now). I must note that the block immediately after this count area word counts the outline numbers (and counts the bullets as words!?!) - it does not have any such length check at all... I think all the len=1 strings that the scanner might give back are just CH_TXTATR_BREAKWORD = 0x01. And they are probably Scanner's zero length string. Scanner's GetEnd points one slot past the end of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen()(no -1 there) And that end spot likely has a break marker. Again gedit and LibOdev agree completely ONLY on the non-space counts. -- View this message in context: http://nabble.documentfoundation.org/PATCH-Fix-for-bug-feature-request-30550-Character-count-without-spaces-tp1778667p1782965.html Sent from the Dev mailing list archive at Nabble.com. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
On 28 October 2010 08:42, LeMoyne j...@mail2lee.com wrote: Using the following sample from a git patch one can see one way in which the current counting method comes up with fewer words than other methods do. +1747,9 1.7.0.4 14 characters on two lines: either 2, 3 or 6 words depending on how you count Gedit says: 2 lines 6 words 15 chars 14 chars(no spaces) LibOdev says: 2 words 14 chars 14 chars excl spaces - (no stat line for lines tho it has para counts) Gedit takes each number as a word breaking the words on punctuation Gedit also counts the new line as whitespace LibOdev counts all of any block of contiguous characters as a word LibOdev in node word counter never sees the newline Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing gedit / LibOdev Words: 2418 / 2414 Chars: 24241 / 24241 Chars – 16830 / 16830 (excl. spaces) Now a near match in words and perfect match on chars excl spaces. Testing with a different entire patch file, the major difference is in words 1338 to 1533 or ~200 out of 1400 words, but the total char and char excl. spaces agree completely 13 459 and 10 157 Taking into account the different word handling (top) and the way they match then don't match I suspect a second difference in the counting method tween gedit and LibOdev and differences in the line breaks in the files after cut and paste. So far gedit and LibOdev agree completely ONLY on the non-space counts. I didn't check results on your reference odt because gedit wont open odt and cut and paste just dumps the XML into the text... Words 3997 / 18 Chars 33429 / 125 Chars – 28469 / 107 Where the second smaller numbers are a page footer's counts. AFAIR - LibOdev doesn't count the footer content and that might be the difference. there are 20+ pages so thats 360+ words ~2500 chars in the footers I also saw how the LibOdev count is zero at load of the odt. Perhaps the count is made somewhere else and saved on the doc without this code or it is stored in the doc and loaded – either way the word count is marked clean so it is not re-counted when the dialog box calls updateStats and the excl. spaces count remains zero. Just clicking in the document causes a full recount tho and that seems too busy somehow.. -- more than enough guessing there All these tests are with the aScanner.GetLen() 1 check in place. With that Len =2 check, the new counting routine has no problem with single letter words like A, a, 1, -, or just , It is puzzling that Mattias removed the check to handle single char words on his machine but a build out of master/LibOdev works (at least for me) with that same check in … Hmm, I originally left that check in because it was in Norbert's sketch code, and I figured he knew what was going on. But I definitely didn't get the right word count with it in place, and I did when I removed it. I was quite puzzled as to its purpose - your explanation about the leading spaces and the SwScanner makes sense, though, and I guess that's the reason it was there. I will test changing back to Mattias simpler submission. (building now). I must note that the block immediately after this count area word counts the outline numbers (and counts the bullets as words!?!) - it does not have any such length check at all... I think all the len=1 strings that the scanner might give back are just CH_TXTATR_BREAKWORD = 0x01. And they are probably Scanner's zero length string. Scanner's GetEnd points one slot past the end of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen() (no -1 there) And that end spot likely has a break marker. Again gedit and LibOdev agree completely ONLY on the non-space counts. Nice analysis! I'm at work now, but with your explanations I'll look into things again when I get home, unless you've solved all the problems by then. I did notice the problem LO has with counting things like isolated punctuation as a word (and its deliberate choice to count bullets as words), but decided not to try and change it, since I figured step 1 was to add the feature without breaking the current behaviour :-P I also couldn't see a way to make it robust for all languages, especially those with non-Latin alphabets and weird punctuation markers. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
Mattias, No problem at all. Recompiled with your original simpler if then statement and I get the same counts for your reference Oasis Metadata Examples odt http://nabble.documentfoundation.org/file/n1783515/07-08-22-MetaData-Examples.odt 07-08-22-MetaData-Examples.odt Just one quick test but absolute agreement with and without the length test. So, it really does seem that the len=1 strings are just the break char and one char words must come through as char+break. You were correct to just slip in the minimal fix and then look at all the other problems. I really bogged down in the greater context and in the scanner weirdness. Didn't really get less confused until doing the simple test of switching between your patch and Cedric's patch. On one hand, the current method will count the same in other languages as long as their space char has a uint val of 32. In other words, the present counter can't tell an upside-down exclamation point from an A: it's all not-a-space. On the other hand there is almost certainly implicit casting involved in the whitespace tests (' ' == unicodeCharVar ) and that could really break it on a different code page. On the gripping hand I don't really know. It does still over-count a leading double quote () as its own word and I'm pretty clueless on that pre-existing condition except to strongly suspicion the scanner ;-) - the double quote isn't in the whitespace list at the top of the file. I will try to look closer to see what the scanner is actually starting with and giving back as it expands and breaks up the node text. I may not get to that for a while so please dont let me stop you. For clarity and completion you may want to pull the numbering/bullets stuff into line with your fix on the main node text and just re-submit your simpler test. The documentation folks will laugh if/when they find out we count bullets as a word. But only because they are in a good mood: they will be happy with your patch. - LeMoyne -- View this message in context: http://nabble.documentfoundation.org/PATCH-Fix-for-bug-feature-request-30550-Character-count-without-spaces-tp1778667p1783515.html Sent from the Dev mailing list archive at Nabble.com. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [PATCH] Fix for bug / feature request 30550 - Character count without spaces
On 10/27/10, Mattias Johnsson m.t.johns...@gmail.com wrote: On 28 October 2010 08:42, LeMoyne j...@mail2lee.com wrote: All these tests are with the aScanner.GetLen() 1 check in place. With that Len =2 check, the new counting routine has no problem with single letter words like A, a, 1, -, or just , It is puzzling that Mattias removed the check to handle single char words on his machine but a build out of master/LibOdev works (at least for me) with that same check in … Hmm, I originally left that check in because it was in Norbert's sketch code, and I figured he knew what was going on. Not at all. I had no clue. I just found the place where the magic was happening so I mentioned it in the bug report. The 'if' was already in the original code. I just had to move things a bit to not have the if for counting char while still have it the way it was before for words... and I may very well have botched that. Norbert ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice