Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-02 Thread Alan BRASLAU
On Mon, 2 Feb 2015 17:55:35 +0100
Hans Hagen  wrote:

> this feature relates to (simple) spell checking and collectign words
> for dedicated spell check lists and, 4 chars is nearly always avalid
> word which is why we discard them

English is rich in "four-letter words"!

Alan ;-)
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-02 Thread Jörg Weger
So I hope you might get bored once in a while before I have to write my 
bachelor thesis :)


Greetings Jörg

On 02.02.2015 00:56, Hans Hagen wrote:

On 2/1/2015 10:06 PM, Jörg Weger wrote:

Is the character count “wc --char ” returns with or without
blank spaces? (Which is important for me.) “man wc” doesn’t talk about
that.

I had hoped there was a better way than to edit the result of
“pdftotext” in my text editor or in libreoffice writer (deleting
unnecessary carriage returns and spaces by searching for regular
expressions) which are able to do the count I need. In fact I had hoped
that ConTeXt was able to count the characters and spaces it renders to
PDF (is that theoretically possible?) …


it's not too hard so maybe when i'm bored or see a good reason ..

Hans

---
   Hans Hagen | PRAGMA ADE
   Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
 tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
  | www.pragma-pod.nl
-
___

If your question is of interest to others as well, please add an entry
to the Wiki!

maillist : ntg-context@ntg.nl /
http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-02 Thread Marcin Borkowski

On 2015-02-01, at 22:06, Jörg Weger  wrote:

> Is the character count “wc --char ” returns with or without 
> blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.
>
> I had hoped there was a better way than to edit the result of 
> “pdftotext” in my text editor or in libreoffice writer (deleting 
> unnecessary carriage returns and spaces by searching for regular 
> expressions) which are able to do the count I need. In fact I had hoped 
> that ConTeXt was able to count the characters and spaces it renders to 
> PDF (is that theoretically possible?) …

I am pretty sure that you can make sed filter out blank characters.  So
then you can just chain pdftotext, sed and wc.

OTOH, here's a relevant question (and a simple answer) on SO.  (It seems
to count newlines, though.)

JFF, I've just coded this in Emacs Lisp:

--8<---cut here---start->8---
;; Count non-blank characters in a buffer

(defun how-many-visible-chars ()
"Count visible (i.e., other than spaces, tabs and newlines)
characters in the buffer."
  (interactive)
  (let ((count 0))
(save-excursion
  (goto-char (point-min))
  (while (not (eobp))
(unless (looking-at-p "[ \t\n]")
  (setq count (1+ count)))
(forward-char)))
(message "%d visible characters" count)))
--8<---cut here---end--->8---

It's terribly unoptimized, but I ran it on a 300+ kB file on my low-end
netbook and it ran in something like 2 seconds, so it's not that bad in
practice.  Also, it's not well-coded: it should e.g. return the number
instead of displaying the message when called non-interactively, it
might take active region into account etc. - but as a proof-of-concept,
it works surprisingly well (i.e., fast).

> Greetings Jörg

Best,

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-02 Thread Hans Hagen

On 2/2/2015 4:39 PM, Alan BRASLAU wrote:


ConTeXt has an option to count the words (you find the result in
.words) in a document but words words shorter than four
letters aren’t taken into account.

word length under 4 characters  :   10
word length =< 4 chars :   20

here you are missing a third of the words! That is 30%


this feature relates to (simple) spell checking and collectign words for 
dedicated spell check lists and, 4 chars is nearly always avalid word 
which is why we discard them


Hans

-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
 | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-02 Thread Alan BRASLAU
On Mon, 2 Feb 2015 10:20:15 +0100
Keith Schultz  wrote:

> Hello All,
> 
> As a linguist, I can say that not counting words that are shorter is
> an absolute NO-GO for an accurate word count and thereby character
> count!
> 
> See below, for a non representative proof !
> 
> > Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster
> > :
> > 
> [snip, snip]
> 
> > ConTeXt has an option to count the words (you find the result in
> > .words) in a document but words words shorter than four
> > letters aren’t taken into account.
> word length under 4 characters  :   10
> word length =< 4 chars :   20
> 
> here you are missing a third of the words! That is 30%
> 
> regards
>   Keith



See also:
Zipf, G. K. (1949), "Human Behavior and the Principle of Least Effort",
Cambridge, MA: Addison-Wesley.

in particular, Chapter 2: On the Economy of Words.


As well as:
Shannon, C. E. (1951), "The redundancy of English", Cybernetics,
248-272.

54% for English, so we can afford to be sloppy (wch s wy txt compr qte
ll).


Alan
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-02 Thread Keith Schultz
Hello All,

As a linguist, I can say that not counting words that are shorter is an 
absolute NO-GO
for an accurate word count and thereby character count!

See below, for a non representative proof !

> Am 01.02.2015 um 22:12 schrieb Wolfgang Schuster 
> :
> 
[snip, snip]

> ConTeXt has an option to count the words (you find the result in 
> .words) in a document
> but words words shorter than four letters aren’t taken into account.
word length under 4 characters  :   10
word length =< 4 chars :   20

here you are missing a third of the words! That is 30%

regards
Keith

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Hans Hagen

On 2/1/2015 10:06 PM, Jörg Weger wrote:

Is the character count “wc --char ” returns with or without
blank spaces? (Which is important for me.) “man wc” doesn’t talk about
that.

I had hoped there was a better way than to edit the result of
“pdftotext” in my text editor or in libreoffice writer (deleting
unnecessary carriage returns and spaces by searching for regular
expressions) which are able to do the count I need. In fact I had hoped
that ConTeXt was able to count the characters and spaces it renders to
PDF (is that theoretically possible?) …


it's not too hard so maybe when i'm bored or see a good reason ..

Hans

---
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
 | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Idris Samawi Hamid ادريس سماوي حامد
On Sun, 01 Feb 2015 15:11:48 -0700, Wolfgang Schuster  
 wrote:




Am 01.02.2015 um 22:32 schrieb Idris Samawi Hamid ادريس سماوي حامد  
:



words shorter than four letters aren’t taken into account.


I get *some* words shorter than four letters in the output, so there  
must be some other logic going on…


Do you have a few examples?


A quick one:

===
\setupspellchecking[state=start,method=2]

\starttext
Dār is the Arabic word for home.
\stoptext
===
--
Idris Samawi Hamid
Professor of Philosophy
Colorado State University
Fort Collins, CO 80523
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Wolfgang Schuster

> Am 01.02.2015 um 22:32 schrieb Idris Samawi Hamid ادريس سماوي حامد 
> :
> 
>> words shorter than four letters aren’t taken into account.
> 
> I get *some* words shorter than four letters in the output, so there must be 
> some other logic going on…

Do you have a few examples?

Wolfgang
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Idris Samawi Hamid ادريس سماوي حامد
On Sun, 01 Feb 2015 14:12:54 -0700, Wolfgang Schuster  
 wrote:



\setupspellchecking[state=start,method=2]
\starttext
\input knuth
\stoptext


Slightly off-topic: Just as Wolfgang's reply came in I was setting up a  
new version of


http://tinyspell.com/

Editor-based spell-checkers are usually not very useful (although some  
LaTeX-centric editors are pretty good at it.) I never knew about  
\setupspellchecking before now. Perhaps it could evolve into something  
very useful.


Part of spell-checking involves getting uppercase vs lowercase right. I  
see that the .words output of \setupspellchecking ignores case, and treats  
'-' (the simple dash) as a word separator. I'd like to see this evolve  
into something more precise.



words shorter than four letters aren’t taken into account.


I get *some* words shorter than four letters in the output, so there must  
be some other logic going on...


Thanks for pointing out this utility, Wolfgang, and

Best wishes
Idris
--
Idris Samawi Hamid
Professor of Philosophy
Colorado State University
Fort Collins, CO 80523
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Wolfgang Schuster

> Am 01.02.2015 um 22:06 schrieb Jörg Weger :
> 
> Is the character count “wc --char ” returns with or without blank 
> spaces? (Which is important for me.) “man wc” doesn’t talk about that.
> 
> I had hoped there was a better way than to edit the result of “pdftotext” in 
> my text editor or in libreoffice writer (deleting unnecessary carriage 
> returns and spaces by searching for regular expressions) which are able to do 
> the count I need. In fact I had hoped that ConTeXt was able to count the 
> characters and spaces it renders to PDF (is that theoretically possible?) …

ConTeXt has an option to count the words (you find the result in 
.words) in a document
but words words shorter than four letters aren’t taken into account.

\setupspellchecking[state=start,method=2]

\starttext
\input knuth
\stoptext

Wolfgang
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Jörg Weger
Is the character count “wc --char ” returns with or without 
blank spaces? (Which is important for me.) “man wc” doesn’t talk about that.


I had hoped there was a better way than to edit the result of 
“pdftotext” in my text editor or in libreoffice writer (deleting 
unnecessary carriage returns and spaces by searching for regular 
expressions) which are able to do the count I need. In fact I had hoped 
that ConTeXt was able to count the characters and spaces it renders to 
PDF (is that theoretically possible?) …


Greetings Jörg

On 01.02.2015 20:11, Aditya Mahajan wrote:

On Sun, 1 Feb 2015, Jörg Weger wrote:


Is there a way to report the “character count including spaces” of the
resulting PDF in ConTeXt?


Given that these counts are never accurate, how about

   pdftotext filename

followed by

wc filename

Aditya


___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Aditya Mahajan

On Sun, 1 Feb 2015, Jörg Weger wrote:

Is there a way to report the “character count including spaces” of the 
resulting PDF in ConTeXt?


Given that these counts are never accurate, how about

  pdftotext filename

followed by

   wc filename

Aditya___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

[NTG-context] PDF document statistics (character count incl. spaces)?

2015-02-01 Thread Jörg Weger
Is there a way to report the “character count including spaces” of the 
resulting PDF in ConTeXt?


Greetings Jörg
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___