Re: Handy word list program for indexing

2007-03-12 Thread Alan G Isaac
Did Steve's question about order ever get answered?
I think he wanted something like this?

fmt -1 file_name | sort | uniq -c | sort -dk2 | sort -srnk1

Cheers,
Alan Isaac

PS Here's another Python implementation, which adds a couple
features: minimum frequency and minimum size requirements.
(Also word counts.) Public domain.

import sys,string

chars2strip = string.punctuation

word_hash = dict()
CT_ALLWORDS = 0
CT_WORDS = 0
WORDSIZE_MIN = 3
FREQ_MIN = 2
for line in sys.stdin:
line.strip()
for word in line.split()
word = word.strip(chars2strip)
if word:
CT_ALLWORDS += 1
if len(word) = WORDSIZE_MIN:
CT_WORDS += 1
word_hash[word] = word_hash.get(word,0) + 1

print =
print === WORD COUNT ==
print =

print Total number of words: %d%(CT_ALLWORDS)
print Total number of words (len = %d): %d%(WORDSIZE_MIN, CT_WORDS)

print =
print === ALPHA ORDER =
print =

for key in sorted(word_hash):
if word_hash[key] = FREQ_MIN:
print %24s %6d%(key, word_hash[key])

print =
print  OCCURRENCE ORDER ===
print =

for word, freq in sorted(word_hash.iteritems(), cmp=lambda a,b: 
cmp((-a[1],a[0].lower()),(-b[1],b[0].lower(:
if freq = FREQ_MIN:
print %7d   %s%(freq,word)





Re: Handy word list program for indexing

2007-03-12 Thread Alan G Isaac
Did Steve's question about order ever get answered?
I think he wanted something like this?

fmt -1 file_name | sort | uniq -c | sort -dk2 | sort -srnk1

Cheers,
Alan Isaac

PS Here's another Python implementation, which adds a couple
features: minimum frequency and minimum size requirements.
(Also word counts.) Public domain.

import sys,string

chars2strip = string.punctuation

word_hash = dict()
CT_ALLWORDS = 0
CT_WORDS = 0
WORDSIZE_MIN = 3
FREQ_MIN = 2
for line in sys.stdin:
line.strip()
for word in line.split()
word = word.strip(chars2strip)
if word:
CT_ALLWORDS += 1
if len(word) = WORDSIZE_MIN:
CT_WORDS += 1
word_hash[word] = word_hash.get(word,0) + 1

print =
print === WORD COUNT ==
print =

print Total number of words: %d%(CT_ALLWORDS)
print Total number of words (len = %d): %d%(WORDSIZE_MIN, CT_WORDS)

print =
print === ALPHA ORDER =
print =

for key in sorted(word_hash):
if word_hash[key] = FREQ_MIN:
print %24s %6d%(key, word_hash[key])

print =
print  OCCURRENCE ORDER ===
print =

for word, freq in sorted(word_hash.iteritems(), cmp=lambda a,b: 
cmp((-a[1],a[0].lower()),(-b[1],b[0].lower(:
if freq = FREQ_MIN:
print %7d   %s%(freq,word)





Re: Handy word list program for indexing

2007-03-12 Thread Alan G Isaac
Did Steve's question about order ever get answered?
I think he wanted something like this?

fmt -1 file_name | sort | uniq -c | sort -dk2 | sort -srnk1

Cheers,
Alan Isaac

PS Here's another Python implementation, which adds a couple
features: minimum frequency and minimum size requirements.
(Also word counts.) Public domain.

import sys,string

chars2strip = string.punctuation

word_hash = dict()
CT_ALLWORDS = 0
CT_WORDS = 0
WORDSIZE_MIN = 3
FREQ_MIN = 2
for line in sys.stdin:
line.strip()
for word in line.split()
word = word.strip(chars2strip)
if word:
CT_ALLWORDS += 1
if len(word) >= WORDSIZE_MIN:
CT_WORDS += 1
word_hash[word] = word_hash.get(word,0) + 1

print "="
print "=== WORD COUNT =="
print "="

print "Total number of words: %d"%(CT_ALLWORDS)
print "Total number of words (len >= %d): %d"%(WORDSIZE_MIN, CT_WORDS)

print "="
print "=== ALPHA ORDER ="
print "="

for key in sorted(word_hash):
if word_hash[key] >= FREQ_MIN:
print "%24s %6d"%(key, word_hash[key])

print "="
print " OCCURRENCE ORDER ==="
print "="

for word, freq in sorted(word_hash.iteritems(), cmp=lambda a,b: 
cmp((-a[1],a[0].lower()),(-b[1],b[0].lower(:
if freq >= FREQ_MIN:
print "%7d   %s"%(freq,word)





Re: Handy word list program for indexing

2007-03-09 Thread Michael Wojcik

Steve Litt wrote:


fmt -1  tsjustfacts.txt  |  sed -e s/^[[:space:][:punct:]]*// | 
sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] |  sort | 
uniq -c | sort -rn


The one thing this doesn't do is, upon final sort, sort by count descending 
but name ascending. Can you think of a way to do that with standard Linux 
commands?


Do you mean you want a sort key composed of the count, sorted 
numerically and descending, and the name, sorted lexically and 
ascending?  So that words with the same count will be grouped together 
in the output and, within that group, sorted lexically?


Change the final sort to specify a multipart key:

   sort -k 1nr -k 2

That says sort by a key composed of the first field, taken as numeric, 
in reverse order; and the second field, using the default options 
(lexicographic and ascending).


This syntax is standard for sort(1) as of SUSv3, by the way - it's not 
specific to Linux.


--
Michael Wojcik



Re: Handy word list program for indexing

2007-03-09 Thread Michael Wojcik

Steve Litt wrote:


fmt -1  tsjustfacts.txt  |  sed -e s/^[[:space:][:punct:]]*// | 
sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] |  sort | 
uniq -c | sort -rn


The one thing this doesn't do is, upon final sort, sort by count descending 
but name ascending. Can you think of a way to do that with standard Linux 
commands?


Do you mean you want a sort key composed of the count, sorted 
numerically and descending, and the name, sorted lexically and 
ascending?  So that words with the same count will be grouped together 
in the output and, within that group, sorted lexically?


Change the final sort to specify a multipart key:

   sort -k 1nr -k 2

That says sort by a key composed of the first field, taken as numeric, 
in reverse order; and the second field, using the default options 
(lexicographic and ascending).


This syntax is standard for sort(1) as of SUSv3, by the way - it's not 
specific to Linux.


--
Michael Wojcik



Re: Handy word list program for indexing

2007-03-09 Thread Michael Wojcik

Steve Litt wrote:


fmt -1 < tsjustfacts.txt  |  sed -e "s/^[[:space:][:punct:]]*//" | 
sed -e "s/[[:space:][:punct:]]*$//" | tr [:upper:] [:lower:] |  sort | 
uniq -c | sort -rn


The one thing this doesn't do is, upon final sort, sort by count descending 
but name ascending. Can you think of a way to do that with standard Linux 
commands?


Do you mean you want a sort key composed of the count, sorted 
numerically and descending, and the name, sorted lexically and 
ascending?  So that words with the same count will be grouped together 
in the output and, within that group, sorted lexically?


Change the final sort to specify a multipart key:

   sort -k 1nr -k 2

That says "sort by a key composed of the first field, taken as numeric, 
in reverse order; and the second field, using the default options 
(lexicographic and ascending).


This syntax is standard for sort(1) as of SUSv3, by the way - it's not 
specific to Linux.


--
Michael Wojcik



Re: Handy word list program for indexing

2007-03-06 Thread Jeremy C. Reed
 The one thing this doesn't do is, upon final sort, sort by count descending 
 but name ascending. Can you think of a way to do that with standard Linux 
 commands?

I am not sure I understand (or maybe I should read this again when I 
wake up :)

Can you give a short example?


Re: Handy word list program for indexing

2007-03-06 Thread William Adams
While such utilities can be useful for the naïve user, they don't  
result in an index, so much as a concordance, and the difference  
between the two should be kept in mind.

Rather than relying on such, if the project and budget warrant it,  
far better to employ a human indexer (who is _not_ also the author).

William

-- 
William Adams
senior graphic designer
Fry Communications



This email message and any files transmitted with it contain information
which is confidential and intended only for the addressee(s). If you are
not the intended recipient(s), any usage,  dissemination, disclosure, or
action taken in  reliance on it is prohibited.  The reliability of  this
method of communication cannot be guaranteed.  Email can be intercepted,
corrupted, delayed, incompletely transmitted, virus-laden,  or otherwise
affected during transmission. Reasonable steps have been taken to reduce
the risk of viruses, but we cannot accept liability for damage sustained
as a result of this message. If you have received this message in error,
please immediately delete it and all copies of it and notify the sender.


Re: Handy word list program for indexing

2007-03-06 Thread Steve Litt
On Tuesday 06 March 2007 02:58, [EMAIL PROTECTED] wrote:
 On Tue, 6 Mar 2007, Steve Litt wrote:
  Indexing is the most distasteful, boring, and tedious part of writing a
  book. Making word lists like this at least makes it a brainless
  activity.

 I've linked to this thread from the following page

   http://wiki.lyx.org/Tips/Indexing

Unable to connect
Firefox can't establish a connection to the server at wiki.lyx.org.

SteveT


Re: Handy word list program for indexing

2007-03-06 Thread Steve Litt
On Tuesday 06 March 2007 08:25, William Adams wrote:
 While such utilities can be useful for the naïve user, they don't
 result in an index, so much as a concordance, and the difference
 between the two should be kept in mind.

 Rather than relying on such, if the project and budget warrant it,
 far better to employ a human indexer (who is _not_ also the author).

 William

There's budget for a human indexer, as long as the indexer is me (the author). 
So as the human indexer, how do I make this thing an index instead of a 
concordance? My plan is to use the word list program to make sure I don't 
leave out things that shouldn't be left out, not to give every term page 
numbers.

How do I make it a real index?

Thanks

SteveT

Steve Litt
Author: Universal Troubleshooting Process books and courseware
http://www.troubleshooters.com/


Re: Handy word list program for indexing

2007-03-06 Thread William Adams
On Mar 6, 2007, at 8:57 AM, Steve Litt wrote:

 There's budget for a human indexer, as long as the indexer is me  
 (the author).

Got it.

 So as the human indexer, how do I make this thing an index instead  
 of a
 concordance?

A concordance is just a list of words in a document w/ reference to  
where they occur. An index is a structured, ordered list of the  
concepts and ideas and terminology in a document which allows one to  
determine if a desired bit of information is present in a document,  
and if so, where to find it.

 My plan is to use the word list program to make sure I don't
 leave out things that shouldn't be left out, not to give every term  
 page
 numbers.

Okay.

 How do I make it a real index?

The traditional thing to do is to read the text twice, once to  
familiarize yourself w/ it and to make notes on what people might  
need / want to look for, the second time, to flag all terms /  
concepts as desired (usually using post it notes, or index cards).

You may want to look up tools like the showidx package which will  
help you to consider the index as you're working w/ the text.

William

-- 
William Adams
senior graphic designer
Fry Communications



This email message and any files transmitted with it contain information
which is confidential and intended only for the addressee(s). If you are
not the intended recipient(s), any usage,  dissemination, disclosure, or
action taken in  reliance on it is prohibited.  The reliability of  this
method of communication cannot be guaranteed.  Email can be intercepted,
corrupted, delayed, incompletely transmitted, virus-laden,  or otherwise
affected during transmission. Reasonable steps have been taken to reduce
the risk of viruses, but we cannot accept liability for damage sustained
as a result of this message. If you have received this message in error,
please immediately delete it and all copies of it and notify the sender.


Re: Handy word list program for indexing

2007-03-06 Thread Jeremy C. Reed
 The one thing this doesn't do is, upon final sort, sort by count descending 
 but name ascending. Can you think of a way to do that with standard Linux 
 commands?

I am not sure I understand (or maybe I should read this again when I 
wake up :)

Can you give a short example?


Re: Handy word list program for indexing

2007-03-06 Thread William Adams
While such utilities can be useful for the naïve user, they don't  
result in an index, so much as a concordance, and the difference  
between the two should be kept in mind.

Rather than relying on such, if the project and budget warrant it,  
far better to employ a human indexer (who is _not_ also the author).

William

-- 
William Adams
senior graphic designer
Fry Communications



This email message and any files transmitted with it contain information
which is confidential and intended only for the addressee(s). If you are
not the intended recipient(s), any usage,  dissemination, disclosure, or
action taken in  reliance on it is prohibited.  The reliability of  this
method of communication cannot be guaranteed.  Email can be intercepted,
corrupted, delayed, incompletely transmitted, virus-laden,  or otherwise
affected during transmission. Reasonable steps have been taken to reduce
the risk of viruses, but we cannot accept liability for damage sustained
as a result of this message. If you have received this message in error,
please immediately delete it and all copies of it and notify the sender.


Re: Handy word list program for indexing

2007-03-06 Thread Steve Litt
On Tuesday 06 March 2007 02:58, [EMAIL PROTECTED] wrote:
 On Tue, 6 Mar 2007, Steve Litt wrote:
  Indexing is the most distasteful, boring, and tedious part of writing a
  book. Making word lists like this at least makes it a brainless
  activity.

 I've linked to this thread from the following page

   http://wiki.lyx.org/Tips/Indexing

Unable to connect
Firefox can't establish a connection to the server at wiki.lyx.org.

SteveT


Re: Handy word list program for indexing

2007-03-06 Thread Steve Litt
On Tuesday 06 March 2007 08:25, William Adams wrote:
 While such utilities can be useful for the naïve user, they don't
 result in an index, so much as a concordance, and the difference
 between the two should be kept in mind.

 Rather than relying on such, if the project and budget warrant it,
 far better to employ a human indexer (who is _not_ also the author).

 William

There's budget for a human indexer, as long as the indexer is me (the author). 
So as the human indexer, how do I make this thing an index instead of a 
concordance? My plan is to use the word list program to make sure I don't 
leave out things that shouldn't be left out, not to give every term page 
numbers.

How do I make it a real index?

Thanks

SteveT

Steve Litt
Author: Universal Troubleshooting Process books and courseware
http://www.troubleshooters.com/


Re: Handy word list program for indexing

2007-03-06 Thread William Adams
On Mar 6, 2007, at 8:57 AM, Steve Litt wrote:

 There's budget for a human indexer, as long as the indexer is me  
 (the author).

Got it.

 So as the human indexer, how do I make this thing an index instead  
 of a
 concordance?

A concordance is just a list of words in a document w/ reference to  
where they occur. An index is a structured, ordered list of the  
concepts and ideas and terminology in a document which allows one to  
determine if a desired bit of information is present in a document,  
and if so, where to find it.

 My plan is to use the word list program to make sure I don't
 leave out things that shouldn't be left out, not to give every term  
 page
 numbers.

Okay.

 How do I make it a real index?

The traditional thing to do is to read the text twice, once to  
familiarize yourself w/ it and to make notes on what people might  
need / want to look for, the second time, to flag all terms /  
concepts as desired (usually using post it notes, or index cards).

You may want to look up tools like the showidx package which will  
help you to consider the index as you're working w/ the text.

William

-- 
William Adams
senior graphic designer
Fry Communications



This email message and any files transmitted with it contain information
which is confidential and intended only for the addressee(s). If you are
not the intended recipient(s), any usage,  dissemination, disclosure, or
action taken in  reliance on it is prohibited.  The reliability of  this
method of communication cannot be guaranteed.  Email can be intercepted,
corrupted, delayed, incompletely transmitted, virus-laden,  or otherwise
affected during transmission. Reasonable steps have been taken to reduce
the risk of viruses, but we cannot accept liability for damage sustained
as a result of this message. If you have received this message in error,
please immediately delete it and all copies of it and notify the sender.


Re: Handy word list program for indexing

2007-03-06 Thread Jeremy C. Reed
> The one thing this doesn't do is, upon final sort, sort by count descending 
> but name ascending. Can you think of a way to do that with standard Linux 
> commands?

I am not sure I understand (or maybe I should read this again when I 
wake up :)

Can you give a short example?


Re: Handy word list program for indexing

2007-03-06 Thread William Adams
While such utilities can be useful for the naïve user, they don't  
result in an index, so much as a concordance, and the difference  
between the two should be kept in mind.

Rather than relying on such, if the project and budget warrant it,  
far better to employ a human indexer (who is _not_ also the author).

William

-- 
William Adams
senior graphic designer
Fry Communications



This email message and any files transmitted with it contain information
which is confidential and intended only for the addressee(s). If you are
not the intended recipient(s), any usage,  dissemination, disclosure, or
action taken in  reliance on it is prohibited.  The reliability of  this
method of communication cannot be guaranteed.  Email can be intercepted,
corrupted, delayed, incompletely transmitted, virus-laden,  or otherwise
affected during transmission. Reasonable steps have been taken to reduce
the risk of viruses, but we cannot accept liability for damage sustained
as a result of this message. If you have received this message in error,
please immediately delete it and all copies of it and notify the sender.


Re: Handy word list program for indexing

2007-03-06 Thread Steve Litt
On Tuesday 06 March 2007 02:58, [EMAIL PROTECTED] wrote:
> On Tue, 6 Mar 2007, Steve Litt wrote:
> > Indexing is the most distasteful, boring, and tedious part of writing a
> > book. Making word lists like this at least makes it a brainless
> > activity.
>
> I've linked to this thread from the following page
>
>   http://wiki.lyx.org/Tips/Indexing

Unable to connect
Firefox can't establish a connection to the server at wiki.lyx.org.

SteveT


Re: Handy word list program for indexing

2007-03-06 Thread Steve Litt
On Tuesday 06 March 2007 08:25, William Adams wrote:
> While such utilities can be useful for the naïve user, they don't
> result in an index, so much as a concordance, and the difference
> between the two should be kept in mind.
>
> Rather than relying on such, if the project and budget warrant it,
> far better to employ a human indexer (who is _not_ also the author).
>
> William

There's budget for a human indexer, as long as the indexer is me (the author). 
So as the human indexer, how do I make this thing an index instead of a 
concordance? My plan is to use the word list program to make sure I don't 
leave out things that shouldn't be left out, not to give every term page 
numbers.

How do I make it a real index?

Thanks

SteveT

Steve Litt
Author: Universal Troubleshooting Process books and courseware
http://www.troubleshooters.com/


Re: Handy word list program for indexing

2007-03-06 Thread William Adams
On Mar 6, 2007, at 8:57 AM, Steve Litt wrote:

> There's budget for a human indexer, as long as the indexer is me  
> (the author).

Got it.

> So as the human indexer, how do I make this thing an index instead  
> of a
> concordance?

A concordance is just a list of words in a document w/ reference to  
where they occur. An index is a structured, ordered list of the  
concepts and ideas and terminology in a document which allows one to  
determine if a desired bit of information is present in a document,  
and if so, where to find it.

> My plan is to use the word list program to make sure I don't
> leave out things that shouldn't be left out, not to give every term  
> page
> numbers.

Okay.

> How do I make it a real index?

The traditional thing to do is to read the text twice, once to  
familiarize yourself w/ it and to make notes on what people might  
need / want to look for, the second time, to flag all terms /  
concepts as desired (usually using post it notes, or index cards).

You may want to look up tools like the showidx package which will  
help you to consider the index as you're working w/ the text.

William

-- 
William Adams
senior graphic designer
Fry Communications



This email message and any files transmitted with it contain information
which is confidential and intended only for the addressee(s). If you are
not the intended recipient(s), any usage,  dissemination, disclosure, or
action taken in  reliance on it is prohibited.  The reliability of  this
method of communication cannot be guaranteed.  Email can be intercepted,
corrupted, delayed, incompletely transmitted, virus-laden,  or otherwise
affected during transmission. Reasonable steps have been taken to reduce
the risk of viruses, but we cannot accept liability for damage sustained
as a result of this message. If you have received this message in error,
please immediately delete it and all copies of it and notify the sender.


Handy word list program for indexing

2007-03-05 Thread Steve Litt
Hi all,

In preparation to create my index for my book, I created a Ruby program to 
list every word in a file (in this case the .lyx file).

Now of course this could be done with a simple one-liner using sed and 
sort -u, but my program lists the words in 2 different orders, first in alpha 
order, which of course could be done by the 1 liner, and then in descending 
order of occurrence, which can't be.

The order of occurrence is very handy because the most occurring words are 
usually garbage like the he, a and the like. Therefore, you can scan and 
delete those words very quickly.

The words used only once or twice comprise the majority of words, and because 
they're used only once or twice, they're typically not important and you can 
scan them very quickly.

The words in the middle typically contain many words useful in construction of 
an index, and should be perused more quickly.

My program, which is written in Ruby, is licensed GNU GPL version 2, and is 
included as the remainder of the body of this document. Have fun with it.

SteveT



#!/usr/bin/ruby
# Copyright (C) 2007 by Steve Litt, all rights reserved
# This program is licensed under the GNU GPL version 2 -- only version 2

require 'set'

$punct=Set.new([,, ., /, , , ?, ;, ', :, '', [, ], 
{, }, |, `, ~, !, @, #, $, %, ^, , *, (, ), _, 
+])

def by_freq_then_name(a, b)
if a[1]  b[1]
return 1
elsif a[1]  b[1]
return -1
elsif a[0]  b[0]
return 1
elsif a[0]  b[0]
return -1
else
return 0
end
end


word_hash = Hash.new()
word_hash['junk'] = 25
STDIN.each do
|line|
line.chomp!
line.strip!
temparr = line.split(/\s\s*/)
temparr.each do
|word|
while word.length  0 and $punct.include?(word[0].chr)
word = word[1..-1]
end
while word.length  0 and $punct.include?(word[-1].chr) 
word = word[0..-2]
end

if word_hash.has_key?(word)
word_hash[word] += 1
else
word_hash[word] = 1
end
end
end

puts =
puts === ALPHA ORDER =
puts =

keys = word_hash.keys.sort
keys.each do
|key|
printf %24s %6d\n, key, word_hash[key]
end

puts =
puts  OCCURRENCE ORDER ===
puts =

temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)}
temparray.each do
|word_freq|
printf %7d   %s\n, word_freq[1], word_freq[0]
end



Re: Handy word list program for indexing

2007-03-05 Thread Jeremy C. Reed
On Mon, 5 Mar 2007, Steve Litt wrote:

 In preparation to create my index for my book, I created a Ruby program to 
 list every word in a file (in this case the .lyx file).
 
 Now of course this could be done with a simple one-liner using sed and 
 sort -u, but my program lists the words in 2 different orders, first in alpha 
 order, which of course could be done by the 1 liner, and then in descending 
 order of occurrence, which can't be.

fmt -1 | sort | uniq -c | sort -rn

Also could had some tr and sed to clean out junk spacing and to lowercase 
everything.

By the way, I did something similar when doing some indexing.

Another thing I used is a spell checker -- words unknown to my dictionary 
I made sure were in the index.

  Jeremy C. Reed


Re: Handy word list program for indexing

2007-03-05 Thread Steve Litt
Hi Jeremy,

On Monday 05 March 2007 21:05, Jeremy C. Reed wrote:
 On Mon, 5 Mar 2007, Steve Litt wrote:
  In preparation to create my index for my book, I created a Ruby program
  to list every word in a file (in this case the .lyx file).
 
  Now of course this could be done with a simple one-liner using sed and
  sort -u, but my program lists the words in 2 different orders, first in
  alpha order, which of course could be done by the 1 liner, and then in
  descending order of occurrence, which can't be.

 fmt -1 | sort | uniq -c | sort -rn

Knowing that would have saved me two hours :-) I wasn't familiar with the fmt 
and the uniq commands. Thanks.

 Also could had some tr and sed to clean out junk spacing and to lowercase
 everything.

Yes. Here's my final answer, merging everything into lower case, and blowing 
off leading and trailing space and punctuation:

fmt -1  tsjustfacts.txt  |  sed -e s/^[[:space:][:punct:]]*// | 
sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] |  sort | 
uniq -c | sort -rn

That's sweet. Simpler than the Ruby, and probably faster, expecially on 
multicore/multiprocessor machines. Thanks!

The one thing this doesn't do is, upon final sort, sort by count descending 
but name ascending. Can you think of a way to do that with standard Linux 
commands?

Another filter that might be useful in this chain is:

grep -v ^[[:space:][:digit:][:punct:]]*$

In other words, if a line is consumed with nothing but space, digits and 
punctuation, it's probably not an index candidate and can be deleted, saving 
future processing and reducing extraneous output.

I'm not sure whether it's a good idea to lowercase everything. I think 
sometimes case serves as a reminder of the meaning of a word. To not force 
everything to lower case, simply remove the tr [:upper:] [:lower:].


 By the way, I did something similar when doing some indexing.

Indexing is the most distasteful, boring, and tedious part of writing a book. 
Making word lists like this at least makes it a brainless activity.

Thanks

SteveT

Steve Litt
Author: Universal Troubleshooting Process books and courseware
http://www.troubleshooters.com/


Re: Handy word list program for indexing

2007-03-05 Thread christian . ridderstrom

On Tue, 6 Mar 2007, Steve Litt wrote:

Indexing is the most distasteful, boring, and tedious part of writing a 
book. Making word lists like this at least makes it a brainless 
activity.


I've linked to this thread from the following page

http://wiki.lyx.org/Tips/Indexing

Maybe you could copy the useful script snippets to this page?

Best regards
/Christian

--
Christian Ridderström, +46-8-768 39 44   http://www.md.kth.se/~chr

Handy word list program for indexing

2007-03-05 Thread Steve Litt
Hi all,

In preparation to create my index for my book, I created a Ruby program to 
list every word in a file (in this case the .lyx file).

Now of course this could be done with a simple one-liner using sed and 
sort -u, but my program lists the words in 2 different orders, first in alpha 
order, which of course could be done by the 1 liner, and then in descending 
order of occurrence, which can't be.

The order of occurrence is very handy because the most occurring words are 
usually garbage like the he, a and the like. Therefore, you can scan and 
delete those words very quickly.

The words used only once or twice comprise the majority of words, and because 
they're used only once or twice, they're typically not important and you can 
scan them very quickly.

The words in the middle typically contain many words useful in construction of 
an index, and should be perused more quickly.

My program, which is written in Ruby, is licensed GNU GPL version 2, and is 
included as the remainder of the body of this document. Have fun with it.

SteveT



#!/usr/bin/ruby
# Copyright (C) 2007 by Steve Litt, all rights reserved
# This program is licensed under the GNU GPL version 2 -- only version 2

require 'set'

$punct=Set.new([,, ., /, , , ?, ;, ', :, '', [, ], 
{, }, |, `, ~, !, @, #, $, %, ^, , *, (, ), _, 
+])

def by_freq_then_name(a, b)
if a[1]  b[1]
return 1
elsif a[1]  b[1]
return -1
elsif a[0]  b[0]
return 1
elsif a[0]  b[0]
return -1
else
return 0
end
end


word_hash = Hash.new()
word_hash['junk'] = 25
STDIN.each do
|line|
line.chomp!
line.strip!
temparr = line.split(/\s\s*/)
temparr.each do
|word|
while word.length  0 and $punct.include?(word[0].chr)
word = word[1..-1]
end
while word.length  0 and $punct.include?(word[-1].chr) 
word = word[0..-2]
end

if word_hash.has_key?(word)
word_hash[word] += 1
else
word_hash[word] = 1
end
end
end

puts =
puts === ALPHA ORDER =
puts =

keys = word_hash.keys.sort
keys.each do
|key|
printf %24s %6d\n, key, word_hash[key]
end

puts =
puts  OCCURRENCE ORDER ===
puts =

temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)}
temparray.each do
|word_freq|
printf %7d   %s\n, word_freq[1], word_freq[0]
end



Re: Handy word list program for indexing

2007-03-05 Thread Jeremy C. Reed
On Mon, 5 Mar 2007, Steve Litt wrote:

 In preparation to create my index for my book, I created a Ruby program to 
 list every word in a file (in this case the .lyx file).
 
 Now of course this could be done with a simple one-liner using sed and 
 sort -u, but my program lists the words in 2 different orders, first in alpha 
 order, which of course could be done by the 1 liner, and then in descending 
 order of occurrence, which can't be.

fmt -1 | sort | uniq -c | sort -rn

Also could had some tr and sed to clean out junk spacing and to lowercase 
everything.

By the way, I did something similar when doing some indexing.

Another thing I used is a spell checker -- words unknown to my dictionary 
I made sure were in the index.

  Jeremy C. Reed


Re: Handy word list program for indexing

2007-03-05 Thread Steve Litt
Hi Jeremy,

On Monday 05 March 2007 21:05, Jeremy C. Reed wrote:
 On Mon, 5 Mar 2007, Steve Litt wrote:
  In preparation to create my index for my book, I created a Ruby program
  to list every word in a file (in this case the .lyx file).
 
  Now of course this could be done with a simple one-liner using sed and
  sort -u, but my program lists the words in 2 different orders, first in
  alpha order, which of course could be done by the 1 liner, and then in
  descending order of occurrence, which can't be.

 fmt -1 | sort | uniq -c | sort -rn

Knowing that would have saved me two hours :-) I wasn't familiar with the fmt 
and the uniq commands. Thanks.

 Also could had some tr and sed to clean out junk spacing and to lowercase
 everything.

Yes. Here's my final answer, merging everything into lower case, and blowing 
off leading and trailing space and punctuation:

fmt -1  tsjustfacts.txt  |  sed -e s/^[[:space:][:punct:]]*// | 
sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] |  sort | 
uniq -c | sort -rn

That's sweet. Simpler than the Ruby, and probably faster, expecially on 
multicore/multiprocessor machines. Thanks!

The one thing this doesn't do is, upon final sort, sort by count descending 
but name ascending. Can you think of a way to do that with standard Linux 
commands?

Another filter that might be useful in this chain is:

grep -v ^[[:space:][:digit:][:punct:]]*$

In other words, if a line is consumed with nothing but space, digits and 
punctuation, it's probably not an index candidate and can be deleted, saving 
future processing and reducing extraneous output.

I'm not sure whether it's a good idea to lowercase everything. I think 
sometimes case serves as a reminder of the meaning of a word. To not force 
everything to lower case, simply remove the tr [:upper:] [:lower:].


 By the way, I did something similar when doing some indexing.

Indexing is the most distasteful, boring, and tedious part of writing a book. 
Making word lists like this at least makes it a brainless activity.

Thanks

SteveT

Steve Litt
Author: Universal Troubleshooting Process books and courseware
http://www.troubleshooters.com/


Re: Handy word list program for indexing

2007-03-05 Thread christian . ridderstrom

On Tue, 6 Mar 2007, Steve Litt wrote:

Indexing is the most distasteful, boring, and tedious part of writing a 
book. Making word lists like this at least makes it a brainless 
activity.


I've linked to this thread from the following page

http://wiki.lyx.org/Tips/Indexing

Maybe you could copy the useful script snippets to this page?

Best regards
/Christian

--
Christian Ridderström, +46-8-768 39 44   http://www.md.kth.se/~chr

Handy word list program for indexing

2007-03-05 Thread Steve Litt
Hi all,

In preparation to create my index for my book, I created a Ruby program to 
list every word in a file (in this case the .lyx file).

Now of course this could be done with a simple one-liner using sed and 
sort -u, but my program lists the words in 2 different orders, first in alpha 
order, which of course could be done by the 1 liner, and then in descending 
order of occurrence, which can't be.

The order of occurrence is very handy because the most occurring words are 
usually garbage like the he, a and the like. Therefore, you can scan and 
delete those words very quickly.

The words used only once or twice comprise the majority of words, and because 
they're used only once or twice, they're typically not important and you can 
scan them very quickly.

The words in the middle typically contain many words useful in construction of 
an index, and should be perused more quickly.

My program, which is written in Ruby, is licensed GNU GPL version 2, and is 
included as the remainder of the body of this document. Have fun with it.

SteveT



#!/usr/bin/ruby
# Copyright (C) 2007 by Steve Litt, all rights reserved
# This program is licensed under the GNU GPL version 2 -- only version 2

require 'set'

$punct=Set.new([",", ".", "/", "<", ">", "?", ";", "'", ":", '"', "[", "]", 
"{", "}", "|", "`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", 
"+"])

def by_freq_then_name(a, b)
if a[1] < b[1]
return 1
elsif a[1] > b[1]
return -1
elsif a[0] > b[0]
return 1
elsif a[0] < b[0]
return -1
else
return 0
end
end


word_hash = Hash.new()
word_hash['junk'] = 25
STDIN.each do
|line|
line.chomp!
line.strip!
temparr = line.split(/\s\s*/)
temparr.each do
|word|
while word.length > 0 and $punct.include?(word[0].chr)
word = word[1..-1]
end
while word.length > 0 and $punct.include?(word[-1].chr) 
word = word[0..-2]
end

if word_hash.has_key?(word)
word_hash[word] += 1
else
word_hash[word] = 1
end
end
end

puts "="
puts "=== ALPHA ORDER ="
puts "="

keys = word_hash.keys.sort
keys.each do
|key|
printf "%24s %6d\n", key, word_hash[key]
end

puts "="
puts " OCCURRENCE ORDER ==="
puts "="

temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)}
temparray.each do
|word_freq|
printf "%7d   %s\n", word_freq[1], word_freq[0]
end



Re: Handy word list program for indexing

2007-03-05 Thread Jeremy C. Reed
On Mon, 5 Mar 2007, Steve Litt wrote:

> In preparation to create my index for my book, I created a Ruby program to 
> list every word in a file (in this case the .lyx file).
> 
> Now of course this could be done with a simple one-liner using sed and 
> sort -u, but my program lists the words in 2 different orders, first in alpha 
> order, which of course could be done by the 1 liner, and then in descending 
> order of occurrence, which can't be.

fmt -1 | sort | uniq -c | sort -rn

Also could had some tr and sed to clean out junk spacing and to lowercase 
everything.

By the way, I did something similar when doing some indexing.

Another thing I used is a spell checker -- words unknown to my dictionary 
I made sure were in the index.

  Jeremy C. Reed


Re: Handy word list program for indexing

2007-03-05 Thread Steve Litt
Hi Jeremy,

On Monday 05 March 2007 21:05, Jeremy C. Reed wrote:
> On Mon, 5 Mar 2007, Steve Litt wrote:
> > In preparation to create my index for my book, I created a Ruby program
> > to list every word in a file (in this case the .lyx file).
> >
> > Now of course this could be done with a simple one-liner using sed and
> > sort -u, but my program lists the words in 2 different orders, first in
> > alpha order, which of course could be done by the 1 liner, and then in
> > descending order of occurrence, which can't be.
>
> fmt -1 | sort | uniq -c | sort -rn

Knowing that would have saved me two hours :-) I wasn't familiar with the fmt 
and the uniq commands. Thanks.

> Also could had some tr and sed to clean out junk spacing and to lowercase
> everything.

Yes. Here's my final answer, merging everything into lower case, and blowing 
off leading and trailing space and punctuation:

fmt -1 < tsjustfacts.txt  |  sed -e "s/^[[:space:][:punct:]]*//" | 
sed -e "s/[[:space:][:punct:]]*$//" | tr [:upper:] [:lower:] |  sort | 
uniq -c | sort -rn

That's sweet. Simpler than the Ruby, and probably faster, expecially on 
multicore/multiprocessor machines. Thanks!

The one thing this doesn't do is, upon final sort, sort by count descending 
but name ascending. Can you think of a way to do that with standard Linux 
commands?

Another filter that might be useful in this chain is:

grep -v ^[[:space:][:digit:][:punct:]]*$

In other words, if a line is consumed with nothing but space, digits and 
punctuation, it's probably not an index candidate and can be deleted, saving 
future processing and reducing extraneous output.

I'm not sure whether it's a good idea to lowercase everything. I think 
sometimes case serves as a reminder of the meaning of a word. To not force 
everything to lower case, simply remove the tr [:upper:] [:lower:].

>
> By the way, I did something similar when doing some indexing.

Indexing is the most distasteful, boring, and tedious part of writing a book. 
Making word lists like this at least makes it a brainless activity.

Thanks

SteveT

Steve Litt
Author: Universal Troubleshooting Process books and courseware
http://www.troubleshooters.com/


Re: Handy word list program for indexing

2007-03-05 Thread christian . ridderstrom

On Tue, 6 Mar 2007, Steve Litt wrote:

Indexing is the most distasteful, boring, and tedious part of writing a 
book. Making word lists like this at least makes it a brainless 
activity.


I've linked to this thread from the following page

http://wiki.lyx.org/Tips/Indexing

Maybe you could copy the useful script snippets to this page?

Best regards
/Christian

--
Christian Ridderström, +46-8-768 39 44   http://www.md.kth.se/~chr