Re: Handy word list program for indexing
Did Steve's question about order ever get answered? I think he wanted something like this? fmt -1 file_name | sort | uniq -c | sort -dk2 | sort -srnk1 Cheers, Alan Isaac PS Here's another Python implementation, which adds a couple features: minimum frequency and minimum size requirements. (Also word counts.) Public domain. import sys,string chars2strip = string.punctuation word_hash = dict() CT_ALLWORDS = 0 CT_WORDS = 0 WORDSIZE_MIN = 3 FREQ_MIN = 2 for line in sys.stdin: line.strip() for word in line.split() word = word.strip(chars2strip) if word: CT_ALLWORDS += 1 if len(word) = WORDSIZE_MIN: CT_WORDS += 1 word_hash[word] = word_hash.get(word,0) + 1 print = print === WORD COUNT == print = print Total number of words: %d%(CT_ALLWORDS) print Total number of words (len = %d): %d%(WORDSIZE_MIN, CT_WORDS) print = print === ALPHA ORDER = print = for key in sorted(word_hash): if word_hash[key] = FREQ_MIN: print %24s %6d%(key, word_hash[key]) print = print OCCURRENCE ORDER === print = for word, freq in sorted(word_hash.iteritems(), cmp=lambda a,b: cmp((-a[1],a[0].lower()),(-b[1],b[0].lower(: if freq = FREQ_MIN: print %7d %s%(freq,word)
Re: Handy word list program for indexing
Did Steve's question about order ever get answered? I think he wanted something like this? fmt -1 file_name | sort | uniq -c | sort -dk2 | sort -srnk1 Cheers, Alan Isaac PS Here's another Python implementation, which adds a couple features: minimum frequency and minimum size requirements. (Also word counts.) Public domain. import sys,string chars2strip = string.punctuation word_hash = dict() CT_ALLWORDS = 0 CT_WORDS = 0 WORDSIZE_MIN = 3 FREQ_MIN = 2 for line in sys.stdin: line.strip() for word in line.split() word = word.strip(chars2strip) if word: CT_ALLWORDS += 1 if len(word) = WORDSIZE_MIN: CT_WORDS += 1 word_hash[word] = word_hash.get(word,0) + 1 print = print === WORD COUNT == print = print Total number of words: %d%(CT_ALLWORDS) print Total number of words (len = %d): %d%(WORDSIZE_MIN, CT_WORDS) print = print === ALPHA ORDER = print = for key in sorted(word_hash): if word_hash[key] = FREQ_MIN: print %24s %6d%(key, word_hash[key]) print = print OCCURRENCE ORDER === print = for word, freq in sorted(word_hash.iteritems(), cmp=lambda a,b: cmp((-a[1],a[0].lower()),(-b[1],b[0].lower(: if freq = FREQ_MIN: print %7d %s%(freq,word)
Re: Handy word list program for indexing
Did Steve's question about order ever get answered? I think he wanted something like this? fmt -1 file_name | sort | uniq -c | sort -dk2 | sort -srnk1 Cheers, Alan Isaac PS Here's another Python implementation, which adds a couple features: minimum frequency and minimum size requirements. (Also word counts.) Public domain. import sys,string chars2strip = string.punctuation word_hash = dict() CT_ALLWORDS = 0 CT_WORDS = 0 WORDSIZE_MIN = 3 FREQ_MIN = 2 for line in sys.stdin: line.strip() for word in line.split() word = word.strip(chars2strip) if word: CT_ALLWORDS += 1 if len(word) >= WORDSIZE_MIN: CT_WORDS += 1 word_hash[word] = word_hash.get(word,0) + 1 print "=" print "=== WORD COUNT ==" print "=" print "Total number of words: %d"%(CT_ALLWORDS) print "Total number of words (len >= %d): %d"%(WORDSIZE_MIN, CT_WORDS) print "=" print "=== ALPHA ORDER =" print "=" for key in sorted(word_hash): if word_hash[key] >= FREQ_MIN: print "%24s %6d"%(key, word_hash[key]) print "=" print " OCCURRENCE ORDER ===" print "=" for word, freq in sorted(word_hash.iteritems(), cmp=lambda a,b: cmp((-a[1],a[0].lower()),(-b[1],b[0].lower(: if freq >= FREQ_MIN: print "%7d %s"%(freq,word)
Re: Handy word list program for indexing
Steve Litt wrote: fmt -1 tsjustfacts.txt | sed -e s/^[[:space:][:punct:]]*// | sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? Do you mean you want a sort key composed of the count, sorted numerically and descending, and the name, sorted lexically and ascending? So that words with the same count will be grouped together in the output and, within that group, sorted lexically? Change the final sort to specify a multipart key: sort -k 1nr -k 2 That says sort by a key composed of the first field, taken as numeric, in reverse order; and the second field, using the default options (lexicographic and ascending). This syntax is standard for sort(1) as of SUSv3, by the way - it's not specific to Linux. -- Michael Wojcik
Re: Handy word list program for indexing
Steve Litt wrote: fmt -1 tsjustfacts.txt | sed -e s/^[[:space:][:punct:]]*// | sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? Do you mean you want a sort key composed of the count, sorted numerically and descending, and the name, sorted lexically and ascending? So that words with the same count will be grouped together in the output and, within that group, sorted lexically? Change the final sort to specify a multipart key: sort -k 1nr -k 2 That says sort by a key composed of the first field, taken as numeric, in reverse order; and the second field, using the default options (lexicographic and ascending). This syntax is standard for sort(1) as of SUSv3, by the way - it's not specific to Linux. -- Michael Wojcik
Re: Handy word list program for indexing
Steve Litt wrote: fmt -1 < tsjustfacts.txt | sed -e "s/^[[:space:][:punct:]]*//" | sed -e "s/[[:space:][:punct:]]*$//" | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? Do you mean you want a sort key composed of the count, sorted numerically and descending, and the name, sorted lexically and ascending? So that words with the same count will be grouped together in the output and, within that group, sorted lexically? Change the final sort to specify a multipart key: sort -k 1nr -k 2 That says "sort by a key composed of the first field, taken as numeric, in reverse order; and the second field, using the default options (lexicographic and ascending). This syntax is standard for sort(1) as of SUSv3, by the way - it's not specific to Linux. -- Michael Wojcik
Re: Handy word list program for indexing
The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? I am not sure I understand (or maybe I should read this again when I wake up :) Can you give a short example?
Re: Handy word list program for indexing
While such utilities can be useful for the naïve user, they don't result in an index, so much as a concordance, and the difference between the two should be kept in mind. Rather than relying on such, if the project and budget warrant it, far better to employ a human indexer (who is _not_ also the author). William -- William Adams senior graphic designer Fry Communications This email message and any files transmitted with it contain information which is confidential and intended only for the addressee(s). If you are not the intended recipient(s), any usage, dissemination, disclosure, or action taken in reliance on it is prohibited. The reliability of this method of communication cannot be guaranteed. Email can be intercepted, corrupted, delayed, incompletely transmitted, virus-laden, or otherwise affected during transmission. Reasonable steps have been taken to reduce the risk of viruses, but we cannot accept liability for damage sustained as a result of this message. If you have received this message in error, please immediately delete it and all copies of it and notify the sender.
Re: Handy word list program for indexing
On Tuesday 06 March 2007 02:58, [EMAIL PROTECTED] wrote: On Tue, 6 Mar 2007, Steve Litt wrote: Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. I've linked to this thread from the following page http://wiki.lyx.org/Tips/Indexing Unable to connect Firefox can't establish a connection to the server at wiki.lyx.org. SteveT
Re: Handy word list program for indexing
On Tuesday 06 March 2007 08:25, William Adams wrote: While such utilities can be useful for the naïve user, they don't result in an index, so much as a concordance, and the difference between the two should be kept in mind. Rather than relying on such, if the project and budget warrant it, far better to employ a human indexer (who is _not_ also the author). William There's budget for a human indexer, as long as the indexer is me (the author). So as the human indexer, how do I make this thing an index instead of a concordance? My plan is to use the word list program to make sure I don't leave out things that shouldn't be left out, not to give every term page numbers. How do I make it a real index? Thanks SteveT Steve Litt Author: Universal Troubleshooting Process books and courseware http://www.troubleshooters.com/
Re: Handy word list program for indexing
On Mar 6, 2007, at 8:57 AM, Steve Litt wrote: There's budget for a human indexer, as long as the indexer is me (the author). Got it. So as the human indexer, how do I make this thing an index instead of a concordance? A concordance is just a list of words in a document w/ reference to where they occur. An index is a structured, ordered list of the concepts and ideas and terminology in a document which allows one to determine if a desired bit of information is present in a document, and if so, where to find it. My plan is to use the word list program to make sure I don't leave out things that shouldn't be left out, not to give every term page numbers. Okay. How do I make it a real index? The traditional thing to do is to read the text twice, once to familiarize yourself w/ it and to make notes on what people might need / want to look for, the second time, to flag all terms / concepts as desired (usually using post it notes, or index cards). You may want to look up tools like the showidx package which will help you to consider the index as you're working w/ the text. William -- William Adams senior graphic designer Fry Communications This email message and any files transmitted with it contain information which is confidential and intended only for the addressee(s). If you are not the intended recipient(s), any usage, dissemination, disclosure, or action taken in reliance on it is prohibited. The reliability of this method of communication cannot be guaranteed. Email can be intercepted, corrupted, delayed, incompletely transmitted, virus-laden, or otherwise affected during transmission. Reasonable steps have been taken to reduce the risk of viruses, but we cannot accept liability for damage sustained as a result of this message. If you have received this message in error, please immediately delete it and all copies of it and notify the sender.
Re: Handy word list program for indexing
The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? I am not sure I understand (or maybe I should read this again when I wake up :) Can you give a short example?
Re: Handy word list program for indexing
While such utilities can be useful for the naïve user, they don't result in an index, so much as a concordance, and the difference between the two should be kept in mind. Rather than relying on such, if the project and budget warrant it, far better to employ a human indexer (who is _not_ also the author). William -- William Adams senior graphic designer Fry Communications This email message and any files transmitted with it contain information which is confidential and intended only for the addressee(s). If you are not the intended recipient(s), any usage, dissemination, disclosure, or action taken in reliance on it is prohibited. The reliability of this method of communication cannot be guaranteed. Email can be intercepted, corrupted, delayed, incompletely transmitted, virus-laden, or otherwise affected during transmission. Reasonable steps have been taken to reduce the risk of viruses, but we cannot accept liability for damage sustained as a result of this message. If you have received this message in error, please immediately delete it and all copies of it and notify the sender.
Re: Handy word list program for indexing
On Tuesday 06 March 2007 02:58, [EMAIL PROTECTED] wrote: On Tue, 6 Mar 2007, Steve Litt wrote: Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. I've linked to this thread from the following page http://wiki.lyx.org/Tips/Indexing Unable to connect Firefox can't establish a connection to the server at wiki.lyx.org. SteveT
Re: Handy word list program for indexing
On Tuesday 06 March 2007 08:25, William Adams wrote: While such utilities can be useful for the naïve user, they don't result in an index, so much as a concordance, and the difference between the two should be kept in mind. Rather than relying on such, if the project and budget warrant it, far better to employ a human indexer (who is _not_ also the author). William There's budget for a human indexer, as long as the indexer is me (the author). So as the human indexer, how do I make this thing an index instead of a concordance? My plan is to use the word list program to make sure I don't leave out things that shouldn't be left out, not to give every term page numbers. How do I make it a real index? Thanks SteveT Steve Litt Author: Universal Troubleshooting Process books and courseware http://www.troubleshooters.com/
Re: Handy word list program for indexing
On Mar 6, 2007, at 8:57 AM, Steve Litt wrote: There's budget for a human indexer, as long as the indexer is me (the author). Got it. So as the human indexer, how do I make this thing an index instead of a concordance? A concordance is just a list of words in a document w/ reference to where they occur. An index is a structured, ordered list of the concepts and ideas and terminology in a document which allows one to determine if a desired bit of information is present in a document, and if so, where to find it. My plan is to use the word list program to make sure I don't leave out things that shouldn't be left out, not to give every term page numbers. Okay. How do I make it a real index? The traditional thing to do is to read the text twice, once to familiarize yourself w/ it and to make notes on what people might need / want to look for, the second time, to flag all terms / concepts as desired (usually using post it notes, or index cards). You may want to look up tools like the showidx package which will help you to consider the index as you're working w/ the text. William -- William Adams senior graphic designer Fry Communications This email message and any files transmitted with it contain information which is confidential and intended only for the addressee(s). If you are not the intended recipient(s), any usage, dissemination, disclosure, or action taken in reliance on it is prohibited. The reliability of this method of communication cannot be guaranteed. Email can be intercepted, corrupted, delayed, incompletely transmitted, virus-laden, or otherwise affected during transmission. Reasonable steps have been taken to reduce the risk of viruses, but we cannot accept liability for damage sustained as a result of this message. If you have received this message in error, please immediately delete it and all copies of it and notify the sender.
Re: Handy word list program for indexing
> The one thing this doesn't do is, upon final sort, sort by count descending > but name ascending. Can you think of a way to do that with standard Linux > commands? I am not sure I understand (or maybe I should read this again when I wake up :) Can you give a short example?
Re: Handy word list program for indexing
While such utilities can be useful for the naïve user, they don't result in an index, so much as a concordance, and the difference between the two should be kept in mind. Rather than relying on such, if the project and budget warrant it, far better to employ a human indexer (who is _not_ also the author). William -- William Adams senior graphic designer Fry Communications This email message and any files transmitted with it contain information which is confidential and intended only for the addressee(s). If you are not the intended recipient(s), any usage, dissemination, disclosure, or action taken in reliance on it is prohibited. The reliability of this method of communication cannot be guaranteed. Email can be intercepted, corrupted, delayed, incompletely transmitted, virus-laden, or otherwise affected during transmission. Reasonable steps have been taken to reduce the risk of viruses, but we cannot accept liability for damage sustained as a result of this message. If you have received this message in error, please immediately delete it and all copies of it and notify the sender.
Re: Handy word list program for indexing
On Tuesday 06 March 2007 02:58, [EMAIL PROTECTED] wrote: > On Tue, 6 Mar 2007, Steve Litt wrote: > > Indexing is the most distasteful, boring, and tedious part of writing a > > book. Making word lists like this at least makes it a brainless > > activity. > > I've linked to this thread from the following page > > http://wiki.lyx.org/Tips/Indexing Unable to connect Firefox can't establish a connection to the server at wiki.lyx.org. SteveT
Re: Handy word list program for indexing
On Tuesday 06 March 2007 08:25, William Adams wrote: > While such utilities can be useful for the naïve user, they don't > result in an index, so much as a concordance, and the difference > between the two should be kept in mind. > > Rather than relying on such, if the project and budget warrant it, > far better to employ a human indexer (who is _not_ also the author). > > William There's budget for a human indexer, as long as the indexer is me (the author). So as the human indexer, how do I make this thing an index instead of a concordance? My plan is to use the word list program to make sure I don't leave out things that shouldn't be left out, not to give every term page numbers. How do I make it a real index? Thanks SteveT Steve Litt Author: Universal Troubleshooting Process books and courseware http://www.troubleshooters.com/
Re: Handy word list program for indexing
On Mar 6, 2007, at 8:57 AM, Steve Litt wrote: > There's budget for a human indexer, as long as the indexer is me > (the author). Got it. > So as the human indexer, how do I make this thing an index instead > of a > concordance? A concordance is just a list of words in a document w/ reference to where they occur. An index is a structured, ordered list of the concepts and ideas and terminology in a document which allows one to determine if a desired bit of information is present in a document, and if so, where to find it. > My plan is to use the word list program to make sure I don't > leave out things that shouldn't be left out, not to give every term > page > numbers. Okay. > How do I make it a real index? The traditional thing to do is to read the text twice, once to familiarize yourself w/ it and to make notes on what people might need / want to look for, the second time, to flag all terms / concepts as desired (usually using post it notes, or index cards). You may want to look up tools like the showidx package which will help you to consider the index as you're working w/ the text. William -- William Adams senior graphic designer Fry Communications This email message and any files transmitted with it contain information which is confidential and intended only for the addressee(s). If you are not the intended recipient(s), any usage, dissemination, disclosure, or action taken in reliance on it is prohibited. The reliability of this method of communication cannot be guaranteed. Email can be intercepted, corrupted, delayed, incompletely transmitted, virus-laden, or otherwise affected during transmission. Reasonable steps have been taken to reduce the risk of viruses, but we cannot accept liability for damage sustained as a result of this message. If you have received this message in error, please immediately delete it and all copies of it and notify the sender.
Handy word list program for indexing
Hi all, In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file). Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. The order of occurrence is very handy because the most occurring words are usually garbage like the he, a and the like. Therefore, you can scan and delete those words very quickly. The words used only once or twice comprise the majority of words, and because they're used only once or twice, they're typically not important and you can scan them very quickly. The words in the middle typically contain many words useful in construction of an index, and should be perused more quickly. My program, which is written in Ruby, is licensed GNU GPL version 2, and is included as the remainder of the body of this document. Have fun with it. SteveT #!/usr/bin/ruby # Copyright (C) 2007 by Steve Litt, all rights reserved # This program is licensed under the GNU GPL version 2 -- only version 2 require 'set' $punct=Set.new([,, ., /, , , ?, ;, ', :, '', [, ], {, }, |, `, ~, !, @, #, $, %, ^, , *, (, ), _, +]) def by_freq_then_name(a, b) if a[1] b[1] return 1 elsif a[1] b[1] return -1 elsif a[0] b[0] return 1 elsif a[0] b[0] return -1 else return 0 end end word_hash = Hash.new() word_hash['junk'] = 25 STDIN.each do |line| line.chomp! line.strip! temparr = line.split(/\s\s*/) temparr.each do |word| while word.length 0 and $punct.include?(word[0].chr) word = word[1..-1] end while word.length 0 and $punct.include?(word[-1].chr) word = word[0..-2] end if word_hash.has_key?(word) word_hash[word] += 1 else word_hash[word] = 1 end end end puts = puts === ALPHA ORDER = puts = keys = word_hash.keys.sort keys.each do |key| printf %24s %6d\n, key, word_hash[key] end puts = puts OCCURRENCE ORDER === puts = temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)} temparray.each do |word_freq| printf %7d %s\n, word_freq[1], word_freq[0] end
Re: Handy word list program for indexing
On Mon, 5 Mar 2007, Steve Litt wrote: In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file). Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. fmt -1 | sort | uniq -c | sort -rn Also could had some tr and sed to clean out junk spacing and to lowercase everything. By the way, I did something similar when doing some indexing. Another thing I used is a spell checker -- words unknown to my dictionary I made sure were in the index. Jeremy C. Reed
Re: Handy word list program for indexing
Hi Jeremy, On Monday 05 March 2007 21:05, Jeremy C. Reed wrote: On Mon, 5 Mar 2007, Steve Litt wrote: In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file). Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. fmt -1 | sort | uniq -c | sort -rn Knowing that would have saved me two hours :-) I wasn't familiar with the fmt and the uniq commands. Thanks. Also could had some tr and sed to clean out junk spacing and to lowercase everything. Yes. Here's my final answer, merging everything into lower case, and blowing off leading and trailing space and punctuation: fmt -1 tsjustfacts.txt | sed -e s/^[[:space:][:punct:]]*// | sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn That's sweet. Simpler than the Ruby, and probably faster, expecially on multicore/multiprocessor machines. Thanks! The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? Another filter that might be useful in this chain is: grep -v ^[[:space:][:digit:][:punct:]]*$ In other words, if a line is consumed with nothing but space, digits and punctuation, it's probably not an index candidate and can be deleted, saving future processing and reducing extraneous output. I'm not sure whether it's a good idea to lowercase everything. I think sometimes case serves as a reminder of the meaning of a word. To not force everything to lower case, simply remove the tr [:upper:] [:lower:]. By the way, I did something similar when doing some indexing. Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. Thanks SteveT Steve Litt Author: Universal Troubleshooting Process books and courseware http://www.troubleshooters.com/
Re: Handy word list program for indexing
On Tue, 6 Mar 2007, Steve Litt wrote: Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. I've linked to this thread from the following page http://wiki.lyx.org/Tips/Indexing Maybe you could copy the useful script snippets to this page? Best regards /Christian -- Christian Ridderström, +46-8-768 39 44 http://www.md.kth.se/~chr
Handy word list program for indexing
Hi all, In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file). Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. The order of occurrence is very handy because the most occurring words are usually garbage like the he, a and the like. Therefore, you can scan and delete those words very quickly. The words used only once or twice comprise the majority of words, and because they're used only once or twice, they're typically not important and you can scan them very quickly. The words in the middle typically contain many words useful in construction of an index, and should be perused more quickly. My program, which is written in Ruby, is licensed GNU GPL version 2, and is included as the remainder of the body of this document. Have fun with it. SteveT #!/usr/bin/ruby # Copyright (C) 2007 by Steve Litt, all rights reserved # This program is licensed under the GNU GPL version 2 -- only version 2 require 'set' $punct=Set.new([,, ., /, , , ?, ;, ', :, '', [, ], {, }, |, `, ~, !, @, #, $, %, ^, , *, (, ), _, +]) def by_freq_then_name(a, b) if a[1] b[1] return 1 elsif a[1] b[1] return -1 elsif a[0] b[0] return 1 elsif a[0] b[0] return -1 else return 0 end end word_hash = Hash.new() word_hash['junk'] = 25 STDIN.each do |line| line.chomp! line.strip! temparr = line.split(/\s\s*/) temparr.each do |word| while word.length 0 and $punct.include?(word[0].chr) word = word[1..-1] end while word.length 0 and $punct.include?(word[-1].chr) word = word[0..-2] end if word_hash.has_key?(word) word_hash[word] += 1 else word_hash[word] = 1 end end end puts = puts === ALPHA ORDER = puts = keys = word_hash.keys.sort keys.each do |key| printf %24s %6d\n, key, word_hash[key] end puts = puts OCCURRENCE ORDER === puts = temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)} temparray.each do |word_freq| printf %7d %s\n, word_freq[1], word_freq[0] end
Re: Handy word list program for indexing
On Mon, 5 Mar 2007, Steve Litt wrote: In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file). Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. fmt -1 | sort | uniq -c | sort -rn Also could had some tr and sed to clean out junk spacing and to lowercase everything. By the way, I did something similar when doing some indexing. Another thing I used is a spell checker -- words unknown to my dictionary I made sure were in the index. Jeremy C. Reed
Re: Handy word list program for indexing
Hi Jeremy, On Monday 05 March 2007 21:05, Jeremy C. Reed wrote: On Mon, 5 Mar 2007, Steve Litt wrote: In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file). Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. fmt -1 | sort | uniq -c | sort -rn Knowing that would have saved me two hours :-) I wasn't familiar with the fmt and the uniq commands. Thanks. Also could had some tr and sed to clean out junk spacing and to lowercase everything. Yes. Here's my final answer, merging everything into lower case, and blowing off leading and trailing space and punctuation: fmt -1 tsjustfacts.txt | sed -e s/^[[:space:][:punct:]]*// | sed -e s/[[:space:][:punct:]]*$// | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn That's sweet. Simpler than the Ruby, and probably faster, expecially on multicore/multiprocessor machines. Thanks! The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? Another filter that might be useful in this chain is: grep -v ^[[:space:][:digit:][:punct:]]*$ In other words, if a line is consumed with nothing but space, digits and punctuation, it's probably not an index candidate and can be deleted, saving future processing and reducing extraneous output. I'm not sure whether it's a good idea to lowercase everything. I think sometimes case serves as a reminder of the meaning of a word. To not force everything to lower case, simply remove the tr [:upper:] [:lower:]. By the way, I did something similar when doing some indexing. Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. Thanks SteveT Steve Litt Author: Universal Troubleshooting Process books and courseware http://www.troubleshooters.com/
Re: Handy word list program for indexing
On Tue, 6 Mar 2007, Steve Litt wrote: Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. I've linked to this thread from the following page http://wiki.lyx.org/Tips/Indexing Maybe you could copy the useful script snippets to this page? Best regards /Christian -- Christian Ridderström, +46-8-768 39 44 http://www.md.kth.se/~chr
Handy word list program for indexing
Hi all, In preparation to create my index for my book, I created a Ruby program to list every word in a file (in this case the .lyx file). Now of course this could be done with a simple one-liner using sed and sort -u, but my program lists the words in 2 different orders, first in alpha order, which of course could be done by the 1 liner, and then in descending order of occurrence, which can't be. The order of occurrence is very handy because the most occurring words are usually garbage like the he, a and the like. Therefore, you can scan and delete those words very quickly. The words used only once or twice comprise the majority of words, and because they're used only once or twice, they're typically not important and you can scan them very quickly. The words in the middle typically contain many words useful in construction of an index, and should be perused more quickly. My program, which is written in Ruby, is licensed GNU GPL version 2, and is included as the remainder of the body of this document. Have fun with it. SteveT #!/usr/bin/ruby # Copyright (C) 2007 by Steve Litt, all rights reserved # This program is licensed under the GNU GPL version 2 -- only version 2 require 'set' $punct=Set.new([",", ".", "/", "<", ">", "?", ";", "'", ":", '"', "[", "]", "{", "}", "|", "`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "+"]) def by_freq_then_name(a, b) if a[1] < b[1] return 1 elsif a[1] > b[1] return -1 elsif a[0] > b[0] return 1 elsif a[0] < b[0] return -1 else return 0 end end word_hash = Hash.new() word_hash['junk'] = 25 STDIN.each do |line| line.chomp! line.strip! temparr = line.split(/\s\s*/) temparr.each do |word| while word.length > 0 and $punct.include?(word[0].chr) word = word[1..-1] end while word.length > 0 and $punct.include?(word[-1].chr) word = word[0..-2] end if word_hash.has_key?(word) word_hash[word] += 1 else word_hash[word] = 1 end end end puts "=" puts "=== ALPHA ORDER =" puts "=" keys = word_hash.keys.sort keys.each do |key| printf "%24s %6d\n", key, word_hash[key] end puts "=" puts " OCCURRENCE ORDER ===" puts "=" temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)} temparray.each do |word_freq| printf "%7d %s\n", word_freq[1], word_freq[0] end
Re: Handy word list program for indexing
On Mon, 5 Mar 2007, Steve Litt wrote: > In preparation to create my index for my book, I created a Ruby program to > list every word in a file (in this case the .lyx file). > > Now of course this could be done with a simple one-liner using sed and > sort -u, but my program lists the words in 2 different orders, first in alpha > order, which of course could be done by the 1 liner, and then in descending > order of occurrence, which can't be. fmt -1 | sort | uniq -c | sort -rn Also could had some tr and sed to clean out junk spacing and to lowercase everything. By the way, I did something similar when doing some indexing. Another thing I used is a spell checker -- words unknown to my dictionary I made sure were in the index. Jeremy C. Reed
Re: Handy word list program for indexing
Hi Jeremy, On Monday 05 March 2007 21:05, Jeremy C. Reed wrote: > On Mon, 5 Mar 2007, Steve Litt wrote: > > In preparation to create my index for my book, I created a Ruby program > > to list every word in a file (in this case the .lyx file). > > > > Now of course this could be done with a simple one-liner using sed and > > sort -u, but my program lists the words in 2 different orders, first in > > alpha order, which of course could be done by the 1 liner, and then in > > descending order of occurrence, which can't be. > > fmt -1 | sort | uniq -c | sort -rn Knowing that would have saved me two hours :-) I wasn't familiar with the fmt and the uniq commands. Thanks. > Also could had some tr and sed to clean out junk spacing and to lowercase > everything. Yes. Here's my final answer, merging everything into lower case, and blowing off leading and trailing space and punctuation: fmt -1 < tsjustfacts.txt | sed -e "s/^[[:space:][:punct:]]*//" | sed -e "s/[[:space:][:punct:]]*$//" | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn That's sweet. Simpler than the Ruby, and probably faster, expecially on multicore/multiprocessor machines. Thanks! The one thing this doesn't do is, upon final sort, sort by count descending but name ascending. Can you think of a way to do that with standard Linux commands? Another filter that might be useful in this chain is: grep -v ^[[:space:][:digit:][:punct:]]*$ In other words, if a line is consumed with nothing but space, digits and punctuation, it's probably not an index candidate and can be deleted, saving future processing and reducing extraneous output. I'm not sure whether it's a good idea to lowercase everything. I think sometimes case serves as a reminder of the meaning of a word. To not force everything to lower case, simply remove the tr [:upper:] [:lower:]. > > By the way, I did something similar when doing some indexing. Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. Thanks SteveT Steve Litt Author: Universal Troubleshooting Process books and courseware http://www.troubleshooters.com/
Re: Handy word list program for indexing
On Tue, 6 Mar 2007, Steve Litt wrote: Indexing is the most distasteful, boring, and tedious part of writing a book. Making word lists like this at least makes it a brainless activity. I've linked to this thread from the following page http://wiki.lyx.org/Tips/Indexing Maybe you could copy the useful script snippets to this page? Best regards /Christian -- Christian Ridderström, +46-8-768 39 44 http://www.md.kth.se/~chr