RE: [PHP] How do I do count the occurrence of each word?
-Original Message- From: Marco Behnke [mailto:ma...@behnke.biz] Sent: 19 August 2012 06:39 To: php-general@lists.php.net Subject: Re: [PHP] How do I do count the occurrence of each word? Am 19.08.12 06:59, schrieb tamouse mailing lists: On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston jt.johns...@usherbrooke.ca wrote: I want to parse this text and count the occurrence of each word: Sample Output: determined = 4 fire = 7 patrol = 3 theft = 6 witness = 1 witnessed = 1 [...] and then you just run through the words building an associative array by incrementing the count of each word as the key to the array: foreach ($words as $word) { $freq[$word]++; } Please an existence check to avoid incrementing not set array keys foreach ($words as $word) { if (array_key_exists($word, $freq)) { $freq[$word] = 1; } else { $freq[$word]++; } } Erm... $freq = array_count_values($words) (http://php.net/array_count_values) Cheers! Mike -- Mike Ford, Electronic Information Developer, Libraries and Learning Innovation, Portland PD507, City Campus, Leeds Metropolitan University, Portland Way, LEEDS, LS1 3HE, United Kingdom E: m.f...@leedsmet.ac.uk T: +44 113 812 4730 To view the terms under which this email is distributed, please go to http://disclaimer.leedsmet.ac.uk/email.htm
Re: [PHP] How do I do count the occurrence of each word?
On Sun, Aug 19, 2012 at 12:38 AM, Marco Behnke ma...@behnke.biz wrote: Am 19.08.12 06:59, schrieb tamouse mailing lists: On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston jt.johns...@usherbrooke.ca wrote: I want to parse this text and count the occurrence of each word: $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html; #Can I do this? $stripping = strip_tags($text); #get rid of html $stripping = strtolower($stripping); #put in lowercase First of all I want to start AFTER the expression News Releases and stop BEFORE the next occurrence of -30- #This may occur an undetermined number of times on http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html Second, do I put $stripping into an array to separate each word by each space ? $stripping = implode( , $stripping); Third how do I count the number of occurrences of each word? Sample Output: determined = 4 fire = 7 patrol = 3 theft = 6 witness = 1 witnessed = 1 ?php $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html #echo strip_tags($text); #echo \n; $stripping = strip_tags($text); #Get text between News Releases and stop before the next occurrence of -30- #$stripping = str_replace(\r, , $stripping);# getting rid of \r #$stripping = str_replace(\n, , $stripping);# getting rid of \n #$stripping = str_replace( , , $stripping);# getting rid of the occurrences of double spaces #$stripping = strtolower($stripping); #Where do I go now? ? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php This is usually a first-year CS programming problem (word frequency counts) complicated a little bit by needing to extract the text. You've started off fine, stripping tags, converting to lower case, you'll want to either convert or strip HTML entities as well, deciding what you want to do with plurals and words like you're, Charlie's, it's, etc, also whether something like RFC822 is a word or not (mixed letters and numbers). When you've arranged all that, splitting on white space is trivial: $words = preg_split('/[[:space:]]+/',$text); and then you just run through the words building an associative array by incrementing the count of each word as the key to the array: foreach ($words as $word) { $freq[$word]++; } Please an existence check to avoid incrementing not set array keys foreach ($words as $word) { if (array_key_exists($word, $freq)) { $freq[$word] = 1; } else { $freq[$word]++; } } Ah, yes, good point -- as written, my code will raise two notices. In addition, declare the $freq array: $freq=array(); as well before the foreach loop to ensure notice-free operation. For output, you may want to sort the array: ksort($freq); -- Marco Behnke Dipl. Informatiker (FH), SAE Audio Engineer Diploma Zend Certified Engineer PHP 5.3 Tel.: 0174 / 9722336 e-Mail: ma...@behnke.biz Softwaretechnik Behnke Heinrich-Heine-Str. 7D 21218 Seevetal http://www.behnke.biz -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] How do I do count the occurrence of each word?
On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston jt.johns...@usherbrooke.ca wrote: I want to parse this text and count the occurrence of each word: $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html; #Can I do this? $stripping = strip_tags($text); #get rid of html $stripping = strtolower($stripping); #put in lowercase First of all I want to start AFTER the expression News Releases and stop BEFORE the next occurrence of -30- #This may occur an undetermined number of times on http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html Second, do I put $stripping into an array to separate each word by each space ? $stripping = implode( , $stripping); Third how do I count the number of occurrences of each word? Sample Output: determined = 4 fire = 7 patrol = 3 theft = 6 witness = 1 witnessed = 1 ?php $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html #echo strip_tags($text); #echo \n; $stripping = strip_tags($text); #Get text between News Releases and stop before the next occurrence of -30- #$stripping = str_replace(\r, , $stripping);# getting rid of \r #$stripping = str_replace(\n, , $stripping);# getting rid of \n #$stripping = str_replace( , , $stripping);# getting rid of the occurrences of double spaces #$stripping = strtolower($stripping); #Where do I go now? ? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php This is usually a first-year CS programming problem (word frequency counts) complicated a little bit by needing to extract the text. You've started off fine, stripping tags, converting to lower case, you'll want to either convert or strip HTML entities as well, deciding what you want to do with plurals and words like you're, Charlie's, it's, etc, also whether something like RFC822 is a word or not (mixed letters and numbers). When you've arranged all that, splitting on white space is trivial: $words = preg_split('/[[:space:]]+/',$text); and then you just run through the words building an associative array by incrementing the count of each word as the key to the array: foreach ($words as $word) { $freq[$word]++; } For output, you may want to sort the array: ksort($freq); -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] How do I do count the occurrence of each word?
Am 19.08.12 06:59, schrieb tamouse mailing lists: On Sat, Aug 18, 2012 at 6:44 PM, John Taylor-Johnston jt.johns...@usherbrooke.ca wrote: I want to parse this text and count the occurrence of each word: $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html; #Can I do this? $stripping = strip_tags($text); #get rid of html $stripping = strtolower($stripping); #put in lowercase First of all I want to start AFTER the expression News Releases and stop BEFORE the next occurrence of -30- #This may occur an undetermined number of times on http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html Second, do I put $stripping into an array to separate each word by each space ? $stripping = implode( , $stripping); Third how do I count the number of occurrences of each word? Sample Output: determined = 4 fire = 7 patrol = 3 theft = 6 witness = 1 witnessed = 1 ?php $text = http://www.cegepsherbrooke.qc.ca/~languesmodernes/test/test.html #echo strip_tags($text); #echo \n; $stripping = strip_tags($text); #Get text between News Releases and stop before the next occurrence of -30- #$stripping = str_replace(\r, , $stripping);# getting rid of \r #$stripping = str_replace(\n, , $stripping);# getting rid of \n #$stripping = str_replace( , , $stripping);# getting rid of the occurrences of double spaces #$stripping = strtolower($stripping); #Where do I go now? ? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php This is usually a first-year CS programming problem (word frequency counts) complicated a little bit by needing to extract the text. You've started off fine, stripping tags, converting to lower case, you'll want to either convert or strip HTML entities as well, deciding what you want to do with plurals and words like you're, Charlie's, it's, etc, also whether something like RFC822 is a word or not (mixed letters and numbers). When you've arranged all that, splitting on white space is trivial: $words = preg_split('/[[:space:]]+/',$text); and then you just run through the words building an associative array by incrementing the count of each word as the key to the array: foreach ($words as $word) { $freq[$word]++; } Please an existence check to avoid incrementing not set array keys foreach ($words as $word) { if (array_key_exists($word, $freq)) { $freq[$word] = 1; } else { $freq[$word]++; } } For output, you may want to sort the array: ksort($freq); -- Marco Behnke Dipl. Informatiker (FH), SAE Audio Engineer Diploma Zend Certified Engineer PHP 5.3 Tel.: 0174 / 9722336 e-Mail: ma...@behnke.biz Softwaretechnik Behnke Heinrich-Heine-Str. 7D 21218 Seevetal http://www.behnke.biz signature.asc Description: OpenPGP digital signature