Re: [PHP] Another parse problem
Rob and Daniel: As expected, both of your submission were excellent. If this was an assignment in one of my classes (as if I could teach either of you anything) you would both receive an A+. Daniel's routine also returned .ie TLD, but that was not stated as a requirement. Daniel's routine also allow for full-link parsing, but again that was not stated as a requirement. How to deal with duplicate domains was not addressed in the given and both routines differed on that point. The given was to parse domain-names, but both routines pulled out sub-domains as well. Perhaps I am wrong in my understanding of what a domain name is, but I would normally look at sub domains as not part of the domain name. Sub domains are simply extensions of the domain name, am I right or wrong? In any event, I will be examining both your code because neither is the way I solved the problem. Mine was a bit more verbose and clumsy in comparison. It's always nice to see how the top dog's do it. Cheers, tedd PS: I've been away for the last couple of days. -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
On Wed, Jun 16, 2010 at 13:22, tedd tedd.sperl...@gmail.com wrote: The given was to parse domain-names, but both routines pulled out sub-domains as well. Perhaps I am wrong in my understanding of what a domain name is, but I would normally look at sub domains as not part of the domain name. Sub domains are simply extensions of the domain name, am I right or wrong? Technically, a domain name is anything from the TLD and SLD levels and below. An FQDN (commonly called a hostname) is in the format cname.sld.tld. -- /Daniel P. Brown daniel.br...@parasane.net || danbr...@php.net http://www.parasane.net/ || http://www.pilotpig.net/ We now offer SAME-DAY SETUP on a new line of servers! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
Daniel P. Brown wrote: On Wed, Jun 16, 2010 at 13:22, tedd tedd.sperl...@gmail.com wrote: The given was to parse domain-names, but both routines pulled out sub-domains as well. Perhaps I am wrong in my understanding of what a domain name is, but I would normally look at sub domains as not part of the domain name. Sub domains are simply extensions of the domain name, am I right or wrong? Technically, a domain name is anything from the TLD and SLD levels and below. An FQDN (commonly called a hostname) is in the format cname.sld.tld. Additionally, extracting top level domains is not so simple since it may have 2 or more parts. Cheers, Rob. -- E-Mail Disclaimer: Information contained in this message and any attached documents is considered confidential and legally protected. This message is intended solely for the addressee(s). Disclosure, copying, and distribution are prohibited unless authorized. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
On Wed, Jun 16, 2010 at 15:52, Robert Cummings rob...@interjinn.com wrote: Additionally, extracting top level domains is not so simple since it may have 2 or more parts. *Gasp!* The Great Cummings is incorrect. /me faints. Actually, ccTLD's are just the very last group of letters. For example, .il, .uk, and .br. However, the ICANN, registrar policies, or sponsorship requirements for some of them require the use of an SLD as well. For example, .co.il, .org.uk, and .com.br, respectively. Some ccTLDs offer the SLD options, but don't require them. For example, you can register .co.in, .firm.in, .gen.in, or any other available SLD+ccTLD, or just the ccTLD .in itself. Still others have no such requirement or even official SLD endorsements, such as good ol' Canada (Land of Clan Cummings), Ireland, and here in the US. -- /Daniel P. Brown daniel.br...@parasane.net || danbr...@php.net http://www.parasane.net/ || http://www.pilotpig.net/ We now offer SAME-DAY SETUP on a new line of servers! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
Daniel P. Brown wrote: On Wed, Jun 16, 2010 at 15:52, Robert Cummings rob...@interjinn.com wrote: Additionally, extracting top level domains is not so simple since it may have 2 or more parts. *Gasp!* The Great Cummings is incorrect. /me faints. Actually, ccTLD's are just the very last group of letters. For example, .il, .uk, and .br. However, the ICANN, registrar policies, or sponsorship requirements for some of them require the use of an SLD as well. For example, .co.il, .org.uk, and .com.br, respectively. Some ccTLDs offer the SLD options, but don't require them. For example, you can register .co.in, .firm.in, .gen.in, or any other available SLD+ccTLD, or just the ccTLD .in itself. Still others have no such requirement or even official SLD endorsements, such as good ol' Canada (Land of Clan Cummings), Ireland, and here in the US. Hahah, I can't be right all the time :D I didn't mean to use TLD, I meant to use domain name, but not including sub-domained names :) I don't even know what that is rightly called to exclude sub-domains. Anyways, those, by virtue of your above description can have two or more parts and there's not a simple way to extract that part without also extracting the sub-domain portions. Cheers, Rob. -- E-Mail Disclaimer: Information contained in this message and any attached documents is considered confidential and legally protected. This message is intended solely for the addressee(s). Disclosure, copying, and distribution are prohibited unless authorized. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
On Wed, Jun 16, 2010 at 21:42, Robert Cummings rob...@interjinn.com wrote: [snip!] Anyways, those, by virtue of your above description can have two or more parts and there's not a simple way to extract that part without also extracting the sub-domain portions. True. Not without some static rules and logic, including knowledge of which ccTLDs have required or potential country-wide SLDs. Though I think the solutions provided by yourself, Shawn, and myself would suffice for most situations. I'm hoping Tedd will share his as well. -- /Daniel P. Brown daniel.br...@parasane.net || danbr...@php.net http://www.parasane.net/ || http://www.pilotpig.net/ We now offer SAME-DAY SETUP on a new line of servers! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Another parse problem
Hi gang: Considering all the recent parsing, here's another problem to consider -- given any text, parse the domain-names out of it. You may limit the parsing to the most popular TDL's, such as .com, .net, and .org, but the finished result should be an array containing all the domain-names found in a text file. Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
On Mon, 2010-06-14 at 09:14 -0400, tedd wrote: Hi gang: Considering all the recent parsing, here's another problem to consider -- given any text, parse the domain-names out of it. You may limit the parsing to the most popular TDL's, such as .com, .net, and .org, but the finished result should be an array containing all the domain-names found in a text file. Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com I'm assuming it won't be anything as simple as assuming all the domains begin with the http:// prefix? :p Thanks, Ash http://www.ashleysheridan.co.uk
Re: [PHP] Another parse problem
At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote: On Mon, 2010-06-14 at 09:14 -0400, tedd wrote: Hi gang: Considering all the recent parsing, here's another problem to consider -- given any text, parse the domain-names out of it. You may limit the parsing to the most popular TDL's, such as .com, .net, and .org, but the finished result should be an array containing all the domain-names found in a text file. Cheers, tedd -- --- http://sperling.comhttp://sperling.com http://ancientstones.comhttp://ancientstones.com http://earthstones.comhttp://earthstones.com I'm assuming it won't be anything as simple as assuming all the domains begin with the http:// prefix? :p Thanks, Ash Ash: Nope, just a text file containing whatever and domain-names. The only domain-name indicator would be the period followed by an approved TDL, such as .com, .net, or .org. Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
tedd wrote: At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote: On Mon, 2010-06-14 at 09:14 -0400, tedd wrote: Hi gang: Considering all the recent parsing, here's another problem to consider -- given any text, parse the domain-names out of it. You may limit the parsing to the most popular TDL's, such as .com, .net, and .org, but the finished result should be an array containing all the domain-names found in a text file. Cheers, tedd -- --- http://sperling.comhttp://sperling.com http://ancientstones.comhttp://ancientstones.com http://earthstones.comhttp://earthstones.com I'm assuming it won't be anything as simple as assuming all the domains begin with the http:// prefix? :p Thanks, Ash Ash: Nope, just a text file containing whatever and domain-names. The only domain-name indicator would be the period followed by an approved TDL, such as .com, .net, or .org. ?php function rip_domains( $text ) { $domains = false; $pattern = '[^-[:alnum:]]*' .'(' . '[-[:alnum:]][-.[:alnum:]]*' . '\.(com|net|org)' .')' .'[^-_[:alnum:]]*'; if( preg_match_all( #$pattern#, $text, $matches ) ) { $domains = array(); foreach( $matches[1] as $domain ) { $domains[$domain] = true; } $domains = array_keys( $domains ); } return $domains; } ? Naive implementation. I'm sure I've missed edge cases someplace. Cheers, Rob. -- E-Mail Disclaimer: Information contained in this message and any attached documents is considered confidential and legally protected. This message is intended solely for the addressee(s). Disclosure, copying, and distribution are prohibited unless authorized. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Another parse problem
On Mon, Jun 14, 2010 at 09:14, tedd t...@sperling.com wrote: Hi gang: Considering all the recent parsing, here's another problem to consider -- given any text, parse the domain-names out of it. You may limit the parsing to the most popular TDL's, such as .com, .net, and .org, but the finished result should be an array containing all the domain-names found in a text file. ?php $text =TXT To test example.com and www.php.net and other domain names such as january.pilotpig.net and ca2.php.parasane.net, we need a reliable method of checking. We don't want to match on regular periods, nor on the 2.2million or 2.2 million or just 2,200,000 other potential matches. And not when we are double-spacing or single-spacing, just when oidk.net and similar domains are found. We'll match hyphen domains like l-i-e.com, but not fake_underscored_domain.net. We also want to match http://-fronted domains like http://php1.net/, which also contains a number. If we wanted to match domains plus paths, but there was no leading http:// to indicate that it should be a URL, we could extend this to grab things like www.facebook.com/parasane, so long as we don't ignore the rare one-character SLDs like x.com, as well as the domains in email addresses like danbr...@php.net So if everything works as expected, we should see eleven domains matched here, because ccTLDs like guthr.ie should be matched as well. TXT; /** * $fromText can be defined via a file_get_contents() or * similar function, while $fullLink should be anything * but false to enable link-matching, which will return * only link-like domains with paths attached. */ function extract_domains($fromText,$fullLink=false) { // If we only want to match the domain names. if ($fullLink === false) { preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5})\b/',$fromText,$matches); return $matches[1]; } // If we want to match just domain names with trailing paths. preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5}\/.+?)\b/',$fromText,$matches); return $matches[1]; } // Demo echo pre.PHP_EOL; echo Just domains:.PHP_EOL; var_dump(extract_domains($text)); echo PHP_EOL; echo Full links:.PHP_EOL; var_dump(extract_domains($text,true)); echo /pre.PHP_EOL; ? -- /Daniel P. Brown daniel.br...@parasane.net || danbr...@php.net http://www.parasane.net/ || http://www.pilotpig.net/ We now offer SAME-DAY SETUP on a new line of servers! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php