Re: [PHP] Another parse problem

2010-06-16 Thread tedd

Rob and Daniel:

As expected, both of your submission were excellent. If this was an 
assignment in one of my classes (as if I could teach either of you 
anything) you would both receive an A+.


Daniel's routine also returned .ie TLD, but that was not stated as a 
requirement.


Daniel's routine also allow for full-link parsing, but again that was 
not stated as a requirement.


How to deal with duplicate domains was not addressed in the given and 
both routines differed on that point.


The given was to parse domain-names, but both routines pulled out 
sub-domains as well. Perhaps I am wrong in my understanding of what a 
domain name is, but I would normally look at sub domains as not part 
of the domain name. Sub domains are simply extensions of the domain 
name, am I right or wrong?


In any event, I will be examining both your code because neither is 
the way I solved the problem. Mine was a bit more verbose and clumsy 
in comparison. It's always nice to see how the top dog's do it.


Cheers,

tedd

PS: I've been away for the last couple of days.

--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-16 Thread Daniel P. Brown
On Wed, Jun 16, 2010 at 13:22, tedd tedd.sperl...@gmail.com wrote:

 The given was to parse domain-names, but both routines pulled out
 sub-domains as well. Perhaps I am wrong in my understanding of what a domain
 name is, but I would normally look at sub domains as not part of the domain
 name. Sub domains are simply extensions of the domain name, am I right or
 wrong?

Technically, a domain name is anything from the TLD and SLD levels
and below.  An FQDN (commonly called a hostname) is in the format
cname.sld.tld.

-- 
/Daniel P. Brown
daniel.br...@parasane.net || danbr...@php.net
http://www.parasane.net/ || http://www.pilotpig.net/
We now offer SAME-DAY SETUP on a new line of servers!

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-16 Thread Robert Cummings



Daniel P. Brown wrote:

On Wed, Jun 16, 2010 at 13:22, tedd tedd.sperl...@gmail.com wrote:

The given was to parse domain-names, but both routines pulled out
sub-domains as well. Perhaps I am wrong in my understanding of what a domain
name is, but I would normally look at sub domains as not part of the domain
name. Sub domains are simply extensions of the domain name, am I right or
wrong?


Technically, a domain name is anything from the TLD and SLD levels
and below.  An FQDN (commonly called a hostname) is in the format
cname.sld.tld.


Additionally, extracting top level domains is not so simple since it may 
have 2 or more parts.


Cheers,
Rob.
--
E-Mail Disclaimer: Information contained in this message and any
attached documents is considered confidential and legally protected.
This message is intended solely for the addressee(s). Disclosure,
copying, and distribution are prohibited unless authorized.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-16 Thread Daniel P. Brown
On Wed, Jun 16, 2010 at 15:52, Robert Cummings rob...@interjinn.com wrote:

 Additionally, extracting top level domains is not so simple since it may
 have 2 or more parts.

*Gasp!*  The Great Cummings is incorrect.

/me faints.

Actually, ccTLD's are just the very last group of letters.  For
example, .il, .uk, and .br.  However, the ICANN, registrar policies,
or sponsorship requirements for some of them require the use of an SLD
as well.  For example, .co.il, .org.uk, and .com.br, respectively.
Some ccTLDs offer the SLD options, but don't require them.  For
example, you can register .co.in, .firm.in, .gen.in, or any other
available SLD+ccTLD, or just the ccTLD .in itself.

Still others have no such requirement or even official SLD
endorsements, such as good ol' Canada (Land of Clan Cummings),
Ireland, and here in the US.

-- 
/Daniel P. Brown
daniel.br...@parasane.net || danbr...@php.net
http://www.parasane.net/ || http://www.pilotpig.net/
We now offer SAME-DAY SETUP on a new line of servers!

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-16 Thread Robert Cummings

Daniel P. Brown wrote:

On Wed, Jun 16, 2010 at 15:52, Robert Cummings rob...@interjinn.com wrote:

Additionally, extracting top level domains is not so simple since it may
have 2 or more parts.


*Gasp!*  The Great Cummings is incorrect.

/me faints.

Actually, ccTLD's are just the very last group of letters.  For
example, .il, .uk, and .br.  However, the ICANN, registrar policies,
or sponsorship requirements for some of them require the use of an SLD
as well.  For example, .co.il, .org.uk, and .com.br, respectively.
Some ccTLDs offer the SLD options, but don't require them.  For
example, you can register .co.in, .firm.in, .gen.in, or any other
available SLD+ccTLD, or just the ccTLD .in itself.

Still others have no such requirement or even official SLD
endorsements, such as good ol' Canada (Land of Clan Cummings),
Ireland, and here in the US.


Hahah, I can't be right all the time :D I didn't mean to use TLD, I 
meant to use domain name, but not including sub-domained names :) I 
don't even know what that is rightly called to exclude sub-domains. 
Anyways, those, by virtue of your above description can have two or more 
parts and there's not a simple way to extract that part without also 
extracting the sub-domain portions.


Cheers,
Rob.
--
E-Mail Disclaimer: Information contained in this message and any
attached documents is considered confidential and legally protected.
This message is intended solely for the addressee(s). Disclosure,
copying, and distribution are prohibited unless authorized.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-16 Thread Daniel Brown
On Wed, Jun 16, 2010 at 21:42, Robert Cummings rob...@interjinn.com wrote:
[snip!]
 Anyways, those, by
 virtue of your above description can have two or more parts and there's not
 a simple way to extract that part without also extracting the sub-domain
 portions.

True.  Not without some static rules and logic, including
knowledge of which ccTLDs have required or potential country-wide
SLDs.  Though I think the solutions provided by yourself, Shawn, and
myself would suffice for most situations.  I'm hoping Tedd will share
his as well.

-- 
/Daniel P. Brown
daniel.br...@parasane.net || danbr...@php.net
http://www.parasane.net/ || http://www.pilotpig.net/
We now offer SAME-DAY SETUP on a new line of servers!

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Another parse problem

2010-06-14 Thread tedd

Hi gang:

Considering all the recent parsing, here's another problem to 
consider -- given any text, parse the domain-names out of it.


You may limit the parsing to the most popular TDL's, such as .com, 
.net, and .org, but the finished result should be an array containing 
all the domain-names found in a text file.


Cheers,

tedd
--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-14 Thread Ashley Sheridan
On Mon, 2010-06-14 at 09:14 -0400, tedd wrote:

 Hi gang:
 
 Considering all the recent parsing, here's another problem to 
 consider -- given any text, parse the domain-names out of it.
 
 You may limit the parsing to the most popular TDL's, such as .com, 
 .net, and .org, but the finished result should be an array containing 
 all the domain-names found in a text file.
 
 Cheers,
 
 tedd
 -- 
 ---
 http://sperling.com  http://ancientstones.com  http://earthstones.com
 


I'm assuming it won't be anything as simple as assuming all the domains
begin with the http:// prefix? :p

Thanks,
Ash
http://www.ashleysheridan.co.uk




Re: [PHP] Another parse problem

2010-06-14 Thread tedd

At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote:

On Mon, 2010-06-14 at 09:14 -0400, tedd wrote:



Hi gang:

Considering all the recent parsing, here's another problem to
consider -- given any text, parse the domain-names out of it.

You may limit the parsing to the most popular TDL's, such as .com,
.net, and .org, but the finished result should be an array containing
all the domain-names found in a text file.

Cheers,

tedd
--
---
http://sperling.comhttp://sperling.com  
http://ancientstones.comhttp://ancientstones.com  
http://earthstones.comhttp://earthstones.com




I'm assuming it won't be anything as simple as assuming all the 
domains begin with the http:// prefix? :p


Thanks,
Ash


Ash:

Nope, just a text file containing whatever and domain-names. The only 
domain-name indicator would be the period followed by an approved 
TDL, such as .com, .net, or .org.


Cheers,

tedd

--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-14 Thread Robert Cummings

tedd wrote:

At 2:18 PM +0100 6/14/10, Ashley Sheridan wrote:

On Mon, 2010-06-14 at 09:14 -0400, tedd wrote:


Hi gang:

Considering all the recent parsing, here's another problem to
consider -- given any text, parse the domain-names out of it.

You may limit the parsing to the most popular TDL's, such as .com,
.net, and .org, but the finished result should be an array containing
all the domain-names found in a text file.

Cheers,

tedd
--
---
http://sperling.comhttp://sperling.com  
http://ancientstones.comhttp://ancientstones.com  
http://earthstones.comhttp://earthstones.com


I'm assuming it won't be anything as simple as assuming all the 
domains begin with the http:// prefix? :p


Thanks,
Ash


Ash:

Nope, just a text file containing whatever and domain-names. The only 
domain-name indicator would be the period followed by an approved 
TDL, such as .com, .net, or .org.


?php

function rip_domains( $text )
{
$domains = false;

$pattern =
'[^-[:alnum:]]*'
   .'('
   .  '[-[:alnum:]][-.[:alnum:]]*'
   .  '\.(com|net|org)'
   .')'
   .'[^-_[:alnum:]]*';

if( preg_match_all( #$pattern#, $text, $matches ) )
{
$domains = array();
foreach( $matches[1] as $domain )
{
$domains[$domain] = true;
}
$domains = array_keys( $domains );
}

return $domains;
}

?

Naive implementation. I'm sure I've missed edge cases someplace.

Cheers,
Rob.
--
E-Mail Disclaimer: Information contained in this message and any
attached documents is considered confidential and legally protected.
This message is intended solely for the addressee(s). Disclosure,
copying, and distribution are prohibited unless authorized.

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Another parse problem

2010-06-14 Thread Daniel P. Brown
On Mon, Jun 14, 2010 at 09:14, tedd t...@sperling.com wrote:
 Hi gang:

 Considering all the recent parsing, here's another problem to consider --
 given any text, parse the domain-names out of it.

 You may limit the parsing to the most popular TDL's, such as .com, .net, and
 .org, but the finished result should be an array containing all the
 domain-names found in a text file.

?php
$text =TXT
To test example.com and www.php.net and other domain names
such as january.pilotpig.net and ca2.php.parasane.net, we need a
reliable method of checking.  We don't want to match on regular
periods, nor on the 2.2million or 2.2 million or just 2,200,000
other potential matches. And not when we are double-spacing or
single-spacing, just when oidk.net and similar domains are found.
We'll match hyphen domains like l-i-e.com, but not fake_underscored_domain.net.
We also want to match http://-fronted domains like http://php1.net/,
which also contains a number.  If we wanted to match domains plus
paths, but there was no leading http:// to indicate that it should
be a URL, we could extend this to grab things like www.facebook.com/parasane,
so long as we don't ignore the rare one-character SLDs like x.com,
as well as the domains in email addresses like danbr...@php.net
So if everything works as expected, we should see eleven domains
matched here, because ccTLDs like guthr.ie should be matched as well.

TXT;

/**
 * $fromText can be defined via a file_get_contents() or
 * similar function, while $fullLink should be anything
 * but false to enable link-matching, which will return
 * only link-like domains with paths attached.
 */
function extract_domains($fromText,$fullLink=false) {

// If we only want to match the domain names.
if ($fullLink === false) {

preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5})\b/',$fromText,$matches);
return $matches[1];
}

// If we want to match just domain names with trailing paths.

preg_match_all('/\b([a-z0-9\-\.]{1,}\.[a-z]{2,5}\/.+?)\b/',$fromText,$matches);
return $matches[1];
}

// Demo
echo pre.PHP_EOL;

echo Just domains:.PHP_EOL;
var_dump(extract_domains($text));

echo PHP_EOL;

echo Full links:.PHP_EOL;
var_dump(extract_domains($text,true));

echo /pre.PHP_EOL;
?


-- 
/Daniel P. Brown
daniel.br...@parasane.net || danbr...@php.net
http://www.parasane.net/ || http://www.pilotpig.net/
We now offer SAME-DAY SETUP on a new line of servers!

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php