Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-30 Thread Sebastian Krebs
2013/5/29 Matijn Woudt tijn...@gmail.com



 On Wed, May 29, 2013 at 10:51 PM, Sebastian Krebs krebs@gmail.comwrote:




 2013/5/29 Matijn Woudt tijn...@gmail.com

 On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.com
 wrote:

  On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote:
   On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote:
   I'm adding some minification to our cache.class.php and am running
 into
  an
   edge case that is causing me grief.
  
   I want to remove all comments of the // variety, HOWEVER I don't
 want to
   remove URLs...
  
   KISS.
  
   To make it simple, straight-forward, and understandable next year
 when I
   have to re-read what I've written:
  
   I'd change all :// to QqQ  -- or any unlikely text string.
  
   Then I'd do whatever needs to be done to the // occurances.
  
   Finally, I'd change all QqQ back to ://.
  
   Jonesy
 
  Wow. This is just a spectacularly bad suggestion.
 
  First off, this task is probably a bit beyond the capabilities of a
  regex. Yes, you may be able to come up with something that works 99%
  of the time, but this is really a job for a parser of some sort. I'm
  sorry I don't have any suggestions on exactly where to go with that,
  however I'm sure Google can be of assistance. The main problem is that
  regex doesn't understand context. It just blindly finds patterns. A
  parser understands context, and can figure out which //'s are comments
  and which are something else. As a bonus, it can probably understand
  other forms of comments like /* */, which regex would completely die
  on.
 
 
 It is possible to write a whole parser as a single regex, being it
 terribly
 long and complex.


 No, it isn't.



 It's better if you throw some smart words on the screen if you want to
 convince someone. Just thinking about it, it makes sense as a true regular
 expression can only describe a regular language, and I think all the
 programming languages are not regular languages.
 But, We have PHP PCRE with extensions like Recursive patterns[1] and Back
 references[2], which can describe much more than just a regular language.
 And I do believe it would be able to handle it.
 Too bad it probably takes months to complete a regular expression like
 this.


Then you start as soon as possible, so that you not realitze, that this is
wrong, when it is too late. I am not going to start explaining this again,
because it becomes a waste of time. You call it smart words on the
screen, I call it advice.


 - Matijn

 [1] http://php.net/manual/en/regexp.reference.recursive.php
 [2] http://php.net/manual/en/regexp.reference.back-references.php




-- 
github.com/KingCrunch


Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-30 Thread David Harkness
On Wed, May 29, 2013 at 10:20 AM, Matijn Woudt tijn...@gmail.com wrote:

 It is possible to write a whole parser as a single regex, being it terribly
 long and complex.


While regular expressions are often used in the lexer--the part that scans
the input stream and breaks it up into meaningful tokens like

{ keyword: function }
{ operator: + }

and

{ identifier: $foo }

that form the building blocks of the language--they aren't combined into a
single expression. Instead, a lexer generator is used to build a state
machine that switches the active expressions to check based on the previous
tokens and context. Each expression recognizes a different type of token,
and many times these aren't even regular expressions.

The second stage--combining tokens based on the rules of the grammar--is
more complex and beyond the abilities of regular expressions. There are
plenty of books on the subject and tools [1] to build the pieces such as
Lex, Yacc, Flex, and Bison. Someone even asked this question on Stack
Overflow [2] a few years ago. And I'm sure if you look you can find someone
that did a masters thesis proving that regular expressions cannot handle a
context-free grammar. And finally I leave you with Jeff Atwood's article
about (not) parsing HTML with regex. [3]

Peace,
David

[1] http://dinosaur.compilertools.net/
[2]
http://stackoverflow.com/questions/3487089/are-regular-expressions-used-to-build-parsers
[3]
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html


[PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Jonesy
On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote:
 I'm adding some minification to our cache.class.php and am running into an
 edge case that is causing me grief.

 I want to remove all comments of the // variety, HOWEVER I don't want to
 remove URLs...

KISS.

To make it simple, straight-forward, and understandable next year when I 
have to re-read what I've written:

I'd change all :// to QqQ  -- or any unlikely text string.

Then I'd do whatever needs to be done to the // occurances.

Finally, I'd change all QqQ back to ://.

Jonesy


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Sean Greenslade
On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote:
 On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote:
 I'm adding some minification to our cache.class.php and am running into an
 edge case that is causing me grief.

 I want to remove all comments of the // variety, HOWEVER I don't want to
 remove URLs...

 KISS.

 To make it simple, straight-forward, and understandable next year when I
 have to re-read what I've written:

 I'd change all :// to QqQ  -- or any unlikely text string.

 Then I'd do whatever needs to be done to the // occurances.

 Finally, I'd change all QqQ back to ://.

 Jonesy

Wow. This is just a spectacularly bad suggestion.

First off, this task is probably a bit beyond the capabilities of a
regex. Yes, you may be able to come up with something that works 99%
of the time, but this is really a job for a parser of some sort. I'm
sorry I don't have any suggestions on exactly where to go with that,
however I'm sure Google can be of assistance. The main problem is that
regex doesn't understand context. It just blindly finds patterns. A
parser understands context, and can figure out which //'s are comments
and which are something else. As a bonus, it can probably understand
other forms of comments like /* */, which regex would completely die
on.

Blindly replacing a string with any unlikely text string is just
bad. I don't care how unlikely your text string is, it _will_
eventually show up in a page. It may take 5 years, but it'll happen.
And when it does, this little hack will blow up spectacularly.

I'm sorry to rain on your parade, but this is not KISS. This may seem
simple, but the submarine bugs it introduces will be a nightmare to
track down, and then you'll be in the same boat that you are in right
now. Don't do that to yourself. Do it right the first time.


-- 
--Zootboy

Sent from some sort of computing device.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Matijn Woudt
On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.comwrote:

 On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote:
  On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote:
  I'm adding some minification to our cache.class.php and am running into
 an
  edge case that is causing me grief.
 
  I want to remove all comments of the // variety, HOWEVER I don't want to
  remove URLs...
 
  KISS.
 
  To make it simple, straight-forward, and understandable next year when I
  have to re-read what I've written:
 
  I'd change all :// to QqQ  -- or any unlikely text string.
 
  Then I'd do whatever needs to be done to the // occurances.
 
  Finally, I'd change all QqQ back to ://.
 
  Jonesy

 Wow. This is just a spectacularly bad suggestion.

 First off, this task is probably a bit beyond the capabilities of a
 regex. Yes, you may be able to come up with something that works 99%
 of the time, but this is really a job for a parser of some sort. I'm
 sorry I don't have any suggestions on exactly where to go with that,
 however I'm sure Google can be of assistance. The main problem is that
 regex doesn't understand context. It just blindly finds patterns. A
 parser understands context, and can figure out which //'s are comments
 and which are something else. As a bonus, it can probably understand
 other forms of comments like /* */, which regex would completely die
 on.


It is possible to write a whole parser as a single regex, being it terribly
long and complex.
That said, there's no other simple syntax that would work, for example in
javascript you could to the following:
var http = 5;
switch(value) {
case http:// Http case here! (this whould not be deleted)
// Do something
}
But most likely you wouldn't care about that..

- Matijn


Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Sean Greenslade
 It is possible to write a whole parser as a single regex, being it terribly
 long and complex.
 That said, there's no other simple syntax that would work, for example in
 javascript you could to the following:
 var http = 5;
 switch(value) {
 case http:// Http case here! (this whould not be deleted)
 // Do something
 }
 But most likely you wouldn't care about that..

 - Matijn

I would have to disagree. There are things that regex just can't at a
fundamental level grok. Things like nested brackets (e.g. the standard
blocking syntax of C, javascript, php, etc.). It's not a parser, and
despite all the little lookahead/behind tricks that enhanced regex can
do, it can't at a fundamental level _interret_ the text it sees. This
task involves interpreting what the text you're looking for actually
means, and should therefore be handled by something that can
interpret.

Also, (I haven't tested it, but) I don't think that example you gave
would work. Without any sort of quoting around the http://;
, I would assume the JS interpreter would take that double slash as a
comment starter. Do tell me if I'm wrong, though.

-- 
--Zootboy

Sent from some sort of computing device.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Matijn Woudt
On Wed, May 29, 2013 at 7:27 PM, Sean Greenslade zootboys...@gmail.comwrote:

  It is possible to write a whole parser as a single regex, being it
 terribly
  long and complex.
  That said, there's no other simple syntax that would work, for example in
  javascript you could to the following:
  var http = 5;
  switch(value) {
  case http:// Http case here! (this whould not be deleted)
  // Do something
  }
  But most likely you wouldn't care about that..
 
  - Matijn

 I would have to disagree. There are things that regex just can't at a
 fundamental level grok. Things like nested brackets (e.g. the standard
 blocking syntax of C, javascript, php, etc.). It's not a parser, and
 despite all the little lookahead/behind tricks that enhanced regex can
 do, it can't at a fundamental level _interret_ the text it sees. This
 task involves interpreting what the text you're looking for actually
 means, and should therefore be handled by something that can
 interpret.


I think it should be possible, but as I said, very very complex. Let's not
try it;)


 Also, (I haven't tested it, but) I don't think that example you gave
 would work. Without any sort of quoting around the http://;
 , I would assume the JS interpreter would take that double slash as a
 comment starter. Do tell me if I'm wrong, though.

 Which is exactly what I meant. Because http is a var set to 5, it is a
valid case statement, it would be equal to:
switch(value) {
case 5: // Http case here! (this whould not be deleted)
// Do something
}

But any regex given above would treat the first one as a http url, and
won't strip the // and everything after it, though in this modified case it
will strip the comments.

- Matijn


Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Sean Greenslade
On Wed, May 29, 2013 at 1:33 PM, Matijn Woudt tijn...@gmail.com wrote:



 On Wed, May 29, 2013 at 7:27 PM, Sean Greenslade zootboys...@gmail.com
 wrote:

  It is possible to write a whole parser as a single regex, being it
  terribly
  long and complex.
  That said, there's no other simple syntax that would work, for example
  in
  javascript you could to the following:
  var http = 5;
  switch(value) {
  case http:// Http case here! (this whould not be deleted)
  // Do something
  }
  But most likely you wouldn't care about that..
 
SNIP
 I think it should be possible, but as I said, very very complex. Let's not
 try it;)


 Also, (I haven't tested it, but) I don't think that example you gave
 would work. Without any sort of quoting around the http://;
 , I would assume the JS interpreter would take that double slash as a
 comment starter. Do tell me if I'm wrong, though.

 Which is exactly what I meant. Because http is a var set to 5, it is a valid
 case statement, it would be equal to:
 switch(value) {
 case 5: // Http case here! (this whould not be deleted)
 // Do something
 }

 But any regex given above would treat the first one as a http url, and won't
 strip the // and everything after it, though in this modified case it will
 strip the comments.

 - Matijn

Sorry, I slightly mis-interpreted what that code was intending to do.
Regardless, it is still something that should be done by an
interpreter. So this is another edge case where regexes would more
than likely break down but an interpreter should (I do say should) do
The Right Thing.

-- 
--Zootboy

Sent from some sort of computing device.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Sebastian Krebs
2013/5/29 Matijn Woudt tijn...@gmail.com

 On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.com
 wrote:

  On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote:
   On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote:
   I'm adding some minification to our cache.class.php and am running
 into
  an
   edge case that is causing me grief.
  
   I want to remove all comments of the // variety, HOWEVER I don't want
 to
   remove URLs...
  
   KISS.
  
   To make it simple, straight-forward, and understandable next year when
 I
   have to re-read what I've written:
  
   I'd change all :// to QqQ  -- or any unlikely text string.
  
   Then I'd do whatever needs to be done to the // occurances.
  
   Finally, I'd change all QqQ back to ://.
  
   Jonesy
 
  Wow. This is just a spectacularly bad suggestion.
 
  First off, this task is probably a bit beyond the capabilities of a
  regex. Yes, you may be able to come up with something that works 99%
  of the time, but this is really a job for a parser of some sort. I'm
  sorry I don't have any suggestions on exactly where to go with that,
  however I'm sure Google can be of assistance. The main problem is that
  regex doesn't understand context. It just blindly finds patterns. A
  parser understands context, and can figure out which //'s are comments
  and which are something else. As a bonus, it can probably understand
  other forms of comments like /* */, which regex would completely die
  on.
 
 
 It is possible to write a whole parser as a single regex, being it terribly
 long and complex.


No, it isn't.


 That said, there's no other simple syntax that would work, for example in
 javascript you could to the following:
 var http = 5;
 switch(value) {
 case http:// Http case here! (this whould not be deleted)
 // Do something
 }
 But most likely you wouldn't care about that..

 - Matijn




-- 
github.com/KingCrunch


Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Matijn Woudt
On Wed, May 29, 2013 at 10:51 PM, Sebastian Krebs krebs@gmail.comwrote:




 2013/5/29 Matijn Woudt tijn...@gmail.com

 On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.com
 wrote:

  On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote:
   On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote:
   I'm adding some minification to our cache.class.php and am running
 into
  an
   edge case that is causing me grief.
  
   I want to remove all comments of the // variety, HOWEVER I don't
 want to
   remove URLs...
  
   KISS.
  
   To make it simple, straight-forward, and understandable next year
 when I
   have to re-read what I've written:
  
   I'd change all :// to QqQ  -- or any unlikely text string.
  
   Then I'd do whatever needs to be done to the // occurances.
  
   Finally, I'd change all QqQ back to ://.
  
   Jonesy
 
  Wow. This is just a spectacularly bad suggestion.
 
  First off, this task is probably a bit beyond the capabilities of a
  regex. Yes, you may be able to come up with something that works 99%
  of the time, but this is really a job for a parser of some sort. I'm
  sorry I don't have any suggestions on exactly where to go with that,
  however I'm sure Google can be of assistance. The main problem is that
  regex doesn't understand context. It just blindly finds patterns. A
  parser understands context, and can figure out which //'s are comments
  and which are something else. As a bonus, it can probably understand
  other forms of comments like /* */, which regex would completely die
  on.
 
 
 It is possible to write a whole parser as a single regex, being it
 terribly
 long and complex.


 No, it isn't.



It's better if you throw some smart words on the screen if you want to
convince someone. Just thinking about it, it makes sense as a true regular
expression can only describe a regular language, and I think all the
programming languages are not regular languages.
But, We have PHP PCRE with extensions like Recursive patterns[1] and Back
references[2], which can describe much more than just a regular language.
And I do believe it would be able to handle it.
Too bad it probably takes months to complete a regular expression like this.

- Matijn

[1] http://php.net/manual/en/regexp.reference.recursive.php
[2] http://php.net/manual/en/regexp.reference.back-references.php


Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls

2013-05-29 Thread Ashley Sheridan


Matijn Woudt tijn...@gmail.com wrote:

On Wed, May 29, 2013 at 10:51 PM, Sebastian Krebs
krebs@gmail.comwrote:




 2013/5/29 Matijn Woudt tijn...@gmail.com

 On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade
zootboys...@gmail.com
 wrote:

  On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote:
   On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote:
   I'm adding some minification to our cache.class.php and am
running
 into
  an
   edge case that is causing me grief.
  
   I want to remove all comments of the // variety, HOWEVER I
don't
 want to
   remove URLs...
  
   KISS.
  
   To make it simple, straight-forward, and understandable next
year
 when I
   have to re-read what I've written:
  
   I'd change all :// to QqQ  -- or any unlikely text string.
  
   Then I'd do whatever needs to be done to the // occurances.
  
   Finally, I'd change all QqQ back to ://.
  
   Jonesy
 
  Wow. This is just a spectacularly bad suggestion.
 
  First off, this task is probably a bit beyond the capabilities of
a
  regex. Yes, you may be able to come up with something that works
99%
  of the time, but this is really a job for a parser of some sort.
I'm
  sorry I don't have any suggestions on exactly where to go with
that,
  however I'm sure Google can be of assistance. The main problem is
that
  regex doesn't understand context. It just blindly finds patterns.
A
  parser understands context, and can figure out which //'s are
comments
  and which are something else. As a bonus, it can probably
understand
  other forms of comments like /* */, which regex would completely
die
  on.
 
 
 It is possible to write a whole parser as a single regex, being it
 terribly
 long and complex.


 No, it isn't.



It's better if you throw some smart words on the screen if you want to
convince someone. Just thinking about it, it makes sense as a true
regular
expression can only describe a regular language, and I think all the
programming languages are not regular languages.
But, We have PHP PCRE with extensions like Recursive patterns[1] and
Back
references[2], which can describe much more than just a regular
language.
And I do believe it would be able to handle it.
Too bad it probably takes months to complete a regular expression like
this.

- Matijn

[1] http://php.net/manual/en/regexp.reference.recursive.php
[2] http://php.net/manual/en/regexp.reference.back-references.php

Sometimes when all you know is regex, everything looks like a nail...

Thanks,
Ash

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php