Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
Guys, this isn't THAT stupid of a question is it? From my perspective, 
the way PHP seems to see it is that I should already know what kind of 
file I'm looking at. In most cases that's not an unreasonable 
assumption. Unfortunately, that's only good for most cases. PHP is rich 
in ways to work with the HTTP protocol, but has no way of detecting 
whether it's opening a text file or a binary file. To me this is a 
glaring omission. There has to be a way to do it, even if it's a 
round-a-bout or backdoor kind of way. Nothing is impossible.

Nick

Axel IS Main wrote:

I'm using file_get_contents() to open URLs. Does anyone know if there 
is a way to look at the result and determine if the file is binary? 
I'd like to be able to block binaries from being processed without 
having to try to think of all the possible binary extensions and omit 
them with a function that looks for these extensions.

Nick

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Adam Voigt
Couldn't you just check the extension on the file?


On Mon, 2004-02-23 at 14:03, Axel IS Main wrote:
 Guys, this isn't THAT stupid of a question is it? From my perspective, 
 the way PHP seems to see it is that I should already know what kind of 
 file I'm looking at. In most cases that's not an unreasonable 
 assumption. Unfortunately, that's only good for most cases. PHP is rich 
 in ways to work with the HTTP protocol, but has no way of detecting 
 whether it's opening a text file or a binary file. To me this is a 
 glaring omission. There has to be a way to do it, even if it's a 
 round-a-bout or backdoor kind of way. Nothing is impossible.
 
 Nick
 
 Axel IS Main wrote:
 
  I'm using file_get_contents() to open URLs. Does anyone know if there 
  is a way to look at the result and determine if the file is binary? 
  I'd like to be able to block binaries from being processed without 
  having to try to think of all the possible binary extensions and omit 
  them with a function that looks for these extensions.
 
  Nick
 
-- 

Adam Voigt
[EMAIL PROTECTED]

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
Yes, and in fact that is what I am doing now. This is a spider bot 
though, so I'm having to think of every single type of binary file that 
could be linked to on the web. So far I'm up to 28 with no end in sight. 
What about a .com file? I can't omit links that end in .com can I? That 
would be counterproductive to say the least. Also, the function that 
does the checking just keep getting longer and longer, which makes the 
spider go slower and slower. Granted, the thing is pretty fast if it has 
enough BW to work with, but still. This could eventually turn into a 
script killer. Detecting whether the stream from file_get_contents(), or 
fopen() for that matter, is binary or not and going with that result is 
the elegant solution to this problem. There has to be a way to do it.

Nick

Adam Voigt wrote:

Couldn't you just check the extension on the file?

On Mon, 2004-02-23 at 14:03, Axel IS Main wrote:
 

Guys, this isn't THAT stupid of a question is it? From my perspective, 
the way PHP seems to see it is that I should already know what kind of 
file I'm looking at. In most cases that's not an unreasonable 
assumption. Unfortunately, that's only good for most cases. PHP is rich 
in ways to work with the HTTP protocol, but has no way of detecting 
whether it's opening a text file or a binary file. To me this is a 
glaring omission. There has to be a way to do it, even if it's a 
round-a-bout or backdoor kind of way. Nothing is impossible.

Nick

Axel IS Main wrote:

   

I'm using file_get_contents() to open URLs. Does anyone know if there 
is a way to look at the result and determine if the file is binary? 
I'd like to be able to block binaries from being processed without 
having to try to think of all the possible binary extensions and omit 
them with a function that looks for these extensions.

Nick

 

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Jas
Well you can do a check on the mime type of the file.  eg.

$mimes = array(1 = application/octet-stream,
   2: = image/jpeg,
etc.
For more info...
http://us4.php.net/manual/en/ref.filesystem.php
Just like the upload file function you can check for the mime types...
http://us4.php.net/manual/en/features.file-upload.php
Just a thought, might not be a comlete solution however.
HTH
Jas
Axel Is Main wrote:
Guys, this isn't THAT stupid of a question is it? From my perspective, 
the way PHP seems to see it is that I should already know what kind of 
file I'm looking at. In most cases that's not an unreasonable 
assumption. Unfortunately, that's only good for most cases. PHP is rich 
in ways to work with the HTTP protocol, but has no way of detecting 
whether it's opening a text file or a binary file. To me this is a 
glaring omission. There has to be a way to do it, even if it's a 
round-a-bout or backdoor kind of way. Nothing is impossible.

Nick

Axel IS Main wrote:

I'm using file_get_contents() to open URLs. Does anyone know if there 
is a way to look at the result and determine if the file is binary? 
I'd like to be able to block binaries from being processed without 
having to try to think of all the possible binary extensions and omit 
them with a function that looks for these extensions.

Nick

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Adam Voigt
Well actually to check .com, just make sure it contains a / then the
.com, that will filter yahoo.com, but keep yahoo.com/downloadme.com


On Mon, 2004-02-23 at 14:19, Axel IS Main wrote:
 Yes, and in fact that is what I am doing now. This is a spider bot 
 though, so I'm having to think of every single type of binary file that 
 could be linked to on the web. So far I'm up to 28 with no end in sight. 
 What about a .com file? I can't omit links that end in .com can I? That 
 would be counterproductive to say the least. Also, the function that 
 does the checking just keep getting longer and longer, which makes the 
 spider go slower and slower. Granted, the thing is pretty fast if it has 
 enough BW to work with, but still. This could eventually turn into a 
 script killer. Detecting whether the stream from file_get_contents(), or 
 fopen() for that matter, is binary or not and going with that result is 
 the elegant solution to this problem. There has to be a way to do it.
 
 Nick
 
 Adam Voigt wrote:
 
 Couldn't you just check the extension on the file?
 
 
 On Mon, 2004-02-23 at 14:03, Axel IS Main wrote:
   
 
 Guys, this isn't THAT stupid of a question is it? From my perspective, 
 the way PHP seems to see it is that I should already know what kind of 
 file I'm looking at. In most cases that's not an unreasonable 
 assumption. Unfortunately, that's only good for most cases. PHP is rich 
 in ways to work with the HTTP protocol, but has no way of detecting 
 whether it's opening a text file or a binary file. To me this is a 
 glaring omission. There has to be a way to do it, even if it's a 
 round-a-bout or backdoor kind of way. Nothing is impossible.
 
 Nick
 
 Axel IS Main wrote:
 
 
 
 I'm using file_get_contents() to open URLs. Does anyone know if there 
 is a way to look at the result and determine if the file is binary? 
 I'd like to be able to block binaries from being processed without 
 having to try to think of all the possible binary extensions and omit 
 them with a function that looks for these extensions.
 
 Nick
 
   
 
-- 

Adam Voigt
[EMAIL PROTECTED]

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Detecting Binaries

2004-02-23 Thread Adam Bregenzer
On Mon, 2004-02-23 at 14:19, Axel IS Main wrote:
 Yes, and in fact that is what I am doing now. This is a spider bot 
 though, so I'm having to think of every single type of binary file that 
 could be linked to on the web. So far I'm up to 28 with no end in sight. 
 What about a .com file? I can't omit links that end in .com can I? That 
 would be counterproductive to say the least. Also, the function that 
 does the checking just keep getting longer and longer, which makes the 
 spider go slower and slower. Granted, the thing is pretty fast if it has 
 enough BW to work with, but still. This could eventually turn into a 
 script killer. Detecting whether the stream from file_get_contents(), or 
 fopen() for that matter, is binary or not and going with that result is 
 the elegant solution to this problem. There has to be a way to do it.

You could trying writing a script to check the first several bytes of
the file for control characters.  If the first 1kb is = 20% (randomly
pulled from my head) control characters it's a safe bet it is a binary
file.  This is not 100% accurate, but it's something to play with that
doesn't rely on mime types or file extensions, both of which can easily
be inaccurate.

-- 
Adam Bregenzer
[EMAIL PROTECTED]
http://adam.bregenzer.net/

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Detecting Binaries

2004-02-23 Thread Marek Kilimajer
Generally, binaries have \0 in them, but it is not necessery.

Axel IS Main wrote:
Guys, this isn't THAT stupid of a question is it? From my perspective, 
the way PHP seems to see it is that I should already know what kind of 
file I'm looking at. In most cases that's not an unreasonable 
assumption. Unfortunately, that's only good for most cases. PHP is rich 
in ways to work with the HTTP protocol, but has no way of detecting 
whether it's opening a text file or a binary file. To me this is a 
glaring omission. There has to be a way to do it, even if it's a 
round-a-bout or backdoor kind of way. Nothing is impossible.

Nick

Axel IS Main wrote:

I'm using file_get_contents() to open URLs. Does anyone know if there 
is a way to look at the result and determine if the file is binary? 
I'd like to be able to block binaries from being processed without 
having to try to think of all the possible binary extensions and omit 
them with a function that looks for these extensions.

Nick


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
Thanks, that's very helpful. It beats the heck out of doing it the way 
I've been doing it.

Richard Davey wrote:

Hello Axel,

Monday, February 23, 2004, 7:38:25 PM, you wrote:

AIM Thanks, you just gave me the solution, I think. I don't have to strip
AIM out every character above standard ascii, I just have to look for them.
AIM If one is there, then just get rid of it. It's true that an OS can't
AIM tell the difference between a jpg and an exe file, but that's to be
AIM expected. But the file_get_contents() function DOES open the file. Since
AIM there is a definite difference between a text file and a binary file, it
AIM should be able to detect that.
The difference isn't as obvious as you might think. Opening a binary
file into a hex editor will show you this. Your brain can determine if
the codes in-front of you are English or not, but from a pure logic
point of view that's a little harder.
Also bear in mind that on Unix ALL files are binary files. It is up to
you to determine the type of the file contents as you see fit. For
example you can check for line-terminated data.
It would be wise to check for characters from 0 to 31, if they appear
then it's almost certainly (but not guaranteed) binary.
 

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
That's not bad, but I found a way to do it simply using chr() and 
passing it a value. It turns out the if I go 0-31 Almost nothing will 
get through. Even the simples html has something in there from that 
list. However, by just looking between 14 and 26, one more than carriage 
return, and one less than escape, it worked really well. I crawled a 
site with a large number of jpg, gif, mp3, wav, and pdf files. Of the 
100's of binaries there only one pdf got through. Not a bad record. I 
also found that in order for this to work I have to process the URLs. 
This makes things really slow so I'm going to have to use both this and 
the check for extension function together. Still, I can worry a lot 
less about getting my index weighted down by binary files. The code is 
pretty basic at this point, but here it is:

   // Check for binaries
   $ckbin = 14;
   while($ckbin = 26){
   $ck = chr($ckbin);
   $cbin = substr_count($read, $ck);
   if($cbin  0){
   echo Killing off binary file URL: $url\n;
   $kill = mysql_unbuffered_query(DELETE FROM search WHERE 
url_id='$url_id');
   continue 2;
   }
   ++$ckbin;
   }
I know it looks kind of funky out of context, but it works really great.

Nick

Richard Davey wrote:

Hello Evan,

Monday, February 23, 2004, 8:57:43 PM, you wrote:

 

It would be wise to check for characters from 0 to 31, if they appear
then it's almost certainly (but not guaranteed) binary.
 

EN Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are,
EN respectively, tab, line feed, and carriage return. That's off the top of my
EN head, which means two things: (1) i may be forgetting something, and (2) I
EN need a life ;)
Let me rephrase - check for the existence of characters 0 through 31
and count how many there are. Set a percentage weight yourself and
figure out in your script if you deem the quantity too many or too
few.
The count_chars() function will be absolutely ideal for this.

 



Re: [PHP] Detecting Binaries

2004-02-23 Thread Evan Nemerson
On Monday 23 February 2004 03:02 pm, Axel IS Main wrote:
 That's not bad, but I found a way to do it simply using chr() and
 passing it a value. It turns out the if I go 0-31 Almost nothing will
 get through. Even the simples html has something in there from that
 list. However, by just looking between 14 and 26, one more than carriage
 return, and one less than escape, it worked really well. I crawled a
 site with a large number of jpg, gif, mp3, wav, and pdf files. Of the
 100's of binaries there only one pdf got through. Not a bad record. I

It should be noted that PDF isn't necessarily a binary. It's just most people 
like to use compression, and embed images, sounds, etc. But if you want to, 
you can fire up emacs and create a PDF from scratch. So really the record is 
better than you think ;)

 also found that in order for this to work I have to process the URLs.
 This makes things really slow so I'm going to have to use both this and
 the check for extension function together. Still, I can worry a lot
 less about getting my index weighted down by binary files. The code is
 pretty basic at this point, but here it is:

 // Check for binaries
 $ckbin = 14;
 while($ckbin = 26){
 $ck = chr($ckbin);
 $cbin = substr_count($read, $ck);
 if($cbin  0){
 echo Killing off binary file URL: $url\n;
 $kill = mysql_unbuffered_query(DELETE FROM search WHERE
 url_id='$url_id');
 continue 2;
 }
 ++$ckbin;
 }
 I know it looks kind of funky out of context, but it works really great.

 Nick

 Richard Davey wrote:
 Hello Evan,
 
 Monday, February 23, 2004, 8:57:43 PM, you wrote:
 It would be wise to check for characters from 0 to 31, if they appear
 then it's almost certainly (but not guaranteed) binary.
 
 EN Assuming that's decimal, you're including 0x09 0x0a and 0x0d which
  are, EN respectively, tab, line feed, and carriage return. That's off
  the top of my EN head, which means two things: (1) i may be forgetting
  something, and (2) I EN need a life ;)
 
 Let me rephrase - check for the existence of characters 0 through 31
 and count how many there are. Set a percentage weight yourself and
 figure out in your script if you deem the quantity too many or too
 few.
 
 The count_chars() function will be absolutely ideal for this.

-- 
Evan Nemerson
[EMAIL PROTECTED]
http://coeusgroup.com/en

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Detecting Binaries

2004-02-23 Thread Shane Nelson


Richard Davey wrote:

Hello Axel,

Monday, February 23, 2004, 7:03:38 PM, you wrote:

AIM Guys, this isn't THAT stupid of a question is it? From my perspective,
AIM the way PHP seems to see it is that I should already know what kind of
AIM file I'm looking at. In most cases that's not an unreasonable 
AIM assumption. Unfortunately, that's only good for most cases. PHP is rich

Even Windows doesn't *know* what type of file you've got until you
actually try and open it. You could rename a jpg to mp3 and you won't
know about it until Winamp moans at you as you open it.
FTP programs seem to know what kind of file you are transferring. Hence 
the ability to switch the transfer between AUTO, BINARY and ASCII modes. 
 I usually leave mine in AUTO mode and it seems to figure it out OK. 
I'd suggest looking at the source to an FTP client to see how they do 
it. filezilla is opensource:

  http://filezilla.sourceforge.net/

Good luck.

Shane

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php