Re: [PHP] Detecting Binaries
Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Couldn't you just check the extension on the file? On Mon, 2004-02-23 at 14:03, Axel IS Main wrote: Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- Adam Voigt [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Yes, and in fact that is what I am doing now. This is a spider bot though, so I'm having to think of every single type of binary file that could be linked to on the web. So far I'm up to 28 with no end in sight. What about a .com file? I can't omit links that end in .com can I? That would be counterproductive to say the least. Also, the function that does the checking just keep getting longer and longer, which makes the spider go slower and slower. Granted, the thing is pretty fast if it has enough BW to work with, but still. This could eventually turn into a script killer. Detecting whether the stream from file_get_contents(), or fopen() for that matter, is binary or not and going with that result is the elegant solution to this problem. There has to be a way to do it. Nick Adam Voigt wrote: Couldn't you just check the extension on the file? On Mon, 2004-02-23 at 14:03, Axel IS Main wrote: Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Well you can do a check on the mime type of the file. eg. $mimes = array(1 = application/octet-stream, 2: = image/jpeg, etc. For more info... http://us4.php.net/manual/en/ref.filesystem.php Just like the upload file function you can check for the mime types... http://us4.php.net/manual/en/features.file-upload.php Just a thought, might not be a comlete solution however. HTH Jas Axel Is Main wrote: Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re[2]: [PHP] Detecting Binaries
Hello Axel, Monday, February 23, 2004, 7:03:38 PM, you wrote: AIM Guys, this isn't THAT stupid of a question is it? From my perspective, AIM the way PHP seems to see it is that I should already know what kind of AIM file I'm looking at. In most cases that's not an unreasonable AIM assumption. Unfortunately, that's only good for most cases. PHP is rich Even Windows doesn't *know* what type of file you've got until you actually try and open it. You could rename a jpg to mp3 and you won't know about it until Winamp moans at you as you open it. AIM in ways to work with the HTTP protocol, but has no way of detecting AIM whether it's opening a text file or a binary file. To me this is a AIM glaring omission. There has to be a way to do it, even if it's a AIM round-a-bout or backdoor kind of way. Nothing is impossible. You could say it's a glaring omission from operating systems too, because most succumb to this. The only way to tell for sure is to read in the header of the file and parse it. If you are blanket rejecting all binaries - good luck, it'll take ages. Another solution might be to just treat the file as text regardless and strip out every byte that is above the standard ASCII value. Hello CPU upgrade requirement. -- Best regards, Richard Davey http://www.phpcommunity.org/wiki/296.html -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Well actually to check .com, just make sure it contains a / then the .com, that will filter yahoo.com, but keep yahoo.com/downloadme.com On Mon, 2004-02-23 at 14:19, Axel IS Main wrote: Yes, and in fact that is what I am doing now. This is a spider bot though, so I'm having to think of every single type of binary file that could be linked to on the web. So far I'm up to 28 with no end in sight. What about a .com file? I can't omit links that end in .com can I? That would be counterproductive to say the least. Also, the function that does the checking just keep getting longer and longer, which makes the spider go slower and slower. Granted, the thing is pretty fast if it has enough BW to work with, but still. This could eventually turn into a script killer. Detecting whether the stream from file_get_contents(), or fopen() for that matter, is binary or not and going with that result is the elegant solution to this problem. There has to be a way to do it. Nick Adam Voigt wrote: Couldn't you just check the extension on the file? On Mon, 2004-02-23 at 14:03, Axel IS Main wrote: Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- Adam Voigt [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
On Mon, 2004-02-23 at 14:19, Axel IS Main wrote: Yes, and in fact that is what I am doing now. This is a spider bot though, so I'm having to think of every single type of binary file that could be linked to on the web. So far I'm up to 28 with no end in sight. What about a .com file? I can't omit links that end in .com can I? That would be counterproductive to say the least. Also, the function that does the checking just keep getting longer and longer, which makes the spider go slower and slower. Granted, the thing is pretty fast if it has enough BW to work with, but still. This could eventually turn into a script killer. Detecting whether the stream from file_get_contents(), or fopen() for that matter, is binary or not and going with that result is the elegant solution to this problem. There has to be a way to do it. You could trying writing a script to check the first several bytes of the file for control characters. If the first 1kb is = 20% (randomly pulled from my head) control characters it's a safe bet it is a binary file. This is not 100% accurate, but it's something to play with that doesn't rely on mime types or file extensions, both of which can easily be inaccurate. -- Adam Bregenzer [EMAIL PROTECTED] http://adam.bregenzer.net/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Generally, binaries have \0 in them, but it is not necessery. Axel IS Main wrote: Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re[2]: [PHP] Detecting Binaries
Hello Axel, Monday, February 23, 2004, 7:38:25 PM, you wrote: AIM Thanks, you just gave me the solution, I think. I don't have to strip AIM out every character above standard ascii, I just have to look for them. AIM If one is there, then just get rid of it. It's true that an OS can't AIM tell the difference between a jpg and an exe file, but that's to be AIM expected. But the file_get_contents() function DOES open the file. Since AIM there is a definite difference between a text file and a binary file, it AIM should be able to detect that. The difference isn't as obvious as you might think. Opening a binary file into a hex editor will show you this. Your brain can determine if the codes in-front of you are English or not, but from a pure logic point of view that's a little harder. Also bear in mind that on Unix ALL files are binary files. It is up to you to determine the type of the file contents as you see fit. For example you can check for line-terminated data. It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. -- Best regards, Richard Davey http://www.phpcommunity.org/wiki/296.html -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Thanks, that's very helpful. It beats the heck out of doing it the way I've been doing it. Richard Davey wrote: Hello Axel, Monday, February 23, 2004, 7:38:25 PM, you wrote: AIM Thanks, you just gave me the solution, I think. I don't have to strip AIM out every character above standard ascii, I just have to look for them. AIM If one is there, then just get rid of it. It's true that an OS can't AIM tell the difference between a jpg and an exe file, but that's to be AIM expected. But the file_get_contents() function DOES open the file. Since AIM there is a definite difference between a text file and a binary file, it AIM should be able to detect that. The difference isn't as obvious as you might think. Opening a binary file into a hex editor will show you this. Your brain can determine if the codes in-front of you are English or not, but from a pure logic point of view that's a little harder. Also bear in mind that on Unix ALL files are binary files. It is up to you to determine the type of the file contents as you see fit. For example you can check for line-terminated data. It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: Re[2]: [PHP] Detecting Binaries
On Monday 23 February 2004 11:55 am, Richard Davey wrote: Hello Axel, Monday, February 23, 2004, 7:38:25 PM, you wrote: AIM Thanks, you just gave me the solution, I think. I don't have to strip AIM out every character above standard ascii, I just have to look for them. AIM If one is there, then just get rid of it. It's true that an OS can't AIM tell the difference between a jpg and an exe file, but that's to be AIM expected. But the file_get_contents() function DOES open the file. Since AIM there is a definite difference between a text file and a binary file, it AIM should be able to detect that. The difference isn't as obvious as you might think. Opening a binary file into a hex editor will show you this. Your brain can determine if the codes in-front of you are English or not, but from a pure logic point of view that's a little harder. Also bear in mind that on Unix ALL files are binary files. It is up to you to determine the type of the file contents as you see fit. For example you can check for line-terminated data. It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are, respectively, tab, line feed, and carriage return. That's off the top of my head, which means two things: (1) i may be forgetting something, and (2) I need a life ;) I'm not up to speed on this thread, but perhaps you could (ab)use some techniques from natural language processing? May be overkill, though ;) -- Best regards, Richard Davey http://www.phpcommunity.org/wiki/296.html -- Evan Nemerson [EMAIL PROTECTED] http://coeusgroup.com/en -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re[4]: [PHP] Detecting Binaries
Hello Evan, Monday, February 23, 2004, 8:57:43 PM, you wrote: It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. EN Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are, EN respectively, tab, line feed, and carriage return. That's off the top of my EN head, which means two things: (1) i may be forgetting something, and (2) I EN need a life ;) Let me rephrase - check for the existence of characters 0 through 31 and count how many there are. Set a percentage weight yourself and figure out in your script if you deem the quantity too many or too few. The count_chars() function will be absolutely ideal for this. -- Best regards, Richard Davey http://www.phpcommunity.org/wiki/296.html -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
That's not bad, but I found a way to do it simply using chr() and passing it a value. It turns out the if I go 0-31 Almost nothing will get through. Even the simples html has something in there from that list. However, by just looking between 14 and 26, one more than carriage return, and one less than escape, it worked really well. I crawled a site with a large number of jpg, gif, mp3, wav, and pdf files. Of the 100's of binaries there only one pdf got through. Not a bad record. I also found that in order for this to work I have to process the URLs. This makes things really slow so I'm going to have to use both this and the check for extension function together. Still, I can worry a lot less about getting my index weighted down by binary files. The code is pretty basic at this point, but here it is: // Check for binaries $ckbin = 14; while($ckbin = 26){ $ck = chr($ckbin); $cbin = substr_count($read, $ck); if($cbin 0){ echo Killing off binary file URL: $url\n; $kill = mysql_unbuffered_query(DELETE FROM search WHERE url_id='$url_id'); continue 2; } ++$ckbin; } I know it looks kind of funky out of context, but it works really great. Nick Richard Davey wrote: Hello Evan, Monday, February 23, 2004, 8:57:43 PM, you wrote: It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. EN Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are, EN respectively, tab, line feed, and carriage return. That's off the top of my EN head, which means two things: (1) i may be forgetting something, and (2) I EN need a life ;) Let me rephrase - check for the existence of characters 0 through 31 and count how many there are. Set a percentage weight yourself and figure out in your script if you deem the quantity too many or too few. The count_chars() function will be absolutely ideal for this.
Re: [PHP] Detecting Binaries
On Monday 23 February 2004 03:02 pm, Axel IS Main wrote: That's not bad, but I found a way to do it simply using chr() and passing it a value. It turns out the if I go 0-31 Almost nothing will get through. Even the simples html has something in there from that list. However, by just looking between 14 and 26, one more than carriage return, and one less than escape, it worked really well. I crawled a site with a large number of jpg, gif, mp3, wav, and pdf files. Of the 100's of binaries there only one pdf got through. Not a bad record. I It should be noted that PDF isn't necessarily a binary. It's just most people like to use compression, and embed images, sounds, etc. But if you want to, you can fire up emacs and create a PDF from scratch. So really the record is better than you think ;) also found that in order for this to work I have to process the URLs. This makes things really slow so I'm going to have to use both this and the check for extension function together. Still, I can worry a lot less about getting my index weighted down by binary files. The code is pretty basic at this point, but here it is: // Check for binaries $ckbin = 14; while($ckbin = 26){ $ck = chr($ckbin); $cbin = substr_count($read, $ck); if($cbin 0){ echo Killing off binary file URL: $url\n; $kill = mysql_unbuffered_query(DELETE FROM search WHERE url_id='$url_id'); continue 2; } ++$ckbin; } I know it looks kind of funky out of context, but it works really great. Nick Richard Davey wrote: Hello Evan, Monday, February 23, 2004, 8:57:43 PM, you wrote: It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. EN Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are, EN respectively, tab, line feed, and carriage return. That's off the top of my EN head, which means two things: (1) i may be forgetting something, and (2) I EN need a life ;) Let me rephrase - check for the existence of characters 0 through 31 and count how many there are. Set a percentage weight yourself and figure out in your script if you deem the quantity too many or too few. The count_chars() function will be absolutely ideal for this. -- Evan Nemerson [EMAIL PROTECTED] http://coeusgroup.com/en -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: Re[4]: [PHP] Detecting Binaries
Alternatively, count unigrams in the first 1000 characters and get the euclidean distance to a sample from e.g. an english text, a french text, a chinese text, etc. - Lucas -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Richard Davey wrote: Hello Axel, Monday, February 23, 2004, 7:03:38 PM, you wrote: AIM Guys, this isn't THAT stupid of a question is it? From my perspective, AIM the way PHP seems to see it is that I should already know what kind of AIM file I'm looking at. In most cases that's not an unreasonable AIM assumption. Unfortunately, that's only good for most cases. PHP is rich Even Windows doesn't *know* what type of file you've got until you actually try and open it. You could rename a jpg to mp3 and you won't know about it until Winamp moans at you as you open it. FTP programs seem to know what kind of file you are transferring. Hence the ability to switch the transfer between AUTO, BINARY and ASCII modes. I usually leave mine in AUTO mode and it seems to figure it out OK. I'd suggest looking at the source to an FTP client to see how they do it. filezilla is opensource: http://filezilla.sourceforge.net/ Good luck. Shane -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Detecting Binaries
I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php