Re: FW: [PHP] Accessing Files Outside the Web Root
On Fri, 2013-03-15 at 09:11 -0400, Dale H. Cook wrote: > At 09:44 PM 3/14/2013, tamouse mailing lists wrote: > > >If you are delivering files to a (human) user via their browser, by whatever > >mechanism, that means someone can write a script to scrape them. > > That script, however, would have to be running on my host system in order to > access the script which actually delivers the file, as the latter script is > located outside of the web root. > > Dale H. Cook, Market Chief Engineer, Centennial Broadcasting, > Roanoke/Lynchburg, VA > http://plymouthcolony.net/starcityeng/index.html > > Not really. You script is web accessible right? It just opens a file and delivers it to the browser of your visitor. It's easy to make a script that pretends to be a browser and make the same request of your script to grab the file. Thanks, Ash http://www.ashleysheridan.co.uk
Re: FW: [PHP] Accessing Files Outside the Web Root
At 09:44 PM 3/14/2013, tamouse mailing lists wrote: >If you are delivering files to a (human) user via their browser, by whatever >mechanism, that means someone can write a script to scrape them. That script, however, would have to be running on my host system in order to access the script which actually delivers the file, as the latter script is located outside of the web root. Dale H. Cook, Market Chief Engineer, Centennial Broadcasting, Roanoke/Lynchburg, VA http://plymouthcolony.net/starcityeng/index.html -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: FW: [PHP] Accessing Files Outside the Web Root
At 04:06 AM 3/14/2013, tamouse mailing lists wrote: >If the files are delivered via the web, by php or some other means, even if >located outside webroot, they'd still be scrapeable. Bots, however, being "mechanical" (i.e., hard wired or programmed) behave in different ways than humans, and that difference can be exploited in a script. Part of the rationale in putting the files outside the root is that they have no URLs, eliminating one vulnerability (you can't scrape the URL of a file if it has no URL). Late last night I figured out why I was having trouble accessing those external files from my script, and now I'm working out the parsing details that enable one script to access multiple external files. My approach probably won't defeat all bad bots, but it will likely defeat most of them. You can't make code bulletproof, but you can wrap it in Kevlar. Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants; Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of http://plymouthcolony.net -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: FW: [PHP] Accessing Files Outside the Web Root
On Mar 13, 2013 7:06 PM, "David Robley" wrote: > > "Dale H. Cook" wrote: > > > At 05:04 PM 3/13/2013, Dan McCullough wrote > > : > >>Web bots can ignore the robots.txt file, most scrapers would. > > > > and at 05:06 PM 3/13/2013, Marc Guay wrote: > > > >>These don't sound like robots that would respect a txt file to me. > > > > Dan and Marc are correct. Although I used the terms "spiders" and > > "pirates" I believe that the correct term, as employed by Dan, is > > "scrapers," and that twerm might be applied to either the robot or the > > site which displays its results. One blogger has called scrapers "the > > arterial plaque of the Internet." I need to implement a solution that > > allows humans to access my files but prevents scrapers from accessing > > them. I will undoubtedly have to implement some type of > > challenge-and-response in the system (such as a captcha), but as long as > > those files are stored below the web root a scraper that has a valid URL > > can probably grab them. That is part of what the "public" in public_html > > implies. > > > > One of the reasons why this irks me is that the scrapers are all > > commercial sites, but they haven't offered me a piece of the action for > > the use of my files. My domain is an entirely non-commercial domain, and I > > provide free hosting for other non-commercial genealogical works, > > primarily pages that are part of the USGenWeb Project, which is perhaps > > the largest of all non-commercial genealogical projects. > > > > readfile() is probably where you want to start, in conjunction with a > captcha or similar > > -- > Cheers > David Robley > > Catholic (n.) A cat with a drinking problem. > > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > If the files are delivered via the web, by php or some other means, even if located outside webroot, they'd still be scrapeable.
Re: FW: [PHP] Accessing Files Outside the Web Root
"Dale H. Cook" wrote: > At 05:04 PM 3/13/2013, Dan McCullough wrote > : >>Web bots can ignore the robots.txt file, most scrapers would. > > and at 05:06 PM 3/13/2013, Marc Guay wrote: > >>These don't sound like robots that would respect a txt file to me. > > Dan and Marc are correct. Although I used the terms "spiders" and > "pirates" I believe that the correct term, as employed by Dan, is > "scrapers," and that twerm might be applied to either the robot or the > site which displays its results. One blogger has called scrapers "the > arterial plaque of the Internet." I need to implement a solution that > allows humans to access my files but prevents scrapers from accessing > them. I will undoubtedly have to implement some type of > challenge-and-response in the system (such as a captcha), but as long as > those files are stored below the web root a scraper that has a valid URL > can probably grab them. That is part of what the "public" in public_html > implies. > > One of the reasons why this irks me is that the scrapers are all > commercial sites, but they haven't offered me a piece of the action for > the use of my files. My domain is an entirely non-commercial domain, and I > provide free hosting for other non-commercial genealogical works, > primarily pages that are part of the USGenWeb Project, which is perhaps > the largest of all non-commercial genealogical projects. > readfile() is probably where you want to start, in conjunction with a captcha or similar -- Cheers David Robley Catholic (n.) A cat with a drinking problem. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: FW: [PHP] Accessing Files Outside the Web Root
At 05:04 PM 3/13/2013, Dan McCullough wrote : >Web bots can ignore the robots.txt file, most scrapers would. and at 05:06 PM 3/13/2013, Marc Guay wrote: >These don't sound like robots that would respect a txt file to me. Dan and Marc are correct. Although I used the terms "spiders" and "pirates" I believe that the correct term, as employed by Dan, is "scrapers," and that twerm might be applied to either the robot or the site which displays its results. One blogger has called scrapers "the arterial plaque of the Internet." I need to implement a solution that allows humans to access my files but prevents scrapers from accessing them. I will undoubtedly have to implement some type of challenge-and-response in the system (such as a captcha), but as long as those files are stored below the web root a scraper that has a valid URL can probably grab them. That is part of what the "public" in public_html implies. One of the reasons why this irks me is that the scrapers are all commercial sites, but they haven't offered me a piece of the action for the use of my files. My domain is an entirely non-commercial domain, and I provide free hosting for other non-commercial genealogical works, primarily pages that are part of the USGenWeb Project, which is perhaps the largest of all non-commercial genealogical projects. Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants; Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of http://plymouthcolony.net -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: FW: [PHP] Accessing Files Outside the Web Root
At 04:58 PM 3/13/2013, Jen Rasmussen wrote: >Have you tried keeping all of your documents in one directory and blocking >that directory via a robots.txt file? A spider used by a pirate site does not have to honor robots.txt, just as a non-Adobe PDF utility does not have to honor security settings imposed by Acrobat Pro. The use of robots.txt would succeed mainly in blocking major search engines, which are not the problem. Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants; Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of http://plymouthcolony.net -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: FW: [PHP] Accessing Files Outside the Web Root
> Have you tried keeping all of your documents in one directory and blocking > that directory via a robots.txt file? These don't sound like robots that would respect a txt file to me. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: FW: [PHP] Accessing Files Outside the Web Root
Web bots can ignore the robots.txt file, most scrapers would. On Mar 13, 2013 4:59 PM, "Jen Rasmussen" wrote: > -Original Message- > From: Dale H. Cook [mailto:radiot...@plymouthcolony.net] > Sent: Wednesday, March 13, 2013 3:38 PM > To: php-general@lists.php.net > Subject: [PHP] Accessing Files Outside the Web Root > > Let me preface my question by noting that I am virtually a PHP novice. > Although I am a long-time webmaster, and have used PHP for some years to > give visitors access to information in my SQL database, this is my first > attempt to use it for another purpose. I have browsed the mailing list > archives and have searched online but have not yet succeeded in teaching > myself how to do what I want to do. This need not provoke a lengthy > discussion or involve extensive hand-holding - if someone can point to an > appropriate code sample or online tutorial that might do the trick. > > I am the author of a number of PDF files that serve as genealogical > reference works. My problem is that there are a number of sites which are > posing as search engines and which display my PDF files in their entirety > on > their own sites. These pirate sites are not simply opening a window that > displays my files as they appear on my site. They are using Google Docs to > display copies of my files that are cached or stored elsewhere online. The > proof of that is that I can modify one of my files and upload it to my > site. > The file, as seen on my site, immediately displays the modification. The > same file, as displayed on the pirate sites, is unmodified and may remain > unmodified for weeks. > > It is obvious that my files, which are stored under public_html, are being > spidered and then stored or cached. This displeases me greatly. I want my > files, some of which have cost an enormous amount of work over many years, > to be available only on my site. Legitimate search engines, such as Google, > may display a snippet, but they do not display the entire file - they link > to my site so the visitor can get the file from me. > > A little study has indicated to me that if I store those files in a folder > outside the web root and use PHP to provide access they will not be > spidered. Writing a PHP script to provide access to the files in that > folder > is what I need help with. I have experimented with a number of code samples > but have not been able to make things work. Could any of you point to code > samples or tutorials that might help me? Remember that, aside from the code > I have written to handle my SQL database I am a PHP novice. > > Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants; > Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of > http://plymouthcolony.net > > > -- > PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: > http://www.php.net/unsub.php > > > Have you tried keeping all of your documents in one directory and blocking > that directory via a robots.txt file? > > Jen > > > > > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > >
FW: [PHP] Accessing Files Outside the Web Root
-Original Message- From: Dale H. Cook [mailto:radiot...@plymouthcolony.net] Sent: Wednesday, March 13, 2013 3:38 PM To: php-general@lists.php.net Subject: [PHP] Accessing Files Outside the Web Root Let me preface my question by noting that I am virtually a PHP novice. Although I am a long-time webmaster, and have used PHP for some years to give visitors access to information in my SQL database, this is my first attempt to use it for another purpose. I have browsed the mailing list archives and have searched online but have not yet succeeded in teaching myself how to do what I want to do. This need not provoke a lengthy discussion or involve extensive hand-holding - if someone can point to an appropriate code sample or online tutorial that might do the trick. I am the author of a number of PDF files that serve as genealogical reference works. My problem is that there are a number of sites which are posing as search engines and which display my PDF files in their entirety on their own sites. These pirate sites are not simply opening a window that displays my files as they appear on my site. They are using Google Docs to display copies of my files that are cached or stored elsewhere online. The proof of that is that I can modify one of my files and upload it to my site. The file, as seen on my site, immediately displays the modification. The same file, as displayed on the pirate sites, is unmodified and may remain unmodified for weeks. It is obvious that my files, which are stored under public_html, are being spidered and then stored or cached. This displeases me greatly. I want my files, some of which have cost an enormous amount of work over many years, to be available only on my site. Legitimate search engines, such as Google, may display a snippet, but they do not display the entire file - they link to my site so the visitor can get the file from me. A little study has indicated to me that if I store those files in a folder outside the web root and use PHP to provide access they will not be spidered. Writing a PHP script to provide access to the files in that folder is what I need help with. I have experimented with a number of code samples but have not been able to make things work. Could any of you point to code samples or tutorials that might help me? Remember that, aside from the code I have written to handle my SQL database I am a PHP novice. Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants; Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of http://plymouthcolony.net -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Have you tried keeping all of your documents in one directory and blocking that directory via a robots.txt file? Jen -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php