possible fix for Scrape: doesn't respect http.proxy settings, redirects, etc.

Meltsner, Kenneth Thu, 19 Jul 2001 17:29:35 -0700
Rich suggested I send this to taglibs-dev -- it's a fix (I think) for problems with 
proxy use and following redirects with the scrape taglib.  This is my first suggestion 
to an open source project; be kind.

Ken

--
It seemed easier to modify your code slightly to cast (perhaps a bad idea) the result 
of URL.openConnection instead of subclassing java.net.HttpURLConnection.    Here's the 
bit I changed (marked with a KJM]:
(from jakarta-taglibs\scrape\src\org\apache\taglibs\scrape; you would also remove 
HttpConnection.java from that directory]
[...]
 /**
  *  Create an http request for the specified URL, check to see if time has
  *     elapsed, if so get page, check last modified header of page, and if
  *     necessary make the request
  *
  */
class Page extends Thread {

    private java.net.HttpURLConnection connection; // object to create an http request
    private long lastmodified;         // time the page was last modified
    private long expires;              // http header = time the page expires
    private URL url;                   // url from the page to be scraped
    private PageData pagedata;    // pagedata object that holds data on this url
    // char array to hold the source page from the http request
    private char source[];
    // max size of the buffer that the http request is read into
    private final long MAX_BUFFER_SIZE = 50000;
    // pagecontext that the servlet resides in, used for logging to the server
    private PageContext pageContext;

    Page(URL url, PageData page, PageContext pc) {
        this.url = url;
        pagedata = page;
        pageContext = pc;
    }

    public void run() {
        long current = new Date().getTime();  // get current time

        // make http connection to url
         try 
         {
             // create new HttpUrlConnection --KJM

             connection = (java.net.HttpURLConnection) url.openConnection();
             connection.setRequestMethod("HEAD");
             connection.connect();

          // set current time to time of last scrape
             pagedata.setLastScrape(current);
             // check response status code a code of 200 is a successful
             // connection
             if (connection.getResponseCode() >= 300) {
                 pageContext.getServletContext().
                   log("Error Occured: " + connection.getResponseMessage());
             } else {
                 // get expires header
                 if ((expires =(long)connection.getExpiration()) == 0)
                     // do this if header does not exist
                     expires = current - 1;

                 // check for a new scrape for this page or that the Expires
                 // time for the page has passed
                 if((current > expires) || pagedata.getnewFlag() || 
                    pagedata.getChangeFlag()) {

                     // get lastmodified header
                     // getLastModified returns 0 if header does not exist
                     if ((lastmodified = (long)connection.getLastModified()) == 0)
                         // do this if header does not exist
                         lastmodified = pagedata.getLastScrape() - 1;

                     // disconnect so that the connection object can be reset to
                     // use GET instead of HEAD
                     connection.disconnect();

                     // check for a new scrape for this page or that Last-
                     // Modified time for the page has passed
                     if ((pagedata.getLastScrape() < lastmodified) || 
                            pagedata.getnewFlag() || pagedata.getChangeFlag()) {

                         // set the request method to get
                         connection.setRequestMethod("GET");
                         // make the connection
                         connection.connect();

                         // check responce code from connection
                         if (connection.getResponseCode() >= 300) {
                             pageContext.getServletContext().
                                log("Error Occured: " +
                                    connection.getResponseMessage());
                             // the connection did not occur return cached data
                             return;
                         }

                         // read http request into buffer return value is false
                         // if an error occured
                         if (streamtochararray(connection.getInputStream())) {
                             // perform the scrapes on this page
                             scrape();
                         }
         }
                 }
             }
         }
        catch (IOException ee) {
             pageContext.getServletContext().
             log(ee.toString());
         }
     }

-----Original Message-----
From: Rich Catlett [mailto:[EMAIL PROTECTED]]
Sent: Monday, July 16, 2001 11:34 AM
To: [EMAIL PROTECTED]
Subject: [Fwd: Scrape: doesn't respect http.proxy settings, redirects,
etc;] (fwd)


It does use the java.net.HttpUrlConnection, that is the super class.  The
Conncet and disconnect are abstract classes that have to be written.  As
far as redirects go there is a setFollowRedirects method that I didn't
bother with since automatically following redirects is supposed to be the
default behavior.  As far as using a proxy, there is another abstract
method usingProxy that I believe would have to be flushed out.  Currently
it simply returns false.  I imagine that to use a proxy, an attribute
would have to be added to the page tag to determine if a proxy is to be
used and then the usingProxy method called.  I am not very strong in this
area and I would have no place to test it currently, so if maybe you would
like to flush out the usingProxy method and submit the fix to the
taglibs-dev list I would be happy to add the change to cvs.

---------------------------------------------------------------------
Rich Catlett        [EMAIL PROTECTED] |  Confuscious say "Man who stand |
Programmer                        |   on toilet, high on pot!"      |
                                  |                                 |
---------------------------------------------------------------------

-------- Original Message --------
Subject: Scrape: doesn't respect http.proxy settings, redirects, etc;
Date: Thu, 12 Jul 2001 12:07:33 -0400
From: "Meltsner, Kenneth" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

I figured this one out:  Scrape doesn't follow redirects, use system proxy settings, 
etc.
because it has its own implementation of HTTPURLConnection.  Was there a reason not to 
use
the standard object from java.net?  If not, it'd be relatively simple to fix...

Ken


Ken Meltsner
Computer Associates
Senior Architect, Portal TAG Team
possible fix for Scrape: doesn't respect http.proxy settings, redirects, etc.

Reply via email to