On May 09, 2006, at 12:57 PM, Dave Cragg wrote:

In that case it would be useful to see the http request that Rev (libUrl) is sending, using

 liburlSetLogField the long id of field <field>

Done

see below:

but think I found the problem: I'm parsing an index page where the URL strings appear to be a single long string like this:

http://www.pacsoa.org.au/palms/Acanthophoenix/index.html

but if I paste these from the msg box into this email my test list of ten URL looks like this:

where the variable

"tOneGenus"

seems to contain: http://www.pacsoa.org.au/palms/Acanthophoenix/ index.html

But if I paste these into an email I get...

tURLs  (of course these are not working )

http://www.pacsoa.org.au/palms/Acoelorrhaphe
/index.html
http://www.pacsoa.org.au/palms/Acrocomia
/index.html
http://www.pacsoa.org.au/palms/Actinokentia
/index.html
http://www.pacsoa.org.au/palms/Actinorhytis
/index.html
http://www.pacsoa.org.au/palms/Adonidia
/index.html
http://www.pacsoa.org.au/palms/Aiphanes
/index.html
http://www.pacsoa.org.au/palms/Allagoptera
/index.html
http://www.pacsoa.org.au/palms/Alloschmidia
/index.html
http://www.pacsoa.org.au/palms/Alsmithia
/index.html




Looks like I have some kind of CRLF in data following the folder, so my GET request fails in Revolution, but pasting the same string into Firefox works. And, it appears my script is possibly generating this but I can see it....

Now, if I chop the end off

 to give us urls like this:

http://www.pacsoa.org.au/palms/Acoelorrhaphe

these work just fine here is my script...

on mouseup
  set the cursor to busy
  getPalms

# previous crawlers
  --getGaneshas
end mouseup


ON getPalms
    --> site is: http://www.pacsoa.org.au/palms/index.html

    # we need to dig every */palms/*.html  file on this page
    # so first is to extract all the URL's

   -- put fld "MainURL" into tStartURL
put "http://www.pacsoa.org.au/palms/index.html"; into tStartURL

    put URL tStartURL into tMainListing # this works..
    REPEAT for each line x in tMainListing
        IF x contains "/palms/"  THEN # we got one for sure
            put x & cr after tPalmList
        END IF

    END REPEAT

    delete line 1 to 2 of tPalmList
    delete line -1 of tPalmList

# clean out tags:
    put  "<[^><]*>" into tRex
    put replacetext(tPalmList, tRex, "") into tPalmsList
    replace " " with "" in tPalmsList

    REPEAT for each line x in tPalmslist
        put "http://www.pacsoa.org.au/palms/"; before x
# i think I am getting an extra CR introduced here...I don't know why
        put "/index.html" after x
        put x & cr after tGenusListing
    END REPEAT


    --> Step through Genus listing

    put line 1 to 10 of tGenusListing into tTestList

    liburlSetLogField the long id of field "logField"
    --repeat for each line tOneGenus in tGenusListing
    REPEAT for each line tOneGenus in tTestList
        --> extract the genus name first
        set the itemdel to "/"
        put item 5 of tOneGenus into tGenus
       -- delete item -1 of tOneGenus
        put tOneGenus & cr after tURLs


        put url (tOneGenus) into tOneGenusPage
        wait 5 ticks
        put tOneGenusPage into fld "previewer"



        --> from each page we have to extract the species URLs
        --repeat for each line x in tOneGenusPage
--if x contains ("/" & tGenus & "/") then put x & cr after tSpeciesPages
        --end repeat

    END REPEAT
    --put tSpeciesPages

    --> Load the Species URL's and then save and .jpg file therein

     put tURLs

END getPalms


tURLs  (of course these are not working )

http://www.pacsoa.org.au/palms/Acoelorrhaphe
/index.html
http://www.pacsoa.org.au/palms/Acrocomia
/index.html
http://www.pacsoa.org.au/palms/Actinokentia
/index.html
http://www.pacsoa.org.au/palms/Actinorhytis
/index.html
http://www.pacsoa.org.au/palms/Adonidia
/index.html
http://www.pacsoa.org.au/palms/Aiphanes
/index.html
http://www.pacsoa.org.au/palms/Allagoptera
/index.html
http://www.pacsoa.org.au/palms/Alloschmidia
/index.html
http://www.pacsoa.org.au/palms/Alsmithia
/index.html

These work: but you can see the extra CR coming in from somewhere i don't see these extra lines in the message box though...

http://www.pacsoa.org.au/palms/Acanthophoenix

http://www.pacsoa.org.au/palms/Acoelorrhaphe

http://www.pacsoa.org.au/palms/Acrocomia

http://www.pacsoa.org.au/palms/Actinokentia

http://www.pacsoa.org.au/palms/Actinorhytis

http://www.pacsoa.org.au/palms/Adonidia

http://www.pacsoa.org.au/palms/Aiphanes

http://www.pacsoa.org.au/palms/Allagoptera

http://www.pacsoa.org.au/palms/Alloschmidia

http://www.pacsoa.org.au/palms/Alsmithia



socket selected: 209.15.79.148:80|6956
GET /palms/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 200 OK

Date: Wed, 10 May 2006 03:00:57 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Last-Modified: Sat, 25 Mar 2006 08:39:13 GMT

ETag: "88fc5f-6154-442501b1"

Accept-Ranges: bytes

Content-Length: 24916

Content-Type: text/html


socket selected: 209.15.79.148:80|6956
GET /palms/Acanthophoenix
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:00:58 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6956
socket selected: 209.15.79.148:80|6957
GET /palms/Acoelorrhaphe
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:00:58 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6957
socket selected: 209.15.79.148:80|6958
GET /palms/Acrocomia
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:00:59 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6958
socket selected: 209.15.79.148:80|6959
GET /palms/Actinokentia
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:00:59 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6959
socket selected: 209.15.79.148:80|6960
GET /palms/Actinorhytis
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:01:00 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6960
socket selected: 209.15.79.148:80|6961
GET /palms/Adonidia
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:01:00 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6961
socket selected: 209.15.79.148:80|6962
GET /palms/Aiphanes
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:01:00 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6962
socket selected: 209.15.79.148:80|6963
GET /palms/Allagoptera
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:01:01 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6963
socket selected: 209.15.79.148:80|6964
GET /palms/Alloschmidia
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:01:01 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6964
socket selected: 209.15.79.148:80|6965
GET /palms/Alsmithia
/index.html HTTP/1.1

Host: www.pacsoa.org.au

User-Agent: Revolution (MacOS)


HTTP/1.1 400 Bad Request

Date: Wed, 10 May 2006 03:01:02 GMT

Server: Apache/1.3.26 (Unix) FrontPage/5.0.2.2510 PHP/4.2.3

Connection: close

Content-Type: text/html; charset=iso-8859-1


CLOSED 209.15.79.148:80|6965


Cheers
Dave

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to