I'm trying to get some web data, in the past, I had like:

wget -O page.html url
links -dump page.html > page.txt
mail [email protected] < page.txt

that worked well, till server got re-developed

whith new server, in browser, can NOT screen scrape, ONLY labels get
copied, not contents;
each contents is in 'individual field' that can be copied individualy, one
by one (never came across such before?)

when I run the script, page.html DOES contain desired data, BUT, not page.txt

looking at page.html it has like[1]:

readonly? is this some sort of attempt to prevent copying of data..?

any thoughts how that sort of html/php can be processed to text ?

or do I need to manually for get rid of stuff up to 'value="'

if that's the way, what do I need to strip data from 'value="' to next '"'?

thanks for any pointers


[1]/snip/
<label class="pfbc-label">Suburb</label><input type="text"
name="SYS_Addresses_e_address_i_0_e_district_tx" value="SYDNEY"
readonly="readonly" class="ro pfbc-textbox"/>

<label class="pfbc-label">State</label><input type="hidden" value="NSW"
name="SYS_Addresses_e_address_i_0_e_state_cd"><input type="text"
name="SYS_Addresses_e_address_i_0_e_state_cd_d" value="NSW"
readonly="readonly" class="ro pfbc-textbox"/>

<label class="pfbc-label">Postcode</label><input type="text"
name="SYS_Addresses_e_address_i_0_e_postcode_tx" value="2000"
readonly="readonly" class="ro pfbc-textbox"/>






-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to