Roy Hinkelman wrote:
Thank you very much!
I had forgotten that unix URLs are case sensitive.
Also, I changed my 'For' statements to your suggestion, tweaked the
exception code a little, and it's working.
So, there are obviously several ways to open files. Do you have a
standard practice, or does it depend on the file format?
I will eventually be working with Excel and possibly mssql tables.
Thanks again for your help.
Roy
On Thu, Dec 3, 2009 at 3:46 AM, Christian Witts
<[email protected] <mailto:[email protected]>> wrote:
Roy Hinkelman wrote:
Your list is great. I've been lurking for the past two weeks
while I learned the basics. Thanks.
I am trying to loop thru 2 files and scrape some data, and the
loops are not working.
The script is not getting past the first URL from state_list,
as the test print shows.
If someone could point me in the right direction, I'd
appreciate it.
I would also like to know the difference between open() and
csv.reader(). I had similar issues with csv.reader() when
opening these files.
Any help greatly appreciated.
Roy
Code: Select all
# DOWNLOAD USGS MISSING FILES
import mechanize
import BeautifulSoup as B_S
import re
# import urllib
import csv
# OPEN FILES
# LOOKING FOR THESE SKUs
_missing = open('C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB
files\\missing_topo_list.csv', 'r')
# IN THESE STATES
_states = open('C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB
files\\state_list.csv', 'r')
# IF NOT FOUND, LIST THEM HERE
_missing_files = []
# APPEND THIS FILE WITH META
_topo_meta = open('C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB
files\\topo_meta.csv', 'a')
# OPEN PAGE
for each_state in _states:
each_state = each_state.replace("\n", "")
print each_state
html = mechanize.urlopen(each_state)
_soup = B_S.BeautifulSoup(html)
# SEARCH THRU PAGE AND FIND ROW CONTAINING META
MATCHING SKU
_table = _soup.find("table", "tabledata")
print _table #test This is returning 'None'
If you take a look at the webpage you open up, you will notice
there are no tables. Are you certain you are using the correct
URLs for this ?
for each_sku in _missing:
The for loop `for each_sku in _missing:` will only iterate once,
you can either pre-read it into a list / dictionary / set
(whichever you prefer) or change it to
_missing_filename = 'C:\\Documents and
Settings\\rhinkelman\\Desktop\\working DB
files\\missing_topo_list.csv'
for each_sku in open(_missing_filename):
# carry on here
each_sku = each_sku.replace("\n","")
print each_sku #test
try:
_row = _table.find('tr', text=re.compile(each_sku))
except (IOError, AttributeError):
_missing_files.append(each_sku)
continue
else:
_row = _row.previous
_row = _row.parent
_fields = _row.findAll('td')
_name = _fields[1].string
_state = _fields[2].string
_lat = _fields[4].string
_long = _fields[5].string
_sku = _fields[7].string
_topo_meta.write(_name + "|" + _state + "|" +
_lat + "|" + _long + "|" + _sku + "||")
print x +': ' + _name
print "Missing Files:"
print _missing_files
_topo_meta.close()
_missing.close()
_states.close()
The message I am getting is:
Code:
>>>
http://libremap.org/data/state/Colorado/drg/
None
33087c2
Traceback (most recent call last):
File "//Dc1/Data/SharedDocs/Roy/_Coding Vault/Python code
samples/usgs_missing_file_META.py", line 34, in <module>
_row = _table.find('tr', text=re.compile(each_sku))
AttributeError: 'NoneType' object has no attribute 'find'
And the files look like:
Code:
state_list
http://libremap.org/data/state/Colorado/drg/
http://libremap.org/data/state/Connecticut/drg/
http://libremap.org/data/state/Pennsylvania/drg/
http://libremap.org/data/state/South_Dakota/drg/
missing_topo_list
33087c2
34087b2
33086b7
34086c2
------------------------------------------------------------------------
_______________________________________________
Tutor maillist - [email protected] <mailto:[email protected]>
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Hope the comments above help in your endeavours.
--
Kind Regards,
Christian Witts
------------------------------------------------------------------------
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Generally I just open files in read or read-binary mode, depending on
the data in them. The only times I use it in the for loop situation is
for things similar to yours when you need to iterate over the file alot
(although if the file is small enough I generally prefer loading it into
a dictionary as it will be faster, you build it once and never have to
read it off of the disk again as it is in memory).
For the Excel you want to work with later take a look at
http://www.python-excel.org/ xlrd is the one I use still (works in the
UNIX environment), haven't had a need to change it to anything else.
For MS SQL you can look at http://pymssql.sourceforge.net/ which is also
supported under UNIX.
--
Kind Regards,
Christian Witts
_______________________________________________
Tutor maillist - [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor