Re: [Tutor] Can't loop thru file and don't see the problem

Christian Witts Fri, 04 Dec 2009 00:01:27 -0800

Roy Hinkelman wrote:

Thank you very much!


I had forgotten that unix URLs are case sensitive.

Also, I changed my 'For' statements to your suggestion, tweaked theexception code a little, and it's working.

So, there are obviously several ways to open files. Do you have astandard practice, or does it depend on the file format?


I will eventually be working with Excel and possibly mssql tables.

Thanks again for your help.

Roy

On Thu, Dec 3, 2009 at 3:46 AM, Christian Witts<[email protected] <mailto:[email protected]>> wrote:


    Roy Hinkelman wrote:


        Your list is great. I've been lurking for the past two weeks
        while I learned the basics. Thanks.

        I am trying to loop thru 2 files and scrape some data, and the
        loops are not working.

        The script is not getting past the first URL from state_list,
        as the test print shows.

        If someone could point me in the right direction, I'd
        appreciate it.

        I would also like to know the difference between open() and
        csv.reader(). I had similar issues with csv.reader() when
        opening these files.

        Any help greatly appreciated.

        Roy

        Code: Select all
           # DOWNLOAD USGS MISSING FILES

           import mechanize
           import BeautifulSoup as B_S
           import re
           # import urllib
           import csv

           # OPEN FILES
           # LOOKING FOR THESE SKUs
           _missing = open('C:\\Documents and
        Settings\\rhinkelman\\Desktop\\working DB
        files\\missing_topo_list.csv', 'r')
           # IN THESE STATES
           _states = open('C:\\Documents and
        Settings\\rhinkelman\\Desktop\\working DB
        files\\state_list.csv', 'r')
           # IF NOT FOUND, LIST THEM HERE
           _missing_files = []
           # APPEND THIS FILE WITH META
           _topo_meta = open('C:\\Documents and
        Settings\\rhinkelman\\Desktop\\working DB
        files\\topo_meta.csv', 'a')

           # OPEN PAGE
           for each_state in _states:
               each_state = each_state.replace("\n", "")
               print each_state
               html = mechanize.urlopen(each_state)
               _soup = B_S.BeautifulSoup(html)
                     # SEARCH THRU PAGE AND FIND ROW CONTAINING META
        MATCHING SKU
               _table = _soup.find("table", "tabledata")
               print _table #test This is returning 'None'

    If you take a look at the webpage you open up, you will notice
    there are no tables.  Are you certain you are using the correct
    URLs for this ?

               for each_sku in _missing:

    The for loop `for each_sku in _missing:` will only iterate once,
    you can either pre-read it into a list / dictionary / set
    (whichever you prefer) or change it to
    _missing_filename = 'C:\\Documents and
    Settings\\rhinkelman\\Desktop\\working DB
    files\\missing_topo_list.csv'
    for each_sku in open(_missing_filename):
      # carry on here

                   each_sku = each_sku.replace("\n","")
                   print each_sku #test
                   try:
                       _row = _table.find('tr', text=re.compile(each_sku))
                   except (IOError, AttributeError):
                       _missing_files.append(each_sku)
                       continue
                   else:
                       _row = _row.previous
                       _row = _row.parent
                       _fields = _row.findAll('td')
                       _name = _fields[1].string
                       _state = _fields[2].string
                       _lat = _fields[4].string
                       _long = _fields[5].string
                       _sku = _fields[7].string

                       _topo_meta.write(_name + "|" + _state + "|" +
        _lat + "|" + _long + "|" + _sku + "||")
                             print x +': ' + _name

           print "Missing Files:"
           print _missing_files
           _topo_meta.close()
           _missing.close()
           _states.close()


        The message I am getting is:

        Code:
           >>>
           http://libremap.org/data/state/Colorado/drg/
           None
           33087c2
           Traceback (most recent call last):
             File "//Dc1/Data/SharedDocs/Roy/_Coding Vault/Python code
        samples/usgs_missing_file_META.py", line 34, in <module>
               _row = _table.find('tr', text=re.compile(each_sku))
           AttributeError: 'NoneType' object has no attribute 'find'


        And the files look like:

        Code:
           state_list
           http://libremap.org/data/state/Colorado/drg/
           http://libremap.org/data/state/Connecticut/drg/
           http://libremap.org/data/state/Pennsylvania/drg/
           http://libremap.org/data/state/South_Dakota/drg/

           missing_topo_list
           33087c2
           34087b2
           33086b7
           34086c2


        ------------------------------------------------------------------------

        _______________________________________________
        Tutor maillist  -  [email protected] <mailto:[email protected]>
        To unsubscribe or change subscription options:
        http://mail.python.org/mailman/listinfo/tutor

    Hope the comments above help in your endeavours.

--Kind Regards,

    Christian Witts



------------------------------------------------------------------------

_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Generally I just open files in read or read-binary mode, depending onthe data in them. The only times I use it in the for loop situation isfor things similar to yours when you need to iterate over the file alot(although if the file is small enough I generally prefer loading it intoa dictionary as it will be faster, you build it once and never have toread it off of the disk again as it is in memory).

For the Excel you want to work with later take a look athttp://www.python-excel.org/ xlrd is the one I use still (works in theUNIX environment), haven't had a need to change it to anything else.For MS SQL you can look at http://pymssql.sourceforge.net/ which is alsosupported under UNIX.


--
Kind Regards,
Christian Witts


_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Can't loop thru file and don't see the problem

Reply via email to