I'm having trouble making this script work to scrape information from a series of Wikipedia articles.
What I'm trying to do is iterate over a series of wiki URLs and pull out the page links on a wiki portal category (e.g. https://en.wikipedia.org/wiki/Category:Electronic_design). I know that all the wiki pages I'm going through have a page links section. However when I try to iterate through them I get this error message: Traceback (most recent call last): File "./wiki_parent.py", line 37, in <module> cleaned = pages.get_text()AttributeError: 'NoneType' object has no attribute 'get_text' Why do I get this error? The file I'm reading in the first part looks like this: 1 Category:Abrahamic_mythology2 Category:Abstraction3 Category:Academic_disciplines4 Category:Activism5 Category:Activists6 Category:Actors7 Category:Aerobics8 Category:Aerospace_engineering9 Category:Aesthetics and it is stored in a the port_ID dict like so: {1: 'Category:Abrahamic_mythology', 2: 'Category:Abstraction', 3: 'Category:Academic_disciplines', 4: 'Category:Activism', 5: 'Category:Activists', 6: 'Category:Actors', 7: 'Category:Aerobics', 8: 'Category:Aerospace_engineering', 9: 'Category:Aesthetics', 10: 'Category:Agnosticism', 11: 'Category:Agriculture'...} The desired output is: parent_num, page_ID, page_num I realize the code is a little hackish but I'm just trying to get this working: #!/usr/bin/env pythonimport os,re,nltkfrom bs4 import BeautifulSoupfrom urllib import urlopen url = "https://en.wikipedia.org/wiki/"+'Category:Furniture' rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki' reg = re.compile('[\w]+:[\w]+') number=1 port_ID = {}for root,dirs,files in os.walk(rootdir): for file in files: if reg.match(file): port_ID[number]=file number+=1 test_file = open('test_file.csv', 'w') for key, value in port_ID.iteritems(): url = "https://en.wikipedia.org/wiki/"+str(value) raw = urlopen(url).read() soup=BeautifulSoup(raw) pages = soup.find("div" , { "id" : "mw-pages" }) cleaned = pages.get_text() cleaned = cleaned.encode('utf-8') pages = cleaned.split('\n') pages = pages[4:-2] test = test = port_ID.items()[0] page_ID = 1 for item in pages: test_file.write('%s %s %s\n' % (test[0],item,page_ID)) page_ID+=1 page_ID = 1 Hi I posted this on stackoverflow and didn't really get any help so I though I would ask here to see if someone could help! Thanks! *Joshua Valdez* *Computational Linguist : Cognitive Scientist * (440)-231-0479 jd...@case.edu <j...@uw.edu> | j...@uw.edu | jo...@armsandanchors.com <http://www.linkedin.com/in/valdezjoshua/> _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor