[wtr-general] Re: Pulling hair out on screen scraping
anyone have any idea why this isnt working? (or should I be making a new topic for this?) for x in 0..2 do for y in 0..9 do numstring = x.to_s + y.to_s puts numstring if (browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text) then var = browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text in theory (assuming 2 books on the page) when it reaches the 3rd book the if will evaluate as false and the var = statement never gets executed. Im getting the first 2 books returning fine, then on the 3rd time around puts numstring executes and then the program ends exit code 1 (it should go to the next page after uneventfully finishing the 2 for loops. The only thing I can think of is that its trying to call the above with numstring = to 03, not finding it on the page and crashing. however thats what the If is there to prevent. Any ideas or tips? On Jan 4, 10:06 am, Bissquitt bissqu...@gmail.com wrote: ok thank you all so much. I got the majority of the code working. This is what I have so far. while contLoop do colVal = worksheet.Cells(row, 'a').Value if (colVal) then browser.goto(http://bookstore.umbc.edu/SelectCourses.aspx? src=2type=2stoid=9trm=Spring%2009cid= + colVal) var = browser.span(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text worksheet.Cells(row, 'b').value = var else contLoop = false end row += 1 sleep 1 end Do you know of an easy way to itterate through each span that watches the above regex and only ones that match or do I need to go through all and parse each individualy? I was trying something like this but i couldnt get it to work. (are span and spans the same? I only saw documentation for spans) browser.spans.each(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text if that can't be done I guess I will just be storing each span into a string, look for the regex and go to next. Thanks again guys On Jan 3, 3:41 pm, Charley Baker charley.ba...@gmail.com wrote: It can be a bit overwhelming to learn Ruby and various libraries at the same time. I'd recommend taking a look at the Pickaxe book:http://whytheluckystiff.net/ruby/pickaxe/ just to get some general familiarity. There are other Ruby tutorials online as well as some good books - The Ruby Way, Everyday Scripting, OReilly's Ruby book. succ! as you mention below is a Ruby core method. Gotapi also has a good searchable reference to Ruby standard api.http://www.gotapi.com/html click on the Ruby Standard Packages. The pickaxe book from the link above also has an index of the core api, many with examples. Here's a link to the Watir rdocs in case you might find that useful.http://wtr.rubyforge.org/rdoc/anda link to supported elements(though openqa is down right now):http://wiki.openqa.org/display/WTR/Methods+supported+by+Element Strange that the hpricot site is down now as well. Another useful way to learn how to use libraries in Ruby is by taking a look at their unit tests. Watir has a large number of unit tests, hpricot has some too. They're located under your ruby install directory in gems. Ruby comes with a few documentation systems: ri and rdoc. For the gems you have installed locally you can see all of the rdocs by going to the command line, type: gem server Then browse tohttp://localhost:8808 ri can also be used from the command line: ri String::succ! Additional responses inline: On Sat, Jan 3, 2009 at 10:31 AM, Bissquitt bissqu...@gmail.com wrote: Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands (http://us.php.net/manual/en/function.abs.php) as an example. In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command, but showed me nothing about how that command works so unless I want to use it exactly the way you used it, its useless). My example here is the ruby on windows site. If I google for anything regarding ruby and excel I either get that site, or another site that just provides me a link to that site and am forced to make due with that site in order to teach myself how to interact with excel. The site itself lists a BUNCH of examples but leaves it up to you to try and pick apart the syntax to understand what it is
[wtr-general] Re: Pulling hair out on screen scraping
if (browser.span(:id, Regexp.new(rptCourses_ctl00_rptItems_ctl + numstring +_lblItemTxtTitle)).exists?) otherwise calling text will throw an exception trying to locate the element. -c On Fri, Jan 9, 2009 at 10:38 AM, Bissquitt bissqu...@gmail.com wrote: anyone have any idea why this isnt working? (or should I be making a new topic for this?) for x in 0..2 do for y in 0..9 do numstring = x.to_s + y.to_s puts numstring if (browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text) then var = browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text in theory (assuming 2 books on the page) when it reaches the 3rd book the if will evaluate as false and the var = statement never gets executed. Im getting the first 2 books returning fine, then on the 3rd time around puts numstring executes and then the program ends exit code 1 (it should go to the next page after uneventfully finishing the 2 for loops. The only thing I can think of is that its trying to call the above with numstring = to 03, not finding it on the page and crashing. however thats what the If is there to prevent. Any ideas or tips? On Jan 4, 10:06 am, Bissquitt bissqu...@gmail.com wrote: ok thank you all so much. I got the majority of the code working. This is what I have so far. while contLoop do colVal = worksheet.Cells(row, 'a').Value if (colVal) then browser.goto(http://bookstore.umbc.edu/SelectCourses.aspx? src=2type=2stoid=9trm=Spring%2009cid= + colVal) var = browser.span(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text worksheet.Cells(row, 'b').value = var else contLoop = false end row += 1 sleep 1 end Do you know of an easy way to itterate through each span that watches the above regex and only ones that match or do I need to go through all and parse each individualy? I was trying something like this but i couldnt get it to work. (are span and spans the same? I only saw documentation for spans) browser.spans.each(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text if that can't be done I guess I will just be storing each span into a string, look for the regex and go to next. Thanks again guys On Jan 3, 3:41 pm, Charley Baker charley.ba...@gmail.com wrote: It can be a bit overwhelming to learn Ruby and various libraries at the same time. I'd recommend taking a look at the Pickaxe book: http://whytheluckystiff.net/ruby/pickaxe/ just to get some general familiarity. There are other Ruby tutorials online as well as some good books - The Ruby Way, Everyday Scripting, OReilly's Ruby book. succ! as you mention below is a Ruby core method. Gotapi also has a good searchable reference to Ruby standard api.http://www.gotapi.com/html click on the Ruby Standard Packages. The pickaxe book from the link above also has an index of the core api, many with examples. Here's a link to the Watir rdocs in case you might find that useful. http://wtr.rubyforge.org/rdoc/anda link to supported elements(though openqa is down right now): http://wiki.openqa.org/display/WTR/Methods+supported+by+Element Strange that the hpricot site is down now as well. Another useful way to learn how to use libraries in Ruby is by taking a look at their unit tests. Watir has a large number of unit tests, hpricot has some too. They're located under your ruby install directory in gems. Ruby comes with a few documentation systems: ri and rdoc. For the gems you have installed locally you can see all of the rdocs by going to the command line, type: gem server Then browse tohttp://localhost:8808 ri can also be used from the command line: ri String::succ! Additional responses inline: On Sat, Jan 3, 2009 at 10:31 AM, Bissquitt bissqu...@gmail.com wrote: Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands (http://us.php.net/manual/en/function.abs.php) as an example. In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command, but showed me nothing about how that command works so unless I want to use it exactly the way you used it, its useless). My example here is the ruby on windows site. If
[wtr-general] Re: Pulling hair out on screen scraping
thank you very much, you are awesome On Jan 9, 4:04 pm, Charley Baker charley.ba...@gmail.com wrote: if (browser.span(:id, Regexp.new(rptCourses_ctl00_rptItems_ctl + numstring +_lblItemTxtTitle)).exists?) otherwise calling text will throw an exception trying to locate the element. -c On Fri, Jan 9, 2009 at 10:38 AM, Bissquitt bissqu...@gmail.com wrote: anyone have any idea why this isnt working? (or should I be making a new topic for this?) for x in 0..2 do for y in 0..9 do numstring = x.to_s + y.to_s puts numstring if (browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text) then var = browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text in theory (assuming 2 books on the page) when it reaches the 3rd book the if will evaluate as false and the var = statement never gets executed. Im getting the first 2 books returning fine, then on the 3rd time around puts numstring executes and then the program ends exit code 1 (it should go to the next page after uneventfully finishing the 2 for loops. The only thing I can think of is that its trying to call the above with numstring = to 03, not finding it on the page and crashing. however thats what the If is there to prevent. Any ideas or tips? On Jan 4, 10:06 am, Bissquitt bissqu...@gmail.com wrote: ok thank you all so much. I got the majority of the code working. This is what I have so far. while contLoop do colVal = worksheet.Cells(row, 'a').Value if (colVal) then browser.goto(http://bookstore.umbc.edu/SelectCourses.aspx? src=2type=2stoid=9trm=Spring%2009cid= + colVal) var = browser.span(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text worksheet.Cells(row, 'b').value = var else contLoop = false end row += 1 sleep 1 end Do you know of an easy way to itterate through each span that watches the above regex and only ones that match or do I need to go through all and parse each individualy? I was trying something like this but i couldnt get it to work. (are span and spans the same? I only saw documentation for spans) browser.spans.each(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text if that can't be done I guess I will just be storing each span into a string, look for the regex and go to next. Thanks again guys On Jan 3, 3:41 pm, Charley Baker charley.ba...@gmail.com wrote: It can be a bit overwhelming to learn Ruby and various libraries at the same time. I'd recommend taking a look at the Pickaxe book: http://whytheluckystiff.net/ruby/pickaxe/just to get some general familiarity. There are other Ruby tutorials online as well as some good books - The Ruby Way, Everyday Scripting, OReilly's Ruby book. succ! as you mention below is a Ruby core method. Gotapi also has a good searchable reference to Ruby standard api.http://www.gotapi.com/html click on the Ruby Standard Packages. The pickaxe book from the link above also has an index of the core api, many with examples. Here's a link to the Watir rdocs in case you might find that useful. http://wtr.rubyforge.org/rdoc/andalink to supported elements(though openqa is down right now): http://wiki.openqa.org/display/WTR/Methods+supported+by+Element Strange that the hpricot site is down now as well. Another useful way to learn how to use libraries in Ruby is by taking a look at their unit tests. Watir has a large number of unit tests, hpricot has some too. They're located under your ruby install directory in gems. Ruby comes with a few documentation systems: ri and rdoc. For the gems you have installed locally you can see all of the rdocs by going to the command line, type: gem server Then browse tohttp://localhost:8808 ri can also be used from the command line: ri String::succ! Additional responses inline: On Sat, Jan 3, 2009 at 10:31 AM, Bissquitt bissqu...@gmail.com wrote: Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands (http://us.php.net/manual/en/function.abs.php) as an example. In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command,
[wtr-general] Re: Pulling hair out on screen scraping
Bissquitt, Thank you for poking the proverbial 'beehive'. I am a Watir/Ruby newbie too. I've posted some really basic questions here questions on this group as well. My background is mainly in hardware test automation, BASIC, and VB Script. Kudos to all the responses to this post. I too am trying to run before walking. I'm just so excited about the potential of Watir that I can sit still. After one month of tinkering, with what little I know, I've already automated 30% of our web configuration tests at my job. The links included here will keep me busy for awhile. Thanks again to all, Gem (newbie) Dandy On Jan 9, 4:53 pm, Bissquitt bissqu...@gmail.com wrote: thank you very much, you are awesome On Jan 9, 4:04 pm, Charley Baker charley.ba...@gmail.com wrote: if (browser.span(:id, Regexp.new(rptCourses_ctl00_rptItems_ctl + numstring +_lblItemTxtTitle)).exists?) otherwise calling text will throw an exception trying to locate the element. -c On Fri, Jan 9, 2009 at 10:38 AM, Bissquitt bissqu...@gmail.com wrote: anyone have any idea why this isnt working? (or should I be making a new topic for this?) for x in 0..2 do for y in 0..9 do numstring = x.to_s + y.to_s puts numstring if (browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text) then var = browser.span(:id, Regexp.new (rptCourses_ctl00_rptItems_ctl + numstring + _lblItemTxtTitle)).text in theory (assuming 2 books on the page) when it reaches the 3rd book the if will evaluate as false and the var = statement never gets executed. Im getting the first 2 books returning fine, then on the 3rd time around puts numstring executes and then the program ends exit code 1 (it should go to the next page after uneventfully finishing the 2 for loops. The only thing I can think of is that its trying to call the above with numstring = to 03, not finding it on the page and crashing. however thats what the If is there to prevent. Any ideas or tips? On Jan 4, 10:06 am, Bissquitt bissqu...@gmail.com wrote: ok thank you all so much. I got the majority of the code working. This is what I have so far. while contLoop do colVal = worksheet.Cells(row, 'a').Value if (colVal) then browser.goto(http://bookstore.umbc.edu/SelectCourses.aspx? src=2type=2stoid=9trm=Spring%2009cid= + colVal) var = browser.span(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text worksheet.Cells(row, 'b').value = var else contLoop = false end row += 1 sleep 1 end Do you know of an easy way to itterate through each span that watches the above regex and only ones that match or do I need to go through all and parse each individualy? I was trying something like this but i couldnt get it to work. (are span and spans the same? I only saw documentation for spans) browser.spans.each(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text if that can't be done I guess I will just be storing each span into a string, look for the regex and go to next. Thanks again guys On Jan 3, 3:41 pm, Charley Baker charley.ba...@gmail.com wrote: It can be a bit overwhelming to learn Ruby and various libraries at the same time. I'd recommend taking a look at the Pickaxe book: http://whytheluckystiff.net/ruby/pickaxe/justto get some general familiarity. There are other Ruby tutorials online as well as some good books - The Ruby Way, Everyday Scripting, OReilly's Ruby book. succ! as you mention below is a Ruby core method. Gotapi also has a good searchable reference to Ruby standard api.http://www.gotapi.com/html click on the Ruby Standard Packages. The pickaxe book from the link above also has an index of the core api, many with examples. Here's a link to the Watir rdocs in case you might find that useful. http://wtr.rubyforge.org/rdoc/andalinkto supported elements(though openqa is down right now): http://wiki.openqa.org/display/WTR/Methods+supported+by+Element Strange that the hpricot site is down now as well. Another useful way to learn how to use libraries in Ruby is by taking a look at their unit tests. Watir has a large number of unit tests, hpricot has some too. They're located under your ruby install directory in gems. Ruby comes with a few documentation systems: ri and rdoc. For the gems you have installed locally you can see all of the rdocs by going to the command line, type: gem server Then browse tohttp://localhost:8808 ri can also be used from the command line: ri String::succ!
[wtr-general] Re: Pulling hair out on screen scraping
ok thank you all so much. I got the majority of the code working. This is what I have so far. while contLoop do colVal = worksheet.Cells(row, 'a').Value if (colVal) then browser.goto(http://bookstore.umbc.edu/SelectCourses.aspx? src=2type=2stoid=9trm=Spring%2009cid= + colVal) var = browser.span(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text worksheet.Cells(row, 'b').value = var else contLoop = false end row += 1 sleep 1 end Do you know of an easy way to itterate through each span that watches the above regex and only ones that match or do I need to go through all and parse each individualy? I was trying something like this but i couldnt get it to work. (are span and spans the same? I only saw documentation for spans) browser.spans.each(:id, / rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text if that can't be done I guess I will just be storing each span into a string, look for the regex and go to next. Thanks again guys On Jan 3, 3:41 pm, Charley Baker charley.ba...@gmail.com wrote: It can be a bit overwhelming to learn Ruby and various libraries at the same time. I'd recommend taking a look at the Pickaxe book:http://whytheluckystiff.net/ruby/pickaxe/ just to get some general familiarity. There are other Ruby tutorials online as well as some good books - The Ruby Way, Everyday Scripting, OReilly's Ruby book. succ! as you mention below is a Ruby core method. Gotapi also has a good searchable reference to Ruby standard api.http://www.gotapi.com/html click on the Ruby Standard Packages. The pickaxe book from the link above also has an index of the core api, many with examples. Here's a link to the Watir rdocs in case you might find that useful.http://wtr.rubyforge.org/rdoc/and a link to supported elements(though openqa is down right now):http://wiki.openqa.org/display/WTR/Methods+supported+by+Element Strange that the hpricot site is down now as well. Another useful way to learn how to use libraries in Ruby is by taking a look at their unit tests. Watir has a large number of unit tests, hpricot has some too. They're located under your ruby install directory in gems. Ruby comes with a few documentation systems: ri and rdoc. For the gems you have installed locally you can see all of the rdocs by going to the command line, type: gem server Then browse tohttp://localhost:8808 ri can also be used from the command line: ri String::succ! Additional responses inline: On Sat, Jan 3, 2009 at 10:31 AM, Bissquitt bissqu...@gmail.com wrote: Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands (http://us.php.net/manual/en/function.abs.php) as an example. In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command, but showed me nothing about how that command works so unless I want to use it exactly the way you used it, its useless). My example here is the ruby on windows site. If I google for anything regarding ruby and excel I either get that site, or another site that just provides me a link to that site and am forced to make due with that site in order to teach myself how to interact with excel. The site itself lists a BUNCH of examples but leaves it up to you to try and pick apart the syntax to understand what it is doing. For example: line = '1' while worksheet.Range(a#{line})['Value'] line.succ! end #line now holds row number of first empty row What on earth does .succ! do? It never tells me. The site, and most that ive seen, are written not to target new people and tutor them but to target advanced users with a more so heres a cool way to approach the problem approach. A simple ok, here is the the excel class, here are the comands in it and what they do, here is a syntax example would be far more helpful as it doesn't leave anything out. I'm still not sure if its possible to return what row the active cell is on. Excel is a strange one. :) Agreed that most sites assume a basic familiarity with Ruby, and with the links above you should be able to get into it fairly quickly. Accessing Excel is done through it's COM interface, so one of the best sources of documentation is actually the Excel VBA Microsoft help file. There's a link to the standalone version of it somewhere on the internets if you don't have it installed. There are some excel libraries on our wiki as well as a project on Rubyforge called Rasta which use Excel. You
[wtr-general] Re: Pulling hair out on screen scraping
forgot to include the code I have thus far. (currently not working do to the Hpricot portion) excel = WIN32OLE.new(excel.application) excel.visible = true workbook = excel.workbooks.open('E:\books\spring 09 classes.xls') worksheet=workbook.worksheets(1) contLoop = true row = 1 while contLoop do colVal = worksheet.Cells(row, 'a').Value if (colVal) then doc = Hpricot(open(http://bookstore.umbc.edu/ SelectCourses.aspx?src=2type=2stoid=9trm=Spring%2009cid= (colVal))) a = doc.search(sp...@id='rptCourses_ctl00_rptItems_ctl\d \d_lblItemTxtTitle']).inner_text worksheet.Cells(row, 'f').value = a else contLoop = false end row += 1 sleep 1 end On Jan 3, 8:32 am, Bissquitt bissqu...@gmail.com wrote: Granted I am new to Watir and ruby in general but I do have a background of programming. My brief experience has been that watir and ruby are awesome but VERY poorly documented, which is odd concidering the massive amount of web pages dedicated to it. anyway, here is the issue I am having. I am trying to screen scrape book information from a college bookstores website. My first attempt was php (and I had a full script done for it) then realized that the site uses javascript to get info from their database and all I was scraping was the static HTML and missed the generated stuff I need. The script in theory: opens an excel document, looks at (A1) and goes to www.website.com/(A1) where (A1) is a course number, stores Title, ISBN and other info into B1, C1, D1 etc (I also have to take into account more than 1 book per class) though once I get the first I should be able to do this. goes to (A2) and repeats. From what I have seen there are 2 ways to do this each with its own problem. 1) use hpricot or some other parser to find the proper tag. This has 2 issues. span id=rptCourses_ctl00_rptItems_ctl00_lblItemTxtISBN9780324574289/ span The second ctl00 itterates to ctl01 for the second book (I am hoping I can just use regexp in line) The second issue is that I have not been able to figure out how to pick out a span tag. There are all sorts of commands for finding links and tables and such but I cant figure out how to pick out that particular tag (specificly with hpricot) 2) Load the entire page into a variable, strip out all new lines and tabs, scan entire page for specific regexp span id=rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle style=font-weight:bold;[^]+\/span I know this works, I used rubulator to test it. It returns all titles of books on the page, I do forsee an issue of which title belongs to which other info if I do it that way though. If an exact example is required I can give out all the info you require though I figured it would be more clutter than helpful. An actual syntax example would be most helpful rather than just refering me to a class definition though I will take whatever is offered. Many thanks, Michael --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Watir General group. To post to this group, send email to watir-general@googlegroups.com Before posting, please read the following guidelines: http://wiki.openqa.org/display/WTR/Support To unsubscribe from this group, send email to watir-general-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/watir-general -~--~~~~--~~--~--~---
[wtr-general] Re: Pulling hair out on screen scraping
Hi there, I'm not sure what you mean by Ruby and Watir being poorly documented. For Ruby, the first edition of the Pickaxe book which is comprehensive is free and available online. There are dozens of other tutorials, sites and blogs about Ruby. Watir also has a lot of examples, a tutorial( http://wiki.openqa.org/display/WTR/Tutorial) and other information on the wiki, if there's something you feel is missing, don't hesitate to suggest it or add it yourself. Oddly, your example doesn't use Watir at all. If you wanted to use Watir to do the same thing here are some possibilities: browser.spans.each {|s| puts s.text} #do something else with the span in the block if you want - e.g. assign some variables, etc var = browser.span(:id, /ctl/).text #find the span by a regex and assign it to a variable An interesting example using hpricot and regexs to find book information - ISBN, price, etc. Scrubyt is another library for screen scraping which internally uses either Firewatir or Mechanize, here's a link to some examples: http://wiki.scrubyt.org/index.php?title=Tutorials HTH, Charley Baker blog: http://charleybakersblog.blogspot.com/ Project Manager, Watir, http://wtr.rubyforge.org QA Architect, Gap Inc Direct On Sat, Jan 3, 2009 at 7:12 AM, Bissquitt bissqu...@gmail.com wrote: forgot to include the code I have thus far. (currently not working do to the Hpricot portion) excel = WIN32OLE.new(excel.application) excel.visible = true workbook = excel.workbooks.open('E:\books\spring 09 classes.xls') worksheet=workbook.worksheets(1) contLoop = true row = 1 while contLoop do colVal = worksheet.Cells(row, 'a').Value if (colVal) then doc = Hpricot(open(http://bookstore.umbc.edu/ SelectCourses.aspx?src=2type=2stoid=9trm=Spring%2009cid=http://bookstore.umbc.edu/SelectCourses.aspx?src=2type=2stoid=9trm=Spring%2009cid= (colVal))) a = doc.search(sp...@id='rptCourses_ctl00_rptItems_ctl\d \d_lblItemTxtTitle']).inner_text worksheet.Cells(row, 'f').value = a else contLoop = false end row += 1 sleep 1 end On Jan 3, 8:32 am, Bissquitt bissqu...@gmail.com wrote: Granted I am new to Watir and ruby in general but I do have a background of programming. My brief experience has been that watir and ruby are awesome but VERY poorly documented, which is odd concidering the massive amount of web pages dedicated to it. anyway, here is the issue I am having. I am trying to screen scrape book information from a college bookstores website. My first attempt was php (and I had a full script done for it) then realized that the site uses javascript to get info from their database and all I was scraping was the static HTML and missed the generated stuff I need. The script in theory: opens an excel document, looks at (A1) and goes to www.website.com/(A1) where (A1) is a course number, stores Title, ISBN and other info into B1, C1, D1 etc (I also have to take into account more than 1 book per class) though once I get the first I should be able to do this. goes to (A2) and repeats. From what I have seen there are 2 ways to do this each with its own problem. 1) use hpricot or some other parser to find the proper tag. This has 2 issues. span id=rptCourses_ctl00_rptItems_ctl00_lblItemTxtISBN9780324574289/ span The second ctl00 itterates to ctl01 for the second book (I am hoping I can just use regexp in line) The second issue is that I have not been able to figure out how to pick out a span tag. There are all sorts of commands for finding links and tables and such but I cant figure out how to pick out that particular tag (specificly with hpricot) 2) Load the entire page into a variable, strip out all new lines and tabs, scan entire page for specific regexp span id=rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle style=font-weight:bold;[^]+\/span I know this works, I used rubulator to test it. It returns all titles of books on the page, I do forsee an issue of which title belongs to which other info if I do it that way though. If an exact example is required I can give out all the info you require though I figured it would be more clutter than helpful. An actual syntax example would be most helpful rather than just refering me to a class definition though I will take whatever is offered. Many thanks, Michael --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Watir General group. To post to this group, send email to watir-general@googlegroups.com Before posting, please read the following guidelines: http://wiki.openqa.org/display/WTR/Support To unsubscribe from this group, send email to watir-general-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/watir-general
[wtr-general] Re: Pulling hair out on screen scraping
Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands ( http://us.php.net/manual/en/function.abs.php ) as an example. In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command, but showed me nothing about how that command works so unless I want to use it exactly the way you used it, its useless). My example here is the ruby on windows site. If I google for anything regarding ruby and excel I either get that site, or another site that just provides me a link to that site and am forced to make due with that site in order to teach myself how to interact with excel. The site itself lists a BUNCH of examples but leaves it up to you to try and pick apart the syntax to understand what it is doing. For example: line = '1' while worksheet.Range(a#{line})['Value'] line.succ! end #line now holds row number of first empty row What on earth does .succ! do? It never tells me. The site, and most that ive seen, are written not to target new people and tutor them but to target advanced users with a more so heres a cool way to approach the problem approach. A simple ok, here is the the excel class, here are the comands in it and what they do, here is a syntax example would be far more helpful as it doesn't leave anything out. I'm still not sure if its possible to return what row the active cell is on. ...Which is when I decided to ask actual people and ended up here. (thanks again btw) ...After that long winded response, I was trying to using Watir to scrape the page because I was having issues with the the javascript not being executed before the scrape (when i did it in php) and figured that a driven web brower would be sure to get it...hence watir. The reason my example was not using watir is because I was unable to find any documentation on how to do what I needed. I saw the browser.links and browser.table but those were the only 2 I found, there was no, here is a list of the commands as I mentioned above. Consiquently I found even less on hpricot since all I get is a 404 on its main site, and every other site links to it so wether or not it was documented is irrelevent, all I have to work with is trying to piece together other peoples code and work with it. I don't quite follow your first example since I am barely familiar with ruby syntax (though it appears to be similar to java) what is the |s| ? Your second example seems to be much closer to what I need since there are MANY spans on the page but only a handfull matching the regexp pattern I gave above. Would you be able to break down the second example for me? var = browser.span(:id, /ctl/).text I know: var is the variable being stored into browser is the watir browser object being driven I'm guessing span just looks for span tags? I'm also guessing that (:id, /ctl/) looks for any span tag with an id matching /ctl/ ? (this is where im not following you as much) what does the : in your example do? what exactly is the second argument doing, what are the slashes? and what does the .text at the end do? Sorry for being rather dense but I have barely delt with web programming before. I've spent my life doing C++, Java, and BASIC so I'm pretty much trying to stumble into a final product as gracefully as I can. Michael On Jan 3, 12:37 pm, Charley Baker charley.ba...@gmail.com wrote: Hi there, I'm not sure what you mean by Ruby and Watir being poorly documented. For Ruby, the first edition of the Pickaxe book which is comprehensive is free and available online. There are dozens of other tutorials, sites and blogs about Ruby. Watir also has a lot of examples, a tutorial(http://wiki.openqa.org/display/WTR/Tutorial) and other information on the wiki, if there's something you feel is missing, don't hesitate to suggest it or add it yourself. Oddly, your example doesn't use Watir at all. If you wanted to use Watir to do the same thing here are some possibilities: browser.spans.each {|s| puts s.text} #do something else with the span in the block if you want - e.g. assign some variables, etc var = browser.span(:id, /ctl/).text #find the span by a regex and assign it to a variable An interesting example using hpricot and regexs to find book information - ISBN, price, etc. Scrubyt is another library for screen scraping which internally uses either Firewatir or Mechanize, here's a link to some examples:http://wiki.scrubyt.org/index.php?title=Tutorials HTH, Charley Baker blog:http://charleybakersblog.blogspot.com/ Project Manager, Watir,http://wtr.rubyforge.org QA Architect, Gap
[wtr-general] Re: Pulling hair out on screen scraping
Michael, A fairly rapid reply, so my apologies if it sounds a little terse. A clearer, succinct email would be helpful, rather than unduly elaborating on your difficulties finding things. My immediate thought is that you are trying to run before you have learnt to walk. I would: - Learn the basics of Ruby using a Ruby tutorial eg http://poignantguide.net/ruby/ (quirky) - Read through parts of the Pickaxe book eg http://www.rubycentral.com/book/ - Learn to use IRB (interactive ruby) to understand how to see what methods do - Look at the RDoc for the libraries (gems) you want to use eg Watir, excel Personally, I consider Ruby's documentation and tutorials very good. Ruby is a scripting language so most people will provide examples of problems they have solved. Plenty of documentation is available through RDoc (Ruby documentation). Your question about Ruby's succ! method is easily answered by searching Google for Ruby succ!. You will find that it Returns the successor. If you run the command irb having installed Ruby and try typing: a = '1' a.succ! = 2 (is returned by irb) a.succ! = 3 (is returned by irb) Charley's first example might be more simply written as: browser.spans.each do | span | puts span.text end This could be understood as: For each span within the browser put the span object into the span variable within the following block of code print the output of span.text to screen Your understanding of the second example is pretty much correct. In answer to your questions: - Searching for ruby colon character will tell you that the : before a name indicates that it is a symbol. A Ruby tutorial will help you understand this. - The // symbols are less easy to find, but they are a regular expression or RegExp. Common across many languages and also available in Java. - The .text method returns the text contained within the span Hope this helps, Alex On 3 Jan 2009, at 18:31, Bissquitt wrote: Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands ( http://us.php.net/manual/en/function.abs.php ) as an example. In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command, but showed me nothing about how that command works so unless I want to use it exactly the way you used it, its useless). My example here is the ruby on windows site. If I google for anything regarding ruby and excel I either get that site, or another site that just provides me a link to that site and am forced to make due with that site in order to teach myself how to interact with excel. The site itself lists a BUNCH of examples but leaves it up to you to try and pick apart the syntax to understand what it is doing. For example: line = '1' while worksheet.Range(a#{line})['Value'] line.succ! end #line now holds row number of first empty row What on earth does .succ! do? It never tells me. The site, and most that ive seen, are written not to target new people and tutor them but to target advanced users with a more so heres a cool way to approach the problem approach. A simple ok, here is the the excel class, here are the comands in it and what they do, here is a syntax example would be far more helpful as it doesn't leave anything out. I'm still not sure if its possible to return what row the active cell is on. ...Which is when I decided to ask actual people and ended up here. (thanks again btw) ...After that long winded response, I was trying to using Watir to scrape the page because I was having issues with the the javascript not being executed before the scrape (when i did it in php) and figured that a driven web brower would be sure to get it...hence watir. The reason my example was not using watir is because I was unable to find any documentation on how to do what I needed. I saw the browser.links and browser.table but those were the only 2 I found, there was no, here is a list of the commands as I mentioned above. Consiquently I found even less on hpricot since all I get is a 404 on its main site, and every other site links to it so wether or not it was documented is irrelevent, all I have to work with is trying to piece together other peoples code and work with it. I don't quite follow your first example since I am barely familiar with ruby syntax (though it appears to be similar to java) what is the |s| ? Your second example seems to be much closer to what I need since there are MANY spans on the page but only a handfull matching the regexp pattern I gave above. Would you be able to
[wtr-general] Re: Pulling hair out on screen scraping
Hi Michael, On Sat, Jan 03, 2009 at 10:31:38AM -0800, Bissquitt wrote: Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands ( http://us.php.net/manual/en/function.abs.php ) as an example. This is in the Watir wiki, which seems to be down at the moment, so here's an alternative link: http://tinyurl.com/watirmethods In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command, but showed me nothing about how that command works so unless I want to use it exactly the way you used it, its useless). I wouldn't go so far as to call the tutorials poorly done but most Watir tutorials do seem to be written with non-programmers in mind. That said, Ruby is a full-fledged language, much like Java and C++ are, so it would be out of scope for a Watir tutorial or Excel automation tutorial to teach you language basics too. However, if you installed Ruby using the one-click installer, you already have most of the documentation you need in your computer. To see the APIs, you can use either fxri (a graphical interface to the language documentation) or the rubygems rdoc server (a daemon that you can access as http://localhost:8808/ through your browser). I can't give the exact location since I don't have access to a Windows computer right now, but all of these are somewhere in the Ruby folder in the Start Menu. A copy of the Pickaxe book mentioned earlier should be in there too. What on earth does .succ! do? It never tells me. The site, and most that ive seen, are written not to target new people and tutor them but to target advanced users with a more so heres a cool way to approach the problem approach. A simple ok, here is the the excel class, here are the comands in it and what they do, here is a syntax example would be far more helpful as it doesn't leave anything out. I'm still not sure if its possible to return what row the active cell is on. Ruby interacts with Excel using an OLE automation object, which is more of a Microsoft thing than a Ruby thing. It is documented here: http://msdn.microsoft.com/en-us/library/aa272268(office.11).aspx I don't quite follow your first example since I am barely familiar with ruby syntax (though it appears to be similar to java) what is the |s| ? This is part of block syntax. Look it up in the Pickaxe inside the chapter called Containers, Blocks, and Iterators. I'm guessing span just looks for span tags? Yes. I'm also guessing that (:id, /ctl/) looks for any span tag with an id matching /ctl/ ? (this is where im not following you as much) Yes. what does the : in your example do? It references the id symbol. Without going into too much detail, span(:id, /ctl/) is more efficient than span('id', /ctl/) due to the way Ruby allocates memory for strings. Don't worry too much about this, just use it (and don't mix up symbols and strings). what exactly is the second argument doing, what are the slashes? The slashes denote a regular expression, which means that it will match any span whose id attribute contains 'ctl'. You can compare this to span(:id, 'ctl'), which will match only the span whose id attribute is exactly equal to 'ctl'. and what does the .text at the end do? It's a method call that returns the text inside the span tag. Sorry for being rather dense but I have barely delt with web programming before. I've spent my life doing C++, Java, and BASIC so I'm pretty much trying to stumble into a final product as gracefully as I can. Don't overthink it. With Ruby, you're still dealing with objects, classes and methods, so your experience with OOP concepts should help you. HTH, Anna --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Watir General group. To post to this group, send email to watir-general@googlegroups.com Before posting, please read the following guidelines: http://wiki.openqa.org/display/WTR/Support To unsubscribe from this group, send email to watir-general-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/watir-general -~--~~~~--~~--~--~---
[wtr-general] Re: Pulling hair out on screen scraping
It can be a bit overwhelming to learn Ruby and various libraries at the same time. I'd recommend taking a look at the Pickaxe book: http://whytheluckystiff.net/ruby/pickaxe/ just to get some general familiarity. There are other Ruby tutorials online as well as some good books - The Ruby Way, Everyday Scripting, OReilly's Ruby book. succ! as you mention below is a Ruby core method. Gotapi also has a good searchable reference to Ruby standard api. http://www.gotapi.com/html click on the Ruby Standard Packages. The pickaxe book from the link above also has an index of the core api, many with examples. Here's a link to the Watir rdocs in case you might find that useful. http://wtr.rubyforge.org/rdoc/ and a link to supported elements(though openqa is down right now): http://wiki.openqa.org/display/WTR/Methods+supported+by+Element Strange that the hpricot site is down now as well. Another useful way to learn how to use libraries in Ruby is by taking a look at their unit tests. Watir has a large number of unit tests, hpricot has some too. They're located under your ruby install directory in gems. Ruby comes with a few documentation systems: ri and rdoc. For the gems you have installed locally you can see all of the rdocs by going to the command line, type: gem server Then browse to http://localhost:8808 ri can also be used from the command line: ri String::succ! Additional responses inline: On Sat, Jan 3, 2009 at 10:31 AM, Bissquitt bissqu...@gmail.com wrote: Regarding documentation, I read the Tutorial all the way through but it only hit on a few specific examples leaving out other commands all together. I've visited MANY ruby and watir sites and never once saw the .span command (does it just search for span tags? guess ill google it after this post) I never even found a site listing all the watir commands ( http://us.php.net/manual/en/function.abs.php ) as an example. In addition there are SO MANY tutorials and such online that are all very poorly done it makes finding a good one via google a needle in a haystack scenario. ie (oh great, you showed me that specific command, but showed me nothing about how that command works so unless I want to use it exactly the way you used it, its useless). My example here is the ruby on windows site. If I google for anything regarding ruby and excel I either get that site, or another site that just provides me a link to that site and am forced to make due with that site in order to teach myself how to interact with excel. The site itself lists a BUNCH of examples but leaves it up to you to try and pick apart the syntax to understand what it is doing. For example: line = '1' while worksheet.Range(a#{line})['Value'] line.succ! end #line now holds row number of first empty row What on earth does .succ! do? It never tells me. The site, and most that ive seen, are written not to target new people and tutor them but to target advanced users with a more so heres a cool way to approach the problem approach. A simple ok, here is the the excel class, here are the comands in it and what they do, here is a syntax example would be far more helpful as it doesn't leave anything out. I'm still not sure if its possible to return what row the active cell is on. Excel is a strange one. :) Agreed that most sites assume a basic familiarity with Ruby, and with the links above you should be able to get into it fairly quickly. Accessing Excel is done through it's COM interface, so one of the best sources of documentation is actually the Excel VBA Microsoft help file. There's a link to the standalone version of it somewhere on the internets if you don't have it installed. There are some excel libraries on our wiki as well as a project on Rubyforge called Rasta which use Excel. You can browse through the source code for those. ...Which is when I decided to ask actual people and ended up here. (thanks again btw) ...After that long winded response, I was trying to using Watir to scrape the page because I was having issues with the the javascript not being executed before the scrape (when i did it in php) and figured that a driven web brower would be sure to get it...hence watir. Yep, makes sense. Watir is great at testing heavy js sites, ajaxy stuff and the generated DOM instead of the page source. The reason my example was not using watir is because I was unable to find any documentation on how to do what I needed. I saw the browser.links and browser.table but those were the only 2 I found, there was no, here is a list of the commands as I mentioned above. Consiquently I found even less on hpricot since all I get is a 404 on its main site, and every other site links to it so wether or not it was documented is irrelevent, all I have to work with is trying to piece together other peoples code and work with it. I don't quite follow your first example since I am barely familiar with ruby syntax (though it appears to be similar to