On Tuesday, March 3, 2015 at 12:28:05 AM UTC-8, Gao Boswell wrote:
>
> I have a spider that want to crawl some data of a movie site, but I found 
> that these data was generated through ajax after the page loaded,and my 
> code is:
> require 'nokogiri'
> require 'watir-webdriver'
>
> browser = Watir::Browser.new 
> browser.goto '
> http://www.tudou.com/albumplay/2Dk1-JIVpzo/yp927-uKGMs.html?FR=LIAN'
> browser.element(:css => "#digBury .dig_container").wait_until_present
> puts '***************************************'
> puts browser.html
> puts '***************************************'
> doc = Nokogiri::HTML(browser.html)
> content = doc.css(".dig_container .num")
> browser.close
>
> from the output I can get the content which I want:
> 在此输入代码...
>
> <div id="digBury" class="dig_wrap">
> <div class="dig_container">
> <a title="喜欢就挖一下吧,登录后双倍威力" class="btn" href="#">
> <i class="iconfont"></i>
> <i class="tip">+1</i>
> <span class="num">332</span>
> </a>
>
> </div>
> </div>
>
> but I know that on the server I must use headless ,so I changed my code to:
> require 'nokogiri'
> require 'watir-webdriver'
> require 'headless'
>
> headless = Headless.new 
> headless.start
>
> browser = Watir::Browser.new 
> browser.goto '
> http://www.tudou.com/albumplay/2Dk1-JIVpzo/yp927-uKGMs.html?FR=LIAN'
> browser.element(:css => "#digBury .dig_container").wait_until_present
> puts '***************************************'
> puts browser.html
> puts '***************************************'
> doc = Nokogiri::HTML(browser.html)
> content = doc.css(".dig_container .num")
> browser.close
> headless.destroy
>
> this time I can't get my result,and the result is:
> <div id="digBury" class="dig_wrap disabled">
> <a title="挖" class="btn" href="#">
> <i class="iconfont"></i>
> <span class="btn_desc">挖</span>
> </a>
> </div>
>
> the diffrence is I have added headless and the effect is  the ajax request 
> don't send or the ajax response I missed ,how can i fix this problem?
>

You might try using phantomjs instead of headless.  download the 
appropriate phantomjs <http://phantomjs.org/> executable and place on your 
path  then for the browser use :phantomjs

also: if spidering be well behaved, get the robots.txt file from the site 
and only access allowed pages.  In addition review the user agreement or 
terms of service and be sure that using a robot or any kind of automation 
to access the site is not forbidden..  

-- 
-- 
Before posting, please read http://watir.com/support. In short: search before 
you ask, be nice.

[email protected]
http://groups.google.com/group/watir-general
[email protected]

--- 
You received this message because you are subscribed to the Google Groups 
"Watir General" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to