Re: Javascript website scraping using WebKit and Selenium tools

2015-07-02 Thread Veek M
dieter wrote:

 Once the problems to get the final HTML code solved,
 I would use lxml and its xpath support to locate any
 relevant HTML information.

Hello Dieter, yes - you are correct. (though I don't think there's any auth 
to browse - nice that you actually tried) He's using jsonP and updating his 
html. I decided to manually mangle it.

urllib to download, re to nuke the jsonp(.stuff i want..) and 
then lxml. It works and I got the text. Now i need to translate - many 
thanks.

I should have checked first using HTTP Headers to see what he was 
downloading - i'm an ass. Oh well solved :)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Javascript website scraping using WebKit and Selenium tools

2015-07-01 Thread dieter
Veek M vek.m1...@gmail.com writes:

 I tried scraping a javascript website using two tools, both didn't work. The 
 website link is: http://xdguo.taobao.com/category-499399872.htm The relevant 
 text I'm trying to extract is 'GY-68...':

 div class=item3line1

 dl class=item  data-id=38952795780
 dt class=photo
 a target=_blank href=//item.taobao.com/item.htm?spm=a1z10.5-
 c.w4002-6778075404.11.54MDOIid=38952795780 data-spm-wangpu-module-
 id=4002-6778075404 data-spm-anchor-id=a1z10.5-c.w4002-6778075404.11
 img 
 src=//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFaFaVXX_!!0-
 item_pic.jpg_240x240.jpg alt=GY-68 BMP180 ?? BOSCH?? ??? ??
 BMP085/img
 /a
 /dt

 ...

When I try to access the link above, I am redirected to a
login page - which, of course, may look very different from what you expect.
You may need to pass on authentication information along with
your request in order to get the page you are expecting.

Note also, that todays sites often heavily use Javascript - which
means that a page only gets the final look when the Javascript
has been executed.


Once the problems to get the final HTML code solved,
I would use lxml and its xpath support to locate any
relevant HTML information.

-- 
https://mail.python.org/mailman/listinfo/python-list


Javascript website scraping using WebKit and Selenium tools

2015-07-01 Thread Veek M
I tried scraping a javascript website using two tools, both didn't work. The 
website link is: http://xdguo.taobao.com/category-499399872.htm The relevant 
text I'm trying to extract is 'GY-68...':

div class=item3line1

dl class=item  data-id=38952795780
dt class=photo
a target=_blank href=//item.taobao.com/item.htm?spm=a1z10.5-
c.w4002-6778075404.11.54MDOIid=38952795780 data-spm-wangpu-module-
id=4002-6778075404 data-spm-anchor-id=a1z10.5-c.w4002-6778075404.11
img 
src=//img.alicdn.com/bao/uploaded/i4/TB1HMt3FFaFaVXX_!!0-
item_pic.jpg_240x240.jpg alt=GY-68 BMP180 ?? BOSCH?? ??? ??
BMP085/img
/a
/dt

I'm trying to match the class=item  bit as a preliminary venture:

from pyvirtualdisplay import Display
from selenium import webdriver
import time

display = Display(visible=0, size=(800, 600))
display.start()

browser = webdriver.Firefox()
browser.get('http://xdguo.taobao.com/category-499399872.htm')
print browser.title

time.sleep(120)
content = browser.find_element_by_class_name('item ')
print content
browser.quit()

display.stop()


I get:
selenium.common.exceptions.NoSuchElementException: Message: Unable to 
locate element: {method:class name,selector:item }

I also tried using WebKit - i know the site renders okay in WebKit because i 
tested with rekonq Here, i get the page (in Chinese) but the actual/relevant 
data is not there. WebKit's supposed to run the Javascript and give me the 
final results but I don't think that's happening.

import sys
from io import StringIO
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
from lxml import etree

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
  def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()

  def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()

url = 'http://xdguo.taobao.com/category-499399872.htm'
r = Render(url) #returns a Render object
result = r.frame.toHtml() #returns a QString
result_utf8 = result.toUtf8() #returns a QByteArray of utf8 data

#QByteArray-str-unicode
#contents = StringIO(unicode(result_utf8.data(), utf-8))
data = result_utf8.data() #returns byte string
print(data)

element = html.fromstring(data)
print(element.tag)

for img in element.xpath('//dl[@class=item ]/dt[@class=photo]/a/img'):
print(img.get('alt'))

#archive_links = html.fromstring(str(result.toAscii()))
#print 
archive_links.xpath(/html/body/div[2]/div[3]/div[2]/div[2]/div[1]/div/div
/div/div/div/div[2]/div[2]/dl[1]/dt/a/img)

Basically I want a list of parts the seller has to offer that I can grep, 
sort, uniq. I also tried elinks and lynx with ECMAScript but that was too 
basic and didn't work.

-- 
https://mail.python.org/mailman/listinfo/python-list