Re: Urllib.request vs. Requests.get
Cloudflare, for whatever reason, appears to be rejecting the `User- Agent` header that urllib is providing:`Python-urllib/3.9`. Using a different `User-Agent` seems to get around the issue: import urllib.request req = urllib.request.Request( url="https://juno.sh/direct-connection-to-jupyter-server/;, method="GET", headers={"User-Agent": "Workaround/1.0"}, ) res = urllib.request.urlopen(req) Paul On Tue, 2021-12-07 at 12:35 +0100, Julius Hamilton wrote: > Hey, > > I am currently working on a simple program which scrapes text from > webpages > via a URL, then segments it (with Spacy). > > I’m trying to refine my program to use just the right tools for the > job, > for each of the steps. > > Requests.get works great, but I’ve seen people use > urllib.request.urlopen() > in some examples. It appealed to me because it seemed lower level > than > requests.get, so it just makes the program feel leaner and purer and > more > direct. > > However, requests.get works fine on this url: > > https://juno.sh/direct-connection-to-jupyter-server/ > > But urllib returns a “403 forbidden”. > > Could anyone please comment on what the fundamental differences are > between > urllib vs. requests, why this would happen, and if urllib has any > option to > prevent this and get the page source? > > Thanks, > Julius -- https://mail.python.org/mailman/listinfo/python-list
Re: Urllib.request vs. Requests.get
On Wed, Dec 8, 2021 at 4:51 AM Julius Hamilton wrote: > > Hey, > > I am currently working on a simple program which scrapes text from webpages > via a URL, then segments it (with Spacy). > > I’m trying to refine my program to use just the right tools for the job, > for each of the steps. > > Requests.get works great, but I’ve seen people use urllib.request.urlopen() > in some examples. It appealed to me because it seemed lower level than > requests.get, so it just makes the program feel leaner and purer and more > direct. > > However, requests.get works fine on this url: > > https://juno.sh/direct-connection-to-jupyter-server/ > > But urllib returns a “403 forbidden”. > > Could anyone please comment on what the fundamental differences are between > urllib vs. requests, why this would happen, and if urllib has any option to > prevent this and get the page source? > *Fundamental* differences? Not many. The requests module is designed to be easy to use, whereas urllib is designed to be basic and simple. Not really a fundamental difference, but perhaps indicative. I'd recommend doing the query with requests, and seeing exactly what headers are being sent. Most likely, there'll be something that you need to add explicitly when using urllib that the server is looking for (maybe a user agent or something). Requests uses Python's logging module to configure everything, so it should be a simple matter of setting log level to DEBUG and sending the request. TBH though, I'd just recommend using requests, unless you specifically need to avoid the dependency :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Urllib.request vs. Requests.get
Hey, I am currently working on a simple program which scrapes text from webpages via a URL, then segments it (with Spacy). I’m trying to refine my program to use just the right tools for the job, for each of the steps. Requests.get works great, but I’ve seen people use urllib.request.urlopen() in some examples. It appealed to me because it seemed lower level than requests.get, so it just makes the program feel leaner and purer and more direct. However, requests.get works fine on this url: https://juno.sh/direct-connection-to-jupyter-server/ But urllib returns a “403 forbidden”. Could anyone please comment on what the fundamental differences are between urllib vs. requests, why this would happen, and if urllib has any option to prevent this and get the page source? Thanks, Julius -- https://mail.python.org/mailman/listinfo/python-list