Re: Urllib.request vs. Requests.get

2021-12-07 Thread Paul Bryan
Cloudflare, for whatever reason, appears to be rejecting the `User-
Agent` header that urllib is providing:`Python-urllib/3.9`. Using a
different `User-Agent` seems to get around the issue:

import urllib.request

req = urllib.request.Request(
url="https://juno.sh/direct-connection-to-jupyter-server/;,
method="GET",
headers={"User-Agent": "Workaround/1.0"},
)

res = urllib.request.urlopen(req)

Paul

On Tue, 2021-12-07 at 12:35 +0100, Julius Hamilton wrote:
> Hey,
> 
> I am currently working on a simple program which scrapes text from
> webpages
> via a URL, then segments it (with Spacy).
> 
> I’m trying to refine my program to use just the right tools for the
> job,
> for each of the steps.
> 
> Requests.get works great, but I’ve seen people use
> urllib.request.urlopen()
> in some examples. It appealed to me because it seemed lower level
> than
> requests.get, so it just makes the program feel leaner and purer and
> more
> direct.
> 
> However, requests.get works fine on this url:
> 
> https://juno.sh/direct-connection-to-jupyter-server/
> 
> But urllib returns a “403 forbidden”.
> 
> Could anyone please comment on what the fundamental differences are
> between
> urllib vs. requests, why this would happen, and if urllib has any
> option to
> prevent this and get the page source?
> 
> Thanks,
> Julius

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Urllib.request vs. Requests.get

2021-12-07 Thread Chris Angelico
On Wed, Dec 8, 2021 at 4:51 AM Julius Hamilton
 wrote:
>
> Hey,
>
> I am currently working on a simple program which scrapes text from webpages
> via a URL, then segments it (with Spacy).
>
> I’m trying to refine my program to use just the right tools for the job,
> for each of the steps.
>
> Requests.get works great, but I’ve seen people use urllib.request.urlopen()
> in some examples. It appealed to me because it seemed lower level than
> requests.get, so it just makes the program feel leaner and purer and more
> direct.
>
> However, requests.get works fine on this url:
>
> https://juno.sh/direct-connection-to-jupyter-server/
>
> But urllib returns a “403 forbidden”.
>
> Could anyone please comment on what the fundamental differences are between
> urllib vs. requests, why this would happen, and if urllib has any option to
> prevent this and get the page source?
>

*Fundamental* differences? Not many. The requests module is designed
to be easy to use, whereas urllib is designed to be basic and simple.
Not really a fundamental difference, but perhaps indicative.

I'd recommend doing the query with requests, and seeing exactly what
headers are being sent. Most likely, there'll be something that you
need to add explicitly when using urllib that the server is looking
for (maybe a user agent or something). Requests uses Python's logging
module to configure everything, so it should be a simple matter of
setting log level to DEBUG and sending the request.

TBH though, I'd just recommend using requests, unless you specifically
need to avoid the dependency :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Urllib.request vs. Requests.get

2021-12-07 Thread Julius Hamilton
Hey,

I am currently working on a simple program which scrapes text from webpages
via a URL, then segments it (with Spacy).

I’m trying to refine my program to use just the right tools for the job,
for each of the steps.

Requests.get works great, but I’ve seen people use urllib.request.urlopen()
in some examples. It appealed to me because it seemed lower level than
requests.get, so it just makes the program feel leaner and purer and more
direct.

However, requests.get works fine on this url:

https://juno.sh/direct-connection-to-jupyter-server/

But urllib returns a “403 forbidden”.

Could anyone please comment on what the fundamental differences are between
urllib vs. requests, why this would happen, and if urllib has any option to
prevent this and get the page source?

Thanks,
Julius
-- 
https://mail.python.org/mailman/listinfo/python-list