Python核心模块之urllib用法详解【Python每日一个知识点第79期】

urllib是Python内建的核心模块之一,主要用于各种网页请求的构造。这个模块操作非常简单,而且功能比较强大,是爬虫入门的不二之选。今天我们为大家整理了urllib库的一些核心用法,帮助大家更快的掌握其用法。Python爬虫项目中常用的requests库即时基于urllib构建的。


Get

urllib的

1
request

模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应:

例如,对豆瓣的一个URL

1
https://api.douban.com/v2/book/2129650

进行抓取,并返回响应:



1
2
3
4
5
6
7
8
<span class="keyword">from</span> urllib <span class="keyword">import</span> request

<span class="keyword">with</span> request.urlopen(<span class="string">'https://api.douban.com/v2/book/2129650'</span>) <span class="keyword">as</span> f:
    data = f.read()
    print(<span class="string">'Status:'</span>, f.status, f.reason)
    <span class="keyword">for</span> k, v <span class="keyword">in</span> f.getheaders():
        print(<span class="string">'%s: %s'</span> % (k, v))
    print(<span class="string">'Data:'</span>, data.decode(<span class="string">'utf-8'</span>))

可以看到HTTP响应的头和JSON数据:



1
2
3
4
5
6
7
8
9
10
11
12
Status: 200 OK
Server: nginx
Date: Tue, 26 May 2015 10:02:27 GMT
Content-Type: application/json; char<span class="operator"><span class="keyword">set</span>=utf-<span class="number">8</span>
Content-Length: <span class="number">2049</span>
<span class="keyword">Connection</span>: <span class="keyword">close</span>
Expires: Sun, <span class="number">1</span> Jan <span class="number">2006</span> <span class="number">01</span>:<span class="number">00</span>:<span class="number">00</span> GMT
<span class="keyword">Pragma</span>: <span class="keyword">no</span>-cache
Cache-Control: must-revalidate, <span class="keyword">no</span>-cache, private
X-DAE-Node: pidl1
Data: {<span class="string">"rating"</span>:{<span class="string">"max"</span>:<span class="number">10</span>,<span class="string">"numRaters"</span>:<span class="number">16</span>,<span class="string">"average"</span>:<span class="string">"7.4"</span>,<span class="string">"min"</span>:<span class="number">0</span>},<span class="string">"subtitle"</span>:<span class="string">""</span>,<span class="string">"author"</span>:[<span class="string">"廖雪峰编著"</span>],<span class="string">"pubdate"</span>:<span class="string">"2007-6"</span>,...}
</span>

如果我们要想模拟浏览器发送GET请求,就需要使用

1
Request

对象,通过往

1
Request

对象添加HTTP头,我们就可以把请求伪装成浏览器。例如,模拟iPhone 6去请求豆瓣首页:



1
2
3
4
5
6
7
8
9
<span class="keyword">from</span> urllib <span class="keyword">import</span> request

req = request.Request(<span class="string">'http://www.douban.com/'</span>)
req.add_header(<span class="string">'User-Agent'</span>, <span class="string">'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25'</span>)
<span class="keyword">with</span> request.urlopen(req) <span class="keyword">as</span> f:
    print(<span class="string">'Status:'</span>, f.status, f.reason)
    <span class="keyword">for</span> k, v <span class="keyword">in</span> f.getheaders():
        print(<span class="string">'%s: %s'</span> % (k, v))
    print(<span class="string">'Data:'</span>, f.read().decode(<span class="string">'utf-8'</span>))

这样豆瓣会返回适合iPhone的移动版网页:



1
2
3
4
5
6
...
    <span class="xml"><span class="tag">&lt;<span class="title">meta</span> <span class="attribute">name</span>=<span class="value">"viewport"</span> <span class="attribute">content</span>=<span class="value">"width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0"</span>&gt;</span>
    <span class="tag">&lt;<span class="title">meta</span> <span class="attribute">name</span>=<span class="value">"format-detection"</span> <span class="attribute">content</span>=<span class="value">"telephone=no"</span>&gt;</span>
    <span class="tag">&lt;<span class="title">link</span> <span class="attribute">rel</span>=<span class="value">"apple-touch-icon"</span> <span class="attribute">sizes</span>=<span class="value">"57x57"</span> <span class="attribute">href</span>=<span class="value">"http://img4.douban.com/pics/cardkit/launcher/57.png"</span> /&gt;</span>
...
</span>

Post

如果要以POST发送一个请求,只需要把参数

1
data

以bytes形式传入。

我们模拟一个微博登录,先读取登录的邮箱和口令,然后按照weibo.cn的登录页的格式以

1
username=xxx&amp;password=xxx

的编码传入:



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<span class="keyword">from</span> urllib <span class="keyword">import</span> request, parse

print(<span class="string">'Login to weibo.cn...'</span>)
email = input(<span class="string">'Email: '</span>)
passwd = input(<span class="string">'Password: '</span>)
login_data = parse.urlencode([
    (<span class="string">'username'</span>, email),
    (<span class="string">'password'</span>, passwd),
    (<span class="string">'entry'</span>, <span class="string">'mweibo'</span>),
    (<span class="string">'client_id'</span>, <span class="string">''</span>),
    (<span class="string">'savestate'</span>, <span class="string">'1'</span>),
    (<span class="string">'ec'</span>, <span class="string">''</span>),
    (<span class="string">'pagerefer'</span>, <span class="string">'https://passport.weibo.cn/signin/welcome?entry=mweibo&amp;r=http%3A%2F%2Fm.weibo.cn%2F'</span>)
])

req = request.Request(<span class="string">'https://passport.weibo.cn/sso/login'</span>)
req.add_header(<span class="string">'Origin'</span>, <span class="string">'https://passport.weibo.cn'</span>)
req.add_header(<span class="string">'User-Agent'</span>, <span class="string">'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25'</span>)
req.add_header(<span class="string">'Referer'</span>, <span class="string">'https://passport.weibo.cn/signin/login?entry=mweibo&amp;res=wel&amp;wm=3349&amp;r=http%3A%2F%2Fm.weibo.cn%2F'</span>)

<span class="keyword">with</span> request.urlopen(req, data=login_data.encode(<span class="string">'utf-8'</span>)) <span class="keyword">as</span> f:
    print(<span class="string">'Status:'</span>, f.status, f.reason)
    <span class="keyword">for</span> k, v <span class="keyword">in</span> f.getheaders():
        print(<span class="string">'%s: %s'</span> % (k, v))
    print(<span class="string">'Data:'</span>, f.read().decode(<span class="string">'utf-8'</span>))

如果登录成功,我们获得的响应如下:



1
2
3
4
5
6
Status: 200 OK
Server: nginx/1.2.0
...
<span class="operator"><span class="keyword">Set</span>-Cookie: SSOLoginState=<span class="number">1432620126</span>;</span> path=/; domain=weibo.cn
...
Data: {"retcode":20000000,"msg":"","data":{...,"uid":"1658384301"}}

如果登录失败,我们获得的响应如下:



1
2
...
Data: {"retcode":50011015,"msg":"\u7528\u6237\u540d\u6216\u5bc6\u7801\u9519\u8bef","data":{"username":"example@python.org","errline":536}}

Handler

如果还需要更复杂的控制,比如通过一个Proxy去访问网站,我们需要利用

1
ProxyHandler

来处理,示例代码如下:



1
2
3
4
5
6
7
proxy_<span class="operator"><span class="keyword">handler</span> = urllib.request.ProxyHandler({<span class="string">'http'</span>: <span class="string">'http://www.example.com:3128/'</span>})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password(<span class="string">'realm'</span>, <span class="string">'host'</span>, <span class="string">'username'</span>, <span class="string">'password'</span>)
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
<span class="keyword">with</span> opener.<span class="keyword">open</span>(<span class="string">'http://www.example.com/login.html'</span>) <span class="keyword">as</span> f:
    pass
</span>

小结

urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能,需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求,再根据浏览器的请求头来伪装,

1
User-Agent

头就是用来标识浏览器的。


《Python入门每日一个知识点》栏目是马哥教育Python年薪20万+的学员社群特别发起,分享Python工具、Python语法、Python项目等知识点,帮助大家快速的了解Python学习,快速步入Python高薪的快车道。

【超全整理】《Python自动化全能开发从入门到精通》python基础教程笔记全放送

相关新闻

联系我们

400-080-6560

在线咨询:点击这里给我发消息

邮件:xiujiang.yang@magedu.com

工作时间:周一至周日,00:09-18:30

QR code