python网络爬虫精解之Beautiful Soup的使用说明
作者:小狐狸梦想去童话镇 发布时间:2021-02-21 15:20:48
一、Beautiful Soup的介绍
Beautiful Soup是一个强大的解析工具,它借助网页结构和属性等特性来解析网页。
它提供一些函数来处理导航、搜索、修改分析树等功能,Beautiful Soup不需要考虑文档的编码格式。Beautiful Soup在解析时实际上需要依赖解析器,常用的解析器是lxml。
二、Beautiful Soup的使用
test03.html测试实例:
<!DOCTYPE html>
<html>
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type" />
<meta content="IE=Edge" http-equiv="X-UA-Compatible" />
<meta content="always" name="referrer" />
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css" />
<title>百度一下,你就知道 </title>
</head>
<body link="#0000cc">
<div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
</div>
</div>
</div>
</div>
</body>
</html>
1、节点选择器
我们之前了解到,一个网页是由若干个元素节点组成的,通过提取某个节点的具体内容,就可以获取到界面呈现的一些数据。使用节点选择器能够简化我们获取数据的过程,在不使用正则表达式的前提下,精准的获取数据。
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.head.title)
print(soup.a)
【运行结果】
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
<title>百度一下,你就知道 </title>
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
分析:
第一条打印数据为获取网页的head节点;
第二条打印内容是获取head节点中title节点,获取该节点使用了一个嵌套选择,因为title节点是嵌套在head节点里面的;
第三条打印内容是获取a节点,在源码中我们看到有许多条a节点,而只匹配到第一个a节点就结束了。当有多个节点时,这种选择方式指只会选择第一个匹配的节点,其他后面节点会忽略。
2、提取信息
一般我们需要的数据位于节点名、属性值、文本值中,以下代码展示了如何获取这三个地方的数据:
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.body.name)
print(soup.body.a.attrs['class'])
print(soup.body.a.attrs['href'])
print(soup.body.a.string)
【运行结果】
body
['mnav']
http://news.baidu.com
新闻
分析:
第一条获取body节点名;
第二条获取a节点class属性值;
第三条获取a节点href属性值;
第四条获取a节点的文本值;
3、关联选择
(1)子节点和子孙节点
子节点可以调用contents属性和children属性,子孙节点可以调用descendants属性,他们返回结果都是生成器类型,通过for循环输出匹配到的信息。
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
# print(soup.body.contents)
for i,content in enumerate(soup.body.contents):
print(i,content)
【运行结果】
0
1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
</div>
</div>
</div>
</div>
2
(2)父节点和祖先节点
获取某个节点的父节点可以调用parent属性,例如获取实例中title节点的父节点:
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.title.parent)
【运行结果】
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
同理,如果是想要获取节点的祖先节点,则可调用parents属性。
(3)兄弟节点
调用next_sibling获取节点的下一个兄弟元素;
调用previous_sibling获取节点的上一个兄弟元素;
调用next_siblings取节点的下一个兄弟节点;
调用previous_siblings获取节点的上一个兄弟节点;
4、方法选择器
find_all()
查找所有符合条件的元素,其使用方法如下:
find_all(name,attrs,recursive,text,**kwargs)
(1)name
根据节点名来查询元素,例如查询实例中a标签元素:
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a"))
for a in soup.find_all(name = "a"):
print(a)
【运行结果】
[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
(2)attrs
在查询时我们还可以传入标签的属性,attrs参数的数据类型是字典。
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",attrs = {"class":"bri"}))
【运行结果】
[<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
可以看到,在加上class=“bri”属性时,查询结果就只剩一条a标签元素。
(3)text
text参数可以用来匹配节点的文本,传入的可以是字符串,也可以是正则表达式对象。
import re
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",text = re.compile('新闻')))
【运行结果】
[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>]
只包含文本内容为“新闻”的a标签。
find()
find()的使用与前者相似,唯一不同的是,find进匹配搜索到的第一个元素,然后返回单个元素,find_all()则是匹配所有符合条件的元素,返回一个列表。
5、CSS选择器
使用CSS选择器时,调用select()方法,传入相应的CSS选择器;
例如使用CSS选择器获取实例中的a标签
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.select('a'))
for a in soup.select('a'):
print(a)
【运行结果】
[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
获取属性
获取上述a标签中的href属性
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
print(a['href'])
【运行结果】
http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/
获取文本
获取上述a标签的文本内容,使用get_text()方法,或者是string获取文本内容
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
print(a.get_text())
print(a.string)
【运行结果】
新闻
新闻
hao123
hao123
地图
地图
视频
视频
贴吧
贴吧
更多产品
更多产品
来源:https://blog.csdn.net/gets_s/article/details/120372061


猜你喜欢
- 原型图:项目需求:服务器接受到报警后将消息推送到前台,(通过前端实时消息提示的效果-websocket长轮询),前台接受到消息后需要发出警报
- 前言Vux 是基于 Vue 和 Weui 开发的手机端页面 UI 组件库,开发初衷是满足公司的微信端表单需求,因为第三方的调查问卷表单系统在
- Python 是一种功能强大的语言,广泛用于自动执行各种任务。无论您是开发人员、系统管理员,还是只是想通过自动化日常任务来节省时间的人,Py
- vendorvendor概念最早是由Keith提出,用来存放依赖包。在版本1.5出现。例如gb项目提供了一个名为gsftp的示例项目,它有一
- <img :onerror="errpic" class="customerHead" :sr
- 本文实例为大家分享了python学生信息管理系统的具体代码,供大家参考,具体内容如下#编译环境为python3 #学生信息管理系统包括基本的
- 在网上搜索了半天,最简单的办法是在新的数据库中创建和原名字一样的数据库,然后把.frm 文件拷贝进去就OK了。 可是,有些时候这样不行,查询
- OpenCV 是一个C++库,目前流行的计算机视觉编程库,用于实时处理计算机视觉方面的问题,它涵盖了很多计算机视觉领域的模块。在P
- 前几天看到一个python框架win10toast,它可以用来做windows的消息通知功能。通过设定通知的间隔时间来实现一些事件通知的功能
- 1 数据导出 python manage.py dumpdata不指定 appname 时默认为导出所有的apppython manage.
- 附加数据库就可以完成. 附加数据库: &nb
- 而Easp类中提供了大量实用的ASP通用过程及方法,可以简化大部分的ASP操作。目前只提供了VBScript版,JScript版将来可能会提
- 前言在安装完python及pip,setuptools等工具后,即可以创建virualenv虚拟环境了,这个类似于虚拟机的工具,可以让同一台
- 在一些不多的数据下载和生成的时候,我们倾向于直接保存为文件,当我们修改某些参数后再一次运行时,之前运行时生成的文件就被覆盖了。为了解决这个问
- 简介每一门数据库语言语法都基本相似,但是对于他们各自的一些特性(函数、存储过程等)的用法就不大相同了,就好比Oracle与Mysql存储过程
- 基本语句结构if 判断条件1: 执行语句1……elif 判断条件2:
- 在SQL语句优化过程中,我们经常会用到hint,现总结一下在SQL优化过程中常见Oracle HINT的用法:1. /*+ALL_ROWS*
- mysql5.7设置远程访问不是和网上说的一样建个用户赋个权限就可以访问的。比如下边这个就是建用户赋权限,可能在之前的版本可以,但是我在我的
- 对于简单的网络例如全连接层Linear可以使用以下方法打印linear层:fc = nn.Linear(3, 5)params = list
- 运行环境:IIS脚本语言:VBScript数据库:Access/SQL Server数据库语言:SQL1.概要:不论是在论坛,还是新闻系统,