python beautiful soup库入门安装教程
作者:Cachel wood 发布时间:2023-03-04 06:24:27
标签:python,beautiful,soup,库
目录
beautiful soup库的安装
beautiful soup库的理解
beautiful soup库的引用
BeautifulSoup类
回顾demo.html
Tag标签
Tag的name
Tag的attrs(属性)
Tag的NavigableString
HTML基本格式
标签树的下行遍历
标签树的上行遍历
标签的平行遍历
bs库的prettify()方法
bs4库的编码
beautiful soup库的安装
pip install beautifulsoup4
beautiful soup库的理解
beautiful soup库是解析、遍历、维护“标签树”的功能库
beautiful soup库的引用
from bs4 import BeautifulSoup
import bs4
BeautifulSoup类
BeautifulSoup对应一个HTML/XML文档的全部内容
回顾demo.html
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
print(demo)
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>
Tag标签
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title)
tag = soup.a
print(tag)
<title>This is a python demo page</title>
<a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a>
任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时,soup.返回第一个
Tag的name
基本元素 | 说明 |
---|---|
Name | 标签的名字, … 的名字是'p',格式:.name |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)
a
p
body
Tag的attrs(属性)
基本元素 | 说明 |
---|---|
Attributes | 标签的属性,字典形式组织,格式:.attrs |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(tag.attrs['href'])
print(type(tag.attrs))
print(type(tag))
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>
Tag的NavigableString
Tag的NavigableString
基本元素 | 说明 |
---|---|
NavigableString | 标签内非属性字符串,<>…</>中字符串,格式:.string |
Tag的Comment
基本元素 | 说明 |
---|---|
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
import requests
from bs4 import BeautifulSoup
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(newsoup.b.string)
print(type(newsoup.b.string))
print(newsoup.p.string)
print(type(newsoup.p.string))
This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>
HTML基本格式
标签树的下行遍历
属性 | 说明 |
---|---|
.contents | 子节点的列表,将所有儿子结点存入列表 |
.children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子结点 |
.descendents | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
BeautifulSoup类型是标签树的根节点
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)
print(len(soup.body.contents))
print(soup.body.contents[1])
<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p ><b>The demo python introduces several python courses.</b></p>, '\n', <p >Python
is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the
following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>, '\n']
5
<p ><b>The demo python introduces several python courses.</b></p>
for child in soup.body.children:
print(child) #遍历儿子结点
for child in soup.body.descendants:
print(child) #遍历子孙节点
标签树的上行遍历
属性 | 说明 |
---|---|
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title.parent)
print(soup.html.parent)
<head><title>This is a python demo page</title></head>
<html><head><title>This is a python demo page</title></head>
<body>
<p ><b>The demo python introduces several python courses.</b></p>
<p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>
</body></html>
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]
标签的平行遍历
属性 | 说明 |
---|---|
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous.sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous.siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)
print(soup.a.previous_sibling)
print(soup.a.previous_sibling.previous_sibling)
print(soup.a.parent)
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
None
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
for sibling in soup.a.next_sibling:
print(sibling) #遍历后续节点
for sibling in soup.a.previous_sibling:
print(sibling) #遍历前续节点
bs库的prettify()方法
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签,方法:.prettify()
bs4库的编码
bs4库将任何HTML输入都变成utf-8编码
python 3.x默认支持编码是utf-8,解析无障碍
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup("<p>中文</p>","html.parser")
print(soup.p.string)
print(soup.p.prettify())
中文
<p>
中文
</p>
来源:https://blog.csdn.net/weixin_46530492/article/details/119960182
0
投稿
猜你喜欢
- 背景:读取TXT文件,加载到kafka中,然后通过logstash消费kafka中的数据加载到es中第一步:导入相应的依赖包pip inst
- 高级加密标准(AES,Advanced Encryption Standard)为最常见的对称加密算法(微信小程序加密传输就是用这个加密算法
- 字符串多级目录取值:比如说:你response接收到的数据是这样的。你现在只需要取到itemstring 这个字段下的值。其他的都不要!思路
- ctrl + r => 输入drivers回车 => etc/hosts , 用记事本打开它,在 127.0.0.1 local
- 一、文件操作1、文件的概念1.文件就是计算机暴露给用户操作硬盘的快捷方式2.计算机的文件,就是用来储存某种信息的数据3.在计算机中,文件是以
- NextGEN Gallery是Wordpress中著名的相册插件,遗憾的是不支持中文等unicode字符,本文将介绍如何将目录转换为拼音(
- 利用Python3来实现TCP协议,和UDP类似。UDP应用于及时通信,而TCP协议用来传送文件、命令等操作,因为这些数据不允许丢失,否则会
- 前言图像分割是许多计算机视觉应用中的关键处理步骤,通常用于将图像划分为不同的区域,这些区域常常对应于真实世界的对象。因此,图像分割是图像识别
- 网页开发人员常常希望能够了解并掌握多种语言,结果是,学习一门语言的所有内容是棘手的,但是却很容易发现你并没有完全利用那些比较特殊却很有用的标
- TensorFlow中的log共有INFO、WARN、ERROR、FATAL 4种级别。有以下几种设置方式。1. 通过设置环境变量控制log
- 图像有时候比数据更能满足人们的视觉需求Pytorch中保存图片的方式pytorch下保存图像有很多种方法,但是这些基本上都是基于图像处理的,
- 前言采集教务系统成绩单是一个非常有意义的项目。在现代教育中,教务系统已经成为了学校管理和教学工作的重要组成部分。然而,由于各种原因,教务系统
- 本文是一篇关于《Effective Python》书中一节的学习笔记,记录了示例代码和思路。如果函数要产生一系列结果,那么最简单的做法就是把
- 这篇文章主要介绍了Python scrapy增量爬取实例及实现过程解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学
- 最近开发vue项目过程中,由于产品需要在项目中添加富文本编辑器,也在npm上找了几个基于vue开发的富文本编辑器,但是对兼容性比较高,不能兼
- 一、项目视图分析通过上图,我们可以看到,一个完整的项目,基本包括三个部分:用户视图层、接口层、数据处理层,其中,用户视图层是用来接收用户的数
- 前言:我们写Python基本不需要自己创建抽象基类,而是通过鸭子类型来解决大部分问题。《流畅的Python》作者使用了15年Python,但
- 字典dict_fruit = {'apple':'苹果','banana':'香蕉&
- mysql是linux平台下最流行的数据库系统,今天介绍的是mysql的安装及简单的操作方法!groupadd mysql //建立mysq
- 目录引入依赖配置构建实体类保存数据查询数据项目中需要存放大量设备日志,且需要对其进行简单的数据分析,信息提取工作.结合众多考量因素,项目决定