Python爬虫自动化获取华图和粉笔网站的错题(推荐)
作者:KelvinChunggg 发布时间:2023-08-14 02:05:42
这篇博客对于考公人或者其他用华图或者粉笔做题的人比较友好,通过输入网址可以自动化获取华图以及粉笔练习的错题。
粉笔网站
我们从做过的题目组中获取错题
打开某一次做题组,我们首先进行抓包看看数据在哪里
我们发现现在数据已经被隐藏,事实上数据在这两个包中:
https://tiku.fenbi.com/api/xingce/questions
https://tiku.fenbi.com/api/xingce/solutions
一个为题目的一个为解析的。此url要通过传入一个题目组参数才能获取到当前题目数据,而题目组参数在这个包中
以网址的倒数第二个数字串有关
url的规则为'https://tiku.fenbi.com/api/xingce/exercises/'+str(id_)+'?app=web&kav=12&version=3.0.0.0'
,id_即为下划线数字
通过请求这个包获取到参数然后通过参数请求上面两个包(
https://tiku.fenbi.com/api/xingce/questions
https://tiku.fenbi.com/api/xingce/solutions
)即可获取到题目数据,而且自己的答案在也在https://tiku.fenbi.com/api/xingce/exercises/'+str(id_)+'?app=web&kav=12&version=3.0.0.0
这个包中。
不过粉笔的题目数据有些是图片,而且图片在题目中,选项中,这里以word文档存储操作docx库有些吃力,于是我想到了直接构造HTML代码,然后通过pdfkit转为pdf(具体如何下载可以参考百度,要下载wkhtmltopdf.exe)即可变为错题集在平板或者其他设备中看。
(请求时一定要携带完整的headers,否则很可能获取不到数据)
具体操作看代码解析
###此函数用于解析题目和每道题的答案
def jiexi(liebiao):
new = []
timu_last = []
for each in liebiao:
new.append(re.sub(r'flag=\\"tex\\" ','',each))
for each in new:
timu_last.append(re.sub(r'\\','',each))
return timu_last
###此函数用于解析选项
def xuanxiang(liebiao):
xuanxiang_v2 = []
xuanxiang_v3 = []
for each in liebiao:
a = re.sub('<p>','',each)
a = re.sub('</p>','',a)
xuanxiang_v2.append(a)
for each in xuanxiang_v2:
each = each+'</p>'
xuanxiang_v3.append(each)
return xuanxiang_v3
import requests
import re
import pdfkit
import os
url = str(input("请输入练习的网址:"))
###获取本节练习id
id_ = re.findall(r'https://www.fenbi.com/spa/tiku.*?/xingce/xingce/(.*?)/',url,re.S)[0]
mid_url = 'https://tiku.fenbi.com/api/xingce/exercises/'+str(id_)+'?app=web&kav=12&version=3.0.0.0'
headers = {
#####完整的headers,自己添加
}
response = requests.get(url=mid_url,headers=headers)
response.encoding = 'utf-8'
page_text = response.text
###获取题目组参数
id_list = re.findall('\"questionIds\"\:\[(.*?)\]\,',page_text,re.S)
###获取自己的答案
your_answer = re.findall(r'"answer":{"choice":"(.*?)",',page_text,re.S)
###此练习名称
name = re.findall(r'"name":"(.*?)",',page_text,re.S)[0]
###真正存储数据的包
timu_url = 'https://tiku.fenbi.com/api/xingce/questions'
params = {
'ids': id_list
}
response = requests.get(url=timu_url,headers=headers,params=params)
response.encoding = 'utf-8'
page_text = response.text
###获取正确答案
true_answer = re.findall('"correctAnswer":{"choice":"(.*?)"',page_text,re.S)
###真正存储数据的包
solution_url = 'https://tiku.fenbi.com/api/xingce/solutions'
response = requests.get(url=solution_url,headers=headers,params=params)
response.encoding = 'utf-8'
page_text = response.text
###获取解析
solution_list = re.findall(r'"solution":"(.*?)","userAnswer"',page_text,re.S)
solution_last = jiexi(solution_list)
cailiao = []
timu = []
###获取单选题题目和复合题的题目
for each in response.json():
timu.append(each['content'])
try:
cailiao.append(each['material']['content'])
except:
cailiao.append('none')
###获取选项信息
A_option = re.findall('\"options\"\:\[\"(.*?)\"\,\".*?\"\,\".*?\"\,\".*?\"\]',page_text,re.S)
B_option = re.findall('\"options\"\:\[\".*?\"\,\"(.*?)\"\,\".*?\"\,\".*?\"\]',page_text,re.S)
C_option = re.findall('\"options\"\:\[\".*?\"\,\".*?\"\,\"(.*?)\"\,\".*?\"\]',page_text,re.S)
D_option = re.findall('\"options\"\:\[\".*?\"\,\".*?\"\,\".*?\"\,\"(.*?)\"\]',page_text,re.S)
A_option = xuanxiang(A_option)
B_option = xuanxiang(B_option)
C_option = xuanxiang(C_option)
D_option = xuanxiang(D_option)
A_option = jiexi(A_option)
B_option = jiexi(B_option)
C_option = jiexi(C_option)
D_option = jiexi(D_option)
###构造HTML代码
count = 0
all_content = "<!DOCTYPE html>\n<meta charset='utf-8'>\n<html>"
for each in true_answer:
if each != your_answer[count]:
###处理复合题
if cailiao[count] != 'none' and cailiao[count] not in all_content:
all_content += cailiao[count]
all_content += str(count+1)
all_content += '、'
all_content += timu[count][3:]
all_content += 'A、'
all_content += A_option[count]
all_content += 'B、'
all_content += B_option[count]
all_content += 'C、'
all_content += C_option[count]
all_content += 'D、'
all_content += D_option[count]
all_content += '<br>'
count += 1
count = 0
all_content += '<br><br><br><br><br><br><br><br><br>'
for each in true_answer:
if each != your_answer[count]:
temp = '第'+str(count+1)+'题的正确答案为'
all_content += temp
if true_answer[count]=='0':
all_content += 'A'
elif true_answer[count]=='1':
all_content += 'B'
elif true_answer[count]=='2':
all_content += 'C'
elif true_answer[count]=='3':
all_content += 'D'
all_content += solution_last[count]
all_content += '<br>'
count += 1
all_content += '</html>'
path_name = name + '.html'
###保存为HTML文件
with open(path_name,'w',encoding='utf-8') as fp:
fp.write(all_content)
confg = pdfkit.configuration(wkhtmltopdf=r'wkhtmltopdf.exe保存的路径')
pdfkit.from_url(path_name, name+'.pdf',configuration=confg)###把HTML文件转为pdf
print('错题PDF保存成功')
###删除HTML文件
os.remove(path_name)
华图网站
也是答题记录中自己做过的题目
华图网站稍微不一样,他的数据直接抓包就可看到
通过请求这个包即可获取到数据,接下来就是解析的事情了,这次我用word文档进行存储,如果觉得不方便也可以像上文一样构造HTML
##导包
import requests
import lxml.etree
import re
import time
import os
from docx import Document
from docx.shared import Inches
from docx.shared import Pt
from docx.shared import Inches
from docx.oxml.ns import qn
from docx.enum.text import WD_ALIGN_PARAGRAPH
url = str(input("请输入练习的网址:"))
headers={
###完整的headers,否则获取不到数据
}
response = requests.get(url = url,headers = headers)
response.encoding='utf-8'
reptext = response.text
tree = lxml.etree.HTML(reptext) #解析网站获取源码
dirName="考公图片"
if not os.path.exists(dirName):
os.mkdir(dirName) #网站图片保存路径
jiexi = re.findall(r'<div class="jiexi-item-title">解析.*?。</div>.*?</div>', reptext,re.S) #获取题目解析
imgg = []
for each in jiexi:
imgg.append(re.findall(r'<img src="(.*?)".*?>', each)) #获取解析里的图片URL
imgt = []
for each in imgg:
if each == []:
imgt.append([1])
else:
imgt.append(each) #把解析里图片URL美化整理一下
jiexilast = []
for qq in jiexi:
jiexilast.append(re.sub(r'<[^>]+>', '', qq)) #美化题目解析
corrected = re.findall(r'<span class="g-right-answer-color">[a-zA-Z]{1,4}</span>', reptext) #获取正确答案
correct = []
for ee in corrected:
correct.append(re.sub(r'<[^>]+>', '', ee)) #美化正确答案
yoursed = re.findall(r'<span class="yellowWord">[a-zA-Z]{1,4}</span>', reptext) #获取自己的答案
yours = []
for ee in yoursed:
yours.append(re.sub(r'<[^>]+>', '', ee)) #美化自己的答案
timuleixing = re.findall(r'<span class="greenWord">(.*?)</span>.*?</div>',reptext,re.S) #获取题目类型
find1 = re.findall(r'<span class="greenWord">.*?</span>(.*?)</div>',reptext,re.S)
for each in find1:
re.sub(r'<.*?>','',each)
find5 = [] #最终的题目
for each in find1:
find5.append(re.sub(r'<[^>]+>', '', each))
img = []
for each in find1:
img.append(re.findall(r'<img src="(.*?)".*?>', each))
imgx = []
for each in img:
if each == []:
imgx.append([1])
else:
imgx.append(each) #最终版题目图片URL
v = tree.xpath('//div[@class="exercise-main-title"]//text()') #本次题目类型
try:
###这是既有复合题也有单选题的
fuheti = re.findall(r'<!--复合题-->(.*?)<div class="exercise-main-topics"',reptext,re.S)[0].split('<!--复合题-->')
except:
try:
###这是只有复合题或者复合题在最后几题的
fuheti = re.findall(r'<!--复合题-->(.*?)<!-- 纠错的弹窗 -->',reptext,re.S)[0].split('<!--复合题-->')
except:
pass
count = 0
###导入标题
document = Document()
p = document.add_paragraph()
p.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
run = p.add_run(v[0][5:-3])
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
choose = []
###处理题目选项
axuanxiang = []
bxuanxiang = []
cxuanxiang = []
dxuanxiang = []
xuanxiang = re.findall(r'<div class="main-topic-choices">(.*?)<div class="main-topic-letters clearfix pl14">',reptext,re.S)
for everything in xuanxiang:
try: ##处理只有两个选项
axuanxiang.append(re.sub("<.*?>","",re.findall(r'<div.*?class.*?main-topic-choice.*?>(A.*?)</div>',everything,re.S)[0]))
except:
axuanxiang.append('--')
try:
bxuanxiang.append(re.sub("<.*?>","",re.findall(r'<div.*?class.*?main-topic-choice.*?>(B.*?)</div>',everything,re.S)[0]))
except:
bxuanxiang.append('--')
try:
cxuanxiang.append(re.sub("<.*?>","",re.findall(r'<div.*?class.*?main-topic-choice.*?>(C.*?)</div>',everything,re.S)[0]))
except:
cxuanxiang.append('--')
try:
dxuanxiang.append(re.sub("<.*?>","",re.findall(r'<div.*?class.*?main-topic-choice.*?>(D.*?)</div>',everything,re.S)[0]))
except:
dxuanxiang.append('--')
for every in correct:
if every != yours[count]:
###处理复合题题目
try:
for eacy in fuheti:
if find5[count] in eacy:
fuheti_URL = re.findall(r'<img src="(.*?)".*?>',re.findall(r'.*?<p>(.*?)</p>',eacy,re.S)[0],re.S)
fuheti_last = re.sub(r'<.*?>','',re.findall(r'.*?<p>(.*?)</p>',eacy,re.S)[0])
fuheti_last = re.sub(r'\xa0\xa0\xa0\xa0\xa0\xa0\xa0','\n',fuheti_last)
if fuheti_last not in choose:
p = document.add_paragraph()
run = p.add_run(fuheti_last)
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
headers ={
'Use-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
}
for eacu in fuheti_URL:
img_data = requests.get(url = eacu,headers = headers).content
img_path = dirName+'/'+'tupian'+'.jpg'
with open(img_path,'wb') as fp:
fp.write(img_data)
print("保存成功")
document.add_picture(img_path, width=Inches(5))
choose.append(fuheti_last)
except:
pass
###导入单选题题目
p = document.add_paragraph()
run = p.add_run(str(count+1)+"、"+timuleixing[count]+find5[count][3:])
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
url = imgx[count][0]
headers ={
'Use-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
}
try:
img_data = requests.get(url = url,headers = headers).content
img_path = dirName+'/'+'tupian'+'.jpg'
with open(img_path,'wb') as fp:
fp.write(img_data)
print("保存成功")
document.add_picture(img_path, width=Inches(5))
count+=1
except:
count+=1
###导入选项
p = document.add_paragraph()
run = p.add_run(axuanxiang[count-1])
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
p = document.add_paragraph()
run = p.add_run(bxuanxiang[count-1])
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
p = document.add_paragraph()
run = p.add_run(cxuanxiang[count-1])
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
p = document.add_paragraph()
run = p.add_run(dxuanxiang[count-1])
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
p = document.add_paragraph()
run = p.add_run("\n")
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
else:
count+=1
###美化界面
p = document.add_paragraph()
run = p.add_run("\n\n\n\n\n")
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
###美化解析
counting = 0
jiexilast2 = []
for ok in jiexilast:
jiexilast2.append(re.sub(r'\n\t\t',':',ok))
for every in correct:
if every != yours[counting]:
###导入解析和答案
p = document.add_paragraph()
run = p.add_run(str(counting+1)+"、"+"正确答案为:"+correct[counting]+"\n"+jiexilast2[counting])
run.font.size = Pt(14)
run.font.name=u'宋体'
r = run._element
r.rPr.rFonts.set(qn('w:eastAsia'),u'宋体')
url = imgt[counting][0]
headers ={
'Use-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
}
try:
img_data = requests.get(url = url,headers = headers).content
img_path = dirName+'/'+'tupian'+'.jpg'
with open(img_path,'wb') as fp:
fp.write(img_data)
print("保存成功")
document.add_picture(img_path, width=Inches(5))
print("写入成功")
counting+=1
except:
counting+=1
else:
counting+=1
###保存文档
document.save(v[0][5:-3]+'.docx')
print(v[0][5:-3]+'保存成功!')
来源:https://blog.csdn.net/KelvinChunggg/article/details/112313749
猜你喜欢
- 首先这个bak文件是SQL数据库的备份文件,要使用SQL恢复然后就可以查询了找到需要的文件注意解压出来有7GB+1、下载SQL server
- 想要使用xpath来解析html内容, PHP自带两个对象DOMDocument,DOMXpath,其中初始化 loadHtml一般都会报很
- MySQL MEM_ROOT详解这篇文章会详细解说MySQL中使用非常广泛的MEM_ROOT的结构体,同时省去debug部分的信息,仅分析正
- 如何最准确地统计在线用户数?我们推荐的这个程序据说是目前最好的在线用户数量统计程序。代码如下:'首先要设置好global.asa&n
- 一、 通过runtime包进行多核设置1.NumCPU()获取当前系统的cpu核数2.GOMAXPROCS设置当前程序运行时占用的cpu核数
- 本文实例讲述了Python 面向对象部分知识点。分享给大家供大家参考,具体如下:面向对象:世间万物,皆可分类。---------------
- 起步这是一个相当实用的内置模块,但是很多人竟然不知道他的存在——笔者也是今天偶然看到的,哎……尽管如此,还是改变不了这个模块好用的事实hea
- 目录一、目标二、环境准备1、基本信息2、数据库环境准备3、建库 & 导入分表三、配置&实践1、pom文件 &nbs
- 这篇文章主要介绍了Python爬虫爬取百度搜索内容代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要
- 随着CSS3越来越热,CSS3动画也逐渐受到大家的关注。这次有幸修改淘宝网全站页头,小小地应用了下(详见http://www.taobao.
- 编写高质量代码的30条黄金守则-Day 01(首选隐式类型转换),本文由比特飞原创发布,转载务必在文章开头附带链接:https://www.
- PHP simplexml_load_file() 函数实例转换 XML 文件为 SimpleXMLElement 对象,然后输出对象的键和
- 基于OpenCV2.4.8和 python 2.7实现简单的手势识别。以下为基本步骤 1.去除背景,提取手的轮廓2. RGB->YUV
- pyinstaller 属于Python第三方库,使用前需先安装# 首先安装pyinstallerpip install pyinstall
- 关于target="_blank"去留的问题在网上已经被反复争议很多次了。有的说要留,有的说要去掉。主张留的一方主要是考
- 刚刚接触爬虫,基础的东西得时时回顾才行,这么全面的帖子无论如何也得厚着脸皮转过来啊!什么是 Urllib 库?urllib 库 是 Pyth
- SQL Server 2008的一些新特点及独到之处:设置和安装SQL Server 2008的设置和安装也有所改进。配置数据和引擎位已经分
- 本文实例为大家分享了php微信公众号开发之快递查询的具体代码,供大家参考,具体内容如下快递查询数组用法foreach查询接口是:爱快递:ht
- 随着新技术的不断发展,JavaScript已经不再仅仅只是一个网络语言。现在,我们能够看到很多使用JavaScript来构建基于本地浏览器的
- 基准函数是测试演化计算算法性能的函数集,由于大部分基准函数集都是C/C++编写,Python编写的基准函数比较少,因此本文实现了13个常用基