Python搭建Keras CNN模型破解网站验证码的实现
作者:不脱发的程序猿 发布时间:2023-04-16 15:51:08
在本项目中,将会用Keras来搭建一个稍微复杂的CNN模型来破解以上的验证码。验证码如下:
利用Keras可以快速方便地搭建CNN模型,本项目搭建的CNN模型如下:
将数据集分为训练集和测试集,占比为8:2,该模型训练的代码如下:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from keras.utils import np_utils, plot_model
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.callbacks import EarlyStopping
from keras.layers import Conv2D, MaxPooling2D
# 读取数据
df = pd.read_csv('./data.csv')
# 标签值
vals = range(31)
keys = ['1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','J','K','L','N','P','Q','R','S','T','U','V','X','Y','Z']
label_dict = dict(zip(keys, vals))
x_data = df[['v'+str(i+1) for i in range(320)]]
y_data = pd.DataFrame({'label':df['label']})
y_data['class'] = y_data['label'].apply(lambda x: label_dict[x])
# 将数据分为训练集和测试集
X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data['class'], test_size=0.3, random_state=42)
x_train = np.array(X_train).reshape((1167, 20, 16, 1))
x_test = np.array(X_test).reshape((501, 20, 16, 1))
# 对标签值进行one-hot encoding
n_classes = 31
y_train = np_utils.to_categorical(Y_train, n_classes)
y_val = np_utils.to_categorical(Y_test, n_classes)
input_shape = x_train[0].shape
# CNN模型
model = Sequential()
# 卷积层和池化层
model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(32, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
# Dropout层
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(0.25))
model.add(Conv2D(128, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(128, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(0.25))
model.add(Flatten())
# 全连接层
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# plot model
##plot_model(model, to_file=r'./model.png', show_shapes=True)
# 模型训练
callbacks = [EarlyStopping(monitor='val_acc', patience=5, verbose=1)]
batch_size = 64
n_epochs = 100
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epochs, \
verbose=1, validation_data=(x_test, y_val), callbacks=callbacks)
mp = './verifycode_Keras.h5'
model.save(mp)
# 绘制验证集上的准确率曲线
val_acc = history.history['val_acc']
plt.plot(range(len(val_acc)), val_acc, label='CNN model')
plt.title('Validation accuracy on verifycode dataset')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend()
plt.show()
在上述代码中,训练模型的时候采用了early stopping技巧。early stopping是用于提前停止训练的callbacks。具体地,可以达到当训练集上的loss不在减小(即减小的程度小于某个阈值)的时候停止继续训练。
运行上述模型训练代码,输出的结果如下:
......(忽略之前的输出)
Epoch 22/100
64/1167 [>.............................] - ETA: 3s - loss: 0.0399 - acc: 1.0000
128/1167 [==>...........................] - ETA: 3s - loss: 0.1195 - acc: 0.9844
192/1167 [===>..........................] - ETA: 2s - loss: 0.1085 - acc: 0.9792
256/1167 [=====>........................] - ETA: 2s - loss: 0.1132 - acc: 0.9727
320/1167 [=======>......................] - ETA: 2s - loss: 0.1045 - acc: 0.9750
384/1167 [========>.....................] - ETA: 2s - loss: 0.1006 - acc: 0.9740
448/1167 [==========>...................] - ETA: 2s - loss: 0.1522 - acc: 0.9643
512/1167 [============>.................] - ETA: 1s - loss: 0.1450 - acc: 0.9648
576/1167 [=============>................] - ETA: 1s - loss: 0.1368 - acc: 0.9653
640/1167 [===============>..............] - ETA: 1s - loss: 0.1353 - acc: 0.9641
704/1167 [=================>............] - ETA: 1s - loss: 0.1280 - acc: 0.9659
768/1167 [==================>...........] - ETA: 1s - loss: 0.1243 - acc: 0.9674
832/1167 [====================>.........] - ETA: 0s - loss: 0.1577 - acc: 0.9639
896/1167 [======================>.......] - ETA: 0s - loss: 0.1488 - acc: 0.9665
960/1167 [=======================>......] - ETA: 0s - loss: 0.1488 - acc: 0.9656
1024/1167 [=========================>....] - ETA: 0s - loss: 0.1427 - acc: 0.9668
1088/1167 [==========================>...] - ETA: 0s - loss: 0.1435 - acc: 0.9669
1152/1167 [============================>.] - ETA: 0s - loss: 0.1383 - acc: 0.9688
1167/1167 [==============================] - 4s 3ms/step - loss: 0.1380 - acc: 0.9683 - val_loss: 0.0835 - val_acc: 0.9760
Epoch 00022: early stopping
可以看到,花费几分钟,一共训练了21次,最近一次的训练后,在测试集上的准确率为96.83%。在测试集的准确率曲线如下图:
模型训练完后,我们对新的验证码进行预测。新的100张验证码如下图:
使用训练好的CNN模型,对这些新的验证码进行预测,预测的Python代码如下:
# -*- coding: utf-8 -*-
import os
import cv2
import numpy as np
def split_picture(imagepath):
# 以灰度模式读取图片
gray = cv2.imread(imagepath, 0)
# 将图片的边缘变为白色
height, width = gray.shape
for i in range(width):
gray[0, i] = 255
gray[height-1, i] = 255
for j in range(height):
gray[j, 0] = 255
gray[j, width-1] = 255
# 中值滤波
blur = cv2.medianBlur(gray, 3) #模板大小3*3
# 二值化
ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY)
# 提取单个字符
chars_list = []
image, contours, hierarchy = cv2.findContours(thresh1, 2, 2)
for cnt in contours:
# 最小的外接矩形
x, y, w, h = cv2.boundingRect(cnt)
if x != 0 and y != 0 and w*h >= 100:
chars_list.append((x,y,w,h))
sorted_chars_list = sorted(chars_list, key=lambda x:x[0])
for i,item in enumerate(sorted_chars_list):
x, y, w, h = item
cv2.imwrite('test_verifycode/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])
def remove_edge_picture(imagepath):
image = cv2.imread(imagepath, 0)
height, width = image.shape
corner_list = [image[0,0] < 127,
image[height-1, 0] < 127,
image[0, width-1]<127,
image[ height-1, width-1] < 127
]
if sum(corner_list) >= 3:
os.remove(imagepath)
def resplit_with_parts(imagepath, parts):
image = cv2.imread(imagepath, 0)
os.remove(imagepath)
height, width = image.shape
file_name = imagepath.split('/')[-1].split(r'.')[0]
# 将图片重新分裂成parts部分
step = width//parts # 步长
start = 0 # 起始位置
for i in range(parts):
cv2.imwrite('./test_verifycode/%s.jpg'%(file_name+'-'+str(i)), \
image[:, start:start+step])
start += step
def resplit(imagepath):
image = cv2.imread(imagepath, 0)
height, width = image.shape
if width >= 64:
resplit_with_parts(imagepath, 4)
elif width >= 48:
resplit_with_parts(imagepath, 3)
elif width >= 26:
resplit_with_parts(imagepath, 2)
# rename and convert to 16*20 size
def convert(dir, file):
imagepath = dir+'/'+file
# 读取图片
image = cv2.imread(imagepath, 0)
# 二值化
ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
# 保存图片
cv2.imwrite('%s/%s' % (dir, file), img)
# 读取图片的数据,并转化为0-1值
def Read_Data(dir, file):
imagepath = dir+'/'+file
# 读取图片
image = cv2.imread(imagepath, 0)
# 二值化
ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
# 显示图片
bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
return bin_values
def predict(VerifyCodePath):
dir = './test_verifycode'
files = os.listdir(dir)
# 清空原有的文件
if files:
for file in files:
os.remove(dir + '/' + file)
split_picture(VerifyCodePath)
files = os.listdir(dir)
if not files:
print('查看的文件夹为空!')
else:
# 去除噪声图片
for file in files:
remove_edge_picture(dir + '/' + file)
# 对黏连图片进行重分割
for file in os.listdir(dir):
resplit(dir + '/' + file)
# 将图片统一调整至16*20大小
for file in os.listdir(dir):
convert(dir, file)
# 图片中的字符代表的向量
files = sorted(os.listdir(dir), key=lambda x: x[0])
table = np.array([Read_Data(dir, file) for file in files]).reshape(-1,20,16,1)
# 模型保存地址
mp = './verifycode_Keras.h5'
# 载入模型
from keras.models import load_model
cnn = load_model(mp)
# 模型预测
y_pred = cnn.predict(table)
predictions = np.argmax(y_pred, axis=1)
# 标签字典
keys = range(31)
vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'N',
'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z']
label_dict = dict(zip(keys, vals))
return ''.join([label_dict[pred] for pred in predictions])
def main():
dir = './VerifyCode/'
correct = 0
for i, file in enumerate(os.listdir(dir)):
true_label = file.split('.')[0]
VerifyCodePath = dir+file
pred = predict(VerifyCodePath)
if true_label == pred:
correct += 1
print(i+1, (true_label, pred), true_label == pred, correct)
total = len(os.listdir(dir))
print('\n总共图片:%d张\n识别正确:%d张\n识别准确率:%.2f%%.'\
%(total, correct, correct*100/total))
main()
以下是该CNN模型的预测结果:
Using TensorFlow backend.
2018-10-25 15:13:50.390130: I C: f_jenkinsworkspace
el-winMwindowsPY35 ensorflowcoreplatformcpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
1 ('ZK6N', 'ZK6N') True 1
2 ('4JPX', '4JPX') True 2
3 ('5GP5', '5GP5') True 3
4 ('5RQ8', '5RQ8') True 4
5 ('5TQP', '5TQP') True 5
6 ('7S62', '7S62') True 6
7 ('8R2Z', '8R2Z') True 7
8 ('8RFV', '8RFV') True 8
9 ('9BBT', '9BBT') True 9
10 ('9LNE', '9LNE') True 10
11 ('67UH', '67UH') True 11
12 ('74UK', '74UK') True 12
13 ('A5T2', 'A5T2') True 13
14 ('AHYV', 'AHYV') True 14
15 ('ASEY', 'ASEY') True 15
16 ('B371', 'B371') True 16
17 ('CCQL', 'CCQL') True 17
18 ('CFD5', 'GFD5') False 17
19 ('CJLJ', 'CJLJ') True 18
20 ('D4QV', 'D4QV') True 19
21 ('DFQ8', 'DFQ8') True 20
22 ('DP18', 'DP18') True 21
23 ('E3HC', 'E3HC') True 22
24 ('E8VB', 'E8VB') True 23
25 ('DE1U', 'DE1U') True 24
26 ('FK1R', 'FK1R') True 25
27 ('FK91', 'FK91') True 26
28 ('FSKP', 'FSKP') True 27
29 ('FVZP', 'FVZP') True 28
30 ('GC6H', 'GC6H') True 29
31 ('GH62', 'GH62') True 30
32 ('H9FQ', 'H9FQ') True 31
33 ('H67Q', 'H67Q') True 32
34 ('HEKC', 'HEKC') True 33
35 ('HV2B', 'HV2B') True 34
36 ('J65Z', 'J65Z') True 35
37 ('JZCX', 'JZCX') True 36
38 ('KH5D', 'KH5D') True 37
39 ('KXD2', 'KXD2') True 38
40 ('1GDH', '1GDH') True 39
41 ('LCL3', 'LCL3') True 40
42 ('LNZR', 'LNZR') True 41
43 ('LZU5', 'LZU5') True 42
44 ('N5AK', 'N5AK') True 43
45 ('N5Q3', 'N5Q3') True 44
46 ('N96Z', 'N96Z') True 45
47 ('NCDG', 'NCDG') True 46
48 ('NELS', 'NELS') True 47
49 ('P96U', 'P96U') True 48
50 ('PD42', 'PD42') True 49
51 ('PECG', 'PEQG') False 49
52 ('PPZF', 'PPZF') True 50
53 ('PUUL', 'PUUL') True 51
54 ('Q2DN', 'D2DN') False 51
55 ('QCQ9', 'QCQ9') True 52
56 ('QDB1', 'QDBJ') False 52
57 ('QZUD', 'QZUD') True 53
58 ('R3T5', 'R3T5') True 54
59 ('S1YT', 'S1YT') True 55
60 ('SP7L', 'SP7L') True 56
61 ('SR2K', 'SR2K') True 57
62 ('SUP5', 'SVP5') False 57
63 ('T2SP', 'T2SP') True 58
64 ('U6V9', 'U6V9') True 59
65 ('UC9P', 'UC9P') True 60
66 ('UFYD', 'UFYD') True 61
67 ('V9NJ', 'V9NH') False 61
68 ('V35X', 'V35X') True 62
69 ('V98F', 'V98F') True 63
70 ('VD28', 'VD28') True 64
71 ('YGHE', 'YGHE') True 65
72 ('YNKD', 'YNKD') True 66
73 ('YVXV', 'YVXV') True 67
74 ('ZFBS', 'ZFBS') True 68
75 ('ET6X', 'ET6X') True 69
76 ('TKVC', 'TKVC') True 70
77 ('2UCU', '2UCU') True 71
78 ('HNBK', 'HNBK') True 72
79 ('X8FD', 'X8FD') True 73
80 ('ZGNX', 'ZGNX') True 74
81 ('LQCU', 'LQCU') True 75
82 ('JNZY', 'JNZVY') False 75
83 ('RX34', 'RX34') True 76
84 ('811E', '811E') True 77
85 ('ETDX', 'ETDX') True 78
86 ('4CPR', '4CPR') True 79
87 ('FE91', 'FE91') True 80
88 ('B7XH', 'B7XH') True 81
89 ('1RUA', '1RUA') True 82
90 ('UBCX', 'UBCX') True 83
91 ('KVT5', 'KVT5') True 84
92 ('HZ3A', 'HZ3A') True 85
93 ('3XLR', '3XLR') True 86
94 ('VC7T', 'VC7T') True 87
95 ('7PG1', '7PQ1') False 87
96 ('4F21', '4F21') True 88
97 ('3HLJ', '3HLJ') True 89
98 ('1KT7', '1KT7') True 90
99 ('1RHE', '1RHE') True 91
100 ('1TTA', '1TTA') True 92
总共图片:100张
识别正确:92张
识别准确率:92.00%.
可以看到,该训练后的CNN模型,其预测新验证的准确率在90%以上。
Demo及数据集下载网站:CNN_4_Verifycode_jb51.rar
来源:https://blog.csdn.net/m0_38106923/article/details/86703037


猜你喜欢
- 关于数据库的逻辑设计,是一个很广泛的问题。本文主要针对开发应用中遇到在MS SQL Server上进行表设计时,对表的主键设计应注意的问题以
- 涉及到的函数为import matplotlib.pyplot as pltfrom skimage import measure, col
- 为何选Nuxt.js?在前后端分离出现之前,传统的web页面都是服务端渲染的,如JSP、PHP、Python Django,还有各种模板技术
- 问题详情:偶然在根目录创建文件夹的时候,突然显示错误,当时很惊讶,以前没遇见过这样的问题。当时界面是这样的。 用了一个 cd / 命令从用户
- 方式一: os.fork()# -*- coding:utf-8 -*-"""pid=os.fork() &n
- /*不同服务器数据库之间的数据操作*/ --创建链接服务器 exec sp_addlinkedserver 'ITSV ',
- Vue-validator 是Vue的表单验证插件,供大家参考,具体内容如下Vue版本: 1.0.24 Vue-validator版本: 2
- Python对不可变序列进行重复拼接操作效率会很低,因为每次都会生成一个新的对象,解释器需要把原来对象中的元素先复制到新的对象里,然后再追加
- 抛出一个问题,其实所谓的熟悉原生JS,怎样的程度才是熟悉呢?最近都在做原生JS熟悉的练习。。。用原生Js封装了一个Ajax插件,引入一般的项
- 注意:由于博文已经很多这方面的内容,这里不介绍安装vscode和anaconda的步骤,只讨论如下情况:已经安装了anaconda,也从官网
- 本篇文章将提供有关 Python 中异步 for 循环的完整详细信息。 我们将讨论异步函数、异步 for 循环和睡眠概念。接下来,我们将讨论
- 为了保护系统或数据安全,我们需要最佳随机密码。这里使用unix系统定义的文件设备/dev/random,从中获取随机数生成器的种子。需求说明
- 可变参数顾名思义,函数的可变参数是传入的参数可以变化的,1个,2个到任意个。当然可以将这些 参数封装成一个 list 或者 tuple 传入
- 创建项目和应用django-admin startproject zqxt_views(项目名)cd zqxt_viewspython ma
- 1.JOIN和UNION区别 join 是两张表做交连后里面条件相同的部分记录产生一个记录集, union是产生的两个记录集(字段要一样的)
- 在 MySQL 下,在进行中文模糊检索时,经常会返回一些与之不相关的记录,如查找 "%a%" 时,返回的可能有中文字符,
- 前言本文主要给大家介绍了关于SQL Server中bit类型字段增删查改的一些事,话说BIT类型字段之前,先看“诡异”的一幕,执行Updat
- 使用keras进行训练,默认使用单显卡,即使设置了os.environ['CUDA_VISIBLE_DEVICES']为两张
- 目录1. 字符串的翻转2. 判断字符串是不是回文串3. 单词大小写4. 字符串的拆分5. 字符串的合并6. 将元素进行重复7. 列表的拓展8
- 通过python+splinter,实现在12306网站刷票并自动购票流程(无法自动识别验证码)。此类程序只是提高了12306网站的 <