python手机号前7位归属地爬虫代码实例
作者:wanli001 发布时间:2021-01-23 05:20:34
标签:python,手机,归属地,爬虫
需求分析
项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份
单线程版本
# coding:utf-8
import requests
from datetime import datetime
class PhoneInfoSpider:
def __init__(self, phoneSections):
self.phoneSections = phoneSections
def phoneInfoHandler(self, textData):
text = textData.splitlines(True)
# print("text length:" + str(len(text)))
if len(text) >= 9:
number = text[1].split('\'')[1]
province = text[2].split('\'')[1]
mobile_area = text[3].split('\'')[1]
postcode = text[5].split('\'')[1]
line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
line_text = number + "," + province + "," + mobile_area + "," + postcode
print(line_text)
# print("province:" + province)
try:
f = open('./result.txt', 'a')
f.write(str(line_text) + '\n')
except Exception as e:
print(Exception, ":", e)
def requestPhoneInfo(self, phoneNum):
try:
url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
response = requests.get(url)
self.phoneInfoHandler(response.text)
except Exception as e:
print(Exception, ":", e)
def requestAllSections(self):
# last用于接上次异常退出前的号码
last = 0
# last = 4
# 自动生成手机号码,后四位补0
for head in self.phoneSections:
head_begin = datetime.now()
print(head + " begin time:" + str(head_begin))
# for i in range(last, 10000):
for i in range(last, 10):
middle = str(i).zfill(4)
phoneNum = head + middle + "0000"
self.requestPhoneInfo(phoneNum)
last = 0
head_end = datetime.now()
print(head + " end time:" + str(head_end))
if __name__ == '__main__':
task_begin = datetime.now()
print("phone check begin time:" + str(task_begin))
# 电信,联通,移动,虚拟运营商
dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
'178', '182', '183', '184', '187', '188', '198']
add = ['170']
all_num = dx + lt + yd + add
# print(all_num)
print(len(all_num))
# 要爬的号码段
spider = PhoneInfoSpider(all_num)
spider.requestAllSections()
task_end = datetime.now()
print("phone check end time:" + str(task_end))
发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。
多线程版本
# coding:utf-8
import requests
from datetime import datetime
import queue
import threading
threadNum = 32
class MyThread(threading.Thread):
def __init__(self, func):
threading.Thread.__init__(self)
self.func = func
def run(self):
self.func()
def requestPhoneInfo():
global lock
while True:
lock.acquire()
if q.qsize() != 0:
print("queue size:" + str(q.qsize()))
p = q.get() # 获得任务
lock.release()
middle = str(9999 - q.qsize()).zfill(4)
phoneNum = phone_head + middle + "0000"
print("phoneNum:" + phoneNum)
try:
url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
# print(url)
response = requests.get(url)
# print(response.text)
phoneInfoHandler(response.text)
except Exception as e:
print(Exception, ":", e)
else:
lock.release()
break
def phoneInfoHandler(textData):
text = textData.splitlines(True)
if len(text) >= 9:
number = text[1].split('\'')[1]
province = text[2].split('\'')[1]
mobile_area = text[3].split('\'')[1]
postcode = text[5].split('\'')[1]
line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
line_text = number + "," + province + "," + mobile_area + "," + postcode
print(line_text)
# print("province:" + province)
try:
f = open('./result.txt', 'a')
f.write(str(line_text) + '\n')
except Exception as e:
print(Exception, ":", e)
if __name__ == '__main__':
task_begin = datetime.now()
print("phone check begin time:" + str(task_begin))
dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
'182', '183', '184', '187', '188', '198']
all_num = dx + lt + yd
print(len(all_num))
for head in all_num:
head_begin = datetime.now()
print(head + " begin time:" + str(head_begin))
q = queue.Queue()
threads = []
lock = threading.Lock()
for p in range(10000):
q.put(p + 1)
print(q.qsize())
for i in range(threadNum):
middle = str(i).zfill(4)
global phone_head
phone_head = head
thread = MyThread(requestPhoneInfo)
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
head_end = datetime.now()
print(head + " end time:" + str(head_end))
task_end = datetime.now()
print("phone check end time:" + str(task_end))
多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。
总共40多个号段,爬完大概1,2个小时,总数据41w左右
来源:https://www.cnblogs.com/wanli002/p/11413281.html


猜你喜欢
- Java一直标榜一句老话叫“编写一次,到处运行(Write Once,Run Anywhere)”,CSS也差一点点做到了。但就是为了差的一
- Python实现对变位词的判断,供大家参考,具体内容如下什么是变位词呢?即两个单词都是由相同的字母组成,而各自的字母顺序不同,譬如pytho
- 当用户访问一个网站的时候,第一屏的信息展示是非常重要的,很大程度上影响了用户是否决定停留,然而光靠文字大面积的堆积,很难直观而迅速的告诉用户
- 一、http协议无状态问题http协议没有提供多次请求之间的关联功能,协议的本意也并未考虑到多次请求之间的状态维持,每一次请求都被协议认为是
- 先说需求: 1、django 自带了admin后管,如果我们需要使用,只需把我们定义的models注册即可;2、但如果只是简单注册,那显示的
- 对于日期的操作可以说是比较常见的case了,日期与格式化字符串互转,日期与时间戳互转,日期的加减操作等,下面主要介绍下常见的需求场景如何实现
- 回表在研究mysql二级索引的时候,发现Mysql回表这个操作,往下研究了一下字面意思,找到索引,回到表中找数据解释一下就是:先通过索引扫描
- 1.int,float相互转换例1:int转float使用float(int)float转int使用int(float)# coding:u
- django版本:1.4.21。一、准备工作1、新建项目和app[root@yl-web-test srv]# django-admin.p
- 在这种配置下我们要实现关键词不区分大小写搜索并高亮显示要借助ASP的正则处理了,请看下面代码:<% Function&nbs
- 前言闲来无聊,写了一个爬虫程序获取百度疫情数据。申明一下,研究而已。而且页面应该会进程做反爬处理,可能需要调整对应xpath。Github仓
- 介绍在本文中,你将学习如何使用 Python 构建人脸识别系统。人脸识别比人脸检测更进一步。在人脸检测中,我们只检测人脸在图像中的位置,但在
- 1. 安装 docker在 WSL2 中安装 docker https://www.jb51.net/article/223179.htm会
- 1. 如何阻止事件冒泡 //非IE if (event && event.stopPropagation) event.st
- 上一次的错误太多,排版也出现了问题,重写了一遍,希望大家支持.循环遍历一个元素是开发中最常见的需求之一,那么让我们来看一个由框架BASE2和
- 控制字符控制字符(Control Character),或者说非打印字符,出现于特定的信息文本中,表示某一控制功能的字符,如控制符:LF(换
- 前言春联是中国传统文化中最具内涵的元素之一,它以对仗工整、简洁精巧的文字描绘美好形象,抒发美好愿望,是中国特有的文学形式,是华人们过年的重要
- 历史:Message Queue的需求由来已久,80年代最早在金融交易中,高盛等公司采用Teknekron公司的产品,当时的Message
- 本文实例讲述了Python中文分词实现方法。分享给大家供大家参考,具体如下:在Python这pymmseg-cpp 还是十分方便的!环境 u
- Python中的最大整数Python中可以通过sys模块来得到int的最大值. python2中使用的方法是import sysmax =