IT科技2025-10-03 22:01:575339

Python爬虫遇到验证码的几种处理方式，文章末尾有源码

最近事情其实挺多了，爬虫打了一下蓝桥杯的遇到验证源码比赛，还在准备着一些证书的种处章末考试，关于爬虫之类的理方博客都搁着了一段时间了，关于我自己确实有点退步了，式文实属不该，爬虫其实我自己也是遇到验证源码在想，大三了，种处章末到底我是理方要去考研，还是式文依然像这样更新换代的学技术，再或者，爬虫继续钻爬虫这路子，遇到验证源码虽然我也不知道这路走的种处章末顺不顺，自己也有点抓不住光明，理方这段时间，式文大概花了一个多月的晚上吧，终于把Django 的大致过了一次，剩下的就是对着官方文档和一些实际项目操作了，这些我也会打算开一个专栏，来专门记录一下我学习Django 的一些心酸道路，学习依旧是亿华云这样，你不学习，就会失去，很是莫名其妙，真的很奇怪，某人的奖学金是靠关系的，某项目的获奖者仅仅只是临时换了一个名字，。。。

不管这些了，无所谓的东西，这边博客，将处理图片验证码的2个比较优秀的方式进行了一次封装, 分别是百度的aip 和一个最近火起来的识别muggle-ocr

这里要主要提一下百度的aip，这里面的东西是真的多，我还扩展了一个识别色情图片的云服务器函数，有兴趣的可以玩一玩，另外学了爬虫之后，这些图片真的是应接不暇，网站也是多的数不胜数，希望净网行动加把劲，剩下的就不比比了，看实际操作吧。

本篇文章介绍了爬虫中验证码的处理方式，并把这些功能封装起来，供我们使用，涉及到百度AIP的调用方式，以及一个最新的开源库muggle识别库的使用。

学会调用百度的aip接口：扩展百度的色情识别接口：学会muggle_ocr 识别接口：

封装源码：

学会调用百度的aip接口：

1. 首先需要注册一个账号：https://login.bce.baidu.com/

注册完成之后登入

2. 创建项目

在这些技术里面找到文字识别，然后点击创建一下项目

创建完成之后：

图片中 AppID , API key, Secret Key 这些待会是需要用的。

下一步可以查看官网文档，或者直接使用我写的代码

3. 安装一下依赖库 pip install baidu-aip

这只是服务器租用一个接口，需要前面的一些设置。

def return_ocr_by_baidu(self, test_image): """ ps: 先在__init__ 函数中完成你自己的baidu_aip 的一些参数设置这次测试使用高精度版本测试如果速度很慢可以换回一般版本 self.client.basicGeneral(image, options) 相关参考网址: https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa :param test_image: 待测试的文件名称 :return: 返回这个验证码的识别效果如果错误可以多次调用 """ image = self.return_image_content(test_image=self.return_path(test_image)) # 调用通用文字识别（高精度版） # self.client.basicAccurate(image) # 如果有可选参数相关参数可以在上面的网址里面找到 options = { } options["detect_direction"] = "true" options["probability"] = "true" # 调用 result = self.client.basicAccurate(image, options) result_s = result[words_result][0][words] # 不打印关闭 print(result_s) if result_s: return result_s.strip() else: raise Exception("The result is None , try it !")

扩展百度的色情识别接口：

我们写代码肯定是要找点乐子的，不可能这么枯燥无味吧?

色情识别接口在内容审核中，找一下就可以了。

调用方式源码：

# -*- coding : utf-8 -*- # @Time : 2020/10/22 17:30 # @author : 沙漏在下雨 # @Software : PyCharm # @CSDN : https://me.csdn.net/qq_45906219 from aip import AipContentCensor from ocr import MyOrc class Auditing(MyOrc): """ 这是一个调用百度内容审核的aip接口主要用来审核一些色情反恐恶心之类的东西网址: https://ai.baidu.com/ai-doc/ANTIPORN/tk3h6xgkn """ def __init__(self): # super().__init__() APP_ID = 填写你的ID API_KEY = 填写你的KEY SECRET_KEY = 填写你的SECRET_KEY self.client = AipContentCensor(APP_ID, API_KEY, SECRET_KEY) def return_path(self, test_image): return super().return_path(test_image) def return_image_content(self, test_image): return super().return_image_content(test_image) def return_Content_by_baidu_of_image(self, test_image, mode=0): """ 继承ocr中的一些方法，因为都是放一起的少些一点代码内容审核: 关于图片中是否存在一些非法不良信息内容审核还可以实现文本审核我觉得有点鸡肋就没一起封装进去 url: https://ai.baidu.com/ai-doc/ANTIPORN/Wk3h6xg56 :param test_image: 待测试的图片可以本地文件也可以网址 :param mode: 默认 = 0 表示识别的本地文件 mode = 1 表示识别的图片网址连接 :return: 返回识别结果 """ if mode == 0: filepath = self.return_image_content(self.return_path(test_image=test_image)) elif mode == 1: filepath = test_image else: raise Exception("The mode is 0 or 1 but your mode is ", mode) # 调用色情识别接口 result = self.client.imageCensorUserDefined(filepath) # """ 如果图片是url调用如下 """ # result = self.client.imageCensorUserDefined(http://www.example.com/image.jpg) print(result) return result a = Auditing() a.return_Content_by_baidu_of_image("test_image/2.jpg", mode=0)

学会muggle_ocr 识别接口：

这个包是最近火起来的，使用起来很简单，没多少其他函数

安装 pip install muggle-ocr 这个下载有点慢最好使用手机热点目前镜像网站(清华/阿里) 还没有更新到这个包因为这个包是最新的一个ocr模型 12 调用接口 def return_ocr_by_muggle(self, test_image, mode=1): """ 调用这个函数使用 muggle_ocr 来进行识别 :param test_image 待测试的文件名称最好绝对路径 :param 模型 mode = 0 即 ModelType.OCR 表示识别普通印刷文本当 mode = 1 默认即 ModelType.Captcha 表示识别4-6位简单英输验证码官方网站: https://pypi.org/project/muggle-ocr/ :return: 返回这个验证码的识别结果如果错误可以多次调用 """ # 确定识别物品 if mode == 1: sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha) elif mode == 0: sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.OCR) else: raise Exception("The mode is 0 or 1 , but your mode == ", mode) filepath = self.return_path(test_image=test_image) with open(filepath, rb) as fr: captcha_bytes = fr.read() result = sdk.predict(image_bytes=captcha_bytes) # 不打印关闭 print(result) return result.strip()

封装源码：

# -*- coding : utf-8 -*- # @Time : 2020/10/22 14:12 # @author : 沙漏在下雨 # @Software : PyCharm # @CSDN : https://me.csdn.net/qq_45906219 import muggle_ocr import os from aip import AipOcr """ PS: 这个作用主要是作了一个封装把2个常用的图片/验证码识别方式合在一起怎么用取决于自己接口1: muggle_ocr pip install muggle-ocr 这个下载有点慢最好使用手机热点目前镜像网站(清华/阿里) 还没有更新到这个包因为这个包是最新的一个ocr模型接口2: baidu-aip pip install baidu-aip 这个知道的人应该很多很多，但是我觉得还是muggle 这个新包猛的一比调用方式可以参考官网文档: https://cloud.baidu.com/doc/OCR/index.html 或者使用我如下的方式都是ok的 :param image_path 待识别的图片路径如果目录很深推荐使用绝对路径 """ class MyOrc: def __init__(self): # 设置一些必要信息使用自己百度aip的内容 APP_ID = 你的ID API_KEY = 你的KEY SECRET_KEY = 你的SECRET_KEY self.client = AipOcr(APP_ID, API_KEY, SECRET_KEY) def return_path(self, test_image): """:return abs image_path""" # 确定路径 if os.path.isabs(test_image): filepath = test_image else: filepath = os.path.abspath(test_image) return filepath def return_image_content(self, test_image): """:return the image content """ with open(test_image, rb) as fr: return fr.read() def return_ocr_by_baidu(self, test_image): """ ps: 先在__init__ 函数中完成你自己的baidu_aip 的一些参数设置这次测试使用高精度版本测试如果速度很慢可以换回一般版本 self.client.basicGeneral(image, options) 相关参考网址: https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa :param test_image: 待测试的文件名称 :return: 返回这个验证码的识别效果如果错误可以多次调用 """ image = self.return_image_content(test_image=self.return_path(test_image)) # 调用通用文字识别（高精度版） # self.client.basicAccurate(image) # 如果有可选参数相关参数可以在上面的网址里面找到 options = { } options["detect_direction"] = "true" options["probability"] = "true" # 调用 result = self.client.basicAccurate(image, options) result_s = result[words_result][0][words] # 不打印关闭 print(result_s) if result_s: return result_s.strip() else: raise Exception("The result is None , try it !") def return_ocr_by_muggle(self, test_image, mode=1): """ 调用这个函数使用 muggle_ocr 来进行识别 :param test_image 待测试的文件名称最好绝对路径 :param 模型 mode = 0 即 ModelType.OCR 表示识别普通印刷文本当 mode = 1 默认即 ModelType.Captcha 表示识别4-6位简单英输验证码官方网站: https://pypi.org/project/muggle-ocr/ :return: 返回这个验证码的识别结果如果错误可以多次调用 """ # 确定识别物品 if mode == 1: sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha) elif mode == 0: sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.OCR) else: raise Exception("The mode is 0 or 1 , but your mode == ", mode) filepath = self.return_path(test_image=test_image) with open(filepath, rb) as fr: captcha_bytes = fr.read() result = sdk.predict(image_bytes=captcha_bytes) # 不打印关闭 print(result) return result.strip() # a = MyOrc() # a.return_ocr_by_baidu(test_image=test_image/digit_img_1.png)

相关文章