requests.Session()

requests.get() 或者 requests.post() 发送GET请求或POST请求都是一次性请求, requests.session() 可以自动处理cookies,做状态保持。获取cookie值不用登录。

1
2
3
4
session = requests.session()

response = session.get(url=initial_url, headers=headers,timeout=120,proxies=proxies).json()
html = session.get(url=url, headers=headers, timeout=120,proxies=proxies).json()

JSON库

json库:用 Python 语言来编码和解码 JSON 对象。

JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,易于人阅读和编写。

可以用于一些网页的快速处理。

json.dumps:将 Python 对象编码成 JSON 字符串。

json.loads:将已编码的 JSON 字符串解码为 Python 对象。

1
2
3
4
5
6
7
8
9
10
def checkkk(url):
data ={
"url":url
}
response = json.loads(requests.post(thisurl, json.dumps(data)).text)
# print(response["info"])
if response["info"] == 2:
return 1
else:
return 0

Selenium库

驱动

浏览器版本更新,驱动不匹配。报错:

1
2
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 118
Current browser version is 131.0.6778.140 with binary path xxx\chrome.exe

我的驱动版本是118,谷歌浏览器版本是131,去下载驱动,驱动地址1是大佬原创的,下载的驱动能满足windows使用,但在ubuntu上运行会报错:

1
2
3
The chromedriver version (131.0.6775.0) detected in PATH at /usr/local/bin/chromedriver might not be compatible with the detected chrome version (131.0.6778.139); currently, chromedriver 131.0.6778.108 is recommended for chrome 131.*, so it is advised to delete the driver in PATH and retry

(更新报错)The chromedriver version (131.0.6778.108) detected in PATH at /usr/local/bin/chromedriver might not be compatible with the detected chrome version (131.0.6778.139); currently, chromedriver 131.0.6778.204 is recommended for chrome 131.*, so it is advised to delete the driver in PATH and retry

根据报错找到官方的驱动地址2,实测能满足ubuntu使用。

解压驱动,将其中的文件复制,根据报错找到浏览器的路径,替换到该路径;根据当前解释器路径找到Python路径,替换。对于ubuntu,则替换到浏览器的地址和/usr/local/bin/,用chmod +x添加执行权限。

接着报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
File "D:\Python\Python38\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 84, in __init__
super().__init__(
File "D:\Python\Python38\lib\site-packages\selenium\webdriver\chromium\webdriver.py", line 104, in __init__
super().__init__(
File "D:\Python\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 286, in __init__
self.start_session(capabilities, browser_profile)
File "D:\Python\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 378, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "D:\Python\Python38\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 440, in execute
self.error_handler.check_response(response)
File "D:\Python\Python38\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 245, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: DevToolsActivePort file doesn't exist
Stacktrace:
GetHandleVerifier [0x00007FF713938E92+54786]
(No symbol) [0x00007FF7138A55B2]
(No symbol) [0x00007FF71375A64B]
(No symbol) [0x00007FF71378CA50]
(No symbol) [0x00007FF713787C46]
(No symbol) [0x00007FF7137853BE]
(No symbol) [0x00007FF7137C3FBB]
(No symbol) [0x00007FF7137C3A30]
(No symbol) [0x00007FF7137BBC43]
(No symbol) [0x00007FF713790941]
(No symbol) [0x00007FF713791B84]
GetHandleVerifier [0x00007FF713C87EE2+3524178]
GetHandleVerifier [0x00007FF713CDD790+3874560]
GetHandleVerifier [0x00007FF713CD5D0F+3843199]
GetHandleVerifier [0x00007FF7139D5026+694166]
(No symbol) [0x00007FF7138B0A28]
(No symbol) [0x00007FF7138ACA34]
(No symbol) [0x00007FF7138ACB62]
(No symbol) [0x00007FF71389CC23]
BaseThreadInitThunk [0x00007FFA21F374B4+20]
RtlUserThreadStart [0x00007FFA232826A1+33]

代码修改:

1
2
3
4
options_ = ChromeOptions()
#options_.add_argument(r"user-data-dir=./Insagram/selenium")

driver = webdriver.Chrome("chromedriver", options=options_)

在程序里用–user-data-dir指定了用户数据文件夹后就报错了,因为不是selenium启动的浏览器。

解决方案:关闭所有的浏览器进程在运行代码,或不指定–user-data-dir,或复制一份user data文件夹。

lxml库

解析HTML,etree和XPath配合使用,将传进去的字符串转变成_Element对象,可以方便地使用getparent()、remove()、xpath()等方法。

1
2
3
from lxml import etree
res = requests.get(image_url, headers=headers).text
html = etree.HTML(res)

Chrome has crashed & ChromeOptions启动参数

浏览器和驱动都安装好了,版本也是匹配的,但运行就报错:

1
2
3
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
(session not created: DevToolsActivePort file doesn't exist)
(The process started from chrome location /opt/google/chrome/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

参考别人的解决方式:

解决DevToolsActivePort file doesn’t exist

Linux调用Selenium报session not created: Chrome failed to start: exited normally.的问题解决方式

首先一定要确认浏览器和驱动的版本匹配、启动浏览器的指定路径正确,然后观察发现以上都是启动参数的问题,于是修改启动参数(主要是添加了一个headless):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from selenium import webdriver
options = webdriver.ChromeOptions()

# 添加启动参数
options_.add_argument('headless') #解决问题主要是要添加了这一项,无界面模式
options.add_argument('--disable-gpu') # 禁用gpu渲染
options.add_argument("--disable-blink-features=AutomationControlled") # selenium消除启动特征1:屏蔽webdriver特征
options_.add_argument(
'User-Agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36')
options_.add_argument('--disable-java')
options_.add_argument('--disable-extensions')
options_.add_argument('no-sandbox')

#添加实验选项
option.add_experimental_option('excludeSwitches', ['enable-automation']) # selenium消除启动特征2:消除window.navigator.webdriver的值,高版本的谷歌浏览器下已经无效了,需用1

options_.add_argument('--proxy-server=socks5://xxxx' ) #代理

driver = webdriver.Chrome("chromedriver", options=options)

大佬分析问题在于启动Selenium需要桌面,而我的Linux上没有桌面,所以报错。解决这个问题的方式是将Selenium设置为不使用浏览器启动,我则是设置了无界面模式(headless)。

然而不使用浏览器启动存在对应的元素没加载的问题,可以通过设置浏览器大小来解决,如1366-768。

ubuntu用selenium收到403状态码

检查和重装了chrome和驱动都没问题,也开启了无界面模式,selenium.get()获取的页面内容却是上面的403拒绝访问。在windows上能成功爬取,在ubuntu上却403。

ERROR: The request could not be satisfied

403 ERROR

The request could not be satisfied.


Request blocked. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.

Generated by cloudfront (CloudFront)
Request ID: OCKGdsv8zSPr8Vpi-JubSXpxV9SzCniI6bBxx1QGDq9MRuo3nYQtHQ==
403意思是服务器收到了请求但拒绝,一般是爬虫程序被识别了,伪装没做好。

先尝试在ubuntu上访问其他网站,其他网站都OK,说明就是被该网站反爬了。查阅资料分析出以下办法可以尝试:

1.绕过检查,增加等待时间?多刷新几次?

2.改变请求头,做更好的伪装

3.改用其他浏览器

最后的解决方法很奇怪,把请求头由linux换为windows就行了。应该是网站针对Linux和Windows的检查机制不一致,刚好能绕过。

1
2
3
4
options_ = webdriver.ChromeOptions()
# options_.add_argument(
# 'User-Agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36')
options_.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')

429请求过多

当请求过于频繁时服务器会拒绝连接发送429状态码的页面,需要关闭过多的连接并循环请求:

1
2
3
4
5
6
7
8
9
response = ''
for i in range(20): # 循环请求网站
session = requests.Session()
session.keep_alive = False #不保持会话连接
response = session.get(downurl, headers=headers, timeout=20, proxies=proxies, cookies=cookie_dict)
print(response.status_code)
time.sleep(20)
if response.status_code == 200:
break
1
dvid image error HTTPConnectionPool(host='172.25.76.14', port=8001): Max retries exceeded with url: /bullet/dvidshubpost (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f50d1737100>: Failed to establish a new connection: [Errno 113] No route to host'))

SSLEOFError

1
requests.exceptions.SSLError: HTTPSConnectionPool(host='xxx', port=443): Max retries exceeded with url: xxx.mp4 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))

参考方案

方案一:降低urllib3的版本 <= 1.25.11 (评论区大部分人亲测有效)

1
pip install urllib3==1.25.11

方案二:修改代理 (本人亲测有效)

1
2
# {"http": "http://xxx", "https": "https://xxx"} 将https的代理用http开头
{"http": "http://xxx", "https": "http://xxx"}

参考教程:

requests 模块的 requests.session() 功能

Python JSON|菜鸟教程

Selenium使用ChromeOptions启动参数

selenium报错session not created: DevToolsActivePort file doesn‘t exist

selenium消除启动特征避免被反爬

python爬虫系列–lxml(etree/parse/xpath)的使用