使用Python和代理采集CSDN文章

本文介绍如何利用 Python、BeautifulSoup 和代理（SOCKS5/HTTP）批量采集 CSDN 博客文章的 H1 标题与正文内容，适用于爬虫入门与实战。

在数据分析、内容聚合、智能推荐等应用中，常常需要批量获取各类博客网站的文章内容。CSDN 作为国内知名技术社区，拥有大量优质博文。为保证采集的隐私和效率，很多场景需要借助代理服务器隐藏自身真实IP。

一、工具选型

本方案采用如下技术栈：

requests：发起HTTP/HTTPS请求，支持 SOCKS5/HTTP 代理
BeautifulSoup：HTML解析与内容提取
代理服务器：支持 SOCKS5 或 HTTP 协议，可匿名采集目标内容

二、环境准备

1. 安装依赖库

pip install requests[socks] beautifulsoup4

> 注意 [socks] 使 requests 支持 SOCKS5 代理，如只用 HTTP 代理可省略。

三、核心采集流程

1. 设定代理和目标网址

# 代理配置：按需选择类型，填写你的代理地址和端口
proxies = {
    # SOCKS5 代理（如有账号密码：socks5h://user:pass@host:port）
    # "http": "socks5h://127.0.0.1:1080",
    # "https": "socks5h://127.0.0.1:1080",

    # HTTP 代理
    # "http": "http://127.0.0.1:8080",
    # "https": "http://127.0.0.1:8080",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
}

# CSDN 文章示例链接，可替换为任意你想抓取的CSDN博文
blog_url = "https://blog.csdn.net/qq_46145584/article/details/138422946"

2. 发起请求并解析页面

import requests
from bs4 import BeautifulSoup

try:
    resp = requests.get(blog_url, headers=headers, proxies=proxies, timeout=15)
    resp.raise_for_status()
    resp.encoding = resp.apparent_encoding  # 保证中文正常显示
    soup = BeautifulSoup(resp.text, "html.parser")
except Exception as e:
    print("请求失败：", e)
    exit()

3. 提取 H1 标题和正文内容

# 提取 H1 标题
h1 = soup.find("h1")
h1_text = h1.get_text(strip=True) if h1 else "未找到H1标题"

# 提取正文内容（CSDN正文常在 <article> 或 <div id="article_content"> 里）
content = soup.find("article")
if not content:
    content = soup.find("div", id="article_content")
content_text = content.get_text(separator='\n', strip=True) if content else "未找到正文内容"

print("H1标题：\n", h1_text)
print("\n正文内容（前300字）：\n", content_text[:300])

四、完整示例代码

import requests
from bs4 import BeautifulSoup

# 代理配置
proxies = {
    # "http": "socks5h://127.0.0.1:1080",
    # "https": "socks5h://127.0.0.1:1080",
    # "http": "http://127.0.0.1:8080",
    # "https": "http://127.0.0.1:8080",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
}

blog_url = "https://blog.csdn.net/qq_46145584/article/details/138422946"

try:
    resp = requests.get(blog_url, headers=headers, proxies=proxies, timeout=15)
    resp.raise_for_status()
    resp.encoding = resp.apparent_encoding
    soup = BeautifulSoup(resp.text, "html.parser")
except Exception as e:
    print("请求失败：", e)
    exit()

h1 = soup.find("h1")
h1_text = h1.get_text(strip=True) if h1 else "未找到H1标题"

content = soup.find("article")
if not content:
    content = soup.find("div", id="article_content")
content_text = content.get_text(separator='\n', strip=True) if content else "未找到正文内容"

print("H1标题：\n", h1_text)
print("\n正文内容（前300字）：\n", content_text[:300])

五、常见问题与优化建议

内容采集不全或为空？
CSDN 某些内容可能由 JS 动态加载，需结合 Selenium、Playwright 等浏览器自动化工具进一步处理。
代理不可用？
检查代理参数、账号密码或本地防火墙设置；注意 socks5h（全域名解析）优于 socks5。
正文内容结构变动？
不同文章或未来 CSDN 改版可能导致正文标签变化，建议通过浏览器 F12 实时查看页面结构，并适当调整选择器。

六、结语

通过以上方案，我们可以快速、安全地采集 CSDN 博客文章的结构化数据。实际批量采集时，可以结合多线程、IP代理池等进一步提升效率与稳定性。后续如需批量获取 CSDN 首页/搜索页所有文章链接再逐个采集，可再进一步扩展爬虫功能。

代理ip http代理 SOCKS代理