如何使用 Python 逐行读取 URL 列表并依次解析每个网页内容_技术教程

本文详解如何修正 readlines() 循环逻辑错误，确保 python 脚本对文本文件中每一行 url 都执行独立的 http 请求与 html 解析，避免仅处理最后一行的问题。

你的原始代码中存在一个典型的缩进与作用域错误：for link in linksList: 循环体仅包含 url = link 这一行，后续所有网络请求、解析和写入操作均位于循环外部。因此，url 变量在循环结束后只保留最后一个值，最终仅对该 URL 执行了一次处理。

要实现“逐个解析每个 URL”，必须将整个请求-解析-保存流程完整包裹在 for 循环内。以下是优化后的完整实现（含关键改进说明）：

✅ 正确结构：循环内完*流程

import requests
from bs4 import BeautifulSoup

def news():
    # 使用 with 语句安全读取文件（自动关闭）
    with open('list.txt', 'r') as links_file:
        links_list = links_file.readlines()

    # 对每个 URL 执行独立处理
    for link in links_list:
        link = link.strip()  # 去除换行符和首尾空格，避免请求失败
        if not link:  # 跳过空行
            continue

        print(f"Processing: {link}")

        try:
            resp = requests.get(link, timeout=10)
            resp.raise_for_status()  # 抛出非200状态异常

            soup = BeautifulSoup(resp.text, 'html.parser')
            target_div = soup.find("div", {"class": "m-exhibitor-entry__item__body__contacts__additional__website"})

            if target_div:
                # 提取所有  标签的文本内容
                with open("Websites.txt", "a", encoding="utf-8") as f:
                    for anchor in target_div.find_all("a"):
                        f.write(anchor.get_text(strip=True) + "\n")
                print(f"✓ Extracted from {link}")
            else:
                print(f"⚠ Warning: Target div not found on {link}")

        except requests.exceptions.RequestException as e:
            print(f"✗ Failed to fetch {link}: {e}")
        except Exception as e:
            print(f"✗ Error parsing {link}: {e}")

if __name__ == "__main__":
    news()

? 关键改进点：

缩进修复：全部网络请求、解析、写入逻辑均置于 for 循环内部，确保每轮迭代独立处理一个 URL；
健壮性增强：
- link.strip() 清除 \n 和空格，防止 requests.get("https://...\n") 报错；
- try/except 捕获网络异常（超时、连接拒绝等）和解析异常；
- resp.raise_for_status() 主动检查 HTTP 错误状态；
- 空行跳过与目标元素存在性校验，避免 AttributeError；
资源管理优化：
- 使用 with open(...) 替代手动 open/close，防止文件句柄泄露；
- 每次写入前重新打开文件（"a" 模式），或更推荐：一次性打开写入流（见下方进阶建议）；
编码声明：encoding="utf-8" 避免中文等特殊字符写入乱码。

⚠ 注意事项：

list.txt 中每行应为一个有效 URL（如 https://www.enlit-europe.com
/exhibitors/precept），无需额外符号；
目标网站可能有反爬机制，若频繁请求被拒，请添加 time.sleep(1) 或设置 headers（如 'User-Agent'）；
若需提升性能，可考虑使用 concurrent.futures.ThreadPoolExecutor 并发请求（注意遵守 robots.txt 及服务条款）。