Kimi-CLI 搜索和页面内容抓取实现

技术研究人工智能 AI Agent

Kimi-CLI 提供了两种主要的 Web 交互能力： 1. SearchWeb 网络搜索工具 2. FetchURL 网页内容抓取工具这两个工具都支持两种模式： - 服务模式通过 Moonshot API 获取优化后的结果 - 本地模式直接进行 HTTP 请求和内容提取

概述

Kimi-CLI 提供了两种主要的 Web 交互能力：

SearchWeb: 网络搜索工具
FetchURL: 网页内容抓取工具

这两个工具都支持两种模式：

服务模式: 通过 Moonshot API 获取优化后的结果
本地模式: 直接进行 HTTP 请求和内容提取

1. SearchWeb 搜索工具

1.1 工具定义

src/kimi_cli/tools/web/search.py

from kosong.tooling import CallableTool2, ToolReturnValue
from pydantic import BaseModel, Field

class Params(BaseModel):
    query: str = Field(description="搜索查询文本")
    limit: int = Field(
        description="返回结果数量",
        default=5,
        ge=1,
        le=20,
    )
    include_content: bool = Field(
        description="是否包含网页内容",
        default=False,
    )

class SearchWeb(CallableTool2[Params]):
    name: str = "SearchWeb"
    description: str = load_desc(Path(__file__).parent / "search.md", {})
    params: type[Params] = Params

    def __init__(self, config: Config, runtime: Runtime):
        super().__init__()
        if config.services.moonshot_search is None:
            raise SkipThisTool()
        self._runtime = runtime
        self._base_url = config.services.moonshot_search.base_url
        self._api_key = config.services.moonshot_search.api_key
        self._oauth_ref = config.services.moonshot_search.oauth
        self._custom_headers = config.services.moonshot_search.custom_headers or {}

1.2 参数说明

query: 搜索查询文本，必需参数
limit: 返回结果数量，默认 5，范围 1-20
include_content: 是否包含网页内容，默认 False。启用时会消耗大量 token

1.3 搜索流程

用户查询 → SearchWeb → 解析 API Key → 调用搜索 API 
         → 解析结果 → 格式化输出 → Agent

1.4 核心实现

@override
async def __call__(self, params: Params) -> ToolReturnValue:
    builder = ToolResultBuilder(max_line_length=None)
    
    # 1. 解析 API Key（支持 OAuth）
    api_key = self._runtime.oauth.resolve_api_key(
        self._api_key, 
        self._oauth_ref
    )
    
    if not self._base_url or not api_key:
        return builder.error(
            "Search service is not configured. You may want to try other methods to search.",
            brief="Search service not configured",
        )
    
    tool_call = get_current_tool_call_or_none()
    assert tool_call is not None, "Tool call is expected to be set"
    
    # 2. 调用搜索 API
    async with (
        new_client_session() as session,
        session.post(
            self._base_url,
            headers={
                "User-Agent": USER_AGENT,
                "Authorization": f"Bearer {api_key}",
                "X-Msh-Tool-Call-Id": tool_call.id,
                **self._runtime.oauth.common_headers(),
                **self._custom_headers,
            },
            json={
                "text_query": params.query,
                "limit": params.limit,
                "enable_page_crawling": params.include_content,
                "timeout_seconds": 30,
            },
        ) as response,
    ):
        if response.status != 200:
            return builder.error(
                (
                    f"Failed to search. Status: {response.status}. "
                    "This may indicates that the search service is currently unavailable."
                ),
                brief="Failed to search",
            )
        
        try:
            results = Response(**await response.json()).search_results
        except ValidationError as e:
            return builder.error(
                (
                    f"Failed to parse search results. Error: {e}. "
                    "This may indicates that the search service is currently unavailable."
                ),
                brief="Failed to parse search results",
            )
    
    # 3. 格式化输出
    for i, result in enumerate(results):
        if i > 0:
            builder.write("---\n\n")
        builder.write(
            f"Title: {result.title}\nDate: {result.date}\n"
            f"URL: {result.url}\nSummary: {result.snippet}\n\n"
        )
        if result.content:
            builder.write(f"{result.content}\n\n")
    
    return builder.ok()

1.5 API 响应模型

class SearchResult(BaseModel):
    site_name: str
    title: str
    url: str
    snippet: str
    content: str = ""
    date: str = ""
    icon: str = ""
    mime: str = ""

class Response(BaseModel):
    search_results: list[SearchResult]

1.6 配置

# config.toml
[services.moonshot_search]
base_url = "https://api.kimi.com/coding/v1/search"
api_key = "sk-xxx"
oauth = "provider-name"  # 可选，使用 OAuth
custom_headers = { "X-Custom-Header" = "value" }  # 可选

2. FetchURL 网页抓取工具

2.1 工具定义

src/kimi_cli/tools/web/fetch.py

import aiohttp
import trafilatura
from kosong.tooling import CallableTool2, ToolReturnValue
from pydantic import BaseModel, Field

class Params(BaseModel):
    url: str = Field(description="要抓取内容的 URL")

class FetchURL(CallableTool2[Params]):
    name: str = "FetchURL"
    description: str = load_desc(Path(__file__).parent / "fetch.md", {})
    params: type[Params] = Params

    def __init__(self, config: Config, runtime: Runtime):
        super().__init__()
        self._runtime = runtime
        self._service_config = config.services.moonshot_fetch

2.2 双模式架构

FetchURL 支持两种模式：

服务模式: 通过 Moonshot Fetch API 获取优化后的内容
本地模式: 直接 HTTP GET + trafilatura 内容提取

@override
async def __call__(self, params: Params) -> ToolReturnValue:
    # 1. 优先使用服务模式
    if self._service_config:
        ret = await self._fetch_with_service(params)
        if not ret.is_error:
            return ret
        logger.warning(
            "Failed to fetch URL via service: {error}", 
            error=ret.message
        )
        # 服务失败，回退到本地模式
    
    # 2. 回退到本地模式
    return await self.fetch_with_http_get(params)

2.3 服务模式实现

async def _fetch_with_service(self, params: Params) -> ToolReturnValue:
    assert self._service_config is not None
    
    tool_call = get_current_tool_call_or_none()
    assert tool_call is not None, "Tool call is expected to be set"
    
    builder = ToolResultBuilder(max_line_length=None)
    
    # 1. 解析 API Key
    api_key = self._runtime.oauth.resolve_api_key(
        self._service_config.api_key, 
        self._service_config.oauth
    )
    
    if not api_key:
        return builder.error(
            "Fetch service is not configured. You may want to try other methods to fetch.",
            brief="Fetch service not configured",
        )
    
    # 2. 构建请求头
    headers = {
        "User-Agent": USER_AGENT,
        "Authorization": f"Bearer {api_key}",
        "Accept": "text/markdown",
        "X-Msh-Tool-Call-Id": tool_call.id,
        **self._runtime.oauth.common_headers(),
        **(self._service_config.custom_headers or {}),
    }
    
    # 3. 调用 Fetch API
    try:
        async with (
            new_client_session() as session,
            session.post(
                self._service_config.base_url,
                headers=headers,
                json={"url": params.url},
            ) as response,
        ):
            if response.status != 200:
                return builder.error(
                    f"Failed to fetch URL via service. Status: {response.status}.",
                    brief="Failed to fetch URL via fetch service",
                )
            
            content = await response.text()
            builder.write(content)
            return builder.ok(
                "The returned content is the main content extracted from the page."
            )
    except aiohttp.ClientError as e:
        return builder.error(
            (
                f"Failed to fetch URL via service due to network error: {str(e)}. "
                "This may indicate the service is unreachable."
            ),
            brief="Network error when calling fetch service",
        )

2.4 本地模式实现

@staticmethod
async def fetch_with_http_get(params: Params) -> ToolReturnValue:
    builder = ToolResultBuilder(max_line_length=None)
    
    try:
        # 1. 发送 HTTP GET 请求
        async with (
            new_client_session() as session,
            session.get(
                params.url,
                headers={
                    "User-Agent": (
                        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                        "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
                    ),
                },
            ) as response,
        ):
            if response.status >= 400:
                return builder.error(
                    (
                        f"Failed to fetch URL. Status: {response.status}. "
                        f"This may indicate the page is not accessible or the server is down."
                    ),
                    brief=f"HTTP {response.status} error",
                )
            
            resp_text = await response.text()
            
            # 2. 处理文本内容
            content_type = response.headers.get(
                aiohttp.hdrs.CONTENT_TYPE, ""
            ).lower()
            
            if content_type.startswith(("text/plain", "text/markdown")):
                builder.write(resp_text)
                return builder.ok(
                    "The returned content is the full content of the page."
                )
    
    except aiohttp.ClientError as e:
        return builder.error(
            (
                f"Failed to fetch URL due to network error: {str(e)}. "
                "This may indicate the URL is invalid or the server is unreachable."
            ),
            brief="Network error",
        )
    
    # 3. 使用 trafilatura 提取内容
    if not resp_text:
        return builder.ok(
            "The response body is empty.",
            brief="Empty response body",
        )
    
    extracted_text = trafilatura.extract(
        resp_text,
        include_comments=True,
        include_tables=True,
        include_formatting=False,
        output_format="txt",
        with_metadata=True,
    )
    
    if not extracted_text:
        return builder.error(
            (
                "Failed to extract meaningful content from the page. "
                "This may indicate the page content is not suitable for text extraction, "
                "or the page requires JavaScript to render its content."
            ),
            brief="No content extracted",
        )
    
    builder.write(extracted_text)
    return builder.ok(
        "The returned content is the main text content extracted from the page."
    )

2.5 Trafilatura 配置

Trafilatura 是一个用于网页内容提取的 Python 库：

extracted_text = trafilatura.extract(
    resp_text,
    include_comments=True,       # 包含评论
    include_tables=True,         # 包含表格
    include_formatting=False,     # 不包含格式化
    output_format="txt",         # 输出纯文本
    with_metadata=True,          # 包含元数据
)

3. 服务配置

3.1 搜索服务配置

# config.toml
[services.moonshot_search]
base_url = "https://api.kimi.com/coding/v1/search"
api_key = "sk-xxx"
oauth = "provider-name"  # 可选
custom_headers = { "X-Custom-Header" = "value" }  # 可选

3.2 抓取服务配置

[services.moonshot_fetch]
base_url = "https://api.kimi.com/coding/v1/fetch"
api_key = "sk-xxx"
oauth = "provider-name"  # 可选
custom_headers = { "X-Custom-Header" = "value" }  # 可选

4. OAuth 集成

4.1 OAuth API Key 解析

# 在运行时解析 OAuth API Key
api_key = self._runtime.oauth.resolve_api_key(
    self._api_key, 
    self._oauth_ref
)

4.2 OAuth 通用头

# 添加 OAuth 通用头
headers = {
    "User-Agent": USER_AGENT,
    "Authorization": f"Bearer {api_key}",
    "X-Msh-Tool-Call-Id": tool_call.id,
    **self._runtime.oauth.common_headers(),  # OAuth 头
    **self._custom_headers,  # 自定义头
}

5. 工具描述

5.1 SearchWeb 描述

search.md:

WebSearch tool allows you to search on the internet to get latest information, including news, documents, release notes, blog posts, papers, etc.

5.2 FetchURL 描述

fetch.md:

FetchURL tool allows you to fetch the content of a web page, extracting the main text content.

6. 使用示例

6.1 搜索示例

# 用户: 搜索最新的 Python 3.13 发布说明
params = Params(
    query="Python 3.13 release notes",
    limit=5,
    include_content=False
)

# 输出:
Title: Python 3.13.0 Release Notes
Date: 2024-10-07
URL: https://docs.python.org/3/whatsnew/3.13.html
Summary: Python 3.13 is the latest stable release of the Python programming language.

Title: What's New in Python 3.13
Date: 2024-10-07
URL: https://realpython.com/python3-13-whats-new/
Summary: A comprehensive guide to the new features and improvements in Python 3.13.

6.2 抓取示例

# 用户: 抓取 Python 3.13 发布说明页面的内容
params = Params(
    url="https://docs.python.org/3/whatsnew/3.13.html"
)

# 输出:
The returned content is the main text content extracted from the page.

# 实际内容:
Python 3.13.0 Release Notes
===========================

Release Date: October 7, 2024

This article explains the new features in Python 3.13, compared to 3.12. Python 3.13 was released on October 7, 2024. For full details, see the change log.
...

7. 错误处理

7.1 搜索错误

if response.status != 200:
    return builder.error(
        (
            f"Failed to search. Status: {response.status}. "
            "This may indicates that the search service is currently unavailable."
        ),
        brief="Failed to search",
    )

7.2 抓取错误

except aiohttp.ClientError as e:
    return builder.error(
        (
            f"Failed to fetch URL due to network error: {str(e)}. "
            "This may indicate the URL is invalid or the server is unreachable."
        ),
        brief="Network error",
    )

7.3 内容提取失败

if not extracted_text:
    return builder.error(
        (
            "Failed to extract meaningful content from the page. "
            "This may indicate the page content is not suitable for text extraction, "
            "or the page requires JavaScript to render its content."
        ),
        brief="No content extracted",
    )

8. 性能优化

8.1 异步 I/O

所有网络请求都是异步的，使用 aiohttp：

async with new_client_session() as session, session.post(...) as response:
    content = await response.text()

8.2 超时控制

搜索 API 设置了 30 秒超时：

json={
    "text_query": params.query,
    "limit": params.limit,
    "enable_page_crawling": params.include_content,
    "timeout_seconds": 30,  # 30 秒超时
}

8.3 结果缓存

服务模式的结果由服务端缓存，本地模式不缓存。

9. 安全考虑

9.1 URL 验证

虽然没有显式的 URL 验证，但通过 aiohttp 的错误处理来捕获无效 URL：

except aiohttp.ClientError as e:
    return builder.error(
        f"Failed to fetch URL due to network error: {str(e)}",
        brief="Network error",
    )

9.2 内容过滤

Trafilatura 会自动过滤脚本、样式等无关内容：

extracted_text = trafilatura.extract(
    resp_text,
    include_formatting=False,  # 移除格式化
    output_format="txt",       # 纯文本
)

9.3 User-Agent

使用真实的浏览器 User-Agent 以避免被拒绝：

headers={
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    ),
}

10. 扩展性

10.1 自定义搜索服务

可以通过配置自定义搜索服务：

[services.custom_search]
base_url = "https://api.custom.com/search"
api_key = "your-api-key"

10.2 自定义抓取服务

可以通过配置自定义抓取服务：

[services.custom_fetch]
base_url = "https://api.custom.com/fetch"
api_key = "your-api-key"

10.3 本地抓取扩展

可以扩展 fetch_with_http_get 方法以支持更多内容提取策略。

11. 与其他工具的集成

11.1 与 Agent 集成

SearchWeb 和 FetchURL 作为工具被 Agent 调用：

# 在 agent.yaml 中配置
tools:
  - "kimi_cli.tools.web:SearchWeb"
  - "kimi_cli.tools.web:FetchURL"

11.2 与技能集成

技能可以引用这些工具：

# search-code.md
---
name: search-code
description: 搜索代码和文档
---

Use SearchWeb to find relevant documentation and code examples.

## Strategy

1. Search for relevant keywords
2. Fetch the most relevant pages
3. Extract and summarize the information

12. 工具链示例

12.1 搜索+抓取链

用户: 查找 Python 3.13 的新特性

1. SearchWeb("Python 3.13 new features")
   → 返回多个结果链接

2. FetchURL("https://docs.python.org/3/whatsnew/3.13.html")
   → 返回详细内容

3. 总结和回答用户

12.2 多搜索+多抓取链

用户: 比较几个框架的性能

1. SearchWeb("Python web framework performance comparison", limit=10)
   → 返回多个比较文章

2. FetchURL多个URL
   → 获取详细内容

3. 分析和总结