Kimi-CLI 搜索和页面内容抓取实现
技术研究 人工智能 AI Agent
Kimi-CLI 提供了两种主要的 Web 交互能力: 1. SearchWeb 网络搜索工具 2. FetchURL 网页内容抓取工具 这两个工具都支持两种模式: - 服务模式 通过 Moonshot API 获取优化后的结果 - 本地模式 直接进行 HTTP 请求和内容提取
概述
Kimi-CLI 提供了两种主要的 Web 交互能力:
- SearchWeb: 网络搜索工具
- FetchURL: 网页内容抓取工具
这两个工具都支持两种模式:
- 服务模式: 通过 Moonshot API 获取优化后的结果
- 本地模式: 直接进行 HTTP 请求和内容提取
1. SearchWeb 搜索工具
1.1 工具定义
src/kimi_cli/tools/web/search.py
from kosong.tooling import CallableTool2, ToolReturnValue
from pydantic import BaseModel, Field
class Params(BaseModel):
query: str = Field(description="搜索查询文本")
limit: int = Field(
description="返回结果数量",
default=5,
ge=1,
le=20,
)
include_content: bool = Field(
description="是否包含网页内容",
default=False,
)
class SearchWeb(CallableTool2[Params]):
name: str = "SearchWeb"
description: str = load_desc(Path(__file__).parent / "search.md", {})
params: type[Params] = Params
def __init__(self, config: Config, runtime: Runtime):
super().__init__()
if config.services.moonshot_search is None:
raise SkipThisTool()
self._runtime = runtime
self._base_url = config.services.moonshot_search.base_url
self._api_key = config.services.moonshot_search.api_key
self._oauth_ref = config.services.moonshot_search.oauth
self._custom_headers = config.services.moonshot_search.custom_headers or {}
1.2 参数说明
query: 搜索查询文本,必需参数limit: 返回结果数量,默认 5,范围 1-20include_content: 是否包含网页内容,默认 False。启用时会消耗大量 token
1.3 搜索流程
用户查询 → SearchWeb → 解析 API Key → 调用搜索 API
→ 解析结果 → 格式化输出 → Agent
1.4 核心实现
@override
async def __call__(self, params: Params) -> ToolReturnValue:
builder = ToolResultBuilder(max_line_length=None)
# 1. 解析 API Key(支持 OAuth)
api_key = self._runtime.oauth.resolve_api_key(
self._api_key,
self._oauth_ref
)
if not self._base_url or not api_key:
return builder.error(
"Search service is not configured. You may want to try other methods to search.",
brief="Search service not configured",
)
tool_call = get_current_tool_call_or_none()
assert tool_call is not None, "Tool call is expected to be set"
# 2. 调用搜索 API
async with (
new_client_session() as session,
session.post(
self._base_url,
headers={
"User-Agent": USER_AGENT,
"Authorization": f"Bearer {api_key}",
"X-Msh-Tool-Call-Id": tool_call.id,
**self._runtime.oauth.common_headers(),
**self._custom_headers,
},
json={
"text_query": params.query,
"limit": params.limit,
"enable_page_crawling": params.include_content,
"timeout_seconds": 30,
},
) as response,
):
if response.status != 200:
return builder.error(
(
f"Failed to search. Status: {response.status}. "
"This may indicates that the search service is currently unavailable."
),
brief="Failed to search",
)
try:
results = Response(**await response.json()).search_results
except ValidationError as e:
return builder.error(
(
f"Failed to parse search results. Error: {e}. "
"This may indicates that the search service is currently unavailable."
),
brief="Failed to parse search results",
)
# 3. 格式化输出
for i, result in enumerate(results):
if i > 0:
builder.write("---\n\n")
builder.write(
f"Title: {result.title}\nDate: {result.date}\n"
f"URL: {result.url}\nSummary: {result.snippet}\n\n"
)
if result.content:
builder.write(f"{result.content}\n\n")
return builder.ok()
1.5 API 响应模型
class SearchResult(BaseModel):
site_name: str
title: str
url: str
snippet: str
content: str = ""
date: str = ""
icon: str = ""
mime: str = ""
class Response(BaseModel):
search_results: list[SearchResult]
1.6 配置
# config.toml
[services.moonshot_search]
base_url = "https://api.kimi.com/coding/v1/search"
api_key = "sk-xxx"
oauth = "provider-name" # 可选,使用 OAuth
custom_headers = { "X-Custom-Header" = "value" } # 可选
2. FetchURL 网页抓取工具
2.1 工具定义
src/kimi_cli/tools/web/fetch.py
import aiohttp
import trafilatura
from kosong.tooling import CallableTool2, ToolReturnValue
from pydantic import BaseModel, Field
class Params(BaseModel):
url: str = Field(description="要抓取内容的 URL")
class FetchURL(CallableTool2[Params]):
name: str = "FetchURL"
description: str = load_desc(Path(__file__).parent / "fetch.md", {})
params: type[Params] = Params
def __init__(self, config: Config, runtime: Runtime):
super().__init__()
self._runtime = runtime
self._service_config = config.services.moonshot_fetch
2.2 双模式架构
FetchURL 支持两种模式:
- 服务模式: 通过 Moonshot Fetch API 获取优化后的内容
- 本地模式: 直接 HTTP GET + trafilatura 内容提取
@override
async def __call__(self, params: Params) -> ToolReturnValue:
# 1. 优先使用服务模式
if self._service_config:
ret = await self._fetch_with_service(params)
if not ret.is_error:
return ret
logger.warning(
"Failed to fetch URL via service: {error}",
error=ret.message
)
# 服务失败,回退到本地模式
# 2. 回退到本地模式
return await self.fetch_with_http_get(params)
2.3 服务模式实现
async def _fetch_with_service(self, params: Params) -> ToolReturnValue:
assert self._service_config is not None
tool_call = get_current_tool_call_or_none()
assert tool_call is not None, "Tool call is expected to be set"
builder = ToolResultBuilder(max_line_length=None)
# 1. 解析 API Key
api_key = self._runtime.oauth.resolve_api_key(
self._service_config.api_key,
self._service_config.oauth
)
if not api_key:
return builder.error(
"Fetch service is not configured. You may want to try other methods to fetch.",
brief="Fetch service not configured",
)
# 2. 构建请求头
headers = {
"User-Agent": USER_AGENT,
"Authorization": f"Bearer {api_key}",
"Accept": "text/markdown",
"X-Msh-Tool-Call-Id": tool_call.id,
**self._runtime.oauth.common_headers(),
**(self._service_config.custom_headers or {}),
}
# 3. 调用 Fetch API
try:
async with (
new_client_session() as session,
session.post(
self._service_config.base_url,
headers=headers,
json={"url": params.url},
) as response,
):
if response.status != 200:
return builder.error(
f"Failed to fetch URL via service. Status: {response.status}.",
brief="Failed to fetch URL via fetch service",
)
content = await response.text()
builder.write(content)
return builder.ok(
"The returned content is the main content extracted from the page."
)
except aiohttp.ClientError as e:
return builder.error(
(
f"Failed to fetch URL via service due to network error: {str(e)}. "
"This may indicate the service is unreachable."
),
brief="Network error when calling fetch service",
)
2.4 本地模式实现
@staticmethod
async def fetch_with_http_get(params: Params) -> ToolReturnValue:
builder = ToolResultBuilder(max_line_length=None)
try:
# 1. 发送 HTTP GET 请求
async with (
new_client_session() as session,
session.get(
params.url,
headers={
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
),
},
) as response,
):
if response.status >= 400:
return builder.error(
(
f"Failed to fetch URL. Status: {response.status}. "
f"This may indicate the page is not accessible or the server is down."
),
brief=f"HTTP {response.status} error",
)
resp_text = await response.text()
# 2. 处理文本内容
content_type = response.headers.get(
aiohttp.hdrs.CONTENT_TYPE, ""
).lower()
if content_type.startswith(("text/plain", "text/markdown")):
builder.write(resp_text)
return builder.ok(
"The returned content is the full content of the page."
)
except aiohttp.ClientError as e:
return builder.error(
(
f"Failed to fetch URL due to network error: {str(e)}. "
"This may indicate the URL is invalid or the server is unreachable."
),
brief="Network error",
)
# 3. 使用 trafilatura 提取内容
if not resp_text:
return builder.ok(
"The response body is empty.",
brief="Empty response body",
)
extracted_text = trafilatura.extract(
resp_text,
include_comments=True,
include_tables=True,
include_formatting=False,
output_format="txt",
with_metadata=True,
)
if not extracted_text:
return builder.error(
(
"Failed to extract meaningful content from the page. "
"This may indicate the page content is not suitable for text extraction, "
"or the page requires JavaScript to render its content."
),
brief="No content extracted",
)
builder.write(extracted_text)
return builder.ok(
"The returned content is the main text content extracted from the page."
)
2.5 Trafilatura 配置
Trafilatura 是一个用于网页内容提取的 Python 库:
extracted_text = trafilatura.extract(
resp_text,
include_comments=True, # 包含评论
include_tables=True, # 包含表格
include_formatting=False, # 不包含格式化
output_format="txt", # 输出纯文本
with_metadata=True, # 包含元数据
)
3. 服务配置
3.1 搜索服务配置
# config.toml
[services.moonshot_search]
base_url = "https://api.kimi.com/coding/v1/search"
api_key = "sk-xxx"
oauth = "provider-name" # 可选
custom_headers = { "X-Custom-Header" = "value" } # 可选
3.2 抓取服务配置
[services.moonshot_fetch]
base_url = "https://api.kimi.com/coding/v1/fetch"
api_key = "sk-xxx"
oauth = "provider-name" # 可选
custom_headers = { "X-Custom-Header" = "value" } # 可选
4. OAuth 集成
4.1 OAuth API Key 解析
# 在运行时解析 OAuth API Key
api_key = self._runtime.oauth.resolve_api_key(
self._api_key,
self._oauth_ref
)
4.2 OAuth 通用头
# 添加 OAuth 通用头
headers = {
"User-Agent": USER_AGENT,
"Authorization": f"Bearer {api_key}",
"X-Msh-Tool-Call-Id": tool_call.id,
**self._runtime.oauth.common_headers(), # OAuth 头
**self._custom_headers, # 自定义头
}
5. 工具描述
5.1 SearchWeb 描述
search.md:
WebSearch tool allows you to search on the internet to get latest information, including news, documents, release notes, blog posts, papers, etc.
5.2 FetchURL 描述
fetch.md:
FetchURL tool allows you to fetch the content of a web page, extracting the main text content.
6. 使用示例
6.1 搜索示例
# 用户: 搜索最新的 Python 3.13 发布说明
params = Params(
query="Python 3.13 release notes",
limit=5,
include_content=False
)
# 输出:
Title: Python 3.13.0 Release Notes
Date: 2024-10-07
URL: https://docs.python.org/3/whatsnew/3.13.html
Summary: Python 3.13 is the latest stable release of the Python programming language.
Title: What's New in Python 3.13
Date: 2024-10-07
URL: https://realpython.com/python3-13-whats-new/
Summary: A comprehensive guide to the new features and improvements in Python 3.13.
6.2 抓取示例
# 用户: 抓取 Python 3.13 发布说明页面的内容
params = Params(
url="https://docs.python.org/3/whatsnew/3.13.html"
)
# 输出:
The returned content is the main text content extracted from the page.
# 实际内容:
Python 3.13.0 Release Notes
===========================
Release Date: October 7, 2024
This article explains the new features in Python 3.13, compared to 3.12. Python 3.13 was released on October 7, 2024. For full details, see the change log.
...
7. 错误处理
7.1 搜索错误
if response.status != 200:
return builder.error(
(
f"Failed to search. Status: {response.status}. "
"This may indicates that the search service is currently unavailable."
),
brief="Failed to search",
)
7.2 抓取错误
except aiohttp.ClientError as e:
return builder.error(
(
f"Failed to fetch URL due to network error: {str(e)}. "
"This may indicate the URL is invalid or the server is unreachable."
),
brief="Network error",
)
7.3 内容提取失败
if not extracted_text:
return builder.error(
(
"Failed to extract meaningful content from the page. "
"This may indicate the page content is not suitable for text extraction, "
"or the page requires JavaScript to render its content."
),
brief="No content extracted",
)
8. 性能优化
8.1 异步 I/O
所有网络请求都是异步的,使用 aiohttp:
async with new_client_session() as session, session.post(...) as response:
content = await response.text()
8.2 超时控制
搜索 API 设置了 30 秒超时:
json={
"text_query": params.query,
"limit": params.limit,
"enable_page_crawling": params.include_content,
"timeout_seconds": 30, # 30 秒超时
}
8.3 结果缓存
服务模式的结果由服务端缓存,本地模式不缓存。
9. 安全考虑
9.1 URL 验证
虽然没有显式的 URL 验证,但通过 aiohttp 的错误处理来捕获无效 URL:
except aiohttp.ClientError as e:
return builder.error(
f"Failed to fetch URL due to network error: {str(e)}",
brief="Network error",
)
9.2 内容过滤
Trafilatura 会自动过滤脚本、样式等无关内容:
extracted_text = trafilatura.extract(
resp_text,
include_formatting=False, # 移除格式化
output_format="txt", # 纯文本
)
9.3 User-Agent
使用真实的浏览器 User-Agent 以避免被拒绝:
headers={
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
),
}
10. 扩展性
10.1 自定义搜索服务
可以通过配置自定义搜索服务:
[services.custom_search]
base_url = "https://api.custom.com/search"
api_key = "your-api-key"
10.2 自定义抓取服务
可以通过配置自定义抓取服务:
[services.custom_fetch]
base_url = "https://api.custom.com/fetch"
api_key = "your-api-key"
10.3 本地抓取扩展
可以扩展 fetch_with_http_get 方法以支持更多内容提取策略。
11. 与其他工具的集成
11.1 与 Agent 集成
SearchWeb 和 FetchURL 作为工具被 Agent 调用:
# 在 agent.yaml 中配置
tools:
- "kimi_cli.tools.web:SearchWeb"
- "kimi_cli.tools.web:FetchURL"
11.2 与技能集成
技能可以引用这些工具:
# search-code.md
---
name: search-code
description: 搜索代码和文档
---
Use SearchWeb to find relevant documentation and code examples.
## Strategy
1. Search for relevant keywords
2. Fetch the most relevant pages
3. Extract and summarize the information
12. 工具链示例
12.1 搜索+抓取链
用户: 查找 Python 3.13 的新特性
1. SearchWeb("Python 3.13 new features")
→ 返回多个结果链接
2. FetchURL("https://docs.python.org/3/whatsnew/3.13.html")
→ 返回详细内容
3. 总结和回答用户
12.2 多搜索+多抓取链
用户: 比较几个框架的性能
1. SearchWeb("Python web framework performance comparison", limit=10)
→ 返回多个比较文章
2. FetchURL多个URL
→ 获取详细内容
3. 分析和总结