AI 增强功能
1. AI 查询优化器
1.1 传统查询优化器 vs AI 优化器
MySQL 9.x 引入了基于机器学习的查询优化器,能够通过分析历史查询模式和统计信息,自动选择最优的执行计划。
-- 传统优化器基于成本模型
-- AI 优化器基于历史数据和预测模型
-- 启用 AI 优化器
SET GLOBAL optimizer_switch = 'ai_optimization=on';
-- 查看当前优化器设置
SHOW VARIABLES LIKE 'optimizer_switch';
-- 为特定查询启用 AI 优化
SELECT /*+ AI_OPTIMIZE */ *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01'
ORDER BY o.total_amount DESC
LIMIT 100;
1.2 AI 优化器工作原理
-- AI 优化器的工作流程:
-- 1. 收集历史查询执行数据
-- 2. 构建查询特征模型
-- 3. 预测不同执行计划的成本
-- 4. 选择最优执行计划
-- 查看 AI 优化器的分析结果
SELECT * FROM sys.ai_query_analysis
WHERE query_time > NOW() - INTERVAL 1 DAY
ORDER BY execution_time DESC;
-- 查看执行计划预测
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 25;
-- AI 优化的统计信息
SELECT * FROM sys.ai_optimizer_stats;
2. 机器学习集成(点击展开)
2.1 内置 ML 模型
MySQL 9.x 支持直接在数据库中创建、训练和使用机器学习模型,实现端到端的 ML 工作流。无需将数据导出到外部 ML 平台,在数据库内部即可完成从数据准备到模型部署的全流程。
| 算法类型 | 支持算法 | 适用场景 |
|---|---|---|
| 分类 | 逻辑回归、随机森林、梯度提升、XGBoost | 用户流失预测、欺诈检测、垃圾邮件识别 |
| 回归 | 线性回归、岭回归、随机森林回归 | 销量预测、价格预测、评分预测 |
| 聚类 | K-Means、DBSCAN、高斯混合模型 | 用户分群、异常检测、文档分类 |
| 推荐 | 协同过滤、矩阵分解 | 商品推荐、内容推荐、用户相似度 |
| 时间序列 | ARIMA、Prophet、LSTM | 需求预测、趋势分析、异常检测 |
flowchart LR
subgraph 数据准备
A[原始数据] --> B[数据清洗]
B --> C[特征工程]
C --> D[训练数据]
end
subgraph 模型训练
D --> E[选择算法]
E --> F[超参数调优]
F --> G[交叉验证]
G --> H[模型训练]
end
subgraph 模型评估
H --> I[模型评估]
I --> J{评估通过?}
end
subgraph 部署预测
J -->|是| K[部署上线]
K --> L[实时预测]
L --> M[业务应用]
end
J -->|否| E
style A fill:#e3f2fd,stroke:#1565c0
style H fill:#c8e6c9,stroke:#2e7d32
style K fill:#fff3e0,stroke:#e65100
style M fill:#e1f5fe,stroke:#01579b
-- 查看支持的 ML 算法
SELECT * FROM sys.ml_available_algorithms;
-- 创建训练数据表
CREATE TABLE customer_features (
customer_id INT PRIMARY KEY,
age INT,
income DECIMAL(10,2),
purchase_frequency INT,
avg_order_value DECIMAL(10,2),
days_since_last_purchase INT,
total_orders INT,
avg_review_score DECIMAL(3,2),
churn_label TINYINT
);
-- 特征工程:创建衍生特征
ALTER TABLE customer_features
ADD COLUMN customer_lifetime_value DECIMAL(10,2)
GENERATED ALWAYS AS (total_orders * avg_order_value);
-- 创建机器学习模型 - 梯度提升分类器
CREATE ML MODEL customer_churn_model
FROM customer_features
FEATURES (age, income, purchase_frequency, avg_order_value, days_since_last_purchase, total_orders, avg_review_score, customer_lifetime_value)
TARGET (churn_label)
ALGORITHM 'gradient_boosting'
OPTIONS (
n_estimators = 100,
max_depth = 5,
learning_rate = 0.1,
min_samples_split = 20,
test_size = 0.2
);
2.2 模型预测
训练完成的模型可直接用于预测,支持实时预测和批量预测两种模式。
flowchart TD
subgraph 实时预测流程
A[前端请求] --> B[API网关]
B --> C[提取特征]
C --> D[ML模型推理]
D --> E[返回预测结果]
E --> F[业务处理]
end
subgraph 批量预测流程
G[定时任务] --> H[批量数据]
H --> I[分批处理]
I --> J[批量推理]
J --> K[结果存储]
K --> L[报表生成]
end
style D fill:#c8e6c9,stroke:#2e7d32
style J fill:#fff3e0,stroke:#e65100
-- 实时预测:单个用户流失概率
SELECT
customer_id,
age,
income,
PREDICT(customer_churn_model USING
age = 35,
income = 50000,
purchase_frequency = 12,
avg_order_value = 150.00,
days_since_last_purchase = 15,
total_orders = 50,
avg_review_score = 4.5
) as churn_probability;
-- 批量预测:对所有未标记用户进行预测
SELECT
customer_id,
PREDICT(customer_churn_model USING
age, income, purchase_frequency, avg_order_value,
days_since_last_purchase, total_orders, avg_review_score
) as churn_probability,
CASE
WHEN PREDICT(customer_churn_model USING
age, income, purchase_frequency, avg_order_value,
days_since_last_purchase, total_orders, avg_review_score) > 0.7
THEN '高风险'
WHEN PREDICT(customer_churn_model USING
age, income, purchase_frequency, avg_order_value,
days_since_last_purchase, total_orders, avg_review_score) > 0.3
THEN '中风险'
ELSE '低风险'
END as risk_level
FROM customer_features
WHERE churn_label IS NULL;
-- 创建预测结果表
CREATE TABLE churn_predictions (
customer_id INT PRIMARY KEY,
predicted_churn_probability DECIMAL(5,4),
risk_level VARCHAR(20),
predicted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (customer_id) REFERENCES customer_features(customer_id)
);
-- 批量预测并存储结果
INSERT INTO churn_predictions (customer_id, predicted_churn_probability, risk_level)
SELECT
customer_id,
PREDICT(customer_churn_model USING
age, income, purchase_frequency, avg_order_value,
days_since_last_purchase, total_orders, avg_review_score
) as prob,
CASE
WHEN prob > 0.7 THEN '高风险'
WHEN prob > 0.3 THEN '中风险'
ELSE '低风险'
END
FROM customer_features
WHERE churn_label IS NULL;
2.3 模型评估与优化
MySQL 提供了丰富的模型评估指标,帮助您了解模型性能并进行针对性优化。
| 指标 | 说明 | 适用场景 |
|---|---|---|
| Accuracy | 准确率 - 正确预测的比例 | 数据平衡的分类问题 |
| Precision | 精确率 - 预测为正的样本中实际为正的比例 | 误报代价高的场景 |
| Recall | 召回率 - 实际为正的样本中被正确预测的比例 | 漏报代价高的场景 |
| F1-Score | 精确率和召回率的调和平均 | 数据不平衡的分类问题 |
| AUC-ROC | ROC曲线下面积,衡量分类器区分能力 | 二分类问题的通用评估 |
| MSE/RMSE | 均方误差/均方根误差 | 回归问题 |
| MAE | 平均绝对误差 | 回归问题,对异常值不敏感 |
-- 查看模型评估指标
SELECT * FROM sys.ml_model_evaluation
WHERE model_name = 'customer_churn_model';
-- 模型评估结果示例输出:
-- +------------------+-------------+-------------+------------+
-- | metric | value | train_score | test_score |
-- +------------------+-------------+-------------+------------+
-- | accuracy | 0.9234 | 0.9567 | 0.9234 |
-- | precision | 0.8912 | 0.9456 | 0.8912 |
-- | recall | 0.8678 | 0.9234 | 0.8678 |
-- | f1_score | 0.8793 | 0.9344 | 0.8793 |
-- | auc_roc | 0.9456 | 0.9876 | 0.9456 |
-- +------------------+-------------+-------------+------------+
-- 查看混淆矩阵
SELECT * FROM sys.ml_confusion_matrix
WHERE model_name = 'customer_churn_model';
-- 查看特征重要性
SELECT feature_name, importance_score
FROM sys.ml_feature_importance
WHERE model_name = 'customer_churn_model'
ORDER BY importance_score DESC;
-- 特征重要性示例:
-- +-----------------------------+------------------+
-- | feature_name | importance_score |
-- +-----------------------------+------------------+
-- | days_since_last_purchase | 0.3245 |
-- | purchase_frequency | 0.2876 |
-- | avg_order_value | 0.1987 |
-- | total_orders | 0.1234 |
-- | age | 0.0456 |
-- | income | 0.0202 |
-- +-----------------------------+------------------+
-- 超参数调优
CREATE ML MODEL customer_churn_tuned
FROM customer_features
FEATURES (age, income, purchase_frequency, avg_order_value, days_since_last_purchase, total_orders)
TARGET (churn_label)
ALGORITHM 'gradient_boosting'
OPTIONS (
n_estimators = 200,
max_depth = 7,
learning_rate = 0.05,
min_samples_split = 10,
min_samples_leaf = 5,
subsample = 0.8,
colsample_bytree = 0.8,
grid_search = TRUE, -- 启用网格搜索调优
cv_folds = 5 -- 5折交叉验证
);
2.4 实际应用案例
-- 案例1:电商用户流失预警系统
-- 创建用户行为特征表
CREATE TABLE user_behavior_features (
user_id INT PRIMARY KEY,
registration_date DATE,
last_login_date DATE,
total_login_days INT,
avg_session_duration DECIMAL(10,2),
browse_count INT,
cart_add_count INT,
purchase_count INT,
wishlist_count INT,
review_count INT,
avg_rating_given DECIMAL(3,2),
coupon_usage_count INT,
referral_count INT,
support_ticket_count INT,
is_churned TINYINT -- 目标变量:1=流失
);
-- 训练流失预测模型
CREATE ML MODEL churn_prediction_model
FROM user_behavior_features
FEATURES (
DATEDIFF(NOW(), registration_date),
DATEDIFF(NOW(), last_login_date),
total_login_days,
avg_session_duration,
browse_count,
cart_add_count,
purchase_count,
wishlist_count,
review_count,
avg_rating_given,
coupon_usage_count,
referral_count,
support_ticket_count
)
TARGET (is_churned)
ALGORITHM 'random_forest'
OPTIONS (n_estimators = 150, max_depth = 10);
-- 实时预测高风险用户
SELECT
user_id,
PREDICT(churn_prediction_model USING
DATEDIFF(NOW(), registration_date) = 90,
DATEDIFF(NOW(), last_login_date) = 15,
total_login_days = 25,
avg_session_duration = 1800,
browse_count = 50,
cart_add_count = 10,
purchase_count = 3,
wishlist_count = 8,
review_count = 2,
avg_rating_given = 4.5,
= 5 coupon_usage_count,
referral_count = 1,
support_ticket_count = 0
) as churn_risk
FROM user_behavior_features
WHERE is_churned IS NULL
ORDER BY churn_risk DESC
LIMIT 100;
-- 案例2:商品销量预测
CREATE TABLE sales_data (
product_id INT,
date DATE,
unit_price DECIMAL(10,2),
discount_rate DECIMAL(5,4),
advertising_budget DECIMAL(10,2),
competitor_price DECIMAL(10,2),
holiday_flag TINYINT,
season VARCHAR(20),
quantity_sold INT
);
-- 训练销量预测模型
CREATE ML MODEL sales_forecast_model
FROM sales_data
FEATURES (unit_price, discount_rate, advertising_budget, competitor_price, holiday_flag)
TARGET (quantity_sold)
ALGORITHM 'linear_regression';
-- 预测未来销量
SELECT
product_id,
PREDICT(sales_forecast_model USING
unit_price = 99.00,
discount_rate = 0.15,
advertising_budget = 5000,
competitor_price = 109.00,
holiday_flag = 1
) as predicted_sales
FROM sales_data
GROUP BY product_id
ORDER BY predicted_sales DESC;
2.5 模型管理
-- 查看所有模型
SELECT * FROM sys.ml_models;
-- 模型列表输出示例:
-- +---------------------------+----------------+----------------+----------------+
-- | model_name | algorithm | created_at | status |
-- +---------------------------+----------------+----------------+----------------+
-- | customer_churn_model | gradient_boost | 2024-01-15 | ready |
-- | sales_forecast_model | linear_reg | 2024-01-20 | ready |
-- | user_clustering_model | kmeans | 2024-01-25 | training |
-- +---------------------------+----------------+----------------+----------------+
-- 查看模型详细信息
DESCRIBE ML MODEL customer_churn_model;
-- 模型版本管理
ALTER ML MODEL customer_churn_model
RETRAIN FROM customer_features
FEATURES (age, income, purchase_frequency, avg_order_value, days_since_last_purchase)
TARGET (churn_label);
-- 回滚到之前的版本
ALTER ML MODEL customer_churn_model
VERSION TO 'v1';
-- 导出模型用于部署
EXPORT ML MODEL customer_churn_model
TO '/backup/models/'
FORMAT 'onnx';
-- 导入外部训练的模型
IMPORT ML MODEL customer_churn_model_v2
FROM '/backup/models/external_model.onnx';
-- 删除模型
DROP ML MODEL customer_churn_model;
-- 模型权限管理
GRANT ML MODEL 'customer_churn_model' TO 'data scientist'@'%';
REVOKE ML MODEL 'customer_churn_model' FROM 'readonly_user'@'%';
2.6 常见问题和最佳实践
-- 最佳实践1:数据质量保证
-- 检查缺失值
SELECT
COUNT(*) as total_rows,
SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) as missing_age,
SUM(CASE WHEN income IS NULL THEN 1 ELSE 0 END) as missing_income
FROM customer_features;
-- 处理缺失值:填充默认值
CREATE VIEW customer_features_clean AS
SELECT
customer_id,
COALESCE(age, AVG(age) OVER()) as age,
COALESCE(income, MEDIAN(income) OVER()) as income
FROM customer_features;
-- 最佳实践2:特征标准化
-- 标准化数值特征
CREATE VIEW customer_features_scaled AS
SELECT
customer_id,
(age - AVG(age) OVER()) / STDDEV(age) OVER() as age_scaled,
(income - AVG(income) OVER()) / STDDEV(income) OVER() as income_scaled
FROM customer_features;
-- 最佳实践3:防止过拟合
-- 使用交叉验证和早停
CREATE ML MODEL customer_churn_balanced
FROM customer_features
FEATURES (age, income, purchase_frequency, avg_order_value)
TARGET (churn_label)
ALGORITHM 'gradient_boosting'
OPTIONS (
n_estimators = 100,
early_stopping_rounds = 10, -- 早停
validation_fraction = 0.1, -- 验证集比例
n_iter_no_change = 5 -- 连续5次不改善则停止
);
-- 最佳实践4:模型监控
CREATE TABLE model_performance_log (
model_name VARCHAR(100),
evaluation_date DATE,
accuracy DECIMAL(5,4),
precision_score DECIMAL(5,4),
recall DECIMAL(5,4),
f1_score DECIMAL(5,4),
sample_size INT
);
-- 定期记录模型性能
INSERT INTO model_performance_log
SELECT
'customer_churn_model',
CURDATE(),
accuracy,
precision,
recall,
f1_score,
sample_size
FROM sys.ml_model_evaluation
WHERE model_name = 'customer_churn_model';
3. 向量搜索与语义查询
3.1 向量索引
MySQL 9.x 原生支持向量数据类型和向量索引,可用于语义搜索、相似度匹配等 AI 应用场景。
-- 创建包含向量列的表
CREATE TABLE documents (
id INT PRIMARY KEY,
title VARCHAR(255),
content TEXT,
embedding VECTOR(768), -- 768维向量(支持 BERT 等模型)
INDEX idx_embedding (embedding) USING HNSW
);
-- 插入带向量的数据
INSERT INTO documents (id, title, content, embedding)
VALUES (
1,
'MySQL 教程',
'MySQL 是一种关系型数据库管理系统...',
'[0.12, -0.34, 0.56, ...]' -- 实际使用 embedding 生成工具
);
-- 创建向量索引(HNSW 算法)
ALTER TABLE documents
ADD INDEX idx_vector (embedding) USING HNSW
OPTIONS (m = 16, ef_construction = 200);
3.2 向量相似度搜索
-- 向量距离计算函数:
-- VECTOR_DISTANCE(v1, v2, 'cosine') - 余弦相似度
-- VECTOR_DISTANCE(v1, v2, 'euclidean') - 欧氏距离
-- VECTOR_DISTANCE(v1, v2, 'dot_product') - 点积
-- 查找最相似的文档
SELECT id, title,
VECTOR_DISTANCE(embedding, '[0.12, -0.34, 0.56, ...]', 'cosine') as distance
FROM documents
ORDER BY distance
LIMIT 5;
-- 使用向量索引加速搜索(近似最近邻)
SELECT id, title,
VECTOR_DISTANCE(embedding, '[query_vector]', 'cosine') as distance
FROM documents
ORDER BY distance
LIMIT 10;
-- 过滤条件 + 向量搜索
SELECT id, title
FROM documents
WHERE category = '数据库'
ORDER BY VECTOR_DISTANCE(embedding, '[query_vector]', 'cosine')
LIMIT 5;
4. 自然语言查询
4.1 NL2SQL 转换
通过自然语言处理技术,将自然语言查询转换为优化的 SQL 语句。
-- 使用自然语言查询
CALL sys.natural_language_query(
'查找2024年销售额超过10000元的客户'
);
-- AI 生成的 SQL 示例:
-- SELECT c.name, SUM(o.total_amount) as total_sales
-- FROM customers c
-- JOIN orders o ON c.id = o.customer_id
-- WHERE YEAR(o.order_date) = 2024
-- GROUP BY c.id, c.name
-- HAVING SUM(o.total_amount) > 10000
-- 自然语言查询复杂问题
SELECT NL('哪些产品的月销量呈下降趋势?');
-- 查看历史自然语言查询
SELECT * FROM sys.nl_query_history
ORDER BY query_time DESC
LIMIT 20;
4.2 查询解释与优化建议
-- AI 解释查询意图
SELECT sys.explain_query(
'SELECT * FROM orders WHERE status = "pending"'
) as explanation;
-- 获取性能优化建议
SELECT sys.get_optimization_suggestions(
'SELECT o.*, c.name FROM orders o JOIN customers c ON o.customer_id = c.id'
);
-- SQL 语句改写建议
SELECT * FROM sys.query_rewrites
WHERE original_query LIKE '%orders%';
5. 智能异常检测
5.1 数据异常检测
-- 配置异常检测
CREATE TABLE anomaly_detection_config (
table_name VARCHAR(64),
column_name VARCHAR(64),
detection_method VARCHAR(20), -- 'statistical', 'ml', 'isolation_forest'
sensitivity DECIMAL(3,2) -- 0.0-1.0
);
-- 为用户表启用异常检测
INSERT INTO anomaly_detection_config
VALUES ('users', 'age', 'isolation_forest', 0.8);
-- 手动触发异常检测
CALL sys.detect_data_anomalies('users');
-- 查看检测到的异常
SELECT * FROM sys.detected_anomalies
WHERE table_name = 'users'
AND detected_at > NOW() - INTERVAL 1 DAY;
5.2 性能异常检测
-- 启用性能异常检测
SET GLOBAL performance_anomaly_detection = 'ON';
-- 配置性能指标阈值
CREATE EVENT hourly_performance_check
ON SCHEDULE EVERY 1 HOUR
DO
BEGIN
CALL sys.detect_performance_anomalies();
CALL sys.detect_slow_queries();
END;
-- 查看性能异常
SELECT
timestamp,
metric_name,
expected_value,
actual_value,
anomaly_score
FROM sys.performance_anomalies
WHERE timestamp > NOW() - INTERVAL 24 HOURS
ORDER BY anomaly_score DESC;
-- 异常自动告警
CREATE TRIGGER performance_anomaly_alert
AFTER INSERT ON sys.performance_anomalies
FOR EACH ROW
BEGIN
IF NEW.anomaly_score > 0.9 THEN
INSERT INTO alert_log VALUES(NEW.timestamp, NEW.metric_name);
END IF;
END;
6. 智能索引推荐
6.1 自动索引分析
-- 开启查询分析收集 SET GLOBAL optimizer_trace = 'enabled=on'; SET GLOBAL performance_schema = 'ON'; -- 运行一段时间后,分析查询模式 CALL sys.analyze_workload('your_database'); -- 查看索引推荐 SELECT table_name, suggested_index, columns, potential_improvement, estimated_size FROM sys.index_recommendations WHERE table_schema = 'your_database' ORDER BY potential_improvement DESC;
6.2 自动应用索引
-- 预览索引创建语句
SELECT create_index_statement
FROM sys.index_recommendations
WHERE table_name = 'orders'
AND recommended_index = 'idx_customer_date';
-- 手动应用推荐索引
CALL sys.apply_index_recommendation(
'orders',
'idx_customer_date'
);
-- 自动应用安全索引(可快速回滚)
CALL sys.apply_recommended_indexes(
'your_database',
'auto',
1000 -- 最大影响行数阈值
);
-- 回滚索引变更
CALL sys.rollback_index_change('idx_customer_date', 'orders');
7. AI 辅助运维
7.1 智能监控仪表盘
-- 查看 AI 健康状态报告
SELECT * FROM sys.ai_health_check\G
-- AI 生成的优化建议
SELECT
category,
issue,
recommendation,
priority,
estimated_impact
FROM sys.ai_recommendations
WHERE status = 'pending'
ORDER BY priority DESC;
-- 自动性能调优
CALL sys.auto_tune('your_database');
-- 查看调优历史
SELECT * FROM sys.auto_tune_history
WHERE tuned_at > NOW() - INTERVAL 7 DAY;
7.2 容量预测
-- 存储容量预测
SELECT
table_name,
current_size_mb,
predicted_size_30d,
predicted_size_90d,
days_until_full
FROM sys.capacity_predictions
WHERE table_schema = 'your_database';
-- 性能趋势预测
SELECT
metric_name,
current_value,
predicted_value_7d,
trend
FROM sys.performance_predictions;
-- 基于预测自动扩容建议
SELECT * FROM sys.scaling_recommendations
WHERE recommendation_type = 'storage';
8. 向量数据库支持
8.1 什么是向量数据库?
向量数据库是一种专门用于存储和检索高维 AI向量数据的数据库,在 和机器学习领域应用广泛。MySQL 8.0+ 开始支持向量存储和相似性搜索,使得它可以用于构建语义搜索、推荐系统、AI 应用等场景。
🔢 向量嵌入
支持将文本、图像、音频等数据转换为高维向量表示
🔍 相似性搜索
支持余弦相似度、欧氏距离等多种相似性度量
🤖 AI 应用
支持大语言模型记忆存储、语义搜索、RAG 应用
⚡ 高性能
专门优化的向量索引,支持百万级向量快速检索
8.2 创建向量列
-- 创建支持向量存储的表(MySQL 8.0.32+)
CREATE TABLE document_embeddings (
id INT AUTO_INCREMENT PRIMARY KEY,
document_id INT NOT NULL,
content TEXT,
embedding VECTOR(1536), -- 1536维向量(如 OpenAI embedding)
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_embedding (embedding)
);
-- 使用 JSON 格式插入向量(兼容旧版本)
INSERT INTO document_embeddings (document_id, content, embedding)
VALUES (
1,
'MySQL 是一个开源关系型数据库',
'[0.12, -0.34, 0.56, ...]' -- 实际为1536维向量
);
8.3 向量相似性搜索
-- 余弦相似度搜索
SELECT
id,
document_id,
content,
COSINE_DISTANCE(embedding, '[0.1, -0.2, 0.3, ...]') AS distance
FROM document_embeddings
ORDER BY distance ASC
LIMIT 5;
-- 欧氏距离搜索
SELECT
id,
document_id,
content,
L2_DISTANCE(embedding, '[0.1, -0.2, 0.3, ...]') AS distance
FROM document_embeddings
ORDER BY distance ASC
LIMIT 5;
-- 内积相似度搜索
SELECT
id,
document_id,
content,
INNER_PRODUCT(embedding, '[0.1, -0.2, 0.3, ...]') AS similarity
FROM document_embeddings
ORDER BY similarity DESC
LIMIT 5;
8.4 使用 HNSW 索引加速检索
-- 创建 HNSW 向量索引(MySQL 8.0.32+)
ALTER TABLE document_embeddings
ADD INDEX idx_hnsw (embedding)
USING HNSW (
16, -- 邻居数(m)
200 -- 搜索候选列表大小(ef_construction)
);
-- 使用索引进行近似最近邻搜索
SELECT
id,
document_id,
content,
COSINE_DISTANCE(embedding, '[0.1, -0.2, 0.3, ...]') AS distance
FROM document_embeddings
ORDER BY embedding LIMIT K 5;
8.5 RAG(检索增强生成)应用示例
-- 1. 创建知识库表
CREATE TABLE knowledge_base (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(500),
content TEXT,
embedding VECTOR(1536),
INDEX idx_embedding (embedding USING HNSW)
);
-- 2. 语义搜索函数
DELIMITER //
CREATE FUNCTION semantic_search(
query_vector VECTOR(1536),
top_k INT
)
RETURNS TABLE (
id INT,
title VARCHAR(500),
content TEXT,
similarity FLOAT
)
BEGIN
RETURN
SELECT
id,
title,
content,
1 - COSINE_DISTANCE(embedding, query_vector) AS similarity
FROM knowledge_base
ORDER BY embedding LIMIT K top_k;
END //
DELIMITER ;
-- 3. 使用语义搜索
SELECT * FROM semantic_search('[query_vector_here]', 3);
8.6 向量数据库对比
| 特性 | MySQL Vector | Pinecone | Milvus | Weaviate |
|---|---|---|---|---|
| 类型 | 关系型+向量 | 专用向量 | 专用向量 | 专用向量 |
| 部署方式 | 自建/云 | 云服务 | 自建/云 | 自建/云 |
| 向量维度 | ≤4096 | 无限制 | 无限制 | 无限制 |
| 索引类型 | HNSW, IVF | HNSW | HNSW, IVF, PQ | HNSW, BF |
| 适用场景 | 中小规模向量 需要 SQL 能力 |
大规模云端部署 快速原型 |
大规模生产环境 需要精细控制 |
语义搜索 知识图谱 |