情感分析之使用PaddleGPU文章正负面分析完整版
[toc]
# 相关文档
官方说明文档:点击跳转 (opens new window)
# 使用Paddle
登录网站:https://aistudio.baidu.com/index/creations (opens new window)
点击我的创作
创建notebook
项目
选择配置,并创建
启动环境
选择GPU启动
点击进入环境
进入终端
输入命令查看安装组件版本
pip list |grep paddle
paddle2onnx 1.1.0
paddlefsl 1.1.0
paddlehub 2.4.0
paddlenlp 2.6.1.post0
paddlepaddle-gpu 2.5.2
2
3
4
5
6
# 下载源码
cd /home/aistudio/work && git clone https://gitee.com/paddlepaddle/PaddleNLP.git
# 标注数据
参考:https://www.hadoop.wiki/pages/598a4a (opens new window)
https://datamining.blog.csdn.net/article/details/134964167 (opens new window)
标注数据分类如下
# 上传标注数据
可以直接下载我标注好的数据,大概9k条左右文章和评论数据
链接: https://pan.baidu.com/s/1nZLbOXVNrY6HoaWVVm9eaA?pwd=5fei 提取码: 5fei
将标注数据上传到/home/aistudio/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/data
目录下
cd /home/aistudio/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction
mkdir data
aistudio@jupyter-3127826-7249891:~/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/data$ ls -lh
total 12M
-rw-r--r-- 1 aistudio aistudio 12M Dec 12 17:40 project-31-at-2023-12-12-08-07-661bbaf0.json
2
3
# 训练模型
# 构建样本数据
# 修改源码
构建之前需要先修改下处理逻辑不然会报错
[2023-12-12 16:34:26,931] [ INFO] - [Train] Start to convert annotation data.
Traceback (most recent call last):
File "label_studio.py", line 738, in <module>
do_convert()
File "label_studio.py", line 698, in do_convert
train_examples = convertor.convert_cls_examples(raw_examples[:p1], data_flag="Train")
File "label_studio.py", line 103, in convert_cls_examples
items = self.process_text_tag(line, task_type="cls")
File "label_studio.py", line 93, in process_text_tag
items["label"] = line["annotations"][0]["result"][0]["value"]["choices"]
IndexError: list index out of range
2
3
4
5
6
7
8
9
10
11
原因:可能是因为标注的数据有些是为空的,我们修改下label_studio.py
的代码,将异常数据直接捕获跳过
修改代码第103行左右的convert_cls_examples
方法
修改前
def convert_cls_examples(self, raw_examples, data_flag="Data"):
"""
Convert labeled data for classification task.
"""
examples = []
logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
for line in raw_examples:
items = self.process_text_tag(line, task_type="cls")
text, labels = items["text"], items["label"]
example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options)
examples.append(example)
logger.info("{0:7} End to convert annotation data.\n".format(""))
return examples
2
3
4
5
6
7
8
9
10
11
12
13
14
修改后
def convert_cls_examples(self, raw_examples, data_flag="Data"):
"""
Convert labeled data for classification task.
"""
examples = []
logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
for line in raw_examples:
try:
items = self.process_text_tag(line, task_type="cls")
text, labels = items["text"], items["label"]
example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options)
examples.append(example)
except Exception :
logger.info("数据解析错误")
logger.info("{0:7} End to convert annotation data.\n".format(""))
return examples
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Github上也有相关讨论:https://github.com/PaddlePaddle/PaddleNLP/issues/6292 (opens new window)
修改完后执行命令
python label_studio.py \
--label_studio_file ./data/project-31-at-2023-12-12-08-07-661bbaf0.json \
--task_type cls \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--options "正向" "负向" "中性" \
--negative_ratio 5 \
--is_shuffle True \
--seed 1000
2
3
4
5
6
7
8
9
构建成功后会提示
[2023-12-13 21:35:42,773] [ INFO] - End to convert annotation data.
[2023-12-13 21:35:42,819] [ INFO] - Save 5642 examples to ./data/train.json.
[2023-12-13 21:35:42,824] [ INFO] - Save 717 examples to ./data/dev.json.
[2023-12-13 21:35:42,830] [ INFO] - Save 686 examples to ./data/test.json.
[2023-12-13 21:35:42,830] [ INFO] - Finished! It takes 0.47 seconds
2
3
4
5
# 执行命令训练模型
python -u -m paddle.distributed.launch --gpus "0" finetune.py \
--train_path ./data/train.json \
--dev_path ./data/dev.json \
--save_dir ./checkpoint \
--learning_rate 1e-5 \
--batch_size 16 \
--max_seq_len 512 \
--num_epochs 10 \
--model uie-senta-base \
--seed 1000 \
--logging_steps 10 \
--valid_steps 100 \
--device gpu
2
3
4
5
6
7
8
9
10
11
12
13
参数说明:
可配置参数说明:
train_path
:必须,训练集文件路径。dev_path
:必须,验证集文件路径。save_dir
:模型 checkpoints 的保存目录,默认为"./checkpoint"。learning_rate
:训练最大学习率,UIE 推荐设置为 1e-5;默认值为1e-5。batch_size
:训练集训练过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为 16。max_seq_len
:模型支持处理的最大序列长度,默认为512。num_epochs
:模型训练的轮次,可以视任务情况进行调整,默认为10。model
:训练使用的预训练模型。可选择的有uie-senta-base
,uie-senta-medium
,uie-senta-mini
,uie-senta-micro
,uie-senta-nano
,默认为uie-senta-base
。logging_steps
: 训练过程中日志打印的间隔 steps 数,默认10。valid_steps
: 训练过程中模型评估的间隔 steps 数,默认100。seed
:全局随机种子,默认为 42。device
: 训练设备,可选择 'cpu'、'gpu' 其中的一种;默认为 GPU 训练。
模型训练完成后会在save_dir
配置的目录下生成保存的模型,model_best
就是最佳模型
aistudio@jupyter-3127826-7249891:~/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction$ ls checkpoint/
model_100 model_1200 model_1500 model_1800 model_2000 model_2300 model_2600 model_2900 model_3100 model_3400 model_3700 model_400 model_600 model_900
model_1000 model_1300 model_1600 model_1900 model_2100 model_2400 model_2700 model_300 model_3200 model_3500 model_3800 model_4000 model_700 model_best
model_1100 model_1400 model_1700 model_200 model_2200 model_2500 model_2800 model_3000 model_3300 model_3600 model_3900 model_500 model_800
2
3
4
# 测试模型准确度
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/test.json \
--batch_size 16 \
--max_seq_len 512 \
--debug
2
3
4
5
6
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00, 3.48it/s]
[2023-12-13 10:42:10,473] [ INFO] - -----------------------------
[2023-12-13 10:42:10,473] [ INFO] - Class Name: 情感倾向[中性,正向,负向]
[2023-12-13 10:42:10,473] [ INFO] - Evaluation Precision: 0.76893 | Recall: 0.74746 | F1: 0.75804
2
3
4
# 应用训练好的模型
from paddlenlp import Taskflow
schema = ['情感倾向[正向,负向,中性]']
senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best")
print(senta("这家点的房间很大,店家服务也很热情,就是房间隔音不好"))
2
3
4
# 服务部署
将./checkpoint/model_best
目录压缩打包
$ tar -czvf model_best.tar.gz model_best/
model_best/
model_best/config.json
model_best/special_tokens_map.json
model_best/model_config.json
model_best/static/
model_best/static/inference.pdiparams.info
model_best/static/inference.pdiparams
model_best/static/inference.pdmodel
model_best/tokenizer_config.json
model_best/model_state.pdparams
model_best/vocab.txt
model_best/.cache_info
2
3
4
5
6
7
8
9
10
11
12
13
下载到本地
放到需要部署的服务器上,然后解压
注意新服务器也要有Paddle的环境
# tar -zxvf model_best.tar.gz
model_best/
model_best/config.json
model_best/special_tokens_map.json
model_best/model_config.json
model_best/static/
model_best/static/inference.pdiparams.info
model_best/static/inference.pdiparams
model_best/static/inference.pdmodel
model_best/tokenizer_config.json
model_best/model_state.pdparams
model_best/vocab.txt
model_best/.cache_info
2
3
4
5
6
7
8
9
10
11
12
13
进入deploy
目录
cd PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/deploy
(nlp) [root@centos7 deploy]# ls
client.py __pycache__ server.py
2
修改server.py
选择使用指定模型
from paddlenlp import SimpleServer, Taskflow
# The schema changed to your defined schema
schema = schema = ['情感倾向[正向,负向,中性]']
# define taskflow to perform sentiment analysis
senta = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base",task_path="/root/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/checkpoint/model_best")
# define your server
app = SimpleServer()
app.register_taskflow("taskflow/senta", senta)
2
3
4
5
6
7
8
9
# 启动服务
paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 --app_dir /root/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/deploy
# python代码调用
import json
import requests
url = "http://0.0.0.0:8189/taskflow/senta"
headers = {"Content-Type": "application/json"}
texts = ["蛋糕味道不错,店家的服务也很热情"]
data = {
"data": {
"text": texts,
}
}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
datas = json.loads(r.text)
print(datas)
2
3
4
5
6
7
8
9
10
11
12
13
14
15