情感分析之使用PaddleGPU文章正负面分析完整版

[toc]

# 相关文档

官方说明文档:点击跳转 (opens new window)

# 使用Paddle

登录网站:https://aistudio.baidu.com/index/creations (opens new window)

点击我的创作

创建notebook项目

选择配置，并创建

启动环境

选择GPU启动

点击进入环境

进入终端

输入命令查看安装组件版本

pip list |grep paddle
paddle2onnx                1.1.0
paddlefsl                  1.1.0
paddlehub                  2.4.0
paddlenlp                  2.6.1.post0
paddlepaddle-gpu           2.5.2

1
2
3
4
5
6

# 下载源码

cd /home/aistudio/work && git clone https://gitee.com/paddlepaddle/PaddleNLP.git

# 标注数据

参考：https://www.hadoop.wiki/pages/598a4a (opens new window)

https://datamining.blog.csdn.net/article/details/134964167 (opens new window)

标注数据分类如下

# 上传标注数据

可以直接下载我标注好的数据，大概9k条左右文章和评论数据

链接: https://pan.baidu.com/s/1nZLbOXVNrY6HoaWVVm9eaA?pwd=5fei 提取码: 5fei

将标注数据上传到/home/aistudio/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/data目录下

cd /home/aistudio/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction

mkdir data

aistudio@jupyter-3127826-7249891:~/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/data$ ls -lh
total 12M
-rw-r--r-- 1 aistudio aistudio 12M Dec 12 17:40 project-31-at-2023-12-12-08-07-661bbaf0.json

1
2
3

# 训练模型

# 构建样本数据

# 修改源码

构建之前需要先修改下处理逻辑不然会报错

[2023-12-12 16:34:26,931] [    INFO] - [Train] Start to convert annotation data.
Traceback (most recent call last):
  File "label_studio.py", line 738, in <module>
    do_convert()
  File "label_studio.py", line 698, in do_convert
    train_examples = convertor.convert_cls_examples(raw_examples[:p1], data_flag="Train")
  File "label_studio.py", line 103, in convert_cls_examples
    items = self.process_text_tag(line, task_type="cls")
  File "label_studio.py", line 93, in process_text_tag
    items["label"] = line["annotations"][0]["result"][0]["value"]["choices"]
IndexError: list index out of range

1
2
3
4
5
6
7
8
9
10
11

原因：可能是因为标注的数据有些是为空的，我们修改下label_studio.py的代码，将异常数据直接捕获跳过

修改代码第103行左右的convert_cls_examples方法

修改前

    def convert_cls_examples(self, raw_examples, data_flag="Data"):
        """
        Convert labeled data for classification task.
        """
        examples = []
        logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
        for line in raw_examples:
             items = self.process_text_tag(line, task_type="cls")
             text, labels = items["text"], items["label"]
             example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options)
             examples.append(example)

        logger.info("{0:7} End to convert annotation data.\n".format(""))
        return examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14

修改后

    def convert_cls_examples(self, raw_examples, data_flag="Data"):
        """
        Convert labeled data for classification task.
        """
        examples = []
        logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
        for line in raw_examples:
            try:
                items = self.process_text_tag(line, task_type="cls")
                text, labels = items["text"], items["label"]
                example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options)
                examples.append(example)
            except Exception :
                logger.info("数据解析错误")
        logger.info("{0:7} End to convert annotation data.\n".format(""))
        return examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Github上也有相关讨论：https://github.com/PaddlePaddle/PaddleNLP/issues/6292 (opens new window)

修改完后执行命令

python label_studio.py \
    --label_studio_file ./data/project-31-at-2023-12-12-08-07-661bbaf0.json \
    --task_type cls \
    --save_dir ./data \
    --splits 0.8 0.1 0.1 \
    --options "正向" "负向" "中性" \
    --negative_ratio 5 \
    --is_shuffle True \
    --seed 1000

1
2
3
4
5
6
7
8
9

构建成功后会提示

[2023-12-13 21:35:42,773] [    INFO] -         End to convert annotation data.
[2023-12-13 21:35:42,819] [    INFO] - Save 5642 examples to ./data/train.json.
[2023-12-13 21:35:42,824] [    INFO] - Save 717 examples to ./data/dev.json.
[2023-12-13 21:35:42,830] [    INFO] - Save 686 examples to ./data/test.json.
[2023-12-13 21:35:42,830] [    INFO] - Finished! It takes 0.47 seconds

1
2
3
4
5

# 执行命令训练模型

python -u -m paddle.distributed.launch --gpus "0" finetune.py \
  --train_path ./data/train.json \
  --dev_path ./data/dev.json \
  --save_dir ./checkpoint \
  --learning_rate 1e-5 \
  --batch_size 16 \
  --max_seq_len 512 \
  --num_epochs 10 \
  --model uie-senta-base \
  --seed 1000 \
  --logging_steps 10 \
  --valid_steps 100 \
  --device gpu

1
2
3
4
5
6
7
8
9
10
11
12
13

参数说明：

可配置参数说明：

train_path：必须，训练集文件路径。

dev_path：必须，验证集文件路径。

save_dir：模型 checkpoints 的保存目录，默认为"./checkpoint"。

learning_rate：训练最大学习率，UIE 推荐设置为 1e-5；默认值为1e-5。

batch_size：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。

max_seq_len：模型支持处理的最大序列长度，默认为512。

num_epochs：模型训练的轮次，可以视任务情况进行调整，默认为10。

model：训练使用的预训练模型。可选择的有uie-senta-base, uie-senta-medium, uie-senta-mini, uie-senta-micro, uie-senta-nano，默认为uie-senta-base。

logging_steps: 训练过程中日志打印的间隔 steps 数，默认10。

valid_steps: 训练过程中模型评估的间隔 steps 数，默认100。

seed：全局随机种子，默认为 42。

device: 训练设备，可选择 'cpu'、'gpu' 其中的一种；默认为 GPU 训练。

模型训练完成后会在save_dir配置的目录下生成保存的模型,model_best就是最佳模型

aistudio@jupyter-3127826-7249891:~/work/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction$ ls checkpoint/
model_100   model_1200  model_1500  model_1800  model_2000  model_2300  model_2600  model_2900  model_3100  model_3400  model_3700  model_400   model_600  model_900
model_1000  model_1300  model_1600  model_1900  model_2100  model_2400  model_2700  model_300   model_3200  model_3500  model_3800  model_4000  model_700  model_best
model_1100  model_1400  model_1700  model_200   model_2200  model_2500  model_2800  model_3000  model_3300  model_3600  model_3900  model_500   model_800

1
2
3
4

# 测试模型准确度

python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/test.json \
    --batch_size 16 \
    --max_seq_len 512 \
    --debug

1
2
3
4
5
6

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:14<00:00,  3.48it/s]
[2023-12-13 10:42:10,473] [    INFO] - -----------------------------
[2023-12-13 10:42:10,473] [    INFO] - Class Name: 情感倾向[中性,正向,负向]
[2023-12-13 10:42:10,473] [    INFO] - Evaluation Precision: 0.76893 | Recall: 0.74746 | F1: 0.75804

1
2
3
4

# 应用训练好的模型

from paddlenlp import Taskflow
schema = ['情感倾向[正向,负向,中性]']
senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best")
print(senta("这家点的房间很大，店家服务也很热情，就是房间隔音不好"))

1
2
3
4

# 服务部署

将./checkpoint/model_best目录压缩打包

$ tar -czvf  model_best.tar.gz model_best/
model_best/
model_best/config.json
model_best/special_tokens_map.json
model_best/model_config.json
model_best/static/
model_best/static/inference.pdiparams.info
model_best/static/inference.pdiparams
model_best/static/inference.pdmodel
model_best/tokenizer_config.json
model_best/model_state.pdparams
model_best/vocab.txt
model_best/.cache_info

1
2
3
4
5
6
7
8
9
10
11
12
13

下载到本地

放到需要部署的服务器上，然后解压

注意新服务器也要有Paddle的环境

参考：https://www.hadoop.wiki/pages/75395e (opens new window)

# tar -zxvf model_best.tar.gz 
model_best/
model_best/config.json
model_best/special_tokens_map.json
model_best/model_config.json
model_best/static/
model_best/static/inference.pdiparams.info
model_best/static/inference.pdiparams
model_best/static/inference.pdmodel
model_best/tokenizer_config.json
model_best/model_state.pdparams
model_best/vocab.txt
model_best/.cache_info

1
2
3
4
5
6
7
8
9
10
11
12
13

进入deploy目录

cd PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/deploy

(nlp) [root@centos7 deploy]# ls
client.py  __pycache__  server.py

1
2

修改server.py选择使用指定模型

from paddlenlp import SimpleServer, Taskflow

# The schema changed to your defined schema
schema = schema =  ['情感倾向[正向,负向,中性]']
# define taskflow to perform sentiment analysis
senta = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base",task_path="/root/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/checkpoint/model_best")
# define your server
app = SimpleServer()
app.register_taskflow("taskflow/senta", senta)

1
2
3
4
5
6
7
8
9

# 启动服务

paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189 --app_dir /root/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction/deploy

# python代码调用

import json

import requests

url = "http://0.0.0.0:8189/taskflow/senta"
headers = {"Content-Type": "application/json"}
texts = ["蛋糕味道不错，店家的服务也很热情"]
data = {
 "data": {
     "text": texts,
 }
}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
datas = json.loads(r.text)
print(datas)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

# HTTP请求

# 服务Docker化

https://www.hadoop.wiki/pages/2eda6a (opens new window)

上次更新: 2023/12/19, 21:00:20

← 情感分析服务化部署之构建Docker服务情感分析之文章正负面分析完整版→