當前位置：首頁 > 科技 > 軟件

容器下在 Triton Server 中使用 TensorRT-LLM 進行推理

來源：責編：時間：2024-02-04 09:01:48 244觀看

導讀1. TensorRT-LLM 編譯模型1.1 TensorRT-LLM 簡介使用 TensorRT 時，通常需要將模型轉換為 ONNX 格式，再將 ONNX 轉換為 TensorRT 格式，然后在 TensorRT、Triton Server 中進行推理。但這個轉換過程并不簡單，經常會遇到各種

1. TensorRT-LLM 編譯模型

1.1 TensorRT-LLM 簡介

使用 TensorRT 時，通常需要將模型轉換為 ONNX 格式，再將 ONNX 轉換為 TensorRT 格式，然后在 TensorRT、Triton Server 中進行推理。

但這個轉換過程并不簡單，經常會遇到各種報錯，需要對模型結構、平臺算子有一定的掌握，具備轉換和調試能力。而 TensorRT-LLM 的目標就是降低這一過程的復雜度，讓大模型更容易跑在 TensorRT 引擎上。

需要注意的是，TensorRT 針對的是具體硬件，不同的 GPU 型號需要編譯不同的 TensorRT 格式模型。這與 ONNX 模型格式的通用性定位顯著不同。

同時，TensortRT-LLM 并不支持全部 GPU 型號，僅支持 H100、L40S、A100、A30、V100 等顯卡。

1.2 配置編譯環境

docker run --gpus device=0 -v $PWD:/app/tensorrt_llm/models -it --rm hubimage/nvidia-tensorrt-llm:v0.7.1 bash

--gpus device=0 表示使用編號為 0 的 GPU 卡，這里的 hubimage/nvidia-tensorrt-llm:v0.7.1 對應的就是 TensorRT-LLM v0.7.1 的 Release 版本。

由于自行打鏡像非常麻煩，這里提供幾個可選版本的鏡像:

hubimage/nvidia-tensorrt-llm:v0.7.1
hubimage/nvidia-tensorrt-llm:v0.7.0
hubimage/nvidia-tensorrt-llm:v0.6.1

1.3 編譯生成 TensorRT 格式模型

在上述容器環境下，執行命令:

python examples/baichuan/build.py --model_version v2_7b /                --model_dir ./models/Baichuan2-7B-Chat /                --dtype float16 /                --parallel_build /                --use_inflight_batching /                --enable_context_fmha /                --use_gemm_plugin float16 /                --use_gpt_attention_plugin float16 /                --output_dir ./models/Baichuan2-7B-trt-engines

生成的文件主要有三個:

baichuan_float16_tp1_rank0.engine，嵌入權重的模型計算圖文件
config.json，模型結構、精度、插件等詳細配置信息文件
model.cache，編譯緩存文件，可以加速后續編譯速度

1.4 推理測試

python examples/run.py --input_text "世界上第二高的山峰是哪座？" /                 --max_output_len=200 /                 --tokenizer_dir ./models/Baichuan2-7B-Chat /                 --engine_dir=./models/Baichuan2-7B-trt-engines

[02/03/2024-10:02:58] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usageInput [Text 0]: "世界上第二高的山峰是哪座？"Output [Text 0 Beam 0]: "珠穆朗瑪峰（Mount Everest）是地球上最高的山峰，海拔高度為8,848米（29,029英尺）。第二高的山峰是喀喇昆侖山脈的喬戈里峰（K2），海拔高度為8,611米（28,251英尺）。"

1.5 驗證是否嚴重退化

模型推理優化，可以替換算子、量化、裁剪反向傳播等手段，但有一個基本線一定要達到，那就是模型不能退化很多。

在精度損失可接受的范圍內，模型的推理優化才有意義。TensorRT-LLM 項目提供的 summarize.py 可以跑一些測試，給模型打分，rouge1、rouge2 和 rougeLsum 是用于評價文本生成質量的指標，可以用于評估模型推理質量。

獲取原格式模型的 Rouge 指標

pip install datasets nltk rouge_score -i https://pypi.tuna.tsinghua.edu.cn/simple

由于目前 optimum 不支持 Baichuan 模型，因此，需要編輯 examples/summarize.py 注釋掉 model.to_bettertransformer()，這個問題在最新的 TensorRT-LLM 代碼中已經解決，我使用的是當前最新的 Release 版本（v0.7.1）。

python examples/summarize.py --test_hf /                    --hf_model_dir ./models/Baichuan2-7B-Chat /                    --data_type fp16 /                    --engine_dir ./models/Baichuan2-7B-trt-engines

輸出結果:

[02/03/2024-10:21:45] [TRT-LLM] [I] Hugging Face (total latency: 31.27020287513733 sec)[02/03/2024-10:21:45] [TRT-LLM] [I] HF beam 0 result[02/03/2024-10:21:45] [TRT-LLM] [I]   rouge1 : 28.847385241217726[02/03/2024-10:21:45] [TRT-LLM] [I]   rouge2 : 9.519352831698162[02/03/2024-10:21:45] [TRT-LLM] [I]   rougeL : 20.85486489462602[02/03/2024-10:21:45] [TRT-LLM] [I]   rougeLsum : 24.090111126907733

獲取 TensorRT 格式模型的 Rouge 指標

python examples/summarize.py --test_trt_llm /                    --hf_model_dir ./models/Baichuan2-7B-Chat /                    --data_type fp16 /                    --engine_dir ./models/Baichuan2-7B-trt-engines

輸出結果:

[02/03/2024-10:23:16] [TRT-LLM] [I] TensorRT-LLM (total latency: 28.360705375671387 sec)[02/03/2024-10:23:16] [TRT-LLM] [I] TensorRT-LLM beam 0 result[02/03/2024-10:23:16] [TRT-LLM] [I]   rouge1 : 26.557043897453102[02/03/2024-10:23:16] [TRT-LLM] [I]   rouge2 : 8.28672928021811[02/03/2024-10:23:16] [TRT-LLM] [I]   rougeL : 19.13639628365737[02/03/2024-10:23:16] [TRT-LLM] [I]   rougeLsum : 22.0436013250798

TensorRT-LLM 編譯之后的模型，rougeLsum 從 24 降到了 22，說明能力會有退化，但只要在可接受的范圍之內，還是可以使用的，因為推理速度會有較大的提升。

完成這步之后，就可以退出容器了，推理是在另外一個容器中進行。

2. Triton Server 配置說明

2.1 Triton Server 簡介

Triton Server 是一個推理框架，提供用戶規模化進行推理的能力。具體包括:

支持多種后端，tensorrt、onnxruntime、pytorch、python、vllm、tensorrtllm 等，還可以自定義后端，只需要相應的 shared library 即可。
對外提供 HTTP、GRPC 接口
batch 能力，支持批量進行推理，而開啟 Dynamic batching 之后，多個 batch 可以合并之后同時進行推理，實現更高吞吐量
pipeline 能力，一個 Triton Server 可以同時推理多個模型，并且模型之間可以進行編排，支持 Concurrent Model Execution 流水線并行推理
觀測能力，提供有 Metrics 可以實時監控推理的各種指標

圖片

上面是 Triton Server 的架構圖，簡單點說 Triton Server 是一個端（模型）到端（應用）的推理框架，提供了圍繞推理的生命周期過程管理，配置好模型之后，就能直接對應用層提供服務。

2.2 Triton Server 使用配置

在 Triton 社區的示例中，通常會有這樣四個目錄:

.├── ensemble│   ├── 1│   └── config.pbtxt├── postprocessing│   ├── 1│   │   └── model.py│   └── config.pbtxt├── preprocessing│   ├── 1│   │   └── model.py│   └── config.pbtxt└── tensorrt_llm    ├── 1    └── config.pbtxt9 directories, 6 files

對于 Triton Server 來說，上面的目錄格式實際上是定義了四個模型，分別是 preprocessing、tensorrt_llm、postprocessing、ensemble，只不過 ensemble 是一個組合模型，定義多個模型來融合。

ensemble 存在的原因在于 tensorrt_llm 的推理并不是 text2text ，借助 Triton Server 的 Pipeline 能力，通過 preprocessing 對輸入進行 Tokenizing，postprocessing 對輸出進行 Detokenizing，就能夠實現端到端的推理能力。否則，在客戶端直接使用 TensorRT-LLM 時，還需要自行處理詞與索引的雙向映射。

這四個模型具體作用如下:

preprocessing, 用于輸入文本的預處理，包括分詞、詞向量化等，實現類似 text2vec 的預處理。
tensorrt_llm, 用于 TensorRT 格式模型的 vec2vec 的推理
postprocessing，用于輸出文本的后處理，包括生成文本的后處理，如對齊、截斷等，實現類似 vec2text 的后處理。
ensemble，將上面的是三個模型進行融合，提供 text2text 的推理

上面定義的模型都有一個 1 目錄表示版本 1 ，在版本目錄中放置模型文件，在模型目錄下放置 config.pbtxt 描述推理的參數 input、output、version 等。

2.3 模型加載的控制管理

Triton Server 通過參數 --model-control-mode 來控制模型加載的方式，目前有三種加載模式:

none，加載目錄下的全部模型
explicit，加載目錄下的指定模型，通過參數 --load-model 加載指定的模型
poll，定時輪詢加載目錄下的全部模型，通過參數 --repository-poll-secs 配置輪詢周期

2.4 模型版本的控制管理

Triton Server 在模型的配置文件 config.pbtxt 中提供有 Version Policy，每個模型可以有多個版本共存。默認使用版本號為 1 的模型，目前有三種版本策略:

所有版本同時使用

version_policy: { all: {}}

只使用最近 n 個版本

version_policy: { latest: { num_versions: 3}}

只使用指定的版本

version_policy: { specific: { versions: [1, 3, 5]}}

3. Triton Server 中使用 TensorRT-LLM

3.1 克隆配置文件

本文示例相關的配置已經整理了一份到 GitHub 上，拷貝模型到指定的目之后，就可以直接進行推理了。

git clone https://github.com/shaowenchen/modelops

3.2 組織推理目錄

拷貝 TensorRT 格式模型

cp Baichuan2-7B-trt-engines/* modelops/triton-tensorrtllm/Baichuan2-7B-Chat/tensorrt_llm/1/

拷貝源模型

cp -r Baichuan2-7B-Chat modelops/triton-tensorrtllm/downloads

此時文件的目錄結構是:

tree modelops/triton-tensorrtllmmodelops/triton-tensorrtllm├── Baichuan2-7B-Chat│   ├── end_to_end_grpc_client.py│   ├── ensemble│   │   ├── 1│   │   └── config.pbtxt│   ├── postprocessing│   │   ├── 1│   │   │   ├── model.py│   │   │   └── __pycache__│   │   │       └── model.cpython-310.pyc│   │   └── config.pbtxt│   ├── preprocessing│   │   ├── 1│   │   │   ├── model.py│   │   │   └── __pycache__│   │   │       └── model.cpython-310.pyc│   │   └── config.pbtxt│   └── tensorrt_llm│       ├── 1│       │   ├── baichuan_float16_tp1_rank0.engine│       │   ├── config.json│       │   └── model.cache│       └── config.pbtxt└── downloads    └── Baichuan2-7B-Chat        ├── Baichuan2 模型社區許可協議.pdf        ├── Community License for Baichuan2 Model.pdf        ├── config.json        ├── configuration_baichuan.py        ├── generation_config.json        ├── generation_utils.py        ├── modeling_baichuan.py        ├── pytorch_model.bin        ├── quantizer.py        ├── README.md        ├── special_tokens_map.json        ├── tokenization_baichuan.py        ├── tokenizer_config.json        └── tokenizer.model13 directories, 26 files

3.3 啟動推理服務

docker run --gpus device=0 --rm -p 38000:8000 -p 38001:8001 -p 38002:8002 /    -v $PWD/modelops/triton-tensorrtllm:/models /    hubimage/nvidia-triton-trt-llm:v0.7.1 /    tritonserver --model-repository=/models/Baichuan2-7B-Chat /    --disable-auto-complete-config /    --backend-cnotallow=python,shm-region-prefix-name=prefix0_:

如果一臺機器上運行了多個 triton server，那么需要用 shm-region-prefix-name=prefix0_ 區分一下共享內存的前綴，詳情可以參考 https://github.com/triton-inference-server/server/issues/4145 。

啟動日志:

I0129 10:27:31.658112 1 server.cc:619]+-------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| Backend     | Path                                                            | Config                                                                                                                                                                                              |+-------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_:","default-max-batch-size":"4"}} || tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}                                      |+-------------+-----------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+I0129 10:27:31.658192 1 server.cc:662]+----------------+---------+--------+| Model          | Version | Status |+----------------+---------+--------+| ensemble       | 1       | READY  || postprocessing | 1       | READY  || preprocessing  | 1       | READY  || tensorrt_llm   | 1       | READY  |+----------------+---------+--------+...I0129 10:27:31.745587 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001I0129 10:27:31.745810 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000I0129 10:27:31.787129 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002

四個模型都處于 READY 狀態，就可以正常推理了。

查看模型配置參數

curl localhost:38000/v2/models/ensemble/config{"name":"ensemble","platform":"ensemble","backend":"","version_policy":{"latest":{"num_versions":1}},"max_batch_size":32,"input":[{"name":"text_input","data_type":"TYPE_STRING",...

可以查看模型的推理參數。如果使用的是 auto-complete-config，那么這個接口可以用于導出 Triton Server 自動生成的模型推理參數，用于修改和調試。

查看 Triton 是否正常運行

curl -v localhost:38000/v2/health/ready< HTTP/1.1 200 OK< Content-Length: 0< Content-Type: text/plain

3.4 客戶端調用

安裝依賴

pip install tritonclient[grpc] -i https://pypi.tuna.tsinghua.edu.cn/simple

Triton GRPC 接口的性能顯著高于 HTTP 接口，同時在容器中，我也沒有找到 HTTP 接口的示例，這里就直接用 GRPC 了。

推理測試

wget https://raw.githubusercontent.com/shaowenchen/modelops/master/triton-tensorrtllm/Baichuan2-7B-Chat/end_to_end_grpc_client.py

python3 ./end_to_end_grpc_client.py -u 127.0.0.1:38001 -p "世界上第三高的山峰是哪座？" -S -o 128珠穆朗瑪峰（Mount Everest）是世界上最高的山峰，海拔高度為8,848米（29,029英尺）。在世界上，珠穆朗瑪峰之后，第二高的山峰是喀喇昆侖山脈的喬戈里峰（K2，又稱K2峰），海拔高度為8,611米（28,251英尺）。第三高的山峰是喜馬拉雅山脈的坎欽隆加峰（Kangchenjunga），海拔高度為8,586米（28,169英尺）。</s>

3.5 查看指標

Triton Server 已經提供了推理指標，監聽在 8002 端口。在本文的示例中，就是 38002 端口。

curl -v localhost:38002/metricsnv_inference_request_success{model="ensemble",versinotallow="1"} 1nv_inference_request_success{model="tensorrt_llm",versinotallow="1"} 1nv_inference_request_success{model="preprocessing",versinotallow="1"} 1nv_inference_request_success{model="postprocessing",versinotallow="1"} 128# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes# TYPE nv_inference_request_failure counternv_inference_request_failure{model="ensemble",versinotallow="1"} 0nv_inference_request_failure{model="tensorrt_llm",versinotallow="1"} 0nv_inference_request_failure{model="preprocessing",versinotallow="1"} 0nv_inference_request_failure{model="postprocessing",versinotallow="1"} 0

在 Grafana 中可以導入面板 https://grafana.com/grafana/dashboards/18737-triton-inference-server/ 查看指標，如下圖:

圖片

4. 總結

本文主要是在學習使用 TensorRT 和 Triton Server 進行推理過程的記錄，主要內容如下:

TensorRT 是一種針對 Nvidia GPU 硬件更高效的模型推理引擎
TensorRT-LLM 能讓大模型更快使用上 TensorRT 引擎
Triton Server 是一個端到端的推理框架，支持大部分的模型框架，能幫助用戶快速實現規模化的推理服務
Triton Server 下使用 TensorRT-LLM 進行推理的示例

5. 參考

https://mmdeploy.readthedocs.io/zh-cn/latest/tutorial/03_pytorch2onnx.html
https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/running.html#running
https://github.com/NVIDIA/TensorRT-LLM
https://github.com/triton-inference-server/triton-tensorrtllm
https://zhuanlan.zhihu.com/p/663748373

本文鏈接：http://www.www897cc.com/showinfo-26-72432-0.html容器下在 Triton Server 中使用 TensorRT-LLM 進行推理

聲明：本網頁內容旨在傳播知識，若有侵權等問題請及時與本網聯系，我們將在第一時間刪除處理。郵件：2376512515@qq.com

上一篇：如何在PHP中使用 Caddy2 協同服務

下一篇：在Go中使用接口：實用性與脆弱性的平衡

標簽：

熱門焦點

Automa-通過連接塊來自動化你的瀏覽器

1、前言通過瀏覽器插件可實現自動化腳本的錄制與編寫，具有代表性的工具就是：Selenium IDE、Katalon Recorder，對于簡單的業務來說可快速實現自動化的上手工作。Selenium IDEKat
使用Webdriver-manager解決瀏覽器與驅動不匹配所帶來自動化無法執行的問題

1、前言在我們使用 Selenium 進行 UI 自動化測試時，常常會因為瀏覽器驅動與瀏覽器版本不匹配，而導致自動化測試無法執行，需要手動去下載對應的驅動版本，并替換原有的驅動，可能還
一個注解實現接口冪等，這樣才優雅！

場景碼猿慢病云管理系統中其實高并發的場景不是很多，沒有必要每個接口都去考慮并發高的場景，比如添加住院患者的這個接口，具體的業務代碼就不貼了，業務偽代碼如下：圖片上述代碼有
花7萬退貨退款無門：誰在縱容淘寶珠寶商家造假？

來源：極點商業作者：楊銘在淘寶購買珠寶玉石后，因為保證金不夠賠付，店鋪關閉，退貨退款難、維權無門的比比皆是。“提供相關產品鑒定證書，支持全國復檢，可以30天無理由退換貨。&
阿里大調整

來源：產品劉有媒體報道稱，近期淘寶天貓集團啟動了近年來最大的人力制度改革，涉及員工績效、層級體系等多個核心事項，目前已形成一個初步的“征求意見版”：1、取消P序列
OPPO K11樣張首曝：千元機影像“卷”得真不錯！

一直以來，OPPO K系列機型都保持著較為均衡的產品體驗，歷來都是2K價位的明星機型，去年推出的OPPO K10和OPPO K10 Pro兩款機型憑借各自的出色配置，堪稱有
Counterpoint ：OPPO雙旗艦戰略全面落地高端產品銷量增長22%

2023年6月30日，全球行業分析機構Counterpoint Research發布的《中國智能手機高端市場白皮書》顯示，中國智能手機品牌正在尋求高質量發展，中國高端智能
最薄的14英寸游戲筆記本電腦 Alienware X14已可以購買

2022年1月份在國際消費電子展(CES2022)上首次亮相的Alienware新品——Alienware X14現在已經可以購買了，這款筆記本電腦被譽為世界上最薄的 14 英寸游戲筆
利用職權私自解除被封帳號 Meta開除20多名員工

11月18日消息，據外媒援引知情人士表示，過去一年時間內，Facebook母公司Meta解雇或處罰了20多名員工以及合同工，指控這些人通過內部系統以不當方式重置用戶帳號，其

日韩成人免费在线_国产成人一二_精品国产免费人成电影在线观..._日本一区二区三区久久久久久久久不

容器下在 Triton Server 中使用 TensorRT-LLM 進行推理

1. TensorRT-LLM 編譯模型

1.1 TensorRT-LLM 簡介

1.2 配置編譯環境

1.3 編譯生成 TensorRT 格式模型

1.4 推理測試

1.5 驗證是否嚴重退化

2. Triton Server 配置說明

2.1 Triton Server 簡介

2.2 Triton Server 使用配置

2.3 模型加載的控制管理

2.4 模型版本的控制管理

3. Triton Server 中使用 TensorRT-LLM

3.1 克隆配置文件

3.2 組織推理目錄

3.3 啟動推理服務

3.4 客戶端調用

3.5 查看指標

4. 總結

5. 參考

Automa-通過連接塊來自動化你的瀏覽器

使用Webdriver-manager解決瀏覽器與驅動不匹配所帶來自動化無法執行的問題

一個注解實現接口冪等，這樣才優雅！

花7萬退貨退款無門：誰在縱容淘寶珠寶商家造假？

阿里大調整

OPPO K11樣張首曝：千元機影像“卷”得真不錯！

Counterpoint ：OPPO雙旗艦戰略全面落地高端產品銷量增長22%

最薄的14英寸游戲筆記本電腦 Alienware X14已可以購買

利用職權私自解除被封帳號 Meta開除20多名員工

最新推薦

猜你喜歡

熱門推薦

相關資訊