[ Server ] Architecture

Today

Total

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Recent Posts

Tags more

관리 메뉴

취미가 좋다

[ Server ] Architecture 본문

Data Engineer/triton inference server

[ Server ] Architecture

benlee73 2021. 6. 24. 11:05

https://github.com/triton-inference-server/server/blob/main/docs/architecture.md

Triton Architecture

아래의 사진은 Triton Inference Server의 high-level architecture를 보여준다.

Model Repository 는 triton이 추론을 하는 기반이 된다.

Inference requests 는 HTTP/REST or gRPC or C API 를 통해 서버에 도착하고, 각 모델의 스케줄러로 보내진다.

triton은 모델별로 multiple scheduling, batching algorithms 이 구현한다.

각 모델 스케줄러는 요청에 대한 batching을 수행하고, 모델에 맞는 백엔드로 요청이 전달된다.

백엔드에서는 inference를 수행하고, 그 출력을 반환한다.

Triton supports a backend C API that allows Triton to be extended with new functionality such as custom pre- and post-processing operations or even a new deep-learning framework.

Triton은 전처리, 후처리 과정, 새로운 딥러닝 프레임워크와 같이 새로운 기능을 확장할 수 있는 backend C API를 지원한다.

The models being served by Triton can be queried and controlled by a dedicated model management API that is available by HTTP/REST or GRPC protocol, or by the C API.

triton으로 서빙되는 모델은 모델 관리 API (HTTP/REST or gRPC or C API) 로 쿼리되고 제어될 수 있다.

Readiness and liveness health endpoints and utilization, throughput and latency metrics ease the integration of Triton into deployment framework such as Kubernetes.

Readiness and liveness health endpoints and utilization, throughput and latency metrics은 triton을 쿠버네티스와 같은 deployment 프레임워크와 쉽게 통합할 수 있다.

Concurrent Model Execution

triton은 여러 모델 그리고 하나의 모델에 대한 여러 인스턴스를 같은 시스템에서 병렬로 실행할 수 있다.

그 시스템은 gpu가 하나 혹은 여러 개일 수도 있고, 없을 수도 있다.

아래의 사진에서는 2개의 모델(model0, model1) 이 있고, 각 모델에 대한 2개의 요청이 들어온 상황이다.

triton은 즉시 gpu에 그들을 예약하고, gpu 하드웨어 스케줄러는 병렬로 계산을 시작한다.

Models executing on the system's CPU are handled similarly by Triton except that the scheduling of the CPU threads execution each model is handled by the system's OS.

시스템의 cpu에서 실행되는 모델은, OS에 의해 제어된다는 점을 제외하고, triton과 유사하게 처리된다.

By default, if multiple requests for the same model arrive at the same time, Triton will serialize their execution by scheduling only one at a time on the GPU, as shown in the following figure.

기본적으로 같은 모델에 대한 여러 요청이 동시에 들어오면, triton은 gpu에서 하나만 스케줄링하여 요청을 직렬로 처리한다.

triton은 intance-group이라는 model configuration 을 통해, 각 모델이 병렬 실행될 수 있는 수를 지정해줄 수 있다.

기본적으로 각 모델은 하나의 인스턴스만 가지고, instance-group을 통해 인스턴스를 증가하면 아래의 그림과 같이 진행된다.

아래 그림에 경우, model1 에 대해서 3개의 인스턴스를 지정했고, 동시에 3개의 요청을 실행할 수 있다.

Models And Schedulers

triton은 각 모델에 대해 독립적으로 multiple scheduling, bathcing algorithms 을 지원할 수 있다.

stateless, stateful, ensemble model 의 타입에 따라 triton이 지원하는 스케줄러가 다르다.

Stateless Models

stateless 모델은 inference 요청들 사이의 state를 저장하지 않고, 각 inference 들은 서로 독립적이다.

예를 들어, CNN의 image classification, object detection이 그렇다.

default scheduler, dynamic batcher 가 이 stateless model에 사용될 수 있다.

즉, CNN과 같은 모델은 이전 요청들과는 관계가 중요치 않기 때문에, 다른 모델 인스턴스로 inference 해도 상관 없다.

Statueful Models

반면, stateful 모델은 inference 요청들 사이의 관계를 유지한다.

모델은 무조건 같은 모델 인스턴스로 들어가야 할 여러 inference 요청들이 sequence로 묶인 것을 받을 것으로 예상한다.

즉, 연속되는 요청들 사이의 state가 유지되어야 하기 때문에, 새로운 요청을 새로운 모델 인스턴스에 보내지 못하고 같은 모델로 보내서 inference를 해야한다.

그래서 모델은 triton에게 요청 시퀀스의 시작과 끝을 알리는 control 신호를 요구할 것이다.

stateful model은 sequence batcher를 사용한다.

이는 한 시퀀스의 모든 inference 요청들이 같은 모델 인스턴로 들어가게 하여, 모델이 올바르게 state를 유지하도록 한다.

그리고 batcher는 모델과 통신하여, 시퀀스가 언제 시작하고 끝나는지, 시퀀스의 correlation ID 등을 알려준다.

client가 stateful model로 inference 요청을 보낼 때, 같은 시퀀스 안의 요청들은 모두 같은 correlation ID를 가져야하고, 시퀀스의 시작과 끝을 표시해야한다.

Ensemble Models

앙상블 모델은 하나 또는 여러 모델의 pipeline 과, 각 모델들 간의 입출력 텐서의 연결을 나타낸다.

"data preprocessing -> inference -> data postprocessing" 와 같이 여러 모델을 포함하는 절차를 위해 앙상블 모델이 사용된다.

이를 통해, 텐서 전송의 오버헤드를 피할 수 있고, triton으로 보내는 요청의 개수도 줄일 수 있다.

앙상블에 속해 있는 각 모델에 스케줄러가 있음에도 불구하고, 앙상블 모델에는 앙상블 스케줄러가 사용된다.

model configuration의 ModelEnsembling::Step 에서 모델 사이의 dataflow를 지정할 수 있다.

스케줄러는 위에서 작성한 각 step의 output tensors를 모아서 지정된 step으로 전달한다.

앙상블 모델은 실제 모델이 아님에도 불구하고, 이러한 특성 때문에 밖에서는 하나의 모델처럼 보여진다.

image classification 과 segmenation 의 앙상블 모델은 아래와 같이 작성된다.

name: "ensemble_model"
platform: "ensemble"
max_batch_size: 1
input [
  {
    name: "IMAGE"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]
output [
  {
    name: "CLASSIFICATION"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  },
  {
    name: "SEGMENTATION"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "image_preprocess_model"
      model_version: -1
      input_map {
        key: "RAW_IMAGE"
        value: "IMAGE"
      }
      output_map {
        key: "PREPROCESSED_OUTPUT"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "classification_model"
      model_version: -1
      input_map {
        key: "FORMATTED_IMAGE"
        value: "preprocessed_image"
      }
      output_map {
        key: "CLASSIFICATION_OUTPUT"
        value: "CLASSIFICATION"
      }
    },
    {
      model_name: "segmentation_model"
      model_version: -1
      input_map {
        key: "FORMATTED_IMAGE"
        value: "preprocessed_image"
      }
      output_map {
        key: "SEGMENTATION_OUTPUT"
        value: "SEGMENTATION"
      }
    }
  ]
}

스케줄러는 앙상블 모델의 input(IMAGE), output(CLASSIFICATION, SEGMENTATION), 각 input_map 과 output_map의 모든 값을 인식한다.

그래서 앙상블 스케줄러가 보는 앙상블 모델은 아래의 그림과 같다.

앙상블 모델로 요청이 들어왔을 때, 앙상블 스케줄러의 동작은 아래와 같다.

요청의 "IMAGE" 텐서가 전처리 모델의 "RAW_IMAGE" 로 매핑된 것을 인식한다.
앙상블 내의 모델을 확인하고, 필요한 input tensor가 모두 준비되었다면 전처리 모델로 요청을 보낸다.
출력 텐서를 가져와서 "preprocessed_image" 에 매핑한다.
새로 수집된 텐서를 앙상블 내 모델의 input으로 보낸다. 그러면 두 모델은 준비상태가 된다.
위의 텐서가 필요한 모델을 확인하고, input tensor가 모두 준비된 모델에 내부 요청을 보낸다. 응답은 모델 로드와 계산 시간에 따라 다르게 나타난다.
3~5번 step을 내부 요청이 전송될 때마다 반복하고, 응답을 앙상블 output 이름으로 매핑한다.

Recognize that the "IMAGE" tensor in the request is mapped to input "RAW_IMAGE" in the preprocess model.
Check models within the ensemble and send an internal request to the preprocess model because all the input tensors required are ready.
Recognize the completion of the internal request, collect the output tensor and map the content to "preprocessed_image" which is an unique name known within the ensemble.
Map the newly collected tensor to inputs of the models within the ensemble. In this case, the inputs of "classification_model" and "segmentation_model" will be mapped and marked as ready.
Check models that require the newly collected tensor and send internal requests to models whose inputs are ready, the classification model and the segmentation model in this case. Note that the responses will be in arbitrary order depending on the load and computation time of individual models.
Repeat step 3-5 until no more internal requests should be sent, and then response to the inference request with the tensors mapped to the ensemble output names.

'Data Engineer > triton inference server' 카테고리의 다른 글

Triton Inference Server Backend (0)	2021.06.24
[ Server ] Trace (0)	2021.06.24
[ Server ] Performance Analyzer (0)	2021.06.24
[ Server ] Metrics (0)	2021.06.24
[ Server ] Model Configuration (0)	2021.06.23

'Data Engineer/triton inference server' Related Articles

Comments

취미가 좋다

[ Server ] Architecture 본문

[ Server ] Architecture

Triton Architecture

Concurrent Model Execution

Models And Schedulers

Stateless Models

Statueful Models

Ensemble Models

'Data Engineer > triton inference server' 카테고리의 다른 글

티스토리툴바