LitServe를 이용한 EXAONE 4.0 서빙

Posted Jul 19, 2025 Updated Jul 19, 2025

By fransoaardi 4 min read

Introduction

최근의 모델 서빙의 툴링/프레임워크의 변화가 궁금하여 찾아보던중, LitServe 라는 프로젝트를 발견하고 간단히 사용해봤다. 아직도 FastAPI 를 많이쓰는것 같지만, 느낌상 pytorch 의 torchserve 를 대신하는 프로젝트가 이 LitServe 일 것 같았다. 서빙 해볼 만만한 모델을 찾다가, EXAONE 4.0 이 나왔다는 글을 봤던게 생각이 나서 허깅페이스를 통해 서빙해봤다.

Configurations

최근에 사용하며 마음이 편안한 uv 라는 통해 pip command 를 실행하고 litserve 를 설치한다.

  
$ mkdir testmlops
$ uv venv
$ uv pip install litserve
$ uv pip install git+https://github.com/lgai-exaone/transformers@add-exaone4

Code

$ tree 
.
├── client.py
└── server.py

server.py

  
import litserve as ls
from transformers import AutoModelForCausalLM, AutoTokenizer


class InferencePipeline(ls.LitAPI):
    def setup(self, device):
        model_name = "LGAI-EXAONE/EXAONE-4.0-1.2B"

        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="bfloat16",
        ).to(device)

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    def decode_request(self, request):
        input = request["input"]
        messages = [
            {"role": "user", "content": input}
        ]
        return messages

    def predict(self, messages):
        input_ids = self.tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt"
        )

        output = self.model.generate(
            input_ids.to(self.model.device),
            max_new_tokens=128,
            do_sample=False,
        )

        return self.tokenizer.decode(output[0])
        
    def encode_response(self, output):
		return {"output": output}

if __name__ == "__main__":
    server = ls.LitServer(InferencePipeline(), accelerator="cuda")
    server.run(port=8000)

client.py

  
# This file is auto-generated by LitServe.
# Disable auto-generation by setting `generate_client_file=False` in `LitServer.run()`.

import requests
import argparse

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('prompt')
    args = parser.parse_args()

    response = requests.post("http://127.0.0.1:8000/predict", json={"input": args.prompt})
    print(f"Status: {response.status_code}\nResponse:\n {response.text}")

Run

  
$ uv run server.py
(...)
$ uv run client.py "임진왜란은 언제일어났니? 일찍일어났어?"
Status: 200
Response:
 {"output":"[|user|]\n임진왜란은 언제일어났니? 일찍일어났어?[|endofturn|]\n[|assistant|]\n<think>\n\n</think>\n\n임진왜란은 대략 **1592년 1월~1598년 10월** 사이에 일어났습니다.  \n\n- **시작(1592년 1월)**: 조선의 대표적인 장군인 권율이 이끄는 왜군이 일본에 침공했습니다.  \n- **종말(1598년 10월)**: 왜군이 패하고 조선이 회복되기 시작했습니다.  \n\n따라서 임진왜란은 **\"일찍\"이라고 보기 어렵습니다**. 고려 말~조선 초기(15"}

처음 써보는 라이브러리였지만 생각보다 수월하게 서빙해볼 수 있었다. 로컬에서 32B 모델을 서빙해보기에는 다소 걱정이 되어 1.2B 를 서빙했다.

시도해보진 않았지만 lightning 이라는 별도 binary 를 이용하면 cloud server 에 deploy 까지 해준다고 하는데, 결국 서빙 관련된 최적화를 프레임워크에 숨겨서 모델구현 부분에 최대한 집중하도록 하는것 같다.

Inference 결과가 다소 이상하지만, 로컬에서 모델 서빙 후 응답까지 받아본 것에 의미를 둔다.

더 해볼것

LitServe 에 전달하는 batch 등 parameter 를 바꿔가며 latency 에 변화가 있는지 실험
response 를 stream 처리
Observability 연동

좀 더 큰 MLOps 의 관점에서 workflow tool 도 연동해서 모델을 지속적으로 학습하고, 배포하고 모니터링하는 큰 프로젝트를 생각했으나, 이 부분에 대해서는 추후에 다시 고민해보려고 한다.

References

LitServe documentation:

https://github.com/Lightning-AI/LitServe

EXAONE 4.0 documentation:

https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B

Wiki

mlops

This post is licensed under CC BY 4.0 by the author.