Sergio Morales E. archive gamedev others cv [pdf]

A Brief Introduction to Optimized Batched Inference with vLLM

In a previous article, we talked about how off-the-shelf, pre-trained models made available through Hugging Face’s model hub could be leveraged to fulfill a wide range of Natural Language Processing (NLP) tasks. This included text classification, question answering, and content generation — either by taking advantage of their base knowledge or by fine-tuning them to create specialized models that honed in on particular subject matters or contexts.

In this article, we will introduce the vLLM library to optimize the performance of these models, and introduce a mechanism through which we can take advantage of a large language model (LLM)’s text generation capabilities to make it perform more specific and context-sensitive tasks.

Read the rest of the article here.

Fork me on GitHub