November 18, 2024
Distilling Llama3.1 8B into 1B in torchtune
In this blog, we present a case study on distilling a Llama 3.1 8B model into Llama 3.2 1B using torchtune’s knowledge distillation recipe. We demonstrate how knowledge distillation (KD) can be used in post-training to improve instruction-following task performance and showcase how users can leverage the recipe.
October 31, 2024
Deploying LLMs with TorchServe + vLLM
The vLLM engine is currently one of the top-performing ways to execute large language models (LLM). It provides the vllm serve command as an easy option to deploy a model on a single machine. While this is convenient, to serve these LLMs in production and at scale some advanced features are necessary.
October 30, 2024
Triton Kernel Compilation Stages
The Triton open-source programming language and compiler offers a high-level, python-based approach to create efficient GPU code. In this blog, we highlight the underlying details of how a triton program is compiled and the intermediate representations. For an introduction to Triton, we refer readers to this blog.
October 28, 2024
Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI
Introduction
October 28, 2024
Getting started with PyTorch, ExecuTorch, and Ethos-U85 in three easy steps
ExecuTorch support for Ethos-U85