Llm-Inference

Flash Indexer for Dynamo

I worked on the Flash Indexer, a high-throughput global KV-cache indexer for NVIDIA Dynamo. The problem is to track which inference workers hold which KV blocks, and then answer routing queries fast enough that the indexer itself does not become the bottleneck. ...

NVIDIA Dynamo: Distributed LLM Inference

Dynamo is NVIDIA’s open-source datacenter-scale distributed inference serving framework for generative AI and reasoning models. Built in Rust for performance and Python for extensibility, it supports disaggregated prefill and decode, dynamic GPU scheduling, and LLM-aware request routing across multi-node multi-GPU topologies. The project has 6k+ GitHub stars and supports backends including TensorRT-LLM, vLLM, and SGLang. ...