kv-cache

Star

Here are 168 public repositories matching this topic...

LMCache / LMCache

Star

Supercharge Your LLM with the Fastest KV Cache Layer

fast amd cuda inference pytorch speed rocm kv-cache llm vllm

Updated Apr 5, 2026
Python

HDT3213 / godis

Star

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

go redis golang cluster redis-server redis-cluster godis kv-cache

Updated Sep 14, 2025
Go

Zefan-Cai / KVCache-Factory

Star

Unified KV Cache Compression Methods for Auto-Regressive Models

kv-cache llm kv-cache-compression

Updated Jan 4, 2025
Python

NVIDIA / kvpress

Star

LLM KV cache compression made easy

python transformers inference pytorch kv-cache large-language-models llm long-context kv-cache-compression

Updated Apr 1, 2026
Python

harleyszhang / llm_note

Star

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming transformer-models kv-cache llm vllm llm-inference triton-kernels

Updated Apr 2, 2026
Python

therealoliver / Deepdive-llama3-from-scratch

Star

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Updated Feb 24, 2025
Jupyter Notebook

raymin0223 / mixture_of_recursions

Star

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)

router early-exiting adaptive-computation kv-cache llm recursive-transformers

Updated Sep 26, 2025
Python

FMInference / H2O

Star

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

sparsity high-throughput heavy-hitters kv-cache gpt-3 large-language-models

Updated Aug 1, 2024
Python

Zefan-Cai / Awesome-LLM-KV-Cache

Star

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

kv-cache llm kv-cache-quantization kv-cache-compression

Updated Mar 3, 2025

thu-nics / C2C

Star

[ICLR'26] The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models"

multi-agent kv-cache llm

Updated Mar 13, 2026
Python

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

metal optimization quantization m2 m3 m1 memory-optimization kv-cache apple-silicon llm generative-ai llama-cpp

Updated May 21, 2025
Python

jjiantong / Awesome-KV-Cache-Optimization

Star

[Survey] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization

machine-learning ai system computer-architecture neural-language-processing mlsys kv-cache serving-ml llm llm-serving llm-inference

Updated Mar 24, 2026
Python

quantumaikr / quant.cpp

Star

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

embeddable transformer pure-c quantization delta-compression kv-cache llm llm-inference gguf turboquant

Updated Apr 5, 2026
C

NVIDIA-Merlin / HierarchicalKV

Star

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.

gpu cuda recommender-system hashtable key-value-store kv-cache dynamic-embedding embedding-storage

Updated Feb 27, 2026
Cuda

itsnamgyu / block-transformer

Star

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

kv-cache llm llm-inference llm-architecture kv-cache-compression

Updated Apr 13, 2025
Python

psmarter / mini-infer

Star

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, MoE expert parallelism, OpenAI-compatible serving

machine-learning cuda inference pytorch transformer triton moe quantization language-model inference-engine kv-cache tensor-parallelism llm speculative-decoding pagedattention continuous-batching

Updated Mar 28, 2026
Python

FastMAS / KVCOMM

Star

[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

multi-agent-systems kv-cache neurips-2025

Updated Nov 3, 2025
Python

alibaba / tair-kvcache

Star

Alibaba Cloud's high-performance KVCache system for LLM inference, with components for global cache management, inference simulation(HiSim), and more.

simulator kv-cache llm kvcache hisim

Updated Apr 3, 2026
C++

kddubey / cappr

Star

Completion After Prompt Probability. Make your LLM make a choice

text-classification probability zero-shot huggingface kv-cache prompt-engineering llamacpp llm-inference

Updated Nov 2, 2024
Python

arozanov / turboquant-mlx

Star

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

metal quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 2, 2026
Python

Improve this page

Add a description, image, and links to the kv-cache topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kv-cache topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache

Here are 168 public repositories matching this topic...

LMCache / LMCache

HDT3213 / godis

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

harleyszhang / llm_note

therealoliver / Deepdive-llama3-from-scratch

raymin0223 / mixture_of_recursions

FMInference / H2O

Zefan-Cai / Awesome-LLM-KV-Cache

thu-nics / C2C

dipampaul17 / KVSplit

jjiantong / Awesome-KV-Cache-Optimization

quantumaikr / quant.cpp

NVIDIA-Merlin / HierarchicalKV

itsnamgyu / block-transformer

psmarter / mini-infer

FastMAS / KVCOMM

alibaba / tair-kvcache

kddubey / cappr

arozanov / turboquant-mlx

Improve this page

Add this topic to your repo