kv-cache
Here are 168 public repositories matching this topic...
A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群
-
Updated
Sep 14, 2025 - Go
Unified KV Cache Compression Methods for Auto-Regressive Models
-
Updated
Jan 4, 2025 - Python
LLM KV cache compression made easy
-
Updated
Apr 1, 2026 - Python
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
-
Updated
Apr 2, 2026 - Python
Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.
-
Updated
Feb 24, 2025 - Jupyter Notebook
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)
-
Updated
Sep 26, 2025 - Python
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
-
Updated
Aug 1, 2024 - Python
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
-
Updated
Mar 3, 2025
[ICLR'26] The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models"
-
Updated
Mar 13, 2026 - Python
Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.
-
Updated
May 21, 2025 - Python
[Survey] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
-
Updated
Mar 24, 2026 - Python
LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.
-
Updated
Apr 5, 2026 - C
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.
-
Updated
Feb 27, 2026 - Cuda
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
-
Updated
Apr 13, 2025 - Python
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, MoE expert parallelism, OpenAI-compatible serving
-
Updated
Mar 28, 2026 - Python
[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
-
Updated
Nov 3, 2025 - Python
Completion After Prompt Probability. Make your LLM make a choice
-
Updated
Nov 2, 2024 - Python
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
-
Updated
Apr 2, 2026 - Python
Improve this page
Add a description, image, and links to the kv-cache topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the kv-cache topic, visit your repo's landing page and select "manage topics."