LocalAI - Models

lfm2.5-1.2b-instruct

Try LFM • Docs • LEAP • Discord # LFM2.5-1.2B-Instruct LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket. - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM. - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning. Find more information about LFM2.5 in our blog post. ## 🗒️ Model Details LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF

Tags

lfm2.5-8b-a1b

Try LFM • Docs • LEAP • Discord # LFM2.5-8B-A1B LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **On-device personal assistant**: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices. - **Compressed performance**: Competitive with much larger dense and MoE models on instruction following and agentic tasks. - **Unmatched throughput**: Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang. Find more information about LFM2.5-8B-A1B in our blog post. **AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.* ## 🗒️ Model Details LFM2.5-8B-A1B is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF

Tags

ced-base-f16

CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-tiny-f16

CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-mini-f16

CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-small-f16

CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

allenai_olmo-3.1-32b-think

The **Olmo-3.1-32B-Think** model is a large language model (LLM) optimized for efficient inference using quantized versions. It is a quantized version of the original **allenai/Olmo-3.1-32B-Think** model, developed by **bartowski** using the **imatrix** quantization method. ### Key Features: - **Base Model**: `allenai/Olmo-3.1-32B-Think` (unquantized version). - **Quantized Versions**: Available in multiple formats (e.g., `Q6_K_L`, `Q4_1`, `bf16`) with varying precision (e.g., Q8_0, Q6_K_L, Q5_K_M). These are derived from the original model using the **imatrix calibration dataset**. - **Performance**: Optimized for low-memory usage and efficient inference on GPUs/CPUs. Recommended quantization types include `Q6_K_L` (near-perfect quality) or `Q4_K_M` (default, balanced performance). - **Downloads**: Available via Hugging Face CLI. Split into multiple files if needed for large models. - **License**: Apache-2.0. ### Recommended Quantization: - Use `Q6_K_L` for highest quality (near-perfect performance). - Use `Q4_K_M` for balanced performance and size. - Avoid lower-quality options (e.g., `Q3_K_S`) unless specific hardware constraints apply. This model is ideal for deploying on GPUs/CPUs with limited memory, leveraging efficient quantization for practical use cases.

Links

https://huggingface.co/bartowski/allenai_Olmo-3.1-32B-Think-GGUF

Tags

lfm2-1.2b

LFM2-1.2B is a hybrid liquid model designed for edge AI and on-device deployment, offering fast inference and multilingual support across 8 languages. It's optimized for agentic tasks, data extraction, and multi-turn conversations with efficient CPU/GPU/NPU compatibility.

Links

Tags

insightface-buffalo-s

Small insightface pack (SCRFD-500MF detector + MBF 512-d embedder + genderage, ~159MB). Good fit for mid-range CPU deployments. NON-COMMERCIAL RESEARCH USE ONLY.

Links

https://github.com/deepinsight/insightface

Tags

insightface-opencv-int8

Int8-quantized OpenCV Zoo face pair (YuNet int8 + SFace int8, ~12MB). Roughly 3x smaller and noticeably faster on CPU than the fp32 variant at comparable accuracy for face tasks. APACHE 2.0 — commercial-safe. Weights are downloaded on install via LocalAI's gallery mechanism.

Links

https://github.com/opencv/opencv_zoo

Tags

face-detect-buffalo-s

Face recognition with insightface's `buffalo_s` pack (SCRFD-500MF detector + MBF 512-d embedder), converted to a C++/ggml GGUF for the `face-detect` backend. Small and CPU-friendly: a good fit for mid-range and edge deployments. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the buffalo_s engine. This GGUF also embeds the MiniFASNet anti-spoof ensemble, available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Links

Tags

wespeaker-resnet34

Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb, exported to ONNX. 256-d embeddings, CPU-friendly — avoids the PyTorch runtime entirely (onnxruntime only). APACHE 2.0. Pair with the `speaker-recognition` backend's OnnxDirectEngine. Use when ECAPA-TDNN's torch dependency is undesirable (small images, edge deployments).

Links

https://github.com/wenet-e2e/wespeaker

Tags

voice-detect-wespeaker-resnet34

Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 256-d embeddings, CPU-friendly and runtime-free (no onnxruntime or torch). CC-BY-4.0. Use when you want WeSpeaker's ResNet34 topology instead of ECAPA-TDNN. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the engine.

Links

Tags

voice-detect-campplus

Speaker recognition with 3D-Speaker's CAM++ trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 192-d embeddings, a fast context-aware masking topology well-suited to CPU and edge deployments. APACHE 2.0. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the CAM++ engine.

Links

Tags

rfdetr-cpp-nano

RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Links

Tags

locate-anything-3b

NVIDIA LocateAnything-3B open-vocabulary object detection (visual grounding), served via the native locate-anything.cpp backend (C++/ggml + purego, no Python). Describe what to find in a text prompt and get labeled boxes back; separate multiple categories with . Q8_0 is the recommended default: box-identical to F16/F32, ~6.3GB, fastest CPU latency. Drop-in for the /v1/detection endpoint (pass the prompt).

Links

Tags

depth-anything-3-base

Depth Anything 3 (base) monocular metric depth + camera pose, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense depth map plus the recovered camera extrinsics (3x4) and intrinsics (3x3). Use GenerateImage (src -> normalized depth PNG at dst) or Predict (JSON depth stats + pose). q4_k is the recommended CPU default.

Links

Tags

depth-anything-3-small

Depth Anything 3 (small / vits), f32 — the smallest backbone (~131 MB) for fast CPU depth + camera pose. Same output as base at lower latency.

Links

Tags

depth-anything-2-base

Depth Anything V2 (base / ViT-B) monocular depth, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense monocular depth map only — no camera pose, no confidence. This is the relative variant (relative inverse depth). Use GenerateImage (src -> normalized depth PNG at dst) or the Depth endpoint. q4_k is the recommended CPU default.

Links

Tags

depth-anything-2-small

Depth Anything V2 (small / ViT-S), f32 — the smallest, fastest backbone for relative monocular depth on CPU. Depth only (no pose). Use GenerateImage (src -> depth PNG) or the Depth endpoint.

Links

Tags

rfdetr-cpp-base

RF-DETR Base object detection model, served via the native rfdetr.cpp backend. F16 quantization is recommended on CPU: identical accuracy to F32, half the size, fastest.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

lfm2.5-1.2b-instruct

lfm2.5-8b-a1b

ced-base-f16

ced-tiny-f16

ced-mini-f16

ced-small-f16

allenai_olmo-3.1-32b-think

lfm2-1.2b

insightface-buffalo-s

insightface-opencv-int8

face-detect-buffalo-s

wespeaker-resnet34

voice-detect-wespeaker-resnet34

voice-detect-campplus

rfdetr-cpp-nano

locate-anything-3b

depth-anything-3-base

depth-anything-3-small

depth-anything-2-base

depth-anything-2-small

rfdetr-cpp-base