Model Gallery

55 models from 1 repositories

Filter by type:

Filter by tags:

ced-base-f16

CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

vibevoice-cpp

VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.

Repository: localaiLicense: mit

vibevoice-cpp-asr

VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker diarization. Returns per-speaker JSON segments with start/end timestamps. English-only. ~10 GB download.

Repository: localaiLicense: mit

face-detect-buffalo-l

Face recognition with insightface's `buffalo_l` pack (SCRFD-10GF detector + ResNet50 ArcFace 512-d embedder), ported to C++/ggml and shipped as a single GGUF for the `face-detect` backend. Highest accuracy of the buffalo line. No Python / onnxruntime / torch runtime: face-detect.cpp reads the detector and embedder architecture (`facedetect.arch`) directly from the GGUF metadata, so installing this entry is all that is needed to select buffalo_l. Drives the Embedding / Detect / FaceVerify / FaceAnalyze gRPC rpcs and the /v1/face/{verify,analyze,embed,detect} REST endpoints. This GGUF also embeds the MiniFASNet anti-spoof ensemble, available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY: for commercial use see `face-detect-yunet-sface`.

Repository: localaiLicense: insightface-non-commercial

face-detect-buffalo-m

Face recognition with insightface's `buffalo_m` pack (SCRFD-2.5GF detector + ResNet50 ArcFace embedder), converted to a C++/ggml GGUF for the `face-detect` backend. Same recognition accuracy as `buffalo_l` with a cheaper detector: a good balance on mid-range hardware. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the buffalo_m engine. This GGUF also embeds the MiniFASNet anti-spoof ensemble, available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Repository: localaiLicense: insightface-non-commercial

face-detect-buffalo-s

Face recognition with insightface's `buffalo_s` pack (SCRFD-500MF detector + MBF 512-d embedder), converted to a C++/ggml GGUF for the `face-detect` backend. Small and CPU-friendly: a good fit for mid-range and edge deployments. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the buffalo_s engine. This GGUF also embeds the MiniFASNet anti-spoof ensemble, available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Repository: localaiLicense: insightface-non-commercial

face-detect-buffalo-sc

Face recognition with insightface's `buffalo_sc` pack (SCRFD-500M detector + a small ArcFace embedder), converted to a C++/ggml GGUF for the `face-detect` backend. This is the smallest insightface pack: the lightest option for low-resource and edge deployments. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the buffalo_sc engine. If this GGUF embeds the MiniFASNet anti-spoof ensemble, it is available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Repository: localaiLicense: insightface-non-commercial

face-detect-antelopev2

Face recognition with insightface's `antelopev2` pack (SCRFD-10G detector + ArcFace glint360k R100, 512-d embedder), converted to a C++/ggml GGUF for the `face-detect` backend. The higher-accuracy insightface pack: heavier, but the best fit when recognition quality matters more than speed. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the antelopev2 engine. If this GGUF embeds the MiniFASNet anti-spoof ensemble, it is available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Repository: localaiLicense: insightface-non-commercial

face-detect-yunet-sface

Face recognition with OpenCV Zoo weights: YuNet detector + SFace 128-d recognizer, converted to a C++/ggml GGUF for the `face-detect` backend. APACHE 2.0: safe for commercial use. Lower accuracy than the buffalo packs and no demographic head, but the commercial-friendly alternative to the insightface buffalo line. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the YuNet + SFace engine.

Repository: localaiLicense: apache-2.0

voice-detect-ecapa-tdnn

Speaker (voice) recognition with SpeechBrain's ECAPA-TDNN trained on VoxCeleb, ported to C++/ggml and shipped as a single GGUF for the `voice-detect` backend. 192-d L2-normalised embeddings, ~1.9% Equal Error Rate on VoxCeleb1-O. APACHE 2.0 - commercial-safe. No Python / torch runtime: voice-detect.cpp reads the embedding architecture (`voicedetect.arch`) directly from the GGUF metadata, so installing this entry is all that is needed to select ECAPA-TDNN. Drives the VoiceVerify / VoiceEmbed gRPC rpcs and the /v1/voice/{verify,embed,register,identify,forget} REST endpoints.

Repository: localaiLicense: apache-2.0

voice-detect-wespeaker-resnet34

Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 256-d embeddings, CPU-friendly and runtime-free (no onnxruntime or torch). CC-BY-4.0. Use when you want WeSpeaker's ResNet34 topology instead of ECAPA-TDNN. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the engine.

Repository: localaiLicense: cc-by-4.0

voice-detect-eres2net

Speaker recognition with 3D-Speaker's ERes2Net trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 192-d embeddings with strong verification accuracy. APACHE 2.0. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the ERes2Net engine.

Repository: localaiLicense: apache-2.0

voice-detect-campplus

Speaker recognition with 3D-Speaker's CAM++ trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 192-d embeddings, a fast context-aware masking topology well-suited to CPU and edge deployments. APACHE 2.0. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the CAM++ engine.

Repository: localaiLicense: apache-2.0

voice-detect-emotion-wav2vec2

Voice analysis (age / gender / emotion) with audEERING's wav2vec2 model, converted to a C++/ggml GGUF for the `voice-detect` backend. Drives the VoiceAnalyze gRPC rpc and the /v1/voice/analyze REST endpoint, returning a continuous age estimate plus gender and emotion class scores for a single utterance. CC-BY-NC-SA-4.0 - research / non-commercial use only. The analysis architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the wav2vec2 analyze head.

Repository: localaiLicense: cc-by-nc-sa-4.0

voice-detect-age-gender-wav2vec2

wav2vec2-large-robust age + gender analysis head (audeering/wav2vec2-large-robust-24-ft-age-gender), converted to a C++/ggml GGUF for the `voice-detect` backend. Drives the VoiceAnalyze gRPC rpc and the /v1/voice/analyze REST endpoint, returning a continuous age estimate plus gender class scores for a single utterance. CC-BY-NC-SA-4.0 - research / non-commercial use only. The analysis architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the wav2vec2 analyze head.

Repository: localaiLicense: cc-by-nc-sa-4.0

rfdetr-cpp-nano

RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

locate-anything-3b

NVIDIA LocateAnything-3B open-vocabulary object detection (visual grounding), served via the native locate-anything.cpp backend (C++/ggml + purego, no Python). Describe what to find in a text prompt and get labeled boxes back; separate multiple categories with . Q8_0 is the recommended default: box-identical to F16/F32, ~6.3GB, fastest CPU latency. Drop-in for the /v1/detection endpoint (pass the prompt).

Repository: localaiLicense: other

depth-anything-3-base

Depth Anything 3 (base) monocular metric depth + camera pose, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense depth map plus the recovered camera extrinsics (3x4) and intrinsics (3x3). Use GenerateImage (src -> normalized depth PNG at dst) or Predict (JSON depth stats + pose). q4_k is the recommended CPU default.

Repository: localaiLicense: apache-2.0

depth-anything-2-base

Depth Anything V2 (base / ViT-B) monocular depth, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense monocular depth map only — no camera pose, no confidence. This is the relative variant (relative inverse depth). Use GenerateImage (src -> normalized depth PNG at dst) or the Depth endpoint. q4_k is the recommended CPU default.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-small

RF-DETR Small object detection model (DINOv2-small backbone, 512px input, 3 decoder layers), served via the native rfdetr.cpp backend (ggml + purego, no Python). A step up from Nano in accuracy while staying lightweight on CPU. F16 quantization is the recommended default: identical accuracy to F32 at roughly half the size. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

wan-2.1-t2v-1.3b-ggml

Wan 2.1 T2V 1.3B — text-to-video diffusion model, GGUF-quantized for the stable-diffusion.cpp backend. Generates short (33-frame) 832x480 clips from a text prompt. Cheapest Wan variant, suitable for CPU-offloaded inference with ~10 GB of usable RAM.

Repository: localaiLicense: apache-2.0

Page 1