LocalAI - Models

voxtral-mini-4b-realtime

Voxtral Mini 4B Realtime is a speech-to-text model from Mistral AI. It is a 4B parameter model optimized for fast, accurate audio transcription with low latency, making it ideal for real-time applications. The model uses the Voxtral architecture for efficient audio processing.

Links

Tags

ace-step-turbo

ACE-Step 1.5 Turbo is a music generation model that can create music from text descriptions, lyrics, or audio samples. Supports both simple text-to-music and advanced music generation with metadata like BPM, key scale, and time signature.

Links

https://huggingface.co/ACE-Step/Ace-Step1.5

Tags

acestep-cpp-turbo-4b

ACE-Step 1.5 Turbo (C++ / GGML) with 4B LM — higher quality music generation from text and lyrics. Uses the larger 4B parameter LM for better metadata/code generation. Stereo 48kHz output.

Links

Tags

face-detect-buffalo-l

Face recognition with insightface's `buffalo_l` pack (SCRFD-10GF detector + ResNet50 ArcFace 512-d embedder), ported to C++/ggml and shipped as a single GGUF for the `face-detect` backend. Highest accuracy of the buffalo line. No Python / onnxruntime / torch runtime: face-detect.cpp reads the detector and embedder architecture (`facedetect.arch`) directly from the GGUF metadata, so installing this entry is all that is needed to select buffalo_l. Drives the Embedding / Detect / FaceVerify / FaceAnalyze gRPC rpcs and the /v1/face/{verify,analyze,embed,detect} REST endpoints. This GGUF also embeds the MiniFASNet anti-spoof ensemble, available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY: for commercial use see `face-detect-yunet-sface`.

Links

Tags

face-detect-buffalo-m

Face recognition with insightface's `buffalo_m` pack (SCRFD-2.5GF detector + ResNet50 ArcFace embedder), converted to a C++/ggml GGUF for the `face-detect` backend. Same recognition accuracy as `buffalo_l` with a cheaper detector: a good balance on mid-range hardware. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the buffalo_m engine. This GGUF also embeds the MiniFASNet anti-spoof ensemble, available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Links

Tags

face-detect-buffalo-s

Face recognition with insightface's `buffalo_s` pack (SCRFD-500MF detector + MBF 512-d embedder), converted to a C++/ggml GGUF for the `face-detect` backend. Small and CPU-friendly: a good fit for mid-range and edge deployments. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the buffalo_s engine. This GGUF also embeds the MiniFASNet anti-spoof ensemble, available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Links

Tags

face-detect-buffalo-sc

Face recognition with insightface's `buffalo_sc` pack (SCRFD-500M detector + a small ArcFace embedder), converted to a C++/ggml GGUF for the `face-detect` backend. This is the smallest insightface pack: the lightest option for low-resource and edge deployments. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the buffalo_sc engine. If this GGUF embeds the MiniFASNet anti-spoof ensemble, it is available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Links

Tags

face-detect-antelopev2

Face recognition with insightface's `antelopev2` pack (SCRFD-10G detector + ArcFace glint360k R100, 512-d embedder), converted to a C++/ggml GGUF for the `face-detect` backend. The higher-accuracy insightface pack: heavier, but the best fit when recognition quality matters more than speed. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the antelopev2 engine. If this GGUF embeds the MiniFASNet anti-spoof ensemble, it is available via the FaceVerify `anti_spoof` request flag. NON-COMMERCIAL RESEARCH USE ONLY.

Links

Tags

face-detect-yunet-sface

Face recognition with OpenCV Zoo weights: YuNet detector + SFace 128-d recognizer, converted to a C++/ggml GGUF for the `face-detect` backend. APACHE 2.0: safe for commercial use. Lower accuracy than the buffalo packs and no demographic head, but the commercial-friendly alternative to the insightface buffalo line. The architecture (`facedetect.arch`) is read from the GGUF metadata, so this entry alone selects the YuNet + SFace engine.

Links

Tags

voice-detect-ecapa-tdnn

Speaker (voice) recognition with SpeechBrain's ECAPA-TDNN trained on VoxCeleb, ported to C++/ggml and shipped as a single GGUF for the `voice-detect` backend. 192-d L2-normalised embeddings, ~1.9% Equal Error Rate on VoxCeleb1-O. APACHE 2.0 - commercial-safe. No Python / torch runtime: voice-detect.cpp reads the embedding architecture (`voicedetect.arch`) directly from the GGUF metadata, so installing this entry is all that is needed to select ECAPA-TDNN. Drives the VoiceVerify / VoiceEmbed gRPC rpcs and the /v1/voice/{verify,embed,register,identify,forget} REST endpoints.

Links

Tags

voice-detect-wespeaker-resnet34

Speaker recognition with WeSpeaker's ResNet34 trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 256-d embeddings, CPU-friendly and runtime-free (no onnxruntime or torch). CC-BY-4.0. Use when you want WeSpeaker's ResNet34 topology instead of ECAPA-TDNN. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the engine.

Links

Tags

voice-detect-eres2net

Speaker recognition with 3D-Speaker's ERes2Net trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 192-d embeddings with strong verification accuracy. APACHE 2.0. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the ERes2Net engine.

Links

Tags

voice-detect-campplus

Speaker recognition with 3D-Speaker's CAM++ trained on VoxCeleb, converted to a C++/ggml GGUF for the `voice-detect` backend. 192-d embeddings, a fast context-aware masking topology well-suited to CPU and edge deployments. APACHE 2.0. The embedding architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the CAM++ engine.

Links

Tags

voice-detect-emotion-wav2vec2

Voice analysis (age / gender / emotion) with audEERING's wav2vec2 model, converted to a C++/ggml GGUF for the `voice-detect` backend. Drives the VoiceAnalyze gRPC rpc and the /v1/voice/analyze REST endpoint, returning a continuous age estimate plus gender and emotion class scores for a single utterance. CC-BY-NC-SA-4.0 - research / non-commercial use only. The analysis architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the wav2vec2 analyze head.

Links

Tags

voice-detect-age-gender-wav2vec2

wav2vec2-large-robust age + gender analysis head (audeering/wav2vec2-large-robust-24-ft-age-gender), converted to a C++/ggml GGUF for the `voice-detect` backend. Drives the VoiceAnalyze gRPC rpc and the /v1/voice/analyze REST endpoint, returning a continuous age estimate plus gender class scores for a single utterance. CC-BY-NC-SA-4.0 - research / non-commercial use only. The analysis architecture (`voicedetect.arch`) is read from the GGUF metadata, so this entry alone selects the wav2vec2 analyze head.

Links

Tags

google-gemma-3-27b-it-qat-q4_0-small

This is a requantized version of https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf. The official QAT weights released by google use fp16 (instead of Q6_K) for the embeddings table, which makes this model take a significant extra amount of memory (and storage) compared to what Q4_0 quants are supposed to take. Requantizing with llama.cpp achieves a very similar result. Note that this model ends up smaller than the Q4_0 from Bartowski. This is because llama.cpp sets some tensors to Q4_1 when quantizing models to Q4_0 with imatrix, but this is a static quant. The perplexity score for this one is even lower with this model compared to the original model by Google, but the results are within margin of error, so it's probably just luck. I also fixed the control token metadata, which was slightly degrading the performance of the model in instruct mode.

Links

Tags

meta-llama_llama-4-scout-17b-16e-instruct

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We are launching two efficient models in the Llama 4 series, Llama 4 Scout, a 17 billion parameter model with 16 experts, and Llama 4 Maverick, a 17 billion parameter model with 128 experts.

Links

Tags

l3.3-70b-magnum-v4-se

The Magnum v4 series is complete, but here's something a little extra I wanted to tack on as I wasn't entirely satisfied with the results of v4 72B. "SE" for Special Edition - this model is finetuned from meta-llama/Llama-3.3-70B-Instruct as an rsLoRA adapter. The dataset is a slightly revised variant of the v4 data with some elements of the v2 data re-introduced. The objective, as with the other Magnum models, is to emulate the prose style and quality of the Claude 3 Sonnet/Opus series of models on a local scale, so don't be surprised to see "Claude-isms" in its output.

Links

Tags

steelskull_l3.3-mokume-gane-r1-70b

Named after the Japanese metalworking technique 'Mokume-gane' (木目金), meaning 'wood grain metal', this model embodies the artistry of creating distinctive layered patterns through the careful mixing of different components. Just as Mokume-gane craftsmen blend various metals to create unique visual patterns, this model combines specialized AI components to generate creative and unexpected outputs.

Links

Tags

steelskull_l3.3-mokume-gane-r1-70b-v1.1

Named after the Japanese metalworking technique 'Mokume-gane' (木目金), meaning 'wood grain metal', this model embodies the artistry of creating distinctive layered patterns through the careful mixing of different components. Just as Mokume-gane craftsmen blend various metals to create unique visual patterns, this model combines specialized AI components to generate creative and unexpected outputs.

Links

Tags

llama-3.2-3b-instruct:q8_0

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Developer: Meta Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Links

https://huggingface.co/hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

voxtral-mini-4b-realtime

ace-step-turbo

acestep-cpp-turbo-4b

face-detect-buffalo-l

face-detect-buffalo-m

face-detect-buffalo-s

face-detect-buffalo-sc

face-detect-antelopev2

face-detect-yunet-sface

voice-detect-ecapa-tdnn

voice-detect-wespeaker-resnet34

voice-detect-eres2net

voice-detect-campplus

voice-detect-emotion-wav2vec2

voice-detect-age-gender-wav2vec2

google-gemma-3-27b-it-qat-q4_0-small

meta-llama_llama-4-scout-17b-16e-instruct

l3.3-70b-magnum-v4-se

steelskull_l3.3-mokume-gane-r1-70b

steelskull_l3.3-mokume-gane-r1-70b-v1.1

llama-3.2-3b-instruct:q8_0