LocalAI - Models

gemmable-4-12b-mtp

## Gemmable 4 12B Gemmable 4 12B is a GGUF export of Gemma 4 12B fine-tuned on Fable-5 style reasoning and assistant traces. ## Highlights - Base model: `google/gemma-4-12B` - Format: GGUF - Training style: Fable-5 style reasoning and assistant traces - Distribution: fp16 GGUF plus matching assistant GGUFs for each quant - Intended use: local inference, coding, reasoning, and assistant workflows ## How to use ### llama.cpp Standard load: ```bash llama-server -m "gemmable-4-12b-fp16.gguf" ``` Speculative / draft-MTP load: ```bash llama-server -m "gemmable-4-12b-Q4_K_M.gguf" \ --spec-draft-model "gemmable-4-12b-Q4_K_M-mtp.gguf" \ --spec-type draft-mtp \ --spec-draft-n-max 4 ``` Use the matching fp16 or quantized main file with its `-mtp` companion. ### LM Studio 1. Search this repo, download target + mtp file. 2. Load target. 3. Load settings → Speculative Decoding → select mtp file file. (Requires LM Studio with am17an's PR merged or custom llama.cpp runtime. As of 2026-05, mainline LM Studio runtime doesn't yet have `draft-mtp` for Gemma-4 — track upstream merge.) ## GGUF / local inference notes ...

Links

https://huggingface.co/Mia-AiLab/Gemmable-4-12B-MTP-GGUF

Tags

lfm2.5-1.2b-instruct

Try LFM • Docs • LEAP • Discord # LFM2.5-1.2B-Instruct LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket. - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM. - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning. Find more information about LFM2.5 in our blog post. ## 🗒️ Model Details LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF

Tags

lfm2.5-8b-a1b

Try LFM • Docs • LEAP • Discord # LFM2.5-8B-A1B LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **On-device personal assistant**: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices. - **Compressed performance**: Competitive with much larger dense and MoE models on instruction following and agentic tasks. - **Unmatched throughput**: Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang. Find more information about LFM2.5-8B-A1B in our blog post. **AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.* ## 🗒️ Model Details LFM2.5-8B-A1B is a general-purpose text-only model with the following features: ...

Links

https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF

Tags

qwen_qwen3.5-35b-a3b

Qwen3.5-35B-A3B is a quantized multimodal language model with 35B parameters using an A3B MoE architecture. It supports image-text understanding and chat interactions via llama-cpp backend.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-35B-A3B-GGUF

Tags

qwen_qwen3.5-0.8b

Qwen 3.5 0.8B parameter model quantized for llama-cpp backend. Supports chat interactions and multimodal image-text inputs.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-0.8B-GGUF

Tags

qwen_qwen3.5-2b

Qwen3.5-2B is a highly efficient, instruction-tuned multilingual language model available in various quantized GGUF formats. Optimized for llama-cpp inference, it supports chat and completion tasks with strong performance on low-RAM hardware. The model is available in multiple quantization levels ranging from Q8_0 to IQ2_M to balance quality and resource usage.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-2B-GGUF

Tags

qwen_qwen3.5-4b

Qwen3.5-4B is a multimodal LLM with 4 billion parameters, optimized for chat and vision tasks. This GGUF quantized version enables efficient local inference via llama-cpp backend. Supports both text and image input for enhanced conversational capabilities.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-4B-GGUF

Tags

qwen_qwen3-next-80b-a3b-thinking

Links

https://huggingface.co/bartowski/Qwen_Qwen3-Next-80B-A3B-Thinking-GGUF

Tags

acestep-cpp-turbo

ACE-Step 1.5 Turbo (C++ / GGML) — native C++ music generation from text descriptions and lyrics. Two-stage pipeline: text-to-code (Qwen3 LM) + code-to-audio (DiT-VAE). Stereo 48kHz output. Uses Q8_0 quantized models for a good balance of quality and speed.

Links

Tags

acestep-cpp-turbo-4b

ACE-Step 1.5 Turbo (C++ / GGML) with 4B LM — higher quality music generation from text and lyrics. Uses the larger 4B parameter LM for better metadata/code generation. Stereo 48kHz output.

Links

Tags

vibevoice-cpp

VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.

Links

Tags

vibevoice-cpp-asr

VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker diarization. Returns per-speaker JSON segments with start/end timestamps. English-only. ~10 GB download.

Links

Tags

qwen3-tts-cpp

Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp). Native C++ text-to-speech with streaming output and zero-shot voice cloning (set `voice` to a 24kHz reference .wav). 24kHz mono, 11 languages with Mandarin dialects. Q8_0 (~0.95 GB talker).

Links

Tags

qwen3-tts-cpp-0.6b-base-q4

Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~0.6 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.

Links

Tags

qwen3-tts-cpp-1.7b-base

Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q8_0 (~2.0 GB talker). Higher-quality streaming + voice cloning, 24kHz mono, 11 languages.

Links

Tags

qwen3-tts-cpp-1.7b-base-q4

Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~1.2 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.

Links

Tags

qwen3-tts-cpp-customvoice

Qwen3-TTS 0.6B CustomVoice (C++ / GGML, qwentts.cpp), Q8_0. Named speakers selected via the `voice` field: serena, vivian, uncle_fu, ryan, aiden, ono_anna, sohee, eric (sichuan dialect), dylan (beijing dialect). Streaming, 24kHz mono, 11 languages.

Links

Tags

qwen3-tts-cpp-customvoice-q4

Qwen3-TTS 0.6B CustomVoice (C++ / GGML, qwentts.cpp), Q4_K_M. Named speakers via the `voice` field (serena, vivian, ryan, aiden, eric, dylan, ...). Streaming, 24kHz mono, 11 languages.

Links

Tags

qwen3-tts-cpp-1.7b-customvoice

Qwen3-TTS 1.7B CustomVoice (C++ / GGML, qwentts.cpp), Q8_0. Named speakers via the `voice` field (serena, vivian, ryan, aiden, eric, dylan, ...). Streaming, 24kHz mono, 11 languages.

Links

Tags

qwen3-tts-cpp-1.7b-customvoice-q4

Qwen3-TTS 1.7B CustomVoice (C++ / GGML, qwentts.cpp), Q4_K_M. Named speakers via the `voice` field. Streaming, 24kHz mono, 11 languages.

Links

Tags

qwen3-tts-cpp-1.7b-voicedesign

Qwen3-TTS 1.7B VoiceDesign (C++ / GGML, qwentts.cpp), Q8_0. Synthesises a speaker from a free-text attribute instruction - REQUIRES the OpenAI `instructions` field (e.g. "male, young adult, moderate pitch"); requests without it are rejected. Streaming, 24kHz mono, 11 languages.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

gemmable-4-12b-mtp

lfm2.5-1.2b-instruct

lfm2.5-8b-a1b

qwen_qwen3.5-35b-a3b

qwen_qwen3.5-0.8b

qwen_qwen3.5-2b

qwen_qwen3.5-4b

qwen_qwen3-next-80b-a3b-thinking

acestep-cpp-turbo

acestep-cpp-turbo-4b

vibevoice-cpp

vibevoice-cpp-asr

qwen3-tts-cpp

qwen3-tts-cpp-0.6b-base-q4

qwen3-tts-cpp-1.7b-base

qwen3-tts-cpp-1.7b-base-q4

qwen3-tts-cpp-customvoice

qwen3-tts-cpp-customvoice-q4

qwen3-tts-cpp-1.7b-customvoice

qwen3-tts-cpp-1.7b-customvoice-q4

qwen3-tts-cpp-1.7b-voicedesign