Model Gallery

106 models from 1 repositories

Filter by type:

Filter by tags:

gemmable-4-12b-mtp
## Gemmable 4 12B Gemmable 4 12B is a GGUF export of Gemma 4 12B fine-tuned on Fable-5 style reasoning and assistant traces. ## Highlights - Base model: `google/gemma-4-12B` - Format: GGUF - Training style: Fable-5 style reasoning and assistant traces - Distribution: fp16 GGUF plus matching assistant GGUFs for each quant - Intended use: local inference, coding, reasoning, and assistant workflows ## How to use ### llama.cpp Standard load: ```bash llama-server -m "gemmable-4-12b-fp16.gguf" ``` Speculative / draft-MTP load: ```bash llama-server -m "gemmable-4-12b-Q4_K_M.gguf" \ --spec-draft-model "gemmable-4-12b-Q4_K_M-mtp.gguf" \ --spec-type draft-mtp \ --spec-draft-n-max 4 ``` Use the matching fp16 or quantized main file with its `-mtp` companion. ### LM Studio 1. Search this repo, download target + mtp file. 2. Load target. 3. Load settings → Speculative Decoding → select mtp file file. (Requires LM Studio with am17an's PR merged or custom llama.cpp runtime. As of 2026-05, mainline LM Studio runtime doesn't yet have `draft-mtp` for Gemma-4 — track upstream merge.) ## GGUF / local inference notes ...

Repository: localai

lfm2.5-1.2b-instruct
Try LFM • Docs • LEAP • Discord # LFM2.5-1.2B-Instruct LFM2.5 is a new family of hybrid models designed for **on-device deployment**. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **Best-in-class performance**: A 1.2B model rivaling much larger models, bringing high-quality AI to your pocket. - **Fast edge inference**: 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU. Runs under 1GB of memory with day-one support for llama.cpp, MLX, and vLLM. - **Scaled training**: Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning. Find more information about LFM2.5 in our blog post. ## 🗒️ Model Details LFM2.5-1.2B-Instruct is a general-purpose text-only model with the following features: ...

Repository: localaiLicense: other

lfm2.5-8b-a1b
Try LFM • Docs • LEAP • Discord # LFM2.5-8B-A1B LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning. - **On-device personal assistant**: Designed to power real-life applications, chaining tool calls, and following complex instructions on all devices. - **Compressed performance**: Competitive with much larger dense and MoE models on instruction following and agentic tasks. - **Unmatched throughput**: Fastest in its size class on both CPU and GPU inference, with day-one support for llama.cpp, MLX, vLLM, and SGLang. Find more information about LFM2.5-8B-A1B in our blog post. **AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.* ## 🗒️ Model Details LFM2.5-8B-A1B is a general-purpose text-only model with the following features: ...

Repository: localaiLicense: other

qwen_qwen3.5-35b-a3b
Qwen3.5-35B-A3B is a quantized multimodal language model with 35B parameters using an A3B MoE architecture. It supports image-text understanding and chat interactions via llama-cpp backend.

Repository: localaiLicense: apache-2.0

qwen_qwen3.5-0.8b
Qwen 3.5 0.8B parameter model quantized for llama-cpp backend. Supports chat interactions and multimodal image-text inputs.

Repository: localaiLicense: apache-2.0

qwen_qwen3.5-2b
Qwen3.5-2B is a highly efficient, instruction-tuned multilingual language model available in various quantized GGUF formats. Optimized for llama-cpp inference, it supports chat and completion tasks with strong performance on low-RAM hardware. The model is available in multiple quantization levels ranging from Q8_0 to IQ2_M to balance quality and resource usage.

Repository: localaiLicense: apache-2.0

qwen_qwen3.5-4b
Qwen3.5-4B is a multimodal LLM with 4 billion parameters, optimized for chat and vision tasks. This GGUF quantized version enables efficient local inference via llama-cpp backend. Supports both text and image input for enhanced conversational capabilities.

Repository: localaiLicense: apache-2.0

qwen_qwen3-next-80b-a3b-thinking

Repository: localaiLicense: apache-2.0

acestep-cpp-turbo
ACE-Step 1.5 Turbo (C++ / GGML) — native C++ music generation from text descriptions and lyrics. Two-stage pipeline: text-to-code (Qwen3 LM) + code-to-audio (DiT-VAE). Stereo 48kHz output. Uses Q8_0 quantized models for a good balance of quality and speed.

Repository: localaiLicense: mit

acestep-cpp-turbo-4b
ACE-Step 1.5 Turbo (C++ / GGML) with 4B LM — higher quality music generation from text and lyrics. Uses the larger 4B parameter LM for better metadata/code generation. Stereo 48kHz output.

Repository: localaiLicense: mit

vibevoice-cpp
VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.

Repository: localaiLicense: mit

vibevoice-cpp-asr
VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker diarization. Returns per-speaker JSON segments with start/end timestamps. English-only. ~10 GB download.

Repository: localaiLicense: mit

qwen3-tts-cpp
Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp). Native C++ text-to-speech with streaming output and zero-shot voice cloning (set `voice` to a 24kHz reference .wav). 24kHz mono, 11 languages with Mandarin dialects. Q8_0 (~0.95 GB talker).

Repository: localaiLicense: mit

qwen3-tts-cpp-0.6b-base-q4
Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~0.6 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-1.7b-base
Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q8_0 (~2.0 GB talker). Higher-quality streaming + voice cloning, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-1.7b-base-q4
Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~1.2 GB talker). Streaming + voice cloning, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-customvoice
Qwen3-TTS 0.6B CustomVoice (C++ / GGML, qwentts.cpp), Q8_0. Named speakers selected via the `voice` field: serena, vivian, uncle_fu, ryan, aiden, ono_anna, sohee, eric (sichuan dialect), dylan (beijing dialect). Streaming, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-customvoice-q4
Qwen3-TTS 0.6B CustomVoice (C++ / GGML, qwentts.cpp), Q4_K_M. Named speakers via the `voice` field (serena, vivian, ryan, aiden, eric, dylan, ...). Streaming, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-1.7b-customvoice
Qwen3-TTS 1.7B CustomVoice (C++ / GGML, qwentts.cpp), Q8_0. Named speakers via the `voice` field (serena, vivian, ryan, aiden, eric, dylan, ...). Streaming, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-1.7b-customvoice-q4
Qwen3-TTS 1.7B CustomVoice (C++ / GGML, qwentts.cpp), Q4_K_M. Named speakers via the `voice` field. Streaming, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

qwen3-tts-cpp-1.7b-voicedesign
Qwen3-TTS 1.7B VoiceDesign (C++ / GGML, qwentts.cpp), Q8_0. Synthesises a speaker from a free-text attribute instruction - REQUIRES the OpenAI `instructions` field (e.g. "male, young adult, moderate pitch"); requests without it are rejected. Streaming, 24kHz mono, 11 languages.

Repository: localaiLicense: mit

Page 1