LocalAI - Models

gemma-4-12b-agentic-fable5-composer2.5-v2-3.5x-tau2

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. ...

Links

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

Tags

gemma-4-12b-coder-fable5-composer2.5-v1

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. ...

Links

https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Tags

dark-scarlett-v0.3-26b-a4b

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/ReadyArt/Dark-Scarlett-v0.3-26B-A4B-GGUF

Tags

privacy-filter-multilingual

A multilingual PII token-classification model: a fine-tune of openai/privacy-filter by OpenMed. It labels every token with a BIOES tag over 54 PII categories (217 classes) across 16 languages (ar, bn, de, en, es, fr, hi, it, ja, ko, nl, pt, te, tr, vi, zh), spanning identity, contact, address, financial, vehicle, digital, and crypto entities. In LocalAI this is a PII detector for the NER redactor tier: set known_usecases to [token_classify] (as below), and any model opts into redaction by listing this one under pii.detectors. The detection policy (which categories to mask vs block, and the score threshold) lives on this model's own pii_detection block - see the overrides below. It runs locally with no Python, served by the standalone privacy-filter backend's TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset entity spans). Architecture: gpt-oss-style sparse MoE (8 layers, 128 experts top-4, ~50M active per token), bidirectional banded attention, o200k tokenizer; served via the openai-privacy-filter architecture. F16, ~2.7 GB.

Links

Tags

privacy-filter-nemotron

A fine-grained English PII token-classification model: a fine-tune of openai/privacy-filter by OpenMed on NVIDIA's Nemotron-PII dataset. It labels every token with a BIOES tag over 55 PII categories (221 classes), trading the multilingual sibling's language breadth for category depth - identity, contact, address, dates, government IDs, financial, healthcare, enterprise, vehicle and digital entities (including api_key, ipv4/ipv6 and mac_address). For multilingual text prefer privacy-filter-multilingual instead. In LocalAI this is a PII detector for the NER redactor tier: set known_usecases to [token_classify] (as below), and any model opts into redaction by listing this one under pii.detectors. The detection policy (which categories to mask vs block, and the score threshold) lives on this model's own pii_detection block - see the overrides below. It runs locally with no Python, served by the standalone privacy-filter backend's TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset entity spans). Architecture: gpt-oss-style sparse MoE (8 layers, d_model 640, 128 experts top-4, ~1.5B total / ~50M active per token), bidirectional banded attention, o200k tokenizer and a 221-way token-classification head; served via the openai-privacy-filter architecture. F16, ~2.8 GB. (A smaller Q8_0 quant exists on the GGUF repo for RAM-constrained use - validate it on your own data, since for PII a single dropped span is a leak.)

Links

Tags

qwopus3.5-9b-coder-mtp

# 🌟 Qwopus3.5-9B-v3.5 ## 💡 Model Overview & v3.5 Design Qwopus3.5-9B-v3.5 is a **data-scaled continuation** of the Qwopus3.5-9B-v3 model. The training data in v3.5 is expanded to cover a broader range of domains, including mathematics, programming, puzzle-solving, multilingual dialogue, instruction-following, multi-turn interactions, and STEM-related tasks. Qwopus3.5-9B-v3.5 is a reasoning-enhanced model based on **Qwen3.5-9B**, designed for: - 🧩 Structured reasoning - 🔧 Tool-augmented workflows - 🔁 Multi-step agentic tasks - ⚡ Token-efficient inference Compared with Qwopus3.5-9B-v3, **3.5 version does not introduce a new architecture, RL stage, or template redesign**. This version is trained with approximately **2× more SFT data**. ## 🎯 Motivation & Generalization Insight The motivation behind v3.5 comes from a simple observation: > This work is motivated by the hypothesis that scaling high-quality SFT data may further enhance the generalization ability of large language models. In earlier Qwopus3.5 experiments, structured reasoning was observed to improve both **accuracy and efficiency**: ...

Links

https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF

Tags

supergemma4-26b-uncensored-v2

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2

Tags

qwen_qwen3.5-2b

Qwen3.5-2B is a highly efficient, instruction-tuned multilingual language model available in various quantized GGUF formats. Optimized for llama-cpp inference, it supports chat and completion tasks with strong performance on low-RAM hardware. The model is available in multiple quantization levels ranging from Q8_0 to IQ2_M to balance quality and resource usage.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-2B-GGUF

Tags

qwen3.5-397b-a17b

Links

https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF

Tags

qwen3.5-27b

Links

https://huggingface.co/unsloth/Qwen3.5-27B-GGUF

Tags

qwen3.5-122b-a10b

Links

https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF

Tags

nanbeige4.1-3b-q8

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

nanbeige4.1-3b-q4

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

omnilingual-0.3b-ctc-q8-sherpa

Omnilingual ASR CTC 300M (int8) is a multilingual automatic speech recognition model supporting 1,600+ languages. Based on Meta's omniASR_CTC_300M architecture (Wav2Vec2 with CTC head), quantized to int8 for efficient inference. Uses the sherpa-onnx backend with ONNX Runtime.

Links

Tags

streaming-zipformer-en-sherpa

Streaming English ASR: sherpa-onnx zipformer transducer (int8, chunk-16 left-128). Low-latency real-time transcription with endpoint detection via sherpa-onnx's online recognizer. English-only; for multilingual offline ASR see omnilingual-0.3b-ctc-q8-sherpa.

Links

Tags

kokoro-multi-lang-v1.0-sherpa

Kokoro multi-lingual TTS (v1.0, int8) served through the sherpa-onnx backend with native streaming TTS. A single model covers many languages and speakers (English, Italian, Spanish, French, German and more) via a built-in voices bank; espeak-ng data and per-language lexicons ship with it. Select a speaker with the `voice` parameter (numeric speaker id) and optionally pass `language=` to hint the language.

Links

Tags

supertonic-3

Supertonic multilingual text-to-speech (Supertone/supertonic-3), served through the native supertonic backend via ONNX Runtime. Lightning-fast on-device flow-matching TTS with 44.1 kHz output, 31 languages, and 10 preset voice styles (F1-F5, M1-M5). No espeak-ng dependency. Defaults to voice F1; override per request with the OpenAI `voice` field, and optionally pass `language=` (e.g. en, ko, ja, it; "na" for language-agnostic).

Links

Tags

glm-4.7-flash-derestricted

This model is a quantized version of the original GLM-4.7-Flash-Derestricted model, derived from the base model `koute/GLM-4.7-Flash-Derestricted`. It is designed for restricted use, featuring tags like "derestricted," "uncensored," and "unlimited." The quantized versions (e.g., Q2_K, Q4_K_S, Q6_K) offer varying trade-offs between accuracy and efficiency, with the Q4_K_S and Q6_K variants being recommended for balanced performance. The model is optimized for fast inference and supports multiple quantization schemes, though some advanced quantization options (like IQ4_XS) are not available. It is intended for use in environments with specific constraints or restrictions.

Links

https://huggingface.co/mradermacher/GLM-4.7-Flash-Derestricted-GGUF

Tags

huihui-glm-4.7-flash-abliterated-i1

The model is a quantized version of **huihui-ai/Huihui-GLM-4.7-Flash-abliterated**, optimized for efficiency and deployment. It uses GGUF files with various quantization levels (e.g., IQ1_M, IQ2_XXS, Q4_K_M) and is designed for tasks requiring low-resource deployment. Key features include: - **Base Model**: Huihui-GLM-4.7-Flash-abliterated (unmodified, original model). - **Quantization**: Supports IQ1_M to Q4_K_M, balancing accuracy and efficiency. - **Use Cases**: Suitable for applications needing lightweight inference, such as edge devices or resource-constrained environments. - **Downloads**: Available in GGUF format with varying quality and size (e.g., 0.2GB to 18.2GB). - **Tags**: Abliterated, uncensored, and optimized for specific tasks. This model is a modified version of the original GLM-4.7, tailored for deployment with quantized weights.

Links

https://huggingface.co/mradermacher/Huihui-GLM-4.7-Flash-abliterated-i1-GGUF

Tags

glm-4.7-flash

**GLM-4.7-Flash** is a 30B-A3B MoE (Model Organism Ensemble) model designed for efficient deployment. It outperforms competitors in benchmarks like AIME 25, GPQA, and τ²-Bench, offering strong accuracy while balancing performance and efficiency. Optimized for lightweight use cases, it supports inference via frameworks like vLLM and SGLang, with detailed deployment instructions in the official repository. Ideal for applications requiring high-quality text generation with minimal resource consumption.

Links

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

Tags

qwen3-vl-reranker-2b-i1

**Model Name:** Qwen3-VL-Reranker-2B-i1 **Base Model:** Qwen/Qwen3-VL-Reranker-2B **Description:** A high-performance multimodal reranking model for state-of-the-art cross-modal search. It supports 30+ languages and handles text, images, screenshots, videos, and mixed modalities. With 8B parameters and a 32K context length, it refines retrieval results by combining embedding vectors with precise relevance scores. Optimized for efficiency, it supports quantized versions (e.g., Q8_0, Q4_K_M) and is ideal for applications requiring accurate multimodal content matching. **Key Features:** - **Multimodal**: Text, images, videos, and mixed content. - **Language Support**: 30+ languages. - **Quantization**: Available in Q8_0 (best quality), Q4_K_M (fast, recommended), and lower-precision options. - **Performance**: Outperforms base models in retrieval tasks (e.g., JinaVDR, ViDoRe v3). - **Use Case**: Enhances search pipelines by refining embeddings with precise relevance scores. **Downloads:** - [GGUF Files](https://huggingface.co/mradermacher/Qwen3-VL-Reranker-2B-i1-GGUF) (e.g., `Qwen3-VL-Reranker-2B.i1-Q4_K_M.gguf`). **Usage:** - Requires `transformers`, `qwen-vl-utils`, and `torch`. - Example: `from scripts.qwen3_vl_reranker import Qwen3VLReranker; model = Qwen3VLReranker(...)` **Citation:** @article{qwen3vlembedding, ...} This description emphasizes its capabilities, efficiency, and versatility for multimodal search tasks.

Links

https://huggingface.co/mradermacher/Qwen3-VL-Reranker-2B-i1-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

gemma-4-12b-agentic-fable5-composer2.5-v2-3.5x-tau2

gemma-4-12b-coder-fable5-composer2.5-v1

dark-scarlett-v0.3-26b-a4b

privacy-filter-multilingual

privacy-filter-nemotron

qwopus3.5-9b-coder-mtp

supergemma4-26b-uncensored-v2

qwen_qwen3.5-2b

qwen3.5-397b-a17b

qwen3.5-27b

qwen3.5-122b-a10b

nanbeige4.1-3b-q8

nanbeige4.1-3b-q4

omnilingual-0.3b-ctc-q8-sherpa

streaming-zipformer-en-sherpa

kokoro-multi-lang-v1.0-sherpa

supertonic-3

glm-4.7-flash-derestricted

huihui-glm-4.7-flash-abliterated-i1

glm-4.7-flash

qwen3-vl-reranker-2b-i1