LocalAI - Models

qwythos-9b-claude-mythos-5-1m

# Qwythos-9B **Developed by Empero** **Qwythos-9B** is a full-parameter reasoning model built on top of a **deeply uncensored Qwen3.5-9B base** and post-trained on **over 500 million tokens** of high-quality Claude Mythos and Claude Fable traces, with chain-of-thought generated in-house by Empero AI's internal tool **rethink**. The result is a compact, fast, **dramatically more capable** 9B reasoning model. Headline capabilities: ...

Links

https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF

Tags

glm-5.2

# GLM-5.2 👋 Join our WeChat or Discord community. 📖 Check out the GLM-5.2 blog and GLM-5 Technical report. 📍 Use GLM-5.2 API services on Z.ai API Platform. 🔜 Try GLM-5.2 here. [Paper] [GitHub] ## Introduction We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**. GLM-5.2's new capabilities include: - **Solid 1M Context:** A solid 1M-token context that stably sustains long-horizon work - **Advanced Coding with Flexible Effort**: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency - **Improved Architecture**: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20% - **Pure Open**: An MIT open-source license — no regional limits, technical access without borders ## Benchmark ## Serve GLM-5.2 Locally ...

Links

https://huggingface.co/unsloth/GLM-5.2-GGUF

Tags

step-3.7-flash

**[ModelPage]**: https://static.stepfun.com/blog/step-3.7-flash/ ## 1. Introduction Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth. We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines. ## 2. Capabilities & Performance ### Multimodal Perception and Verification ...

Links

https://huggingface.co/unsloth/Step-3.7-Flash-GGUF

Tags

privacy-filter-multilingual

A multilingual PII token-classification model: a fine-tune of openai/privacy-filter by OpenMed. It labels every token with a BIOES tag over 54 PII categories (217 classes) across 16 languages (ar, bn, de, en, es, fr, hi, it, ja, ko, nl, pt, te, tr, vi, zh), spanning identity, contact, address, financial, vehicle, digital, and crypto entities. In LocalAI this is a PII detector for the NER redactor tier: set known_usecases to [token_classify] (as below), and any model opts into redaction by listing this one under pii.detectors. The detection policy (which categories to mask vs block, and the score threshold) lives on this model's own pii_detection block - see the overrides below. It runs locally with no Python, served by the standalone privacy-filter backend's TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset entity spans). Architecture: gpt-oss-style sparse MoE (8 layers, 128 experts top-4, ~50M active per token), bidirectional banded attention, o200k tokenizer; served via the openai-privacy-filter architecture. F16, ~2.7 GB.

Links

Tags

privacy-filter-nemotron

A fine-grained English PII token-classification model: a fine-tune of openai/privacy-filter by OpenMed on NVIDIA's Nemotron-PII dataset. It labels every token with a BIOES tag over 55 PII categories (221 classes), trading the multilingual sibling's language breadth for category depth - identity, contact, address, dates, government IDs, financial, healthcare, enterprise, vehicle and digital entities (including api_key, ipv4/ipv6 and mac_address). For multilingual text prefer privacy-filter-multilingual instead. In LocalAI this is a PII detector for the NER redactor tier: set known_usecases to [token_classify] (as below), and any model opts into redaction by listing this one under pii.detectors. The detection policy (which categories to mask vs block, and the score threshold) lives on this model's own pii_detection block - see the overrides below. It runs locally with no Python, served by the standalone privacy-filter backend's TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset entity spans). Architecture: gpt-oss-style sparse MoE (8 layers, d_model 640, 128 experts top-4, ~1.5B total / ~50M active per token), bidirectional banded attention, o200k tokenizer and a 221-way token-classification head; served via the openai-privacy-filter architecture. F16, ~2.8 GB. (A smaller Q8_0 quant exists on the GGUF repo for RAM-constrained use - validate it on your own data, since for PII a single dropped span is a leak.)

Links

Tags

nemotron-3-nano-omni-30b-a3b-reasoning-apex

# Model Overview ### Description: NVIDIA Nemotron 3 Nano Omni is a multimodal large language model that unifies video, audio, image, and text understanding to support enterprise-grade Q&A, summarization, transcription, and document intelligence workflows. It extends the Nemotron Nano family with integrated video+speech comprehension, Graphical User Interface (GUI), Optical Character Recognition (OCR), and speech transcription capabilities, enabling end-to-end processing of rich enterprise content such as meeting recordings, M&E assets, training videos, and complex business documents. NVIDIA Nemotron 3 Nano Omni was developed by NVIDIA as part of the Nemotron model family. This model is available for commercial use. This model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b. For more information, please see the Training Dataset section below. ### License/Terms of Use Governing Terms: Use of this model is governed by the NVIDIA Open Model Agreement ### Deployment Geography: Global ...

Links

https://huggingface.co/mudler/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-APEX-GGUF

Tags

kimi-k2.6

🤗 huggingchat | 📰 Tech Blog ## 1. Model Introduction Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration. ### Key Features - **Long-Horizon Coding**: K2.6 achieves significant improvements on complex, end-to-end coding tasks, generalizing robustly across programming languages (Rust, Go, Python) and domains spanning front-end, DevOps, and performance optimization. - **Coding-Driven Design**: K2.6 is capable of transforming simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows, generating structured layouts, interactive elements, and rich animations with deliberate aesthetic precision. - **Elevated Agent Swarm**: Scaling horizontally to 300 sub-agents executing 4,000 coordinated steps, K2.6 can dynamically decompose tasks into parallel, domain-specialized subtasks, delivering end-to-end outputs from documents to websites to spreadsheets in a single autonomous run. - **Proactive & Open Orchestration**: For autonomous tasks, K2.6 demonstra ...

Links

https://huggingface.co/unsloth/Kimi-K2.6-GGUF

Tags

qwen_qwen3.5-4b

Qwen3.5-4B is a multimodal LLM with 4 billion parameters, optimized for chat and vision tasks. This GGUF quantized version enables efficient local inference via llama-cpp backend. Supports both text and image input for enhanced conversational capabilities.

Links

https://huggingface.co/bartowski/Qwen_Qwen3.5-4B-GGUF

Tags

qwen3.5-27b-claude-4.6-opus-reasoning-distilled-i1

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF - A GGUF quantized model optimized for local inference. Specialized for reasoning and chain-of-thought tasks. Based on Qwen 3.5 architecture with enhanced language understanding. Available in multiple quantization levels for various hardware requirements. Distilled from Claude-style reasoning models for enhanced logical reasoning capabilities.

Links

https://huggingface.co/mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF

Tags

qwen3.5-4b-claude-4.6-opus-reasoning-distilled

Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-GGUF - A GGUF quantized model optimized for local inference. Specialized for reasoning and chain-of-thought tasks. Based on Qwen 3.5 architecture with enhanced language understanding. Available in multiple quantization levels for various hardware requirements. Distilled from Claude-style reasoning models for enhanced logical reasoning capabilities.

Links

https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Tags

whisperx-tiny

WhisperX Tiny is a fast and accurate speech recognition model with speaker diarization capabilities. Built on OpenAI's Whisper with additional features for alignment and speaker segmentation.

Links

https://github.com/m-bain/whisperX

Tags

voxcpm-1.5

VoxCPM 1.5 is an end-to-end text-to-speech (TTS) model from ModelBest. It features zero-shot voice cloning and high-quality speech synthesis capabilities.

Links

https://huggingface.co/openbmb/VoxCPM1.5

Tags

vllm-omni-qwen3-omni-30b

Qwen3-Omni-30B-A3B-Instruct via vLLM-Omni - A large multimodal model (30B active, 3B activated per token) from Alibaba Qwen team. Supports text, image, audio, and video understanding with text and speech output. Features native multimodal understanding across all modalities.

Links

https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

Tags

vllm-omni-qwen3-tts-custom-voice

Qwen3-TTS-12Hz-1.7B-CustomVoice via vLLM-Omni - Text-to-speech model from Alibaba Qwen team with custom voice cloning capabilities. Generates natural-sounding speech with voice personalization.

Links

https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Tags

qwen3-vl-embedding-8b

**Model Name:** Qwen3-VL-Embedding-8B **Base Model:** Qwen/Qwen3-VL-8B-Instruct **Description:** The **Qwen3-VL-Embedding** and **Qwen3-VL-Reranker** model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. **Key Features:** - Model Type: MultiModal Embedding - Supported Languages: 30+ Languages - Supported Input Modalities: Text, images, screenshots, videos, and arbitrary multimodal combinations (e.g., text + image, text + video) - Number of Parameters: 8B - Context Length: 32k - Embedding Dimension: Up to 4096, supports user-defined output dimensions ranging from 64 to 4096 **Downloads:** - [GGUF Files](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) (e.g., `Qwen3-VL-Embedding-8B-Q8_0.gguf`). **Usage:** - Requires `transformers`, `qwen-vl-utils`, and `torch`. - Example: `from scripts.qwen3_vl_embedding import Qwen3VLEmbedder model = Qwen3VLEmbedder(...)` **Citation:** @article{qwen3vlembedding, ...} This description emphasizes its capabilities, efficiency, and versatility for multimodal search tasks.

Links

Tags

qwen3-vl-embedding-2b

**Model Name:** Qwen3-VL-Embedding-2B **Base Model:** Qwen/Qwen3-VL-2B-Instruct **Description:** The **Qwen3-VL-Embedding** and **Qwen3-VL-Reranker** model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. **Key Features:** - Model Type: MultiModal Embedding - Supported Languages: 30+ Languages - Supported Input Modalities: Text, images, screenshots, videos, and arbitrary multimodal combinations (e.g., text + image, text + video) - Number of Parameters: 2B - Context Length: 32k - Embedding Dimension: Up to 2048, supports user-defined output dimensions ranging from 64 to 2048 **Downloads:** - [GGUF Files](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) (e.g., `Qwen3-VL-Embedding-2B-Q8_0.gguf`). **Usage:** - Requires `transformers`, `qwen-vl-utils`, and `torch`. - Example: `from scripts.qwen3_vl_embedding import Qwen3VLEmbedder model = Qwen3VLEmbedder(...)` **Citation:** @article{qwen3vlembedding, ...} This description emphasizes its capabilities, efficiency, and versatility for multimodal search tasks.

Links

Tags

qwen3-vl-reranker-8b

**Model Name:** Qwen3-VL-Reranker-8B **Base Model:** Qwen/Qwen3-VL-Reranker-8B **Description:** A high-performance multimodal reranking model for state-of-the-art cross-modal search. It supports 30+ languages and handles text, images, screenshots, videos, and mixed modalities. With 8B parameters and a 32K context length, it refines retrieval results by combining embedding vectors with precise relevance scores. Optimized for efficiency, it supports quantized versions (e.g., Q8_0, Q4_K_M) and is ideal for applications requiring accurate multimodal content matching. **Key Features:** - **Multimodal**: Text, images, videos, and mixed content. - **Language Support**: 30+ languages. - **Quantization**: Available in Q8_0 (best quality), Q4_K_M (fast, recommended), and lower-precision options. - **Performance**: Outperforms base models in retrieval tasks (e.g., JinaVDR, ViDoRe v3). - **Use Case**: Enhances search pipelines by refining embeddings with precise relevance scores. **Downloads:** - [GGUF Files](https://huggingface.co/mradermacher/Qwen3-VL-Reranker-8B-GGUF) (e.g., `Qwen3-VL-Reranker-8B.Q8_0.gguf`). **Usage:** - Requires `transformers`, `qwen-vl-utils`, and `torch`. - Example: `from scripts.qwen3_vl_reranker import Qwen3VLReranker; model = Qwen3VLReranker(...)` **Citation:** @article{qwen3vlembedding, ...} This description emphasizes its capabilities, efficiency, and versatility for multimodal search tasks.

Links

https://huggingface.co/mradermacher/Qwen3-VL-Reranker-8B-GGUF

Tags

qwen3-vl-reranker-2b-i1

**Model Name:** Qwen3-VL-Reranker-2B-i1 **Base Model:** Qwen/Qwen3-VL-Reranker-2B **Description:** A high-performance multimodal reranking model for state-of-the-art cross-modal search. It supports 30+ languages and handles text, images, screenshots, videos, and mixed modalities. With 8B parameters and a 32K context length, it refines retrieval results by combining embedding vectors with precise relevance scores. Optimized for efficiency, it supports quantized versions (e.g., Q8_0, Q4_K_M) and is ideal for applications requiring accurate multimodal content matching. **Key Features:** - **Multimodal**: Text, images, videos, and mixed content. - **Language Support**: 30+ languages. - **Quantization**: Available in Q8_0 (best quality), Q4_K_M (fast, recommended), and lower-precision options. - **Performance**: Outperforms base models in retrieval tasks (e.g., JinaVDR, ViDoRe v3). - **Use Case**: Enhances search pipelines by refining embeddings with precise relevance scores. **Downloads:** - [GGUF Files](https://huggingface.co/mradermacher/Qwen3-VL-Reranker-2B-i1-GGUF) (e.g., `Qwen3-VL-Reranker-2B.i1-Q4_K_M.gguf`). **Usage:** - Requires `transformers`, `qwen-vl-utils`, and `torch`. - Example: `from scripts.qwen3_vl_reranker import Qwen3VLReranker; model = Qwen3VLReranker(...)` **Citation:** @article{qwen3vlembedding, ...} This description emphasizes its capabilities, efficiency, and versatility for multimodal search tasks.

Links

https://huggingface.co/mradermacher/Qwen3-VL-Reranker-2B-i1-GGUF

Tags

mistral-nemo-instruct-2407-12b-thinking-m-claude-opus-high-reasoning-i1

The model described in this repository is the **Mistral-Nemo-Instruct-2407-12B** (12 billion parameters), a large language model optimized for instruction tuning and high-level reasoning tasks. It is a **quantized version** of the original model, compressed for efficiency while retaining key capabilities. The model is designed to generate human-like text, perform complex reasoning, and support multi-modal tasks, making it suitable for applications requiring strong language understanding and output.

Links

https://huggingface.co/mradermacher/Mistral-Nemo-Instruct-2407-12B-Thinking-M-Claude-Opus-High-Reasoning-i1-GGUF

Tags

glm-4.5v-i1

The model in question is a **quantized version** of the **GLM-4.5V** large language model, originally developed by **zai-org**. This repository provides multiple quantized variants of the model, optimized for different trade-offs between size, speed, and quality. The base model, **GLM-4.5V**, is a multilingual (Chinese/English) large language model, and this quantized version is designed for efficient inference on hardware with limited memory. Key features include: - **Quantization options**: IQ2_M, Q2_K, Q4_K_M, IQ3_M, IQ4_XS, etc., with sizes ranging from 43 GB to 96 GB. - **Performance**: Optimized for inference, with some variants (e.g., Q4_K_M) balancing speed and quality. - **Vision support**: The model is a vision model, with mmproj files available in the static repository. - **License**: MIT-licensed. This quantized version is ideal for applications requiring compact, efficient models while retaining most of the original capabilities of the base GLM-4.5V.

Links

https://huggingface.co/mradermacher/GLM-4.5V-i1-GGUF

Tags

qwen3-vl-30b-a3b-instruct

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on-demand deployment. #### Key Enhancements: * **Visual Agent**: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. * **Visual Coding Boost**: Generates Draw.io/HTML/CSS/JS from images/videos. * **Advanced Spatial Perception**: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. * **Long Context & Video Understanding**: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. * **Enhanced Multimodal Reasoning**: Excels in STEM/Math—causal analysis and logical, evidence-based answers. * **Upgraded Visual Recognition**: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. * **Expanded OCR**: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. * **Text Understanding on par with pure LLMs**: Seamless text–vision fusion for lossless, unified comprehension. #### Model Architecture Updates: 1. **Interleaved-MRoPE**: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. **DeepStack**: Fuses multi‑level ViT features to capture fine-grained details and sharpen image–text alignment. 3. **Text–Timestamp Alignment:** Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Instruct.

Links

https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

qwythos-9b-claude-mythos-5-1m

glm-5.2

step-3.7-flash

privacy-filter-multilingual

privacy-filter-nemotron

nemotron-3-nano-omni-30b-a3b-reasoning-apex

kimi-k2.6

qwen_qwen3.5-4b

qwen3.5-27b-claude-4.6-opus-reasoning-distilled-i1

qwen3.5-4b-claude-4.6-opus-reasoning-distilled

whisperx-tiny

voxcpm-1.5

vllm-omni-qwen3-omni-30b

vllm-omni-qwen3-tts-custom-voice

qwen3-vl-embedding-8b

qwen3-vl-embedding-2b

qwen3-vl-reranker-8b

qwen3-vl-reranker-2b-i1

mistral-nemo-instruct-2407-12b-thinking-m-claude-opus-high-reasoning-i1

glm-4.5v-i1

qwen3-vl-30b-a3b-instruct