Model Gallery

25 models from 1 repositories

Filter by type:

Filter by tags:

supergemma4-26b-uncensored-v2
Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Repository: localaiLicense: gemma

nemo-parakeet-tdt-0.6b
NVIDIA NeMo Parakeet TDT 0.6B v3 is an automatic speech recognition (ASR) model from NVIDIA's NeMo toolkit. Parakeet models are state-of-the-art ASR models trained on large-scale English audio data.

Repository: localaiLicense: apache-2.0

voxtral-mini-4b-realtime
Voxtral Mini 4B Realtime is a speech-to-text model from Mistral AI. It is a 4B parameter model optimized for fast, accurate audio transcription with low latency, making it ideal for real-time applications. The model uses the Voxtral architecture for efficient audio processing.

Repository: localaiLicense: apache-2.0

moonshine-tiny
Moonshine Tiny is a lightweight speech-to-text model optimized for fast transcription. It is designed for efficient on-device ASR with high accuracy relative to its size.

Repository: localaiLicense: apache-2.0

whisperx-tiny
WhisperX Tiny is a fast and accurate speech recognition model with speaker diarization capabilities. Built on OpenAI's Whisper with additional features for alignment and speaker segmentation.

Repository: localaiLicense: mit

omnilingual-0.3b-ctc-q8-sherpa
Omnilingual ASR CTC 300M (int8) is a multilingual automatic speech recognition model supporting 1,600+ languages. Based on Meta's omniASR_CTC_300M architecture (Wav2Vec2 with CTC head), quantized to int8 for efficient inference. Uses the sherpa-onnx backend with ONNX Runtime.

Repository: localaiLicense: apache-2.0

streaming-zipformer-en-sherpa
Streaming English ASR: sherpa-onnx zipformer transducer (int8, chunk-16 left-128). Low-latency real-time transcription with endpoint detection via sherpa-onnx's online recognizer. English-only; for multilingual offline ASR see omnilingual-0.3b-ctc-q8-sherpa.

Repository: localaiLicense: apache-2.0

silero-vad-sherpa
Silero VAD served through the sherpa-onnx backend. Uses the same ONNX weights as the dedicated silero-vad backend, loaded through sherpa-onnx's C VAD API. Pairs with the sherpa-onnx ASR entries for round-trip audio pipelines.

Repository: localaiLicense: mit

vits-ljs-sherpa
VITS-LJS English single-speaker TTS served through the sherpa-onnx backend. Trained on the LJSpeech corpus at 22.05 kHz. Pairs with the sherpa-onnx ASR entries for round-trip audio pipelines.

Repository: localaiLicense: mit

vllm-omni-qwen3-omni-30b
Qwen3-Omni-30B-A3B-Instruct via vLLM-Omni - A large multimodal model (30B active, 3B activated per token) from Alibaba Qwen team. Supports text, image, audio, and video understanding with text and speech output. Features native multimodal understanding across all modalities.

Repository: localaiLicense: apache-2.0

ace-step-turbo
ACE-Step 1.5 Turbo is a music generation model that can create music from text descriptions, lyrics, or audio samples. Supports both simple text-to-music and advanced music generation with metadata like BPM, key scale, and time signature.

Repository: localaiLicense: mit

acestep-cpp-turbo
ACE-Step 1.5 Turbo (C++ / GGML) — native C++ music generation from text descriptions and lyrics. Two-stage pipeline: text-to-code (Qwen3 LM) + code-to-audio (DiT-VAE). Stereo 48kHz output. Uses Q8_0 quantized models for a good balance of quality and speed.

Repository: localaiLicense: mit

acestep-cpp-turbo-4b
ACE-Step 1.5 Turbo (C++ / GGML) with 4B LM — higher quality music generation from text and lyrics. Uses the larger 4B parameter LM for better metadata/code generation. Stereo 48kHz output.

Repository: localaiLicense: mit

qwen3-tts-cpp
Qwen3-TTS 0.6B (C++ / GGML) — native C++ text-to-speech from text input. Generates 24kHz mono audio. Supports 10 languages (en, zh, ja, ko, de, fr, es, it, pt, ru). Uses F16 GGUF models (~2 GB total).

Repository: localaiLicense: apache-2.0

qwen3-tts-cpp-customvoice
Qwen3-TTS 0.6B Custom Voice (C++ / GGML) — text-to-speech with voice cloning support. Generates 24kHz mono audio with optional reference audio for voice cloning via ECAPA-TDNN speaker embeddings. Supports 10 languages (en, zh, ja, ko, de, fr, es, it, pt, ru).

Repository: localaiLicense: apache-2.0

fish-speech-s2-pro
Fish Speech S2-Pro is a high-quality text-to-speech model supporting voice cloning via reference audio. Uses a two-stage pipeline: text to semantic tokens (LLaMA-based) then semantic to audio (DAC decoder).

Repository: localaiLicense: apache-2.0

qwen3-omni-30b-a3b-instruct
Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation model. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. This GGUF build runs on llama.cpp with the bundled mmproj for multimodal inputs.

Repository: localaiLicense: apache-2.0

qwen3-omni-30b-a3b-thinking
Qwen3-Omni-30B-A3B-Thinking is the reasoning-enhanced variant of Qwen3-Omni, a natively end-to-end multilingual omni-modal foundation model. It processes text, images, and audio and produces chain-of-thought reasoning before the final answer. This GGUF build runs on llama.cpp with the bundled mmproj.

Repository: localaiLicense: apache-2.0

qwen3-asr-0.6b
Qwen3-ASR 0.6B is a compact automatic speech recognition model from the Qwen3 family, distributed as a GGUF for llama.cpp. It accepts audio input through the paired mmproj and transcribes it to text, supporting multilingual speech.

Repository: localaiLicense: apache-2.0

qwen3-asr-1.7b
Qwen3-ASR 1.7B is the larger automatic speech recognition model from the Qwen3 family, distributed as a GGUF for llama.cpp. It accepts audio input through the paired mmproj and produces higher-quality multilingual transcriptions than the 0.6B variant.

Repository: localaiLicense: apache-2.0

ltx-2
**LTX-2** is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. **Key Features:** - **Joint Audio-Video Generation**: Generates synchronized video and audio in a single model - **Image-to-Video**: Converts static images into dynamic videos with matching audio - **High Quality**: Produces realistic video with natural motion and synchronized audio - **Open Weights**: Available under the LTX-2 Community License Agreement **Model Details:** - **Model Type**: Diffusion-based audio-video foundation model - **Architecture**: DiT (Diffusion Transformer) based - **Developed by**: Lightricks - **Paper**: [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://arxiv.org/abs/2601.03233) **Usage Tips:** - Width & height settings must be divisible by 32 - Frame count must be divisible by 8 + 1 (e.g., 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 105, 113, 121) - Recommended settings: width=768, height=512, num_frames=121, frame_rate=24.0 - For best results, use detailed prompts describing motion and scene dynamics **Limitations:** - This model is not intended or able to provide factual information - Prompt following is heavily influenced by the prompting-style - When generating audio without speech, the audio may be of lower quality **Citation:** ```bibtex @article{hacohen2025ltx2, title={LTX-2: Efficient Joint Audio-Visual Foundation Model}, author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and others}, journal={arXiv preprint arXiv:2601.03233}, year={2025} } ```

Repository: localaiLicense: ltx-2-community-license-agreement

Page 1