AI pulse last 7 days
Daily AI pulse from YouTube, blogs, Reddit, HN. Ruthlessly filtered.
Sources (41)▶
- criticalAndrej Karpathy
Były dyrektor AI w Tesli, OpenAI cofounder. Każde video to gold.
- criticalAnthropic
Oficjalny kanał Anthropic. Każdy release Claude'a.
- criticalComfyUI Blog
Release log dla integracji ComfyUI — Luma Uni-1, GPT Image 2, ACE-Step music gen, Seedance. Pokrywa video+image+music+workflow.
- criticalOpenAI Blog
Oficjalny blog OpenAI. Wszystkie release.
- criticalSimon Willison's Weblog
Najlepszy 'thinker' AI. Codzienne posty, deep insights, niska hype rate.
- highAI Explained
Głęboka analiza papers i benchmarków, niska hype rate.
- highAI Jason
Praktyczne tutoriale Claude Code, MCP, workflow vibe codingu.
- highBen's Bites
Daily AI digest, creator-friendly tone. Codex, model releases, agentic AI.
- highCole Medin
Vibe coding + agentic workflows + Claude Code MCP integrations.
- highFal AI Blog
Fal hostuje większość nowych AI image/video modeli — ich blog to wczesne sygnały premier.
- highHN: 3D & Gaussian Splatting
HN signal dla 3D generative — Gaussian Splatting, NeRF, image-to-3D. Próg 20 bo niszowa kategoria (top historic 182pts).
- highHN: AI agents / MCP
HN posty o agentach, MCP, vibe codingu z min 100 pkt.
- highHN: Claude / Anthropic
HN posty z 'Claude' lub 'Anthropic' z min 100 pkt.
- highHugging Face Blog
Releases dla image, video, audio, 3D modeli. Część tech-heavy — Gemini relevance odfiltruje noise. Downgraded z critical: za duży volume na 'must-read' status.
- highIndyDevDan
Claude Code power user, prompty, hooki.
- highInterconnects (Nathan Lambert)
AI policy + research analysis. Niska hype rate, opinionated.
- highLatent Space
Podcast + blog Swyx — wywiady z founderami i deep dives engineeringowe.
- highMatt Wolfe
Comprehensive AI tools weekly digest. ~700K subs.
- highMatthew Berman
AI news, model release reviews, agent demos. Wysoki output.
- highr/aivideo
Community AI video — Sora, Veo, Runway, Kling, LTX. Co naprawdę zaskakuje twórców.
- highr/ClaudeAI
Społeczność Claude'a — power users, tipy, problemy.
- highr/LocalLLaMA
Open-source LLMs, lokalne uruchamianie, benchmarks bez hype.
- highr/StableDiffusion
Największa community open-source image gen (700k+ users). Premiery modeli, LoRA, ComfyUI workflows.
- highRiley Brown
Vibe coding, AI builder workflows, Cursor + Claude tutorials.
- highThe Decoder
Niemiecki AI news outlet po angielsku, dobre breaking news.
- highTheo - t3.gg
TypeScript + AI dev workflows. Hot takes, narrative-driven.
- highYannic Kilcher
Paper reviews i deep dives w research AI.
- lowAI Weirdness
Janelle Shane — playful AI experiments, image gen quirks. Niski volume, unikalna perspektywa.
- mediumbycloud
AI papers digestible — między 2MP a Yannic Kilcher.
- mediumCreative Bloq
Design industry — gdzie AI ingeruje w klasyczne dyscypliny graficzne.
- mediumFireship
100-sec format, often AI/LLM + tech news.
- mediumfxguide
VFX i film industry — coraz więcej AI w pipeline. Profesjonalna perspektywa.
- mediumGreg Isenberg
Solo founder vibe — buduje produkty z AI, podcasty z indie hackers.
- mediumr/ChatGPTCoding
Vibe coding tipy, IDE setupy, prompty. Mix wszystkich modeli.
- mediumr/comfyui
ComfyUI workflows — custom nodes, JSON workflows, optymalizacje.
- mediumr/midjourney
Midjourney community — premiery v7+, style references, prompt patterns.
- mediumr/runwayml
Runway-specific community — premiery features, prompt patterns, comparisons z konkurencją.
- mediumr/SunoAI
Suno music gen community — nowe wersje modelu, lyric prompting techniques. Audio AI ma slaby RSS ecosystem.
- mediumTina Huang
AI workflows for data science, practical applications.
- mediumTwo Minute Papers
Krótkie streszczenia papers AI, świetne dla szybkiego scan'a.
- mediumWes Roth
AI news z bardziej clickbaitowym tonem — filtr Gemini odsiewa hype.

feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp
A new, powerful multimodal AI model, Mimo v2.5, with a massive 1M token context window and MoE architecture, is now supported by `llama.cpp`, making it accessible for local experi…
The popular `llama.cpp` project, known for enabling local inference of large language models, has officially added support for the new Mimo v2.5 model through a recent pull request. This significant update allows hobbyists and creative non-developers to run a highly advanced, multimodal Mixture of Experts (MoE) model on their consumer hardware. Mimo v2.5 features a sparse MoE architecture with 310B total parameters (15B activated), an exceptional 1M token context length, and comprehensive multimodal capabilities spanning text, image, video, and audio, supported by dedicated 729M-param vision and 261M-param audio encoders. This integration democratizes access to cutting-edge AI, making powerful local experimentation more feasible.
r/LocalLLaMA·model_release·05/07/2026, 11:23 AM·/u/jacek2023Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide
Speed up Qwen 3.5/3.6 models by nearly 3x on a single GPU using NextN Multi-Token Prediction in llama.cpp with this specific build and quantization guide.
This technical guide details how to implement NextN Multi-Token Prediction (MTP) for the Qwen 3.5 and 3.6 model families using llama.cpp. By leveraging MTP, users can achieve approximately 2.9x faster decoding speeds with zero loss in output quality, as the prediction heads are natively integrated into these models. The process currently requires building llama.cpp from specific pull requests (#22400 and #22673) or using a provided fork. A critical step involves a specific quantization override (--tensor-type nextn=q8_0) to prevent output corruption. Benchmarks show the 35B MoE variant reaching an impressive ~150 tokens per second on a single RTX 3090 Ti.
r/LocalLLaMA·tutorial·05/07/2026, 09:56 AM·/u/yes_i_tried_googlewhy llama.cpp can’t combine speculative decode methods?
Users are seeking to combine MTP and ngram speculative decoding in llama.cpp to maximize speed in coding tasks, but current implementation limits them to one method.
A technical discussion on r/LocalLLaMA highlights a current limitation in llama.cpp regarding speculative decoding methods. A user testing Qwen 3.6 27B with Multi-Token Prediction (MTP) found that while MTP is effective, combining it with ngram speculation would be ideal for agentic coding. Ngram is particularly fast at predicting repeated code blocks, which occurs frequently during file edits. Currently, llama.cpp only supports one speculative method at a time via command-line arguments. The community is exploring whether this is a fundamental architectural constraint or a temporary implementation hurdle that could be resolved to further boost local inference speeds.
r/LocalLLaMA·tooling·05/07/2026, 07:53 AM·/u/QwoctopussyGet faster qwen 3.6 27b
Achieve 50 t/s on Qwen 3.6 27B with 100k context on a single RTX 3090 by using MTP GGUFs and a specific llama.cpp branch.
A user on r/LocalLLaMA shared a method to significantly boost inference speeds for the Qwen 3.6 27B model on consumer hardware. By utilizing Multi-Token Prediction (MTP) GGUF files and a specific pull request for llama.cpp, they achieved speeds of 50 tokens per second on an RTX 3090. The setup involves using Q4_K_M quantization for the model and Q4_0 for the K/V cache to fit a 100k context within 19GB of VRAM. The post includes a step-by-step guide for applying the PR and the exact server configuration flags needed. It also mentions a Mac-specific installation via Homebrew for similar performance gains.
r/LocalLLaMA·tooling·05/06/2026, 11:33 PM·/u/admajicUploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results
MTP (Multi-Token Prediction) can significantly speed up local LLM inference, but its effectiveness varies greatly depending on the model architecture and hardware setup.
User /u/havenoammo released GGUF versions of the Qwen3.6-35B-A3B model featuring 'grafted' Multi-Token Prediction (MTP) layers. While MTP previously showed 2-2.5x speedups on dense models like the 27B variant, results for this MoE (Mixture of Experts) version are more modest, ranging from a 6% to 50% increase in tokens per second. The performance seems highly dependent on the specific GPU configuration and quantization level (Q4 vs Q8). The release includes the isolated MTP layers and conversion scripts on HuggingFace, allowing the community to experiment with speculative decoding. These preliminary results suggest that MoE architectures might not benefit as uniformly from MTP as dense models do in current llama.cpp implementations.
r/LocalLLaMA·tooling·05/06/2026, 09:51 PM·/u/havenoammoGreat results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot
A complete, reproducible configuration for running Qwen 3.6-35B locally in VS Code, achieving ~100 t/s for high-quality coding tasks on consumer hardware.
A user on r/LocalLLaMA shared a highly successful local coding setup using the Qwen 3.6-35B model (MoE architecture) via llama.cpp on an AMD R9700 GPU. The post includes the exact startup command for the Vulkan server, a VS Code chatLanguageModels.json configuration, and a complex React/TypeScript prompt that generated a fully functional website. Performance metrics show generation speeds of ~100 tokens/second, though large 38k token prompts cause a 17-second prefill delay. The setup utilizes context checkpointing and flash attention to maintain efficiency. This serves as a practical blueprint for developers looking to replace paid coding assistants with local LLMs.
r/LocalLLaMA·tooling·05/06/2026, 08:47 PM·/u/supracodeQwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR
Boost your local Qwen3.6-27B inference speed by 2.5x using MTP-enabled GGUFs and a custom llama.cpp build.
A community developer has successfully implemented Multi-Token Prediction (MTP) for the Qwen3.6-27B model in GGUF format, achieving a 2.5x increase in token throughput. By 'grafting' Q8-quantized MTP draft heads onto Unsloth UD XL base models, the setup allows for speculative decoding where four tokens are predicted per forward pass. This implementation utilizes an unmerged llama.cpp pull request (#22673) to enable MTP support locally, a feature previously limited to server-side engines like vLLM. The method adds minimal VRAM overhead while significantly improving inference speed on consumer hardware. Detailed build instructions and the conversion script are provided on HuggingFace.
r/LocalLLaMA·tooling·05/06/2026, 11:45 AM·/u/havenoammo2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints
Run Qwen 3.6 27B locally with 2.5x speedup (up to 28 tok/s) using new MTP support in llama.cpp and optimized GGUF quants.
A new optimization for Qwen 3.6 27B leverages Multi-Token Prediction (MTP) via a llama.cpp Pull Request to achieve 2.5x faster inference. User /u/ex-arman68 shared custom GGUF quants that include fixed chat templates and support for massive context windows, reaching up to 262k on 48GB RAM using q4_0 KV cache compression. The setup requires compiling a specific experimental branch of llama.cpp but delivers approximately 28 tokens per second on Apple Silicon. Detailed hardware recommendations for both Mac and NVIDIA users are provided, covering various RAM configurations from 16GB to 80GB. Note that vision capabilities currently conflict with MTP in this experimental build.
r/LocalLLaMA·tooling·05/06/2026, 09:35 AM·/u/ex-arman68Qwen 3.6 27b Q4.0 MTP GGUF
Multi-Token Prediction (MTP) allows running a 27b model at the speed of a 9b model on integrated GPUs using llama.cpp.
A user report on r/LocalLLaMA highlights the performance benefits of Multi-Token Prediction (MTP) for the Qwen 3.6 27b model. Using the Q4.0 GGUF quantization in llama.cpp, the 27b model achieves inference speeds comparable to the smaller 9b Qwen 3.5 model. This test was conducted on an AMD iGPU with 64GB of unified memory, demonstrating that MTP significantly lowers the hardware barrier for running larger models locally. The results suggest that MTP is a viable path for making mid-sized models feel as responsive as small models on consumer-grade integrated graphics.
r/LocalLLaMA·tooling·05/06/2026, 03:01 AM·/u/Available_Hornet3538Qwen 3.6 27B MTP on v100 32GB: 54 t/s
Multi-Token Prediction (MTP) nearly doubles inference speed for Qwen 3.6 27B on older V100 hardware, making it a highly viable local coding assistant.
A user report demonstrates a significant performance boost for Qwen 3.6 27B using Multi-Token Prediction (MTP) on a Tesla V100 32GB GPU. By utilizing a specific MTP branch of llama.cpp, inference speeds jumped from approximately 30 t/s to 54 t/s, nearly doubling the output rate. The setup utilized a q8_0 KV cache and supported a 200k context limit, effectively serving as a high-speed VS Code Copilot replacement. While performance dipped slightly to 40-45 t/s at higher context depths (50k+ tokens), the model remained highly effective for complex tasks like tool calls and code refactoring. This highlights the potential of MTP to extend the lifecycle of older enterprise hardware for modern local LLM workloads.
r/LocalLLaMA·tooling·05/06/2026, 02:18 AM·/u/m94301
MTP on strix halo with llama.cpp (PR #22673)
Multi-Token Prediction (MTP) in llama.cpp nearly doubles inference speeds on AMD Strix Halo hardware, reaching up to 80 t/s on 35B models.
A user on r/LocalLLaMA demonstrated a significant performance boost using the new Multi-Token Prediction (MTP) support in llama.cpp. Testing on an AMD Strix Halo (AI Max 395) with 128GB of fast DDR5-8000 RAM, inference speeds for a Qwen 35B model jumped from approximately 40 t/s to between 60 and 80 t/s. The setup utilized a specific pull request (#22673) and specialized GGUF files designed for MTP. While prompt processing (PP) speeds remained stable, the generation speed benefit is nearly double in some scenarios. This highlights the potential of speculative decoding techniques to make large local models much more responsive on high-end unified memory APUs.
r/LocalLLaMA·tooling·05/05/2026, 10:26 PM·/u/Edenar
Relevance auto-scored by LLM (0–10). List shows top 30 from the last 7 days.