AI pulse last 7 days
Daily AI pulse from YouTube, blogs, Reddit, HN. Ruthlessly filtered.
Sources (41)▶
- criticalAndrej Karpathy
Były dyrektor AI w Tesli, OpenAI cofounder. Każde video to gold.
- criticalAnthropic
Oficjalny kanał Anthropic. Każdy release Claude'a.
- criticalComfyUI Blog
Release log dla integracji ComfyUI — Luma Uni-1, GPT Image 2, ACE-Step music gen, Seedance. Pokrywa video+image+music+workflow.
- criticalOpenAI Blog
Oficjalny blog OpenAI. Wszystkie release.
- criticalSimon Willison's Weblog
Najlepszy 'thinker' AI. Codzienne posty, deep insights, niska hype rate.
- highAI Explained
Głęboka analiza papers i benchmarków, niska hype rate.
- highAI Jason
Praktyczne tutoriale Claude Code, MCP, workflow vibe codingu.
- highBen's Bites
Daily AI digest, creator-friendly tone. Codex, model releases, agentic AI.
- highCole Medin
Vibe coding + agentic workflows + Claude Code MCP integrations.
- highFal AI Blog
Fal hostuje większość nowych AI image/video modeli — ich blog to wczesne sygnały premier.
- highHN: 3D & Gaussian Splatting
HN signal dla 3D generative — Gaussian Splatting, NeRF, image-to-3D. Próg 20 bo niszowa kategoria (top historic 182pts).
- highHN: AI agents / MCP
HN posty o agentach, MCP, vibe codingu z min 100 pkt.
- highHN: Claude / Anthropic
HN posty z 'Claude' lub 'Anthropic' z min 100 pkt.
- highHugging Face Blog
Releases dla image, video, audio, 3D modeli. Część tech-heavy — Gemini relevance odfiltruje noise. Downgraded z critical: za duży volume na 'must-read' status.
- highIndyDevDan
Claude Code power user, prompty, hooki.
- highInterconnects (Nathan Lambert)
AI policy + research analysis. Niska hype rate, opinionated.
- highLatent Space
Podcast + blog Swyx — wywiady z founderami i deep dives engineeringowe.
- highMatt Wolfe
Comprehensive AI tools weekly digest. ~700K subs.
- highMatthew Berman
AI news, model release reviews, agent demos. Wysoki output.
- highr/aivideo
Community AI video — Sora, Veo, Runway, Kling, LTX. Co naprawdę zaskakuje twórców.
- highr/ClaudeAI
Społeczność Claude'a — power users, tipy, problemy.
- highr/LocalLLaMA
Open-source LLMs, lokalne uruchamianie, benchmarks bez hype.
- highr/StableDiffusion
Największa community open-source image gen (700k+ users). Premiery modeli, LoRA, ComfyUI workflows.
- highRiley Brown
Vibe coding, AI builder workflows, Cursor + Claude tutorials.
- highThe Decoder
Niemiecki AI news outlet po angielsku, dobre breaking news.
- highTheo - t3.gg
TypeScript + AI dev workflows. Hot takes, narrative-driven.
- highYannic Kilcher
Paper reviews i deep dives w research AI.
- lowAI Weirdness
Janelle Shane — playful AI experiments, image gen quirks. Niski volume, unikalna perspektywa.
- mediumbycloud
AI papers digestible — między 2MP a Yannic Kilcher.
- mediumCreative Bloq
Design industry — gdzie AI ingeruje w klasyczne dyscypliny graficzne.
- mediumFireship
100-sec format, often AI/LLM + tech news.
- mediumfxguide
VFX i film industry — coraz więcej AI w pipeline. Profesjonalna perspektywa.
- mediumGreg Isenberg
Solo founder vibe — buduje produkty z AI, podcasty z indie hackers.
- mediumr/ChatGPTCoding
Vibe coding tipy, IDE setupy, prompty. Mix wszystkich modeli.
- mediumr/comfyui
ComfyUI workflows — custom nodes, JSON workflows, optymalizacje.
- mediumr/midjourney
Midjourney community — premiery v7+, style references, prompt patterns.
- mediumr/runwayml
Runway-specific community — premiery features, prompt patterns, comparisons z konkurencją.
- mediumr/SunoAI
Suno music gen community — nowe wersje modelu, lyric prompting techniques. Audio AI ma slaby RSS ecosystem.
- mediumTina Huang
AI workflows for data science, practical applications.
- mediumTwo Minute Papers
Krótkie streszczenia papers AI, świetne dla szybkiego scan'a.
- mediumWes Roth
AI news z bardziej clickbaitowym tonem — filtr Gemini odsiewa hype.
Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide
Speed up Qwen 3.5/3.6 models by nearly 3x on a single GPU using NextN Multi-Token Prediction in llama.cpp with this specific build and quantization guide.
This technical guide details how to implement NextN Multi-Token Prediction (MTP) for the Qwen 3.5 and 3.6 model families using llama.cpp. By leveraging MTP, users can achieve approximately 2.9x faster decoding speeds with zero loss in output quality, as the prediction heads are natively integrated into these models. The process currently requires building llama.cpp from specific pull requests (#22400 and #22673) or using a provided fork. A critical step involves a specific quantization override (--tensor-type nextn=q8_0) to prevent output corruption. Benchmarks show the 35B MoE variant reaching an impressive ~150 tokens per second on a single RTX 3090 Ti.
r/LocalLLaMA·tutorial·05/07/2026, 09:56 AM·/u/yes_i_tried_googlewhy llama.cpp can’t combine speculative decode methods?
Users are seeking to combine MTP and ngram speculative decoding in llama.cpp to maximize speed in coding tasks, but current implementation limits them to one method.
A technical discussion on r/LocalLLaMA highlights a current limitation in llama.cpp regarding speculative decoding methods. A user testing Qwen 3.6 27B with Multi-Token Prediction (MTP) found that while MTP is effective, combining it with ngram speculation would be ideal for agentic coding. Ngram is particularly fast at predicting repeated code blocks, which occurs frequently during file edits. Currently, llama.cpp only supports one speculative method at a time via command-line arguments. The community is exploring whether this is a fundamental architectural constraint or a temporary implementation hurdle that could be resolved to further boost local inference speeds.
r/LocalLLaMA·tooling·05/07/2026, 07:53 AM·/u/QwoctopussyUploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results
MTP (Multi-Token Prediction) can significantly speed up local LLM inference, but its effectiveness varies greatly depending on the model architecture and hardware setup.
User /u/havenoammo released GGUF versions of the Qwen3.6-35B-A3B model featuring 'grafted' Multi-Token Prediction (MTP) layers. While MTP previously showed 2-2.5x speedups on dense models like the 27B variant, results for this MoE (Mixture of Experts) version are more modest, ranging from a 6% to 50% increase in tokens per second. The performance seems highly dependent on the specific GPU configuration and quantization level (Q4 vs Q8). The release includes the isolated MTP layers and conversion scripts on HuggingFace, allowing the community to experiment with speculative decoding. These preliminary results suggest that MoE architectures might not benefit as uniformly from MTP as dense models do in current llama.cpp implementations.
r/LocalLLaMA·tooling·05/06/2026, 09:51 PM·/u/havenoammoQwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR
Boost your local Qwen3.6-27B inference speed by 2.5x using MTP-enabled GGUFs and a custom llama.cpp build.
A community developer has successfully implemented Multi-Token Prediction (MTP) for the Qwen3.6-27B model in GGUF format, achieving a 2.5x increase in token throughput. By 'grafting' Q8-quantized MTP draft heads onto Unsloth UD XL base models, the setup allows for speculative decoding where four tokens are predicted per forward pass. This implementation utilizes an unmerged llama.cpp pull request (#22673) to enable MTP support locally, a feature previously limited to server-side engines like vLLM. The method adds minimal VRAM overhead while significantly improving inference speed on consumer hardware. Detailed build instructions and the conversion script are provided on HuggingFace.
r/LocalLLaMA·tooling·05/06/2026, 11:45 AM·/u/havenoammo
Gemma 4 MTP released
Get up to 2x faster inference on Gemma 4 models using the newly released Multi-Token Prediction draft checkpoints for speculative decoding.
Google has officially released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, including variants for the 31B and smaller models. These draft models are designed for Speculative Decoding, where a smaller model predicts multiple future tokens that the main model then validates in parallel. This technique can achieve up to 2x speedups in generation latency while maintaining identical output quality compared to standard autoregressive generation. The release includes specialized checkpoints on Hugging Face tuned as assistants for the main Gemma 4 weights. This is a significant update for local LLM users and on-device applications where inference speed is often the primary bottleneck.
r/LocalLLaMA·model_release·05/05/2026, 04:01 PM·rerriGemma 4 MTP released
Google released MTP draft models for Gemma 4, enabling up to 2x faster generation through speculative decoding without sacrificing output quality.
Google has officially released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, including the 31B and various MoE variants. MTP works by pairing the base model with a smaller, faster draft model that predicts multiple tokens ahead. These predictions are then verified in parallel by the main model using a Speculative Decoding pipeline. This approach achieves up to a 2x speedup in inference speed, which is critical for local and on-device deployments. Crucially, the final output remains identical to standard generation, offering a significant performance boost for supported hardware and software stacks without sacrificing quality.
r/LocalLLaMA·model_release·05/05/2026, 04:01 PM·/u/rerri
Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding- Google Developers Blog
Google achieved a 3X speedup in LLM inference on TPUs by using a new 'diffusion-style' parallel token drafting technique.
Google researchers have introduced a novel approach to speculative decoding inspired by diffusion models, specifically optimized for TPU architectures. Traditional speculative decoding relies on a smaller draft model to predict tokens sequentially, but this new method generates multiple draft tokens in parallel, similar to how diffusion models refine images. This shift addresses the memory bandwidth bottlenecks common in LLM inference, resulting in up to 3X faster generation speeds. While the benchmarks focus on Google's proprietary hardware, the move toward non-autoregressive drafting represents a significant evolution in inference strategy. This technique could eventually influence local model optimization if adapted for consumer GPUs.
r/LocalLLaMA·news·05/05/2026, 03:50 PM·/u/eternviking
Relevance auto-scored by LLM (0–10). List shows top 30 from the last 7 days.