AI pulse last 7 days
Daily AI pulse from YouTube, blogs, Reddit, HN. Ruthlessly filtered.
Sources (41)▶
- criticalAndrej Karpathy
Były dyrektor AI w Tesli, OpenAI cofounder. Każde video to gold.
- criticalAnthropic
Oficjalny kanał Anthropic. Każdy release Claude'a.
- criticalComfyUI Blog
Release log dla integracji ComfyUI — Luma Uni-1, GPT Image 2, ACE-Step music gen, Seedance. Pokrywa video+image+music+workflow.
- criticalOpenAI Blog
Oficjalny blog OpenAI. Wszystkie release.
- criticalSimon Willison's Weblog
Najlepszy 'thinker' AI. Codzienne posty, deep insights, niska hype rate.
- highAI Explained
Głęboka analiza papers i benchmarków, niska hype rate.
- highAI Jason
Praktyczne tutoriale Claude Code, MCP, workflow vibe codingu.
- highBen's Bites
Daily AI digest, creator-friendly tone. Codex, model releases, agentic AI.
- highCole Medin
Vibe coding + agentic workflows + Claude Code MCP integrations.
- highFal AI Blog
Fal hostuje większość nowych AI image/video modeli — ich blog to wczesne sygnały premier.
- highHN: 3D & Gaussian Splatting
HN signal dla 3D generative — Gaussian Splatting, NeRF, image-to-3D. Próg 20 bo niszowa kategoria (top historic 182pts).
- highHN: AI agents / MCP
HN posty o agentach, MCP, vibe codingu z min 100 pkt.
- highHN: Claude / Anthropic
HN posty z 'Claude' lub 'Anthropic' z min 100 pkt.
- highHugging Face Blog
Releases dla image, video, audio, 3D modeli. Część tech-heavy — Gemini relevance odfiltruje noise. Downgraded z critical: za duży volume na 'must-read' status.
- highIndyDevDan
Claude Code power user, prompty, hooki.
- highInterconnects (Nathan Lambert)
AI policy + research analysis. Niska hype rate, opinionated.
- highLatent Space
Podcast + blog Swyx — wywiady z founderami i deep dives engineeringowe.
- highMatt Wolfe
Comprehensive AI tools weekly digest. ~700K subs.
- highMatthew Berman
AI news, model release reviews, agent demos. Wysoki output.
- highr/aivideo
Community AI video — Sora, Veo, Runway, Kling, LTX. Co naprawdę zaskakuje twórców.
- highr/ClaudeAI
Społeczność Claude'a — power users, tipy, problemy.
- highr/LocalLLaMA
Open-source LLMs, lokalne uruchamianie, benchmarks bez hype.
- highr/StableDiffusion
Największa community open-source image gen (700k+ users). Premiery modeli, LoRA, ComfyUI workflows.
- highRiley Brown
Vibe coding, AI builder workflows, Cursor + Claude tutorials.
- highThe Decoder
Niemiecki AI news outlet po angielsku, dobre breaking news.
- highTheo - t3.gg
TypeScript + AI dev workflows. Hot takes, narrative-driven.
- highYannic Kilcher
Paper reviews i deep dives w research AI.
- lowAI Weirdness
Janelle Shane — playful AI experiments, image gen quirks. Niski volume, unikalna perspektywa.
- mediumbycloud
AI papers digestible — między 2MP a Yannic Kilcher.
- mediumCreative Bloq
Design industry — gdzie AI ingeruje w klasyczne dyscypliny graficzne.
- mediumFireship
100-sec format, often AI/LLM + tech news.
- mediumfxguide
VFX i film industry — coraz więcej AI w pipeline. Profesjonalna perspektywa.
- mediumGreg Isenberg
Solo founder vibe — buduje produkty z AI, podcasty z indie hackers.
- mediumr/ChatGPTCoding
Vibe coding tipy, IDE setupy, prompty. Mix wszystkich modeli.
- mediumr/comfyui
ComfyUI workflows — custom nodes, JSON workflows, optymalizacje.
- mediumr/midjourney
Midjourney community — premiery v7+, style references, prompt patterns.
- mediumr/runwayml
Runway-specific community — premiery features, prompt patterns, comparisons z konkurencją.
- mediumr/SunoAI
Suno music gen community — nowe wersje modelu, lyric prompting techniques. Audio AI ma slaby RSS ecosystem.
- mediumTina Huang
AI workflows for data science, practical applications.
- mediumTwo Minute Papers
Krótkie streszczenia papers AI, świetne dla szybkiego scan'a.
- mediumWes Roth
AI news z bardziej clickbaitowym tonem — filtr Gemini odsiewa hype.

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
ParoQuant is a new quantization method that preserves the reasoning and logic capabilities of LLMs at low bitrates better than standard techniques.
ParoQuant introduces Pairwise Rotation Quantization, a novel technique designed to minimize information loss during the compression of reasoning-heavy LLMs. Unlike standard quantization methods that often degrade complex logic chains, ParoQuant uses a pairwise approach to handle outlier weights more effectively. The release includes a dedicated GitHub repository and pre-quantized models on HuggingFace for immediate testing. This is particularly significant for users running large reasoning models on consumer hardware where VRAM is limited. Initial benchmarks suggest superior performance in maintaining Chain of Thought (CoT) coherence compared to traditional 4-bit methods.
r/LocalLLaMA·tooling·05/07/2026, 02:07 AM·/u/Total-Resort-3120Get faster qwen 3.6 27b
Achieve 50 t/s on Qwen 3.6 27B with 100k context on a single RTX 3090 by using MTP GGUFs and a specific llama.cpp branch.
A user on r/LocalLLaMA shared a method to significantly boost inference speeds for the Qwen 3.6 27B model on consumer hardware. By utilizing Multi-Token Prediction (MTP) GGUF files and a specific pull request for llama.cpp, they achieved speeds of 50 tokens per second on an RTX 3090. The setup involves using Q4_K_M quantization for the model and Q4_0 for the K/V cache to fit a 100k context within 19GB of VRAM. The post includes a step-by-step guide for applying the PR and the exact server configuration flags needed. It also mentions a Mac-specific installation via Homebrew for similar performance gains.
r/LocalLLaMA·tooling·05/06/2026, 11:33 PM·/u/admajic
The GB10 Solution Atlas is now open source, the inference engine made for the community with breakneck inference speeds (Qwen3.6-35B-FP8 100+ tok/s)
Atlas is a high-performance, Rust-based open-source inference engine that delivers 3x faster speeds than vLLM on Blackwell hardware by removing Python overhead.
Atlas is a newly open-sourced inference engine written in pure Rust and CUDA, designed to bypass the performance bottlenecks of the standard Python/PyTorch stack. Optimized for NVIDIA Blackwell (GB10) architecture, it achieves over 100 tokens per second on Qwen3.5-35B models using NVFP4 precision and Multi-Token Prediction (MTP). The engine features a lightweight 2.5GB Docker image with sub-2-minute cold starts and provides native support for OpenAI and Anthropic API formats. By rewriting the stack from HTTP handlers to kernel dispatch, the developers claim a 3x throughput increase over vLLM. Future updates aim to bring these optimizations to AMD Strix Halo and RTX 6000 Blackwell hardware.
r/LocalLLaMA·tooling·05/06/2026, 08:36 PM·/u/Live-Possession-6726vLLM V0 to V1: Correctness Before Corrections in RL
vLLM V1 is a major upgrade optimized for RL and reasoning models, focusing on output correctness and significantly better inference performance.
vLLM is transitioning from V0 to V1, marking a major architectural overhaul focused on Reinforcement Learning (RL) workflows. The update emphasizes a 'Correctness Before Corrections' philosophy, addressing the critical need for high-fidelity outputs in complex reasoning tasks. This shift is particularly relevant for serving modern models like DeepSeek-R1 that rely on long-chain reasoning and RL-based optimization. The new version aims to significantly reduce overhead and improve throughput while maintaining strict output validation. It represents a move towards more robust, production-ready inference for the next generation of agentic and reasoning LLMs.
Hugging Face Blog·tooling·05/06/2026, 07:06 PM2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints
Run Qwen 3.6 27B locally with 2.5x speedup (up to 28 tok/s) using new MTP support in llama.cpp and optimized GGUF quants.
A new optimization for Qwen 3.6 27B leverages Multi-Token Prediction (MTP) via a llama.cpp Pull Request to achieve 2.5x faster inference. User /u/ex-arman68 shared custom GGUF quants that include fixed chat templates and support for massive context windows, reaching up to 262k on 48GB RAM using q4_0 KV cache compression. The setup requires compiling a specific experimental branch of llama.cpp but delivers approximately 28 tokens per second on Apple Silicon. Detailed hardware recommendations for both Mac and NVIDIA users are provided, covering various RAM configurations from 16GB to 80GB. Note that vision capabilities currently conflict with MTP in this experimental build.
r/LocalLLaMA·tooling·05/06/2026, 09:35 AM·/u/ex-arman68
Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.
Gemma 4 31B proves that token efficiency beats raw speed: it completes tasks faster than Qwen 3.6 by being smarter with every token generated.
A performance comparison between Google's Gemma 4 31B and Alibaba's Qwen 3.6/3.5 27B highlights a critical distinction between raw inference speed and task completion time. While Qwen models often achieve higher scores on synthetic benchmarks, Gemma 4 demonstrates superior token efficiency, requiring fewer tokens to generate accurate responses. This creates a 'slower is faster' scenario where Gemma, despite having lower tokens-per-second due to its larger size, finishes complex tasks more quickly than its competitors. The analysis suggests that Qwen may be 'benchmaxxed'—optimized specifically for test scores—whereas Gemma offers higher intelligence density for real-world use. Local LLM enthusiasts are now looking forward to further optimizations like DFlash and MTP to enhance Gemma's perf…
r/LocalLLaMA·news·05/05/2026, 06:12 PM·/u/MiaBchDave
Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding- Google Developers Blog
Google achieved a 3X speedup in LLM inference on TPUs by using a new 'diffusion-style' parallel token drafting technique.
Google researchers have introduced a novel approach to speculative decoding inspired by diffusion models, specifically optimized for TPU architectures. Traditional speculative decoding relies on a smaller draft model to predict tokens sequentially, but this new method generates multiple draft tokens in parallel, similar to how diffusion models refine images. This shift addresses the memory bandwidth bottlenecks common in LLM inference, resulting in up to 3X faster generation speeds. While the benchmarks focus on Google's proprietary hardware, the move toward non-autoregressive drafting represents a significant evolution in inference strategy. This technique could eventually influence local model optimization if adapted for consumer GPUs.
r/LocalLLaMA·news·05/05/2026, 03:50 PM·/u/eternviking
Relevance auto-scored by LLM (0–10). List shows top 30 from the last 7 days.