AI pulse last 7 days
Daily AI pulse from YouTube, blogs, Reddit, HN. Ruthlessly filtered.
Sources (41)▶
- criticalAndrej Karpathy
Były dyrektor AI w Tesli, OpenAI cofounder. Każde video to gold.
- criticalAnthropic
Oficjalny kanał Anthropic. Każdy release Claude'a.
- criticalComfyUI Blog
Release log dla integracji ComfyUI — Luma Uni-1, GPT Image 2, ACE-Step music gen, Seedance. Pokrywa video+image+music+workflow.
- criticalOpenAI Blog
Oficjalny blog OpenAI. Wszystkie release.
- criticalSimon Willison's Weblog
Najlepszy 'thinker' AI. Codzienne posty, deep insights, niska hype rate.
- highAI Explained
Głęboka analiza papers i benchmarków, niska hype rate.
- highAI Jason
Praktyczne tutoriale Claude Code, MCP, workflow vibe codingu.
- highBen's Bites
Daily AI digest, creator-friendly tone. Codex, model releases, agentic AI.
- highCole Medin
Vibe coding + agentic workflows + Claude Code MCP integrations.
- highFal AI Blog
Fal hostuje większość nowych AI image/video modeli — ich blog to wczesne sygnały premier.
- highHN: 3D & Gaussian Splatting
HN signal dla 3D generative — Gaussian Splatting, NeRF, image-to-3D. Próg 20 bo niszowa kategoria (top historic 182pts).
- highHN: AI agents / MCP
HN posty o agentach, MCP, vibe codingu z min 100 pkt.
- highHN: Claude / Anthropic
HN posty z 'Claude' lub 'Anthropic' z min 100 pkt.
- highHugging Face Blog
Releases dla image, video, audio, 3D modeli. Część tech-heavy — Gemini relevance odfiltruje noise. Downgraded z critical: za duży volume na 'must-read' status.
- highIndyDevDan
Claude Code power user, prompty, hooki.
- highInterconnects (Nathan Lambert)
AI policy + research analysis. Niska hype rate, opinionated.
- highLatent Space
Podcast + blog Swyx — wywiady z founderami i deep dives engineeringowe.
- highMatt Wolfe
Comprehensive AI tools weekly digest. ~700K subs.
- highMatthew Berman
AI news, model release reviews, agent demos. Wysoki output.
- highr/aivideo
Community AI video — Sora, Veo, Runway, Kling, LTX. Co naprawdę zaskakuje twórców.
- highr/ClaudeAI
Społeczność Claude'a — power users, tipy, problemy.
- highr/LocalLLaMA
Open-source LLMs, lokalne uruchamianie, benchmarks bez hype.
- highr/StableDiffusion
Największa community open-source image gen (700k+ users). Premiery modeli, LoRA, ComfyUI workflows.
- highRiley Brown
Vibe coding, AI builder workflows, Cursor + Claude tutorials.
- highThe Decoder
Niemiecki AI news outlet po angielsku, dobre breaking news.
- highTheo - t3.gg
TypeScript + AI dev workflows. Hot takes, narrative-driven.
- highYannic Kilcher
Paper reviews i deep dives w research AI.
- lowAI Weirdness
Janelle Shane — playful AI experiments, image gen quirks. Niski volume, unikalna perspektywa.
- mediumbycloud
AI papers digestible — między 2MP a Yannic Kilcher.
- mediumCreative Bloq
Design industry — gdzie AI ingeruje w klasyczne dyscypliny graficzne.
- mediumFireship
100-sec format, often AI/LLM + tech news.
- mediumfxguide
VFX i film industry — coraz więcej AI w pipeline. Profesjonalna perspektywa.
- mediumGreg Isenberg
Solo founder vibe — buduje produkty z AI, podcasty z indie hackers.
- mediumr/ChatGPTCoding
Vibe coding tipy, IDE setupy, prompty. Mix wszystkich modeli.
- mediumr/comfyui
ComfyUI workflows — custom nodes, JSON workflows, optymalizacje.
- mediumr/midjourney
Midjourney community — premiery v7+, style references, prompt patterns.
- mediumr/runwayml
Runway-specific community — premiery features, prompt patterns, comparisons z konkurencją.
- mediumr/SunoAI
Suno music gen community — nowe wersje modelu, lyric prompting techniques. Audio AI ma slaby RSS ecosystem.
- mediumTina Huang
AI workflows for data science, practical applications.
- mediumTwo Minute Papers
Krótkie streszczenia papers AI, świetne dla szybkiego scan'a.
- mediumWes Roth
AI news z bardziej clickbaitowym tonem — filtr Gemini odsiewa hype.
why llama.cpp can’t combine speculative decode methods?
Users are seeking to combine MTP and ngram speculative decoding in llama.cpp to maximize speed in coding tasks, but current implementation limits them to one method.
A technical discussion on r/LocalLLaMA highlights a current limitation in llama.cpp regarding speculative decoding methods. A user testing Qwen 3.6 27B with Multi-Token Prediction (MTP) found that while MTP is effective, combining it with ngram speculation would be ideal for agentic coding. Ngram is particularly fast at predicting repeated code blocks, which occurs frequently during file edits. Currently, llama.cpp only supports one speculative method at a time via command-line arguments. The community is exploring whether this is a fundamental architectural constraint or a temporary implementation hurdle that could be resolved to further boost local inference speeds.
r/LocalLLaMA·tooling·05/07/2026, 07:53 AM·/u/Qwoctopussy
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
ParoQuant is a new quantization method that preserves the reasoning and logic capabilities of LLMs at low bitrates better than standard techniques.
ParoQuant introduces Pairwise Rotation Quantization, a novel technique designed to minimize information loss during the compression of reasoning-heavy LLMs. Unlike standard quantization methods that often degrade complex logic chains, ParoQuant uses a pairwise approach to handle outlier weights more effectively. The release includes a dedicated GitHub repository and pre-quantized models on HuggingFace for immediate testing. This is particularly significant for users running large reasoning models on consumer hardware where VRAM is limited. Initial benchmarks suggest superior performance in maintaining Chain of Thought (CoT) coherence compared to traditional 4-bit methods.
r/LocalLLaMA·tooling·05/07/2026, 02:07 AM·/u/Total-Resort-3120Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development
Choosing between Nvidia and Apple for local AI coding: RTX 5090 wins on raw speed for fast iterations, while M5 Max wins on memory capacity for massive codebases.
This discussion evaluates the trade-offs between the RTX 5090 and M5 Max (128GB) for local agentic software development using models like Qwen 3.6 27B. The RTX 5090 provides approximately 3x faster token generation, which is vital for rapid code iteration, but its 32GB VRAM limits context windows and quantization levels (Q4/Q5). Conversely, the M5 Max's 128GB of unified memory supports massive context and higher precision models, though at significantly lower speeds. The author considers a multi-agent setup where a high-level orchestrator manages faster sub-agents for codebase exploration. Technical factors like Multi-Token Prediction (MTP) and MLX optimizations are highlighted as potential game-changers for Apple Silicon's usability in agentic workflows.
r/LocalLLaMA·tooling·05/07/2026, 12:34 AM·/u/BawbbySmithGreat results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot
A complete, reproducible configuration for running Qwen 3.6-35B locally in VS Code, achieving ~100 t/s for high-quality coding tasks on consumer hardware.
A user on r/LocalLLaMA shared a highly successful local coding setup using the Qwen 3.6-35B model (MoE architecture) via llama.cpp on an AMD R9700 GPU. The post includes the exact startup command for the Vulkan server, a VS Code chatLanguageModels.json configuration, and a complex React/TypeScript prompt that generated a fully functional website. Performance metrics show generation speeds of ~100 tokens/second, though large 38k token prompts cause a 17-second prefill delay. The setup utilizes context checkpointing and flash attention to maintain efficiency. This serves as a practical blueprint for developers looking to replace paid coding assistants with local LLMs.
r/LocalLLaMA·tooling·05/06/2026, 08:47 PM·/u/supracodeHas anyone tried Zyphra 1 - 8B MoE?
Zyphra released ZAYA1-8B, a reasoning MoE that uses less than 1B active parameters to deliver high-end math and logic performance on local hardware.
Zyphra has announced the release of ZAYA1-8B, a new Mixture of Experts (MoE) model focused on reasoning and intelligence density. Despite having 8 billion total parameters, it utilizes fewer than 1 billion active parameters during inference, making it exceptionally efficient for local deployment. The developers claim it outperforms much larger open-weight models in mathematics and logic benchmarks. Notably, the model was trained using AMD hardware and leverages test-time compute to narrow the gap with frontier models like DeepSeek-V3.2. This release highlights a trend toward hyper-efficient, specialized reasoning models that prioritize logic over raw parameter count.
r/LocalLLaMA·model_release·05/06/2026, 08:39 PM·/u/appakaradiQwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM
You can now run Qwen3.6 27B with a massive 200k context window on a single RTX 5090 using NVFP4 quantization and vLLM.
A user successfully ran Qwen3.6 27B on a single RTX 5090 with 32GB VRAM, achieving a stable 200k context window. The setup utilizes NVFP4 quantization via the compressed-tensors library and vLLM's MTP (Multi-Token Prediction) for speculative decoding. Benchmarks show generation speeds between 65-75 tokens/second at 200k context, with TTFT (Time To First Token) dropping significantly when using prefix caching. This configuration demonstrates the potential of Blackwell's FP4 support for handling large-scale local inference. The author provides exact vLLM parameters and stability data for others to replicate the results on consumer hardware.
r/LocalLLaMA·tooling·05/06/2026, 02:05 PM·/u/MaheidemDecoupled Attention from Weights - Gemma 4 26B
Run massive models like Gemma 4 26B by splitting attention and weights across multiple cheap local machines, bypassing single-GPU VRAM limits.
Larql introduces a method to decouple attention mechanisms from model weights, specifically demonstrated with Gemma 4 26B. This approach allows users to split the memory load across multiple local machines, keeping the attention mechanism on a primary device while offloading the massive weight matrices to a secondary, cheaper server like an old Xeon. This effectively bypasses the VRAM bottleneck that typically limits local LLM performance and model size. The repository includes functional code to implement this distributed inference strategy. It represents a significant shift for home lab enthusiasts who want to run large-scale models without investing in high-end enterprise GPUs.
r/LocalLLaMA·tooling·05/06/2026, 11:56 AM·/u/yeah-okProtip if you want to squeeze most out of your VRAM if you have a CPU with iGPU
Free up hundreds of MBs of VRAM for your models by plugging your monitor into the motherboard and using your iGPU for the OS display.
This practical tip for local LLM enthusiasts explains how to maximize available VRAM on dedicated GPUs by offloading system tasks. By enabling the integrated GPU (iGPU) in the BIOS and connecting the display cable directly to the motherboard, the system uses the iGPU for GUI rendering instead of the primary graphics card. This simple hardware adjustment can reclaim several hundred megabytes of VRAM, which is often critical when trying to fit a specific model or a larger context window into memory. The method is especially effective for users on Windows or Linux distributions with a desktop environment. It offers a straightforward way to optimize hardware resources without needing complex software tweaks.
r/LocalLLaMA·tutorial·05/06/2026, 11:35 AM·/u/Th3Sim0n
Bad news: Apple drops high-memory Mac Studio configs
Apple has capped Mac Studio RAM at 96GB, removing the 256GB/512GB options that were essential for running the largest local LLMs efficiently.
Apple has quietly discontinued high-memory configurations for the Mac Studio, removing the 256GB and 512GB RAM options. The M3 Ultra Mac Studio is now capped at 96GB of unified memory, while the Mac mini remains limited to 48GB. This shift is reportedly due to supply chain constraints and rising production costs for high-capacity memory chips. For the local LLM community, this is a major blow, as these machines were the most cost-effective way to run massive models like Qwen 397B on a single device. Future users needing high VRAM equivalents will now have to look toward the secondary market or far more expensive enterprise hardware.
r/LocalLLaMA·news·05/06/2026, 11:13 AM·/u/jzn212.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints
Run Qwen 3.6 27B locally with 2.5x speedup (up to 28 tok/s) using new MTP support in llama.cpp and optimized GGUF quants.
A new optimization for Qwen 3.6 27B leverages Multi-Token Prediction (MTP) via a llama.cpp Pull Request to achieve 2.5x faster inference. User /u/ex-arman68 shared custom GGUF quants that include fixed chat templates and support for massive context windows, reaching up to 262k on 48GB RAM using q4_0 KV cache compression. The setup requires compiling a specific experimental branch of llama.cpp but delivers approximately 28 tokens per second on Apple Silicon. Detailed hardware recommendations for both Mac and NVIDIA users are provided, covering various RAM configurations from 16GB to 80GB. Note that vision capabilities currently conflict with MTP in this experimental build.
r/LocalLLaMA·tooling·05/06/2026, 09:35 AM·/u/ex-arman68Qwen 3.6 27b Q4.0 MTP GGUF
Multi-Token Prediction (MTP) allows running a 27b model at the speed of a 9b model on integrated GPUs using llama.cpp.
A user report on r/LocalLLaMA highlights the performance benefits of Multi-Token Prediction (MTP) for the Qwen 3.6 27b model. Using the Q4.0 GGUF quantization in llama.cpp, the 27b model achieves inference speeds comparable to the smaller 9b Qwen 3.5 model. This test was conducted on an AMD iGPU with 64GB of unified memory, demonstrating that MTP significantly lowers the hardware barrier for running larger models locally. The results suggest that MTP is a viable path for making mid-sized models feel as responsive as small models on consumer-grade integrated graphics.
r/LocalLLaMA·tooling·05/06/2026, 03:01 AM·/u/Available_Hornet3538Qwen 3.6 27B MTP on v100 32GB: 54 t/s
Multi-Token Prediction (MTP) nearly doubles inference speed for Qwen 3.6 27B on older V100 hardware, making it a highly viable local coding assistant.
A user report demonstrates a significant performance boost for Qwen 3.6 27B using Multi-Token Prediction (MTP) on a Tesla V100 32GB GPU. By utilizing a specific MTP branch of llama.cpp, inference speeds jumped from approximately 30 t/s to 54 t/s, nearly doubling the output rate. The setup utilized a q8_0 KV cache and supported a 200k context limit, effectively serving as a high-speed VS Code Copilot replacement. While performance dipped slightly to 40-45 t/s at higher context depths (50k+ tokens), the model remained highly effective for complex tasks like tool calls and code refactoring. This highlights the potential of MTP to extend the lifecycle of older enterprise hardware for modern local LLM workloads.
r/LocalLLaMA·tooling·05/06/2026, 02:18 AM·/u/m94301
Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama
If you use Ollama, update it immediately to the latest version to prevent a critical memory leak that could expose your private data to remote attackers.
A critical security vulnerability dubbed "Bleeding Llama" has been discovered in Ollama, the most popular tool for running local LLMs. This unauthenticated memory leak allows remote attackers to extract sensitive information directly from the host's RAM without any credentials. The flaw stems from improper handling of specific API requests, potentially exposing user prompts, model weights, or system environment variables. Security researchers at Cyera identified the issue, emphasizing the extreme risk of exposing Ollama instances to the public internet. Users are urged to update to the latest version immediately and ensure their instances are behind a firewall or VPN.
r/LocalLLaMA·tooling·05/06/2026, 02:02 AM·/u/exintrovert420
Claude Code @ Opus 4.7 vs OpenCode @ qwen3.6:27b. Both shipped a playable cozy roguelite.
Local models like Qwen 3.6:27b have reached parity with top-tier Claude models for building and shipping entire playable games.
A direct comparison between Anthropic's Claude Code (running Opus 4.7) and the open-source OpenCode (using Qwen 3.6:27b) reveals that local models are closing the gap in complex software development. Both agents successfully generated a fully playable 'cozy roguelite' game, managing game logic, state, and basic assets. While Opus 4.7 produced slightly more optimized and cleaner code architecture, the Qwen-based local setup demonstrated that high-tier coding capabilities are no longer exclusive to proprietary cloud APIs. This benchmark is significant for developers prioritizing privacy and cost-efficiency, as a 27b parameter local model can now handle end-to-end project shipping.
r/LocalLLaMA·tooling·05/05/2026, 10:58 PM·/u/rm-rf-rmDeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.
Most coding tasks don't need expensive cloud models; routing simple tasks to a local LLM can cut your API bill by 75% without losing quality.
A developer conducted a 10-day experiment comparing a local Qwen 3.6 27b model (running on an RTX 3090) against frontier cloud models like GPT-5.2. The analysis revealed that 65% of daily coding tasks, such as project scanning and boilerplate generation, performed identically on local hardware. For debugging with multi-file context, local models reached 61% accuracy, while complex architecture decisions still required cloud intervention, representing only 15% of total tasks. By implementing a task-routing strategy, the author reduced their monthly API costs from $85 to $22. This case study highlights that the massive price gap between local and cloud models often doesn't justify the performance difference for routine work.
r/LocalLLaMA·tooling·05/05/2026, 08:55 PM·/u/spencer_kw
Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.
Gemma 4 31B proves that token efficiency beats raw speed: it completes tasks faster than Qwen 3.6 by being smarter with every token generated.
A performance comparison between Google's Gemma 4 31B and Alibaba's Qwen 3.6/3.5 27B highlights a critical distinction between raw inference speed and task completion time. While Qwen models often achieve higher scores on synthetic benchmarks, Gemma 4 demonstrates superior token efficiency, requiring fewer tokens to generate accurate responses. This creates a 'slower is faster' scenario where Gemma, despite having lower tokens-per-second due to its larger size, finishes complex tasks more quickly than its competitors. The analysis suggests that Qwen may be 'benchmaxxed'—optimized specifically for test scores—whereas Gemma offers higher intelligence density for real-world use. Local LLM enthusiasts are now looking forward to further optimizations like DFlash and MTP to enhance Gemma's perf…
r/LocalLLaMA·news·05/05/2026, 06:12 PM·/u/MiaBchDaveGemma 4 MTP released
Google released MTP draft models for Gemma 4, enabling up to 2x faster generation through speculative decoding without sacrificing output quality.
Google has officially released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, including the 31B and various MoE variants. MTP works by pairing the base model with a smaller, faster draft model that predicts multiple tokens ahead. These predictions are then verified in parallel by the main model using a Speculative Decoding pipeline. This approach achieves up to a 2x speedup in inference speed, which is critical for local and on-device deployments. Crucially, the final output remains identical to standard generation, offering a significant performance boost for supported hardware and software stacks without sacrificing quality.
r/LocalLLaMA·model_release·05/05/2026, 04:01 PM·/u/rerri
Gemma 4 MTP released
Get up to 2x faster inference on Gemma 4 models using the newly released Multi-Token Prediction draft checkpoints for speculative decoding.
Google has officially released Multi-Token Prediction (MTP) draft models for the Gemma 4 family, including variants for the 31B and smaller models. These draft models are designed for Speculative Decoding, where a smaller model predicts multiple future tokens that the main model then validates in parallel. This technique can achieve up to 2x speedups in generation latency while maintaining identical output quality compared to standard autoregressive generation. The release includes specialized checkpoints on Hugging Face tuned as assistants for the main Gemma 4 weights. This is a significant update for local LLM users and on-device applications where inference speed is often the primary bottleneck.
r/LocalLLaMA·model_release·05/05/2026, 04:01 PM·rerri
Use Qwen3.6 right way -> send it to pi coding agent and forget
Combine Qwen 3.6 with the pi.dev agent and Exa search to create a local coding and research powerhouse that rivals Perplexity.
A user on r/LocalLLaMA shares a highly effective local workflow centered around the Qwen 3.6 35B model. By integrating the model with the pi.dev coding agent, Exa web search, and browser extensions, they claim to have automated 80% of their coding and system administration tasks. The setup excels in Python, Rust, and C++, while also serving as a viable, high-quality replacement for Perplexity in web research. For complex logic, the user delegates planning to Kimi 2.6 while leaving the execution to Qwen. This highlights the growing importance of the 'harness' or interface in maximizing LLM performance.
r/LocalLLaMA·tooling·05/05/2026, 03:53 PM·/u/Willing-Toe1942Heretic 1.3 released: Reproducible models, integrated benchmarking system, reduced peak VRAM usage, broader model support, and more
Heretic 1.3 brings byte-for-byte reproducibility to model abliteration, integrated benchmarking, and lower VRAM requirements for processing large models like Qwen 3.5.
Heretic 1.3, the leading tool for LLM abliteration (decensoring), introduces several major technical updates focused on transparency and efficiency. The headline feature is a reproducibility system that allows users to generate byte-for-byte identical models by capturing environment metadata, including GPU drivers and library versions. A new integrated benchmarking suite based on lm-evaluation-harness enables running MMLU and GSM8K tests directly within the tool to verify model quality. Additionally, peak VRAM usage has been significantly reduced, and support has been expanded to include latest-generation architectures like Qwen 3.5 and Gemma 4. This release solidifies Heretic's position as a professional-grade utility for the local LLM community.
r/LocalLLaMA·tooling·05/05/2026, 02:57 PM·-p-e-w-
Relevance auto-scored by LLM (0–10). List shows top 30 from the last 7 days.