AI pulse last 7 days
Daily AI pulse from YouTube, blogs, Reddit, HN. Ruthlessly filtered.
Sources (41)▶
- criticalAndrej Karpathy
Były dyrektor AI w Tesli, OpenAI cofounder. Każde video to gold.
- criticalAnthropic
Oficjalny kanał Anthropic. Każdy release Claude'a.
- criticalComfyUI Blog
Release log dla integracji ComfyUI — Luma Uni-1, GPT Image 2, ACE-Step music gen, Seedance. Pokrywa video+image+music+workflow.
- criticalOpenAI Blog
Oficjalny blog OpenAI. Wszystkie release.
- criticalSimon Willison's Weblog
Najlepszy 'thinker' AI. Codzienne posty, deep insights, niska hype rate.
- highAI Explained
Głęboka analiza papers i benchmarków, niska hype rate.
- highAI Jason
Praktyczne tutoriale Claude Code, MCP, workflow vibe codingu.
- highBen's Bites
Daily AI digest, creator-friendly tone. Codex, model releases, agentic AI.
- highCole Medin
Vibe coding + agentic workflows + Claude Code MCP integrations.
- highFal AI Blog
Fal hostuje większość nowych AI image/video modeli — ich blog to wczesne sygnały premier.
- highHN: 3D & Gaussian Splatting
HN signal dla 3D generative — Gaussian Splatting, NeRF, image-to-3D. Próg 20 bo niszowa kategoria (top historic 182pts).
- highHN: AI agents / MCP
HN posty o agentach, MCP, vibe codingu z min 100 pkt.
- highHN: Claude / Anthropic
HN posty z 'Claude' lub 'Anthropic' z min 100 pkt.
- highHugging Face Blog
Releases dla image, video, audio, 3D modeli. Część tech-heavy — Gemini relevance odfiltruje noise. Downgraded z critical: za duży volume na 'must-read' status.
- highIndyDevDan
Claude Code power user, prompty, hooki.
- highInterconnects (Nathan Lambert)
AI policy + research analysis. Niska hype rate, opinionated.
- highLatent Space
Podcast + blog Swyx — wywiady z founderami i deep dives engineeringowe.
- highMatt Wolfe
Comprehensive AI tools weekly digest. ~700K subs.
- highMatthew Berman
AI news, model release reviews, agent demos. Wysoki output.
- highr/aivideo
Community AI video — Sora, Veo, Runway, Kling, LTX. Co naprawdę zaskakuje twórców.
- highr/ClaudeAI
Społeczność Claude'a — power users, tipy, problemy.
- highr/LocalLLaMA
Open-source LLMs, lokalne uruchamianie, benchmarks bez hype.
- highr/StableDiffusion
Największa community open-source image gen (700k+ users). Premiery modeli, LoRA, ComfyUI workflows.
- highRiley Brown
Vibe coding, AI builder workflows, Cursor + Claude tutorials.
- highThe Decoder
Niemiecki AI news outlet po angielsku, dobre breaking news.
- highTheo - t3.gg
TypeScript + AI dev workflows. Hot takes, narrative-driven.
- highYannic Kilcher
Paper reviews i deep dives w research AI.
- lowAI Weirdness
Janelle Shane — playful AI experiments, image gen quirks. Niski volume, unikalna perspektywa.
- mediumbycloud
AI papers digestible — między 2MP a Yannic Kilcher.
- mediumCreative Bloq
Design industry — gdzie AI ingeruje w klasyczne dyscypliny graficzne.
- mediumFireship
100-sec format, often AI/LLM + tech news.
- mediumfxguide
VFX i film industry — coraz więcej AI w pipeline. Profesjonalna perspektywa.
- mediumGreg Isenberg
Solo founder vibe — buduje produkty z AI, podcasty z indie hackers.
- mediumr/ChatGPTCoding
Vibe coding tipy, IDE setupy, prompty. Mix wszystkich modeli.
- mediumr/comfyui
ComfyUI workflows — custom nodes, JSON workflows, optymalizacje.
- mediumr/midjourney
Midjourney community — premiery v7+, style references, prompt patterns.
- mediumr/runwayml
Runway-specific community — premiery features, prompt patterns, comparisons z konkurencją.
- mediumr/SunoAI
Suno music gen community — nowe wersje modelu, lyric prompting techniques. Audio AI ma slaby RSS ecosystem.
- mediumTina Huang
AI workflows for data science, practical applications.
- mediumTwo Minute Papers
Krótkie streszczenia papers AI, świetne dla szybkiego scan'a.
- mediumWes Roth
AI news z bardziej clickbaitowym tonem — filtr Gemini odsiewa hype.
Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide
Speed up Qwen 3.5/3.6 models by nearly 3x on a single GPU using NextN Multi-Token Prediction in llama.cpp with this specific build and quantization guide.
This technical guide details how to implement NextN Multi-Token Prediction (MTP) for the Qwen 3.5 and 3.6 model families using llama.cpp. By leveraging MTP, users can achieve approximately 2.9x faster decoding speeds with zero loss in output quality, as the prediction heads are natively integrated into these models. The process currently requires building llama.cpp from specific pull requests (#22400 and #22673) or using a provided fork. A critical step involves a specific quantization override (--tensor-type nextn=q8_0) to prevent output corruption. Benchmarks show the 35B MoE variant reaching an impressive ~150 tokens per second on a single RTX 3090 Ti.
r/LocalLLaMA·tutorial·05/07/2026, 09:56 AM·/u/yes_i_tried_googleQwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.
A high-performance, uncensored 27B model that successfully retains advanced Multi-Token Prediction (MTP) features for better local inference.
LLMFan46 has released 'heretic v2', an uncensored fine-tune of the Qwen3.6 27B model. This release is notable for preserving all 15 native Multi-Token Prediction (MTP) modules, which are frequently lost or degraded during the fine-tuning process. The model achieves a very low Kullback–Leibler divergence (KLD) of 0.0021, suggesting it maintains the original model's reasoning capabilities while eliminating refusals. With a refusal rate of only 6%, it is optimized for unrestricted local use. The model is available in multiple formats including Safetensors, GGUF, and NVFP4 to support various hardware setups.
r/LocalLLaMA·model_release·05/07/2026, 02:59 AM·/u/LLMFan46Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development
Choosing between Nvidia and Apple for local AI coding: RTX 5090 wins on raw speed for fast iterations, while M5 Max wins on memory capacity for massive codebases.
This discussion evaluates the trade-offs between the RTX 5090 and M5 Max (128GB) for local agentic software development using models like Qwen 3.6 27B. The RTX 5090 provides approximately 3x faster token generation, which is vital for rapid code iteration, but its 32GB VRAM limits context windows and quantization levels (Q4/Q5). Conversely, the M5 Max's 128GB of unified memory supports massive context and higher precision models, though at significantly lower speeds. The author considers a multi-agent setup where a high-level orchestrator manages faster sub-agents for codebase exploration. Technical factors like Multi-Token Prediction (MTP) and MLX optimizations are highlighted as potential game-changers for Apple Silicon's usability in agentic workflows.
r/LocalLLaMA·tooling·05/07/2026, 12:34 AM·/u/BawbbySmithGet faster qwen 3.6 27b
Achieve 50 t/s on Qwen 3.6 27B with 100k context on a single RTX 3090 by using MTP GGUFs and a specific llama.cpp branch.
A user on r/LocalLLaMA shared a method to significantly boost inference speeds for the Qwen 3.6 27B model on consumer hardware. By utilizing Multi-Token Prediction (MTP) GGUF files and a specific pull request for llama.cpp, they achieved speeds of 50 tokens per second on an RTX 3090. The setup involves using Q4_K_M quantization for the model and Q4_0 for the K/V cache to fit a 100k context within 19GB of VRAM. The post includes a step-by-step guide for applying the PR and the exact server configuration flags needed. It also mentions a Mac-specific installation via Homebrew for similar performance gains.
r/LocalLLaMA·tooling·05/06/2026, 11:33 PM·/u/admajicUploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results
MTP (Multi-Token Prediction) can significantly speed up local LLM inference, but its effectiveness varies greatly depending on the model architecture and hardware setup.
User /u/havenoammo released GGUF versions of the Qwen3.6-35B-A3B model featuring 'grafted' Multi-Token Prediction (MTP) layers. While MTP previously showed 2-2.5x speedups on dense models like the 27B variant, results for this MoE (Mixture of Experts) version are more modest, ranging from a 6% to 50% increase in tokens per second. The performance seems highly dependent on the specific GPU configuration and quantization level (Q4 vs Q8). The release includes the isolated MTP layers and conversion scripts on HuggingFace, allowing the community to experiment with speculative decoding. These preliminary results suggest that MoE architectures might not benefit as uniformly from MTP as dense models do in current llama.cpp implementations.
r/LocalLLaMA·tooling·05/06/2026, 09:51 PM·/u/havenoammoGreat results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot
A complete, reproducible configuration for running Qwen 3.6-35B locally in VS Code, achieving ~100 t/s for high-quality coding tasks on consumer hardware.
A user on r/LocalLLaMA shared a highly successful local coding setup using the Qwen 3.6-35B model (MoE architecture) via llama.cpp on an AMD R9700 GPU. The post includes the exact startup command for the Vulkan server, a VS Code chatLanguageModels.json configuration, and a complex React/TypeScript prompt that generated a fully functional website. Performance metrics show generation speeds of ~100 tokens/second, though large 38k token prompts cause a 17-second prefill delay. The setup utilizes context checkpointing and flash attention to maintain efficiency. This serves as a practical blueprint for developers looking to replace paid coding assistants with local LLMs.
r/LocalLLaMA·tooling·05/06/2026, 08:47 PM·/u/supracodeHOT TAKE: local models + agent harnesses are now capable enough to hand off junior-level IT professional tasks to [human written]
Local models like Qwen3.6 combined with agent harnesses are now capable of autonomously handling complex, multi-step IT administration tasks previously reserved for humans.
An IT veteran with 30 years of experience reports that local LLMs have reached a tipping point for practical automation. Using Qwen3.6 27b within the Hermes Agent harness, the user successfully automated a series of junior-level tasks: system patching, Docker installation, and setting up multiple GitHub repositories with local model services. The agent completed in 90 minutes what typically takes a human three hours, demonstrating the ability to troubleshoot errors and request approvals autonomously. The post suggests a future where 'admin agents' are embedded in infrastructure, fundamentally changing the labor ratio in IT departments. This highlights the shift from simple chat interfaces to tenacious agentic loops that can execute real-world system commands.
r/LocalLLaMA·tooling·05/06/2026, 03:21 PM·/u/Porespellar
Thanks to the sub my silly node and workflow got 3k downloads overnight, therefore I fixed some bugs, unified some features, and uploaded the latest and the greatest version to HF.
A new ComfyUI node that automates character consistency and scene composition using a structured Qwen-based procedural prompting system.
The ComfyUI Character Composer is a procedural prompt system designed to streamline character consistency and scene composition. Built upon the Qwen-Image-Edit-Rapid-AIO ecosystem, it provides a structured approach to generation, reducing the need for manual LLM prompting or copy-pasting. The tool features a unified txt2img and img2img workflow and utilizes a SFW JSON library for managing assets. Following a viral reception on Reddit with over 3,000 downloads, the developer has updated the node with bug fixes and unified features. It aims to offer more controllable generation for users working with complex character-driven workflows.
r/StableDiffusion·tooling·05/06/2026, 03:14 PM·/u/Mundane-Ad-57372.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints
Run Qwen 3.6 27B locally with 2.5x speedup (up to 28 tok/s) using new MTP support in llama.cpp and optimized GGUF quants.
A new optimization for Qwen 3.6 27B leverages Multi-Token Prediction (MTP) via a llama.cpp Pull Request to achieve 2.5x faster inference. User /u/ex-arman68 shared custom GGUF quants that include fixed chat templates and support for massive context windows, reaching up to 262k on 48GB RAM using q4_0 KV cache compression. The setup requires compiling a specific experimental branch of llama.cpp but delivers approximately 28 tokens per second on Apple Silicon. Detailed hardware recommendations for both Mac and NVIDIA users are provided, covering various RAM configurations from 16GB to 80GB. Note that vision capabilities currently conflict with MTP in this experimental build.
r/LocalLLaMA·tooling·05/06/2026, 09:35 AM·/u/ex-arman68Solidity LM surpasses Opus
A new 27B local model specifically fine-tuned for Solidity claims to outperform Claude Opus in smart contract coding benchmarks.
Developer /u/swingbear has released Qwen3.6-Solidity-27B, a fine-tuned model specifically optimized for the Solidity programming language. According to the author, the model achieved a higher pass@1 score on the 'soleval' benchmark compared to Claude Opus 4.7. This 27B parameter model represents a significant achievement for local LLMs in specialized coding tasks, outperforming a much larger frontier model in a niche domain. The project involved substantial compute investment to bridge the gap between general-purpose models and domain-specific tools. The model is currently available on HuggingFace for testing and community feedback.
r/LocalLLaMA·model_release·05/06/2026, 06:59 AM·/u/swingbear
Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)
For 16GB VRAM users, Qwen 3.6 27B at IQ4_XS quantization is the ideal choice, balancing high-quality reasoning (like SVG generation) with usable local performance.
A detailed community benchmark by /u/bobaburger compares various quantization levels of the Qwen 3.6 27B model to find the optimal balance for 16GB VRAM hardware. The test uses a creative and difficult task: tracking a non-standard chess game from PGN and rendering the board state as functional SVG code. Results show that while BF16 and Q8 are near-perfect, IQ4_XS emerges as the recommended 'sweet spot' for consumer GPUs, maintaining spatial reasoning where lower quants (Q3 and below) fail. The author also demonstrates significant performance gains using the TurboQuant fork of llama.cpp, reaching 22 tokens per second on an RTX 5060 Ti.
r/LocalLLaMA·tooling·05/06/2026, 05:10 AM·/u/bobaburger
Chromium AI Image Description Plugin [ComfyUI Powered]
Analyze web images, detect AI artifacts, and generate motion prompts directly from your browser using your local ComfyUI setup and VLM models.
This Chromium plugin bridges the gap between web browsing and local ComfyUI workflows, allowing users to analyze images on any website. It leverages Vision Language Models (VLM) like Qwen 3.5 and Gemma 3 to provide detailed descriptions, OCR, and AI artifact detection. A standout feature is 'Motion Aware prompt', which suggests animation instructions for video generation based on a still image. The plugin requires a running ComfyUI backend and specific workflows provided by the author on GitHub. It also supports custom prompts for specialized image analysis tasks, making it a powerful tool for prompt engineering and quality control.
r/comfyui·tooling·05/06/2026, 02:26 AM·/u/deadsoulinsideQwen 3.6 27B MTP on v100 32GB: 54 t/s
Multi-Token Prediction (MTP) nearly doubles inference speed for Qwen 3.6 27B on older V100 hardware, making it a highly viable local coding assistant.
A user report demonstrates a significant performance boost for Qwen 3.6 27B using Multi-Token Prediction (MTP) on a Tesla V100 32GB GPU. By utilizing a specific MTP branch of llama.cpp, inference speeds jumped from approximately 30 t/s to 54 t/s, nearly doubling the output rate. The setup utilized a q8_0 KV cache and supported a 200k context limit, effectively serving as a high-speed VS Code Copilot replacement. While performance dipped slightly to 40-45 t/s at higher context depths (50k+ tokens), the model remained highly effective for complex tasks like tool calls and code refactoring. This highlights the potential of MTP to extend the lifecycle of older enterprise hardware for modern local LLM workloads.
r/LocalLLaMA·tooling·05/06/2026, 02:18 AM·/u/m94301
Claude Code @ Opus 4.7 vs OpenCode @ qwen3.6:27b. Both shipped a playable cozy roguelite.
Local models like Qwen 3.6:27b have reached parity with top-tier Claude models for building and shipping entire playable games.
A direct comparison between Anthropic's Claude Code (running Opus 4.7) and the open-source OpenCode (using Qwen 3.6:27b) reveals that local models are closing the gap in complex software development. Both agents successfully generated a fully playable 'cozy roguelite' game, managing game logic, state, and basic assets. While Opus 4.7 produced slightly more optimized and cleaner code architecture, the Qwen-based local setup demonstrated that high-tier coding capabilities are no longer exclusive to proprietary cloud APIs. This benchmark is significant for developers prioritizing privacy and cost-efficiency, as a 27b parameter local model can now handle end-to-end project shipping.
r/LocalLLaMA·tooling·05/05/2026, 10:58 PM·/u/rm-rf-rm
MTP on strix halo with llama.cpp (PR #22673)
Multi-Token Prediction (MTP) in llama.cpp nearly doubles inference speeds on AMD Strix Halo hardware, reaching up to 80 t/s on 35B models.
A user on r/LocalLLaMA demonstrated a significant performance boost using the new Multi-Token Prediction (MTP) support in llama.cpp. Testing on an AMD Strix Halo (AI Max 395) with 128GB of fast DDR5-8000 RAM, inference speeds for a Qwen 35B model jumped from approximately 40 t/s to between 60 and 80 t/s. The setup utilized a specific pull request (#22673) and specialized GGUF files designed for MTP. While prompt processing (PP) speeds remained stable, the generation speed benefit is nearly double in some scenarios. This highlights the potential of speculative decoding techniques to make large local models much more responsive on high-end unified memory APUs.
r/LocalLLaMA·tooling·05/05/2026, 10:26 PM·/u/EdenarDeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.
Stop overpaying for cloud AI: 65% of coding tasks can be handled locally with zero quality loss, potentially cutting your API bills by 75%.
A developer conducted a 10-day experiment comparing a local Qwen 3.6 27b model on an RTX 3090 against cloud frontier models like GPT-5.2 for daily coding tasks. The results revealed that 65% of tasks, including file scanning and boilerplate generation, were handled identically by the local model. While complex debugging and architectural decisions still favored cloud models, these accounted for only 15% of the total workload. By routing simpler tasks to local hardware and reserving cloud for high-complexity work, the author reduced their monthly API bill from $85 to $22. This highlights a significant 'laziness tax' where users overpay for cloud intelligence on tasks that local hardware can easily manage.
r/LocalLLaMA·tooling·05/05/2026, 08:55 PM·spencer_kwDeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid.
Most coding tasks don't need expensive cloud models; routing simple tasks to a local LLM can cut your API bill by 75% without losing quality.
A developer conducted a 10-day experiment comparing a local Qwen 3.6 27b model (running on an RTX 3090) against frontier cloud models like GPT-5.2. The analysis revealed that 65% of daily coding tasks, such as project scanning and boilerplate generation, performed identically on local hardware. For debugging with multi-file context, local models reached 61% accuracy, while complex architecture decisions still required cloud intervention, representing only 15% of total tasks. By implementing a task-routing strategy, the author reduced their monthly API costs from $85 to $22. This case study highlights that the massive price gap between local and cloud models often doesn't justify the performance difference for routine work.
r/LocalLLaMA·tooling·05/05/2026, 08:55 PM·/u/spencer_kw
Relevance auto-scored by LLM (0–10). List shows top 30 from the last 7 days.