3RecursiveIntelligence.io

Practical AI Methodology Meets Cognitive Science|Looking for Ricursive (the AI chip design company)? You want ricursive.com

AI/ML Reading List

Curated links with summaries. RSS feed ↗

  • Nvidia unveils Ising AI models for quantum error correction and calibration

    Nvidia released Ising AI models designed to improve quantum error correction and calibration workflows, applying deep learning to a critical bottleneck in quantum computing hardware development. This represents a novel intersection of classical AI/ML infrastructure and quantum computing that could accelerate the practical timeline for fault-tolerant quantum systems—a long-horizon signal for how AI tooling may unlock adjacent compute paradigms.

    ai-mlcommunitylong-signal:rdd
  • ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

    ClawBench introduces a benchmark evaluating AI agents on 153 real-world tasks across 144 live websites with human ground-truth and multi-layer behavioral telemetry; best model (Claude Sonnet 4.6) achieves only 33.3% success, signaling that agentic web automation remains a hard, unsolved capability frontier with direct implications for autonomous task execution research and deployment feasibility.

    ai-mlcommunity
  • HDNA Workbench: open-box AI research platform where every neuron is inspectable with one-line PyTorch inspection wrapper [P]

    Developer released HDNA Workbench, an open-source Python framework for neural network interpretability featuring per-neuron inspection, a PyTorch wrapper enabling layer-level tracing on existing models, and built-in training curricula. Addresses mechanistic interpretability bottleneck by making transparency architectural rather than post-hoc, relevant to both compliance auditing and fundamental AI cognition research.

    ai-mlcommunity
  • Layerwise “surprise” signal for OOD detection [R]

    Nervecode proposes a lightweight OOD detection method using layerwise 'surprise' signals from neural network layer activations, achieving 99.2% AUROC on MNIST→FashionMNIST benchmarks. The approach offers interpretability advantages over output-only baselines by identifying which layers diverge under distribution shift, relevant to practitioners building robust ML systems.

    ai-mlcommunity
  • Reducing LLM hallucination by using a model-agnostic control layer [R]

    A Reddit user presents a model-agnostic gating layer that constrains LLM output to reduce hallucinations, achieving 95% accuracy and 1.5% hallucination rate on a 200-question benchmark versus plain LLM (28% acc, 16% hallucination) and RAG (31% acc, 29% hallucination). The approach is relevant to production LLM systems seeking reliability, though the methodology (small benchmark, single-source reporting, no peer review) and incremental nature of answer gating limit novelty signals.

    ai-mlcommunity
  • [P] Added 8 Indian languages to Chatterbox TTS via LoRA — 1.4% of parameters, no phoneme engineering [P]

    Practitioner successfully extended Resemble AI's open-source multilingual TTS to 8 Indian languages (500M+ speakers) using parameter-efficient LoRA (1.4% of model) + grapheme-level tokenization + Brahmic script warm-start, achieving intelligible output without phoneme engineering. Demonstrates practical democratization of speech synthesis for underrepresented languages via efficient fine-tuning — directly applicable to practitioners building localized voice systems, with full model and training code released on HuggingFace.

    ai-mlcommunity
  • CHI PLAY reviews [R]
    ai-mlcommunity
  • From Plan to Action: How Well Do Agents Follow the Plan?

    A new arXiv paper systematically evaluates whether code-generation agents (SWE-agent) actually follow instructed plans across 16,991 trajectories, finding that agents default to training-internalized workflows when plans aren't reinforced, and that poorly-designed plans hurt more than help. This matters because it reveals a fundamental training/alignment gap: agents need fine-tuning to follow explicit plans rather than baking task-specific workflows into model weights—a key insight for building more controllable and generalizable AI agents.

    ai-mlresearchvelocity:hn-medium
  • The Effect of Document Selection on Query-focused Text Analysis

    Researchers systematically evaluated seven document selection strategies across four text analysis methods (LDA, BERTopic, TopicGPT, HiCode) on 26 queries, finding that semantic or hybrid retrieval consistently outperform simpler baselines while avoiding unnecessary compute. This establishes document selection as a deliberate methodological choice rather than a side effect, offering practical guidance for large-scale text analysis workflows.

    ai-mlresearch
  • Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

    RALP reformulates knowledge graph link prediction as prompt learning, using LLMs with chain-of-thought reasoning to handle unseen entities, relations, and literals where traditional embedding models fail. Achieves 5%+ MRR improvements over SOTA and demonstrates practical reasoning on complex OWL class expressions, with open-source implementation released.

    ai-mlresearchvelocity:hn-high
  • ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

    ReasonXL introduces a 2M+ sample parallel corpus of reasoning traces across five European languages and demonstrates that LLMs can be adapted to reason in non-English languages without performance loss through SFT+RLVR, with mechanistic analysis revealing language identity encoded in early layers and efficient representational rerouting in upper layers. This matters to practitioners building multilingual systems and researchers studying how language identity is encoded in model architecture—a concrete technical solution to a widely acknowledged limitation in non-English LLM deployment.

    ai-mlresearch
  • HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST

    HiCoLoRA proposes a hierarchical LoRA framework to improve zero-shot dialog state tracking by aligning dynamic contexts with static prompts using spectral clustering and semantic-aware initialization. Relevant to practitioners building task-oriented dialogue systems and researchers exploring parameter-efficient fine-tuning approaches that generalize across domains without retraining.

    ai-mlresearchlong-signal:rddvelocity:hn-medium
  • Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

    EnergyGPT applies supervised fine-tuning and LoRA to LLaMA 3.1-8B for energy-sector Q&A, showing improvements over base model on domain benchmarks. Relevant for practitioners exploring parameter-efficient adaptation in regulated/specialized domains, though the core technique (fine-tuning) and architectural choices are conventional.

    ai-mlresearchvelocity:hn-medium
  • AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought

    AdaMCoT introduces an adaptive multilingual chain-of-thought framework that dynamically routes reasoning through intermediary 'thinking languages' to improve factual reasoning consistency across high and low-resource languages without additional pretraining. This addresses a core limitation in production multilingual AI systems: performance degradation in underrepresented languages, offering practitioners a scalable approach to bridge cross-lingual capability gaps.

    ai-mlresearch
  • Accelerating Speculative Decoding with Block Diffusion Draft Trees

    DDTree extends block diffusion drafting for speculative decoding by constructing draft trees from per-position distributions, improving token acceptance rates over single-trajectory methods like DFlash while maintaining single-pass verification. This advances a critical inference bottleneck for practitioners deploying large language models at scale, with competitive performance against strong baselines like EAGLE-3.

    ai-mlresearch
  • MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

    MoshiRAG introduces asynchronous retrieval-augmented generation for full-duplex speech language models, enabling real-time conversational AI to access external knowledge sources without sacrificing interactivity. This matters because it solves a key tradeoff: maintaining factuality in conversational systems while preserving the naturalness advantages of full-duplex interaction—a capability gap between compact real-time models and larger offline systems.

    ai-mlresearchvelocity:hn-medium
  • Latent Planning Emerges with Scale

    Researchers introduce a framework for detecting latent planning in LLMs—internal representations that shape token generation ahead of time—and demonstrate via mechanistic analysis that planning ability increases with model scale across the Qwen-3 family. This contributes to understanding of how LLMs generate coherent long-horizon outputs and provides tools for interpreting emergent reasoning capabilities, advancing the cognitive science of language models.

    ai-mlresearch
  • KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

    KoCo conditions LLM pre-training by mapping documents to semantic coordinates and prepending them as context, improving downstream task performance by ~30% convergence acceleration and reducing hallucinations. This addresses a real gap in how LLMs currently treat training corpora as flat token sequences, offering practitioners a concrete conditioning approach that could influence how foundation models are trained.

    ai-mlresearch
  • From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

    DialRouter introduces sequential routing for multi-turn dialogue using Monte Carlo tree search and learned routing policies to optimize cumulative performance across LLM selections, moving beyond myopic single-turn decisions. This matters to practitioners and researchers building production dialogue systems where model selection now accounts for conversation dynamics and long-horizon reward accumulation rather than greedy per-turn optimization.

    ai-mlresearchlong-signal:rddvelocity:hn-medium
  • Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

    Researchers propose cooperative memory paging—replacing evicted conversation segments with compact keyword bookmarks (~8-24 tokens) and a recall() tool—achieving superior performance on multi-session LLM conversations versus truncation and retrieval baselines across four model families. The work identifies bookmark discrimination as the remaining bottleneck (96% recall rate but only 57% page accuracy) and provides a detailed ablation study (3,176+ probes) characterizing design tradeoffs, offering practitioners actionable guidance for implementing long-horizon conversation systems.

    ai-mlresearchlong-signal:rdd
  • Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System

    Researchers developed CARIS, a clinical research system that combines LLMs with the Model Context Protocol to automate the full research pipeline—from study design to IRB documentation to ML model development—without requiring coding or exposing raw patient data. This work matters because it demonstrates how agentic AI can lower barriers to data-driven clinical research while maintaining privacy, a critical capability for scaling AI adoption across regulated domains with heterogeneous data environments.

    ai-mlresearchvelocity:hn-medium
  • Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

    Researchers find that LLMs exhibit 'temporal flattening'—reduced semantic and cognitive-emotional drift over time compared to humans—even when conditioned on generation history, achieving 94-98% accuracy in distinguishing human from LLM text based on temporal variability patterns alone. This reveals a structural limitation of current LLMs relevant to synthetic data generation and suggests fundamental differences in how models versus humans evolve writing over extended periods.

    ai-mlresearch
  • Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

    Self-Distillation Zero converts sparse binary rewards into dense token-level supervision by training a single model to both generate and revise responses, then distilling the reviser back into the generator. This addresses a key post-training bottleneck—practitioners can improve models on math/code tasks by 10%+ without external teachers or high-quality demonstrations, making the approach accessible across model scales and domains.

    ai-mlresearchvelocity:hn-medium
  • Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

    Researchers introduce a graph-based chain-of-thought pruning framework that converts linear reasoning traces into DAGs and applies dual pruning strategies to eliminate redundant reflections in reasoning LLMs, achieving 42% token reduction without accuracy loss. This addresses a practical pain point in RL-fine-tuned models (overthinking/inefficiency) with a structured, reproducible approach combining SFT, DPO, and GRPO—directly useful for practitioners scaling reasoning workloads.

    ai-mlresearch
  • Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

    Double introduces a novel speculative decoding framework that breaks theoretical speedup ceilings through iterative draft-model retrieval and authoritative target-model guidance, achieving 5.3× speedup on LLaMA3.3-70B without model retraining. This matters to practitioners because speculative decoding is a deployed inference optimization; a training-free method that outperforms EAGLE-3 has immediate practical application for LLM serving at scale.

    ai-mlresearch
  • Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG

    TARG introduces a training-free adaptive gating mechanism that decides when to retrieve in RAG systems using only lightweight uncertainty signals from draft model outputs, reducing retrieval calls by 70-90% while maintaining or improving accuracy. This directly addresses a critical production constraint—RAG latency and token inflation—with a model-agnostic method that requires no retraining or auxiliary infrastructure, making it immediately applicable to deployed LLM systems.

    ai-mlresearchvelocity:hn-high
  • Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning

    Joint Flashback Adaptation proposes a method to reduce catastrophic forgetting in LLMs during incremental task learning using limited prompts from prior tasks and latent task interpolation. Addresses a core challenge in continual learning for language models with demonstrated improvements across reasoning and instruction-following benchmarks, relevant to practitioners building adaptive systems.

    ai-mlresearch
  • The Enforcement and Feasibility of Hate Speech Moderation on Twitter

    Researchers conducted a global audit of Twitter's hate speech moderation using 540K annotated tweets across 8 languages, finding 80% of hateful content remains online and is removed at rates no higher than non-hateful tweets regardless of severity. The study demonstrates that human-AI moderation pipelines could substantially reduce hate speech exposure at costs below regulatory penalties, suggesting enforcement gaps reflect resource allocation choices rather than technical impossibility—a key signal for policy makers and platforms on the feasibility and economics of content moderation systems.

    ai-mlresearchvelocity:hn-medium
  • Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

    Researchers show that existing multilingual benchmarks (like those used by frontier models) primarily measure reasoning and factual recall rather than actual multilingual capability, and propose round-trip translation as a more aligned evaluation method with 0.94 correlation to LMArena user ratings. This matters because it exposes a systematic misalignment in how the field evaluates multilingual models and provides practitioners a simpler, more valid alternative for assessing real-world multilingual proficiency.

    ai-mlresearch
  • Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

    Researchers propose an Item Response Theory framework using anchor items to calibrate new LLM benchmarks while fixing parameters from prior evaluations, enabling efficient comparison across models and time periods with only ~100 anchor questions per dataset. This solves a critical infrastructure problem: as models and benchmarks proliferate, maintaining commensurable evaluation scores across studies becomes computationally prohibitive and methodologically messy—this framework reduces evaluation cost while preserving ranking reliability (ρ≥0.9), enabling sustainable benchmarking infrastructure as the field scales.

    ai-mlresearch
  • InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

    InsightFlow uses LLMs to automatically generate 5P causal case formulation graphs from therapy transcripts, evaluated against 46 expert-annotated clinical transcripts using structural, semantic, and expert criteria. The work demonstrates feasible LLM-driven automation of a time-consuming clinical workflow with clinically meaningful outputs, signaling broader potential for AI-augmented clinical decision support in mental health.

    ai-mlresearchvelocity:hn-medium
  • Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

    Tri-RAG proposes converting unstructured external knowledge into structured triplets (Condition, Proof, Conclusion) to improve retrieval precision and reduce token consumption in RAG systems. This addresses a core RAG bottleneck—naive concatenation of retrieved text fragments—with a lightweight, frozen-parameter approach that maintains semantic alignment while reducing context overhead.

    ai-mlresearchvelocity:hn-high
  • SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

    Researchers propose SCRIPT, a model-agnostic module that injects Korean subcharacter (Jamo) compositional structure into pre-trained language models without architectural changes, improving performance on NLU and NLG tasks while reshaping embeddings to capture grammatical regularities. This work signals growing attention to linguistically-informed tokenization and the gap between subword schemes and morphologically-rich languages—relevant for practitioners building multilingual or non-English NLP systems.

    ai-mlresearch
  • Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

    Researchers propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), a method that treats retrieval order as a structural property rather than an unordered set, using hypergraphs with precedence constraints to recover coherent reasoning trajectories. This addresses a fundamental assumption violation in existing RAG systems and demonstrates measurable improvements on order-sensitive reasoning tasks, directly relevant to improving LLM grounding and reasoning capabilities.

    ai-mlresearchvelocity:hn-medium
  • Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

    Safe-SAIL introduces a sparse autoencoder interpretation framework to identify and explain safety-related latent features in LLMs across four domains (pornography, politics, violence, terror), reducing explanation cost by 55% and releasing 1,758 annotated features plus open-source toolkit. This advances mechanistic understanding of how models encode safety-critical concepts and supports alignment researchers in auditing model internals for risky behavior patterns.

    ai-mlresearch