Practical AI Methodology Meets Cognitive Science|Looking for Ricursive (the AI chip design company)? You want ricursive.com
AI/ML Reading List
Curated links with summaries. RSS feed ↗
- A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]
MONET, a 104.9M high-quality image-text dataset refined from 2.9B images, is released under Apache 2.0 with accompanying arxiv paper and companion tools (UMAP visualization, retrieval interface, T2I training codebase). This is relevant to practitioners and researchers building vision-language models, democratizes access to curated training data, and provides reproducible methodology via the published paper.
ai-mlcommunity - Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]
Wall-OSS-0.5 is an open-source 4B vision-language-action model from X Square Robot featuring gradient-bridge training that prioritizes discrete action tokens over flow matching and achieves 60.5% average task progress after fine-tuning (+17.5pp over pi0.5) with zero-shot real-robot evaluation. This matters because it democratizes access to state-of-the-art embodied AI training code and benchmarks on real hardware, shifting what practitioners can build and validating gradient-bridge scaling as a core VLA design principle.
ai-mlcommunity - Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems [R]
Researchers introduce AgingBench, a longitudinal deployment benchmark showing that upgrading an agent's backbone model (Claude Sonnet 4.6 to Opus 4.7) can paradoxically degrade performance by ~15% due to memory state evolution effects over time. The finding that memory policy drives 4.5x variance in agent half-life suggests that 'swap and deploy' model upgrade strategies are unsafe for long-lived agentic systems—a critical operational concern as agents move from research to production.
ai-mlcommunitylong-signal:rdd - Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]
Researcher demonstrates that LLMs internally discriminate correct from incorrect answers (0.76–0.88 AUROC) but verbally express 99% confidence uniformly; probe-targeted LoRA fine-tuning (few hundred examples, <10 min training) teaches models to align expressed confidence with internal metacognitive signals, with causal validation via activation patching. This addresses a critical deployment problem—LLM calibration and honest uncertainty quantification—with a lightweight, generalizable method tested across 8 models (7B–70B) and backed by pre-registration and reproducible code.
ai-mlcommunity - Apple working to cram massive Gemini model into iPhone to power new Siri
Apple is attempting to compress Google's massive Gemini model to run on iPhones as part of a new Siri integration, but reports indicate the effort will rely heavily on cloud processing despite Apple's privacy-focused positioning. This represents a significant industry signal about the current feasibility ceiling for large-scale model deployment on consumer devices and the practical trade-offs between privacy and capability.
ai-mlresearch - LLMs believe false statements even after explicit warnings that they're false
Researchers found that LLMs absorb false statements into their representations even when those statements are explicitly labeled as false in training data, a phenomenon called 'negation neglect.' The finding explains a root cause of hallucination and has direct implications for training data curation and model reliability in production systems.
ai-mlresearch - How to stop holding AI agents backai-mlresearch
- The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF
Researchers introduce DistractionIF, a benchmark showing that larger LLMs are paradoxically less robust to instruction-like noise in reference text (up to 30-point performance drop), with mechanistic evidence that scaling erodes the probabilistic boundary between instruction-following and data-processing. This inverse scaling phenomenon and the proposed GRPO-based mitigation directly address a critical failure mode in production RAG and agentic deployments where external context contamination is common.
ai-mlresearch - Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
Researchers propose a hybrid zeroth-order optimization framework to improve robustness of safety alignment in LLMs against perturbations (noise, quantization), showing that post-alignment zeroth-order refinement strengthens safety behavior while preserving utility. This addresses a foundational vulnerability in deployed LLMs and signals the field's maturation toward adversarially robust alignment—central to long-term AI safety and deployment confidence.
ai-mlresearchlong-signal:rdd - Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics
Researchers propose a policy-neutral execution layer to bridge sim-to-real gaps in industrial scheduling systems by instrumenting decision validity, action admissibility, and execution attribution. This work improves observability and reliability of RL policies in asynchronous, partially-observed industrial environments—valuable for practitioners deploying event-driven systems but incremental relative to broader RL infrastructure advances.
ai-mlresearch - The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling
A new 306M-parameter transformer architecture augments GPT-2 Small with category-theoretic inductive biases and simplicial message passing, achieving 12% relative perplexity reduction on WikiText-103 through ablation-validated architectural components. This represents a principled intersection of cognitive science and formal mathematics applied to language modeling—a rare approach that could shift how researchers think about inductive structure in neural architectures.
ai-mlresearchlong-signal:rdd - KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing
Researchers introduce KBF, a black-box auditing protocol that fingerprints LLM APIs by measuring stable numerical recall near knowledge boundaries, enabling detection of model substitution fraud and deployment inconsistencies. This addresses a critical gap in supply-chain trust for LLM access—users relying on relay APIs now have a low-cost verification method, with empirical validation across 16 production endpoints revealing real inconsistencies in platform offerings.
ai-mlresearchvelocity:hn-high - Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference
Researchers identify five factor-graph primitives that, when composed, preserve closed-form variational inference even in deep architectures—typically a breaking point for tractable probabilistic inference. The framework enables Bayesian mixture-of-experts with inferred gating and universal function approximation guarantees, demonstrated on ensemble forecasting with calibrated uncertainty across five benchmarks, directly advancing the tractability frontier for hierarchical probabilistic models.
ai-mlresearch - Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
Researchers introduce CFMME, a 6,052-instance Chinese financial multimodal benchmark evaluating large vision-language models across eight image modalities and four tasks, revealing current SOTAs achieve only 66% accuracy on QA—signaling domain-specific gaps. This addresses a genuine evaluation blind spot for LVLMs in financial contexts, relevant to practitioners building domain-specialized multimodal systems but narrower than general capability research.
ai-mlresearch - On the Optimizer Dependence of Neural Scaling Laws
arXiv paper demonstrates that the scaling exponent α in neural scaling laws L(N) ∝ N^(-α) varies systematically with optimizer choice—preconditioned optimizers achieve 2.6× larger exponents at natural-language spectral conditions versus vanilla SGD. This challenges the assumption that scaling laws are fixed constants and implies optimizer selection materially affects model-size doubling efficiency, with direct implications for scaling-law forecasting and LLM training efficiency.
ai-mlresearch - SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow
SURGENT is a surgical multi-agent system combining Tree-of-Thought planning, retrieval-augmented reasoning, and novel memory architecture to handle complete perioperative workflows while preserving privacy through local deployment. The work addresses genuine limitations of commodity LLMs in clinical contexts (input constraints, auditability, context management) and shows measurable improvements over baselines—signaling how domain-specific multi-agent systems can advance AI in high-stakes, regulated domains.
ai-mlresearch - BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
BlockBatch proposes a training-free inference framework that executes multiple block-size branches in parallel for diffusion language models, achieving 26.6% reduction in denoising steps and 1.33x end-to-end speedup. This work is relevant to practitioners optimizing dLLM deployment and signals a shift in how the field approaches inference efficiency through algorithmic branching rather than single-parameter tuning.
ai-mlresearchvelocity:hn-medium - Quantum-Enhanced Adversarial Robustness in Artificial Intelligence
An arXiv preprint surveys quantum computing techniques applied to adversarial robustness in AI systems, covering quantum optimization, feature mapping, and hybrid quantum-classical architectures. While addressing a legitimate alignment concern (adversarial attacks in safety-critical systems), the work appears primarily expository rather than presenting novel empirical results or theoretical advances that would shift practitioner capabilities or understanding.
ai-mlresearchlong-signal:rdd - No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand
Researchers introduce NRLB, a multi-agent framework that generates plain-language summaries tailored to diverse reader groups (elementary students, non-native speakers, readers with attention deficits) by simulating cognitive barriers and iteratively refining outputs. This work addresses a concrete gap in NLP—accessibility compliance and broad readability—with demonstrated human preference gains (55-76%), signaling both practical deployment value and emerging focus on accessibility as a capability requirement in summarization systems.
ai-mlresearchvelocity:hn-medium - Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers
Researchers propose SafeDIG, a safety steering framework using sparse autoencoders and position-aware feature transfer to control harmful outputs in diffusion transformers like FLUX.1 and Stable Diffusion 3.5. The work is significant because it tackles the fundamental challenge of safety adaptation across shifting risk domains—a critical problem for production deployment of generative models where simple prompt filtering fails.
ai-mlresearch - Meta-Programming for Linear-time Temporal Answer Set Programming
Researchers propose a flexible meta-programming framework that unifies implementations of temporal extensions to Answer Set Programming (TEL, MEL, DEL) through declarative encoding in clingo, introducing the metasp system. Relevant to knowledge representation and automated reasoning communities, advancing the expressiveness and rapid prototyping of temporal logics for constraint-solving and planning domains.
ai-mlresearch - LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs
LFQ introduces a logit-aware quantization method that improves low-bit weight-only quantization for LLM generation by optimizing the final Transformer block with cross-entropy loss instead of MSE, aligning token probability distributions. This directly addresses a deployment bottleneck—memory-efficient inference on generative tasks—making it relevant to practitioners building production LLM systems.
ai-mlresearch - Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling
Researchers propose Graph-Distance Contribution Reward (GDCR) and Step Advantage Policy Optimization (SAPO) to solve step-level credit assignment in agentic search systems by modeling world knowledge as a latent graph and scoring intermediate steps by distance to answer nodes. This advances a critical constraint in training long-horizon agentic reasoning systems—moving beyond trajectory rewards that obscure which individual steps actually contributed to success.
ai-mlresearch - MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery
MOOSE-Copilot presents a unified framework for human-AI collaborative scientific hypothesis discovery, formalized through explicit interaction signals (blueprints, routing, feedback) and validated via web UI. Relevant to researchers building LLM-assisted scientific tools and HCI practitioners, but represents iterative refinement on known interaction paradigms rather than a capability breakthrough.
ai-mlresearch - How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
Researchers show that neural scaling laws and the Vendi Score are special cases of matrix spectral functions, unifying dataset valuation theory. They develop a secular-equation optimization method delivering 35,000x speedup on ImageNet-scale data, empirically demonstrating that facility location outperforms other objectives for predicting training-subset value across multiple regimes.
ai-mlresearch - LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis
LogDx-CI benchmarks 11 context-reduction tools for processing CI failure logs before LLM diagnosis, finding hybrid grep+tail routers dominate cost-quality tradeoffs and that agent-loop rescues weak contexts at higher cost. The work surfaces a previously unbenchmarked but field-critical problem—log preprocessing for coding agents—with reproducible data and code, making it directly actionable for practitioners building LLM debugging systems.
ai-mlresearch - Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
Researchers introduce differential circuit vulnerability to measure how internal computational circuits degrade under fine-tuning, finding that RL preserves base model circuits better than SFT despite slower task adaptation. This mechanistic insight advances understanding of catastrophic forgetting and provides practitioners with evidence for choosing RL-based adaptation when prior capabilities must be retained.
ai-mlresearch - Specialty-Specific Medical Language Model for Immune-Mediated Diseases
Researchers developed a transformer-based Named Entity Recognition model for extracting disease entities from clinical text in immunology/infectious disease, achieving 0.89 F1 on expert-annotated case reports. Relevant to practitioners building clinical NLP systems, but applies existing techniques (BERT, domain embeddings) to narrow vertical without advancing core ML capabilities.
ai-mlresearch - Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale
Researchers designed and evaluated a triadic LLM-teacher-student collaboration system for K-12 writing instruction using 57,954 essays across 120 schools, finding that strategic labor division (LLM as generative support, teacher as quality gatekeeper) improves outcomes but exhibits diminishing returns from over-expansion. This matters because it provides empirical evidence for *how* to integrate LLMs in education responsibly—balancing automation benefits against pedagogical oversight and student learning, with direct implications for AI policy in schools.
ai-mlresearch - Temporal Stability and Few-Shot Prompting in Math Task Assessment
Longitudinal study of Gemini and Coteach across model versions and few-shot prompting shows prompt engineering produces more reliable gains than passive updates on educational task classification. Relevant for practitioners deploying LLMs in specialized domains where version stability is unpredictable.
ai-mlresearch - From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs
Researchers propose HTP, a hierarchical approach using LLMs to generate realistic urban GPS trajectories by first generating travel patterns via quantized tokens, then GPS points—outperforming baselines by 29.78% while addressing privacy concerns in smart city applications. The work demonstrates a novel way to extend LLM vocabulary for structured spatial data, relevant to practitioners working on synthetic data generation and LLM adaptation for domain-specific tasks.
ai-mlresearch - Make LLM Learn to Synthesize from Streaming Experiences through Feedback
Researchers introduce StreamSynth, a framework where LLMs learn to synthesize data sequentially across task streams, accumulating and transferring synthesis experience rather than treating each task in isolation. This advances synthetic data generation from a static capability to an experience-driven process, directly relevant to practitioners building on LLMs for data generation pipelines and researchers studying transfer learning and few-shot adaptation.
ai-mlresearch - Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation
Moment-KV proposes a momentum-based approach to compress KV caches during LLM decoding by modeling token importance as a temporally evolving state with attention decay, rather than using static heuristics. This addresses a major inference bottleneck for long-generation tasks with demonstrated 2.3-3.2% fidelity improvements, directly relevant to practitioners scaling LLM deployments.
ai-mlresearch - Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability
Researchers propose a hybrid system where LLMs generate Python code to encode constrained optimization problems as MaxSAT instances, then verify solutions against canonical encodings. The approach achieves >80% correctness on preference-based reasoning tasks where baseline LLM methods fail, demonstrating that externalized reasoning via formal solvers substantially improves reliability on multi-constraint problems relevant to robotics and planning domains.
ai-mlresearch - Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation
Researchers propose 'Think Fast, Talk Smart,' a hybrid architecture that uses deterministic computation for recurring analysis before bounded LLM calls, demonstrating lower error rates and costs than pure LLM baselines on sleep-health insights. The work articulates a principled design rule for production AI systems: deterministic code should own analysis; LLMs should only express verified facts within constrained interfaces—with direct implications for reliability in healthcare, finance, and other domains requiring faithful output grounding.
ai-mlresearchvelocity:hn-high - Hexo Labs Open-Sources SIA: A Self-Improving Agent That Updates Both the Harness and the Model Weights
Hexo Labs released SIA, an open-source MIT-licensed framework that co-optimizes both agent scaffolds (prompts, tools, retry logic) and model weights in a single self-improving loop, using multi-agent LLM decision-making to select RL algorithms per reward shape. Demonstrated 20+ point gains on LawBench, 12× speedup on kernel synthesis, and consistent SOTA improvements across three unrelated domains—advancing the emerging field of test-time agent optimization beyond fixed-harness or fixed-weight constraints.
ai-mlresearchboost:open-source - GitHub Copilot for Eclipse Goes Open Source Under MIT, Six Weeks After Microsoft Signaled the Move
Microsoft open-sourced GitHub Copilot for Eclipse (15K lines of Java, MIT license) on May 21, achieving 1,200+ stars rapidly. This democratizes IDE-integrated AI coding assistance and signals Microsoft's strategy to expand Copilot's reach beyond VS Code into enterprise development environments.
ai-mlresearchboost:open-source