Practical AI Methodology Meets Cognitive Science|Looking for Ricursive (the AI chip design company)? You want ricursive.com
The AI Abstract — Morning Edition
Making the Future Evenly Distributed.
When multiple AI models are asked to diagnose the same patient as a group, they agree 98% of the time — not because they reasoned together, but because each one simply repeated its opening position.
Medical AI systems that use multiple models in deliberation are producing the appearance of consensus while concealing a near-total failure of actual reasoning. The finding comes from 🔬MedAgentAudit, a new audit framework tested across 14,400 cases and six multi-agent architectures. The number that should stop anyone building or deploying these systems: in peer discussion rounds, models repeated their initial view without engaging new evidence 98.42% of the time. Think of it like a committee where every member reads their opening statement, listens politely, then reads it again. The word "discussion" is doing no work. The word "consensus" is doing active harm, because the output looks like agreement that was earned.
The paper identifies ten failure categories. Two others deserve attention alongside the evidence-aversion finding. In 16.63% of cases, models made unsupported observations: claims that had no grounding in the presented patient data. And authority bias, where a model defers to a perceived senior agent rather than the evidence, climbed to 68.75% as rounds progressed. That last number has a specific shape worth understanding. It means the longer a multi-agent system deliberates, the more likely it is to converge on whoever spoke with the most confidence first, not whoever was right. More rounds, less accuracy.
The design implication is structural, not cosmetic. Current medical AI evaluation focuses on whether the final output matches a correct answer. MedAgentAudit makes the case that this misses the problem entirely, because a system can reach a correct answer through broken reasoning and then fail in ways that don't show up in accuracy scores. The researchers propose process-level auditing: watching how a system reasons, not just what it concludes. That's a harder engineering problem and a more expensive evaluation protocol, but it's the one that would actually catch the failure mode this paper found.
A separate cluster of research this cycle raises a related problem one layer up: how AI systems get evaluated at all, and whether the evaluation methods themselves are trustworthy.
🔬A Two-Phase Stability Study of LLM Judges tested whether LLMs can reliably substitute for human graders on Thai bar exam essays. The answer is yes for clear-cut cases, and no in exactly the situations where it matters most. When a rubric cell was ambiguous, human graders disagreed with each other in structured, meaningful ways. Some took minority readings that reflected genuine legal interpretation differences. LLMs, regardless of size or vendor, collapsed toward the majority position every time. The diversity of human judgment, which represents real interpretive range in a discipline that runs on contested readings, simply disappeared.
This is a problem for how AI evaluation pipelines get built. The standard approach is to train or select an LLM judge by optimizing for agreement with a human panel. If the human panel contains genuine disagreement, and the LLM learns to reproduce only the majority view, then the benchmark looks valid but is systematically blind to edge cases. In law, edge cases are frequently the cases that matter. The paper's framing is careful: it doesn't say LLM judges are useless, but it does say that validating them only on agreement rates will produce tools that inherit convergence failures without flagging them.
A complementary finding from 🔬JudgmentBench points toward a partial fix, at least for evaluation design. Researchers had practicing attorneys evaluate legal AI outputs using two methods: rubric-based scoring and pairwise comparison. The rubric approach yielded a Spearman correlation of 0.150 with ground-truth quality. The pairwise approach yielded 0.908. Same experts, same outputs, radically different signal quality. Rubrics require experts to translate their judgment into a scale; comparisons let them exercise judgment directly. For anyone building evaluation pipelines in high-stakes domains, this is a concrete methodological choice with a measurable cost attached to getting it wrong.
The three evaluation stories together point at a signal worth tracking: the field's tools for measuring whether AI systems work may be systematically underestimating failure in ambiguous, high-stakes domains. Legal and medical settings are the leading edge of that problem because they're where deployment pressure is highest and where the cost of missed failures is clearest.
On the efficiency side, 🔬Tensor Mixture (MixT) is the lead result in a cluster of four compression papers. The approach replaces standard dense linear layers in transformer models with mixtures of lower-dimensional tensor operators. A dense linear layer is like a full grid of connections between every input and every output. A tensor operator is more like a set of structured, overlapping patterns that together approximate that grid with far fewer numbers. MixT applies this replacement across the model and achieves 47.5% parameter reduction and 60.4% memory savings on LLaMA2-7B while preserving accuracy on MMLU. For practitioners running inference at scale or fine-tuning on constrained hardware, this is a deployment-facing result, not a benchmark curiosity.
Prompting also got a concrete improvement. 🔬Verification-First prompting asks a model to verify a candidate answer before it solves the problem. The mechanism: showing the model a proposed answer first constrains the space of reasoning paths it needs to explore, the way a completed edge of a puzzle constrains where the remaining pieces go. This reverse reasoning reaches 94.9% on GPQA-Diamond with Gemini-3-Pro, beating standard chain-of-thought at negligible additional cost. The practical value is that it's a prompt change, not a training change, which means it's testable today on existing deployments.
A smaller result that earns a mention: researchers built a knowledge graph from a single neuroscience textbook, fine-tuned a small language model on question-answer pairs derived from that graph with reinforcement learning feedback, and achieved expert-level performance on neuroscience reasoning while using orders of magnitude fewer parameters than frontier models. The 🔬paper releases code and curriculum. The implication is not that scale is irrelevant but that structured domain knowledge can substitute for it in bounded expert tasks. That's a replicable pattern for anyone building specialized tools with limited compute.
On the hardware edge, a researcher ran a 12.6-million-parameter generative image model on a microcontroller with 512KB of RAM. The ⚠️Reddit post links to a Zenodo preprint; the implementation streams weights from an SD card and runs pure C inference on a RISC-V chip, generating a 64x64 image in 26 seconds. It works. The more interesting finding is the tooling gap it exposes: RISC-V has nothing comparable to ARM's CMSIS-NN library for optimized neural network operations, which means every RISC-V deployment at this level currently requires building from scratch.
🔬 MedAgentAudit: Read for the ten failure taxonomy and the authority bias progression across rounds — the mechanism matters for anyone specifying multi-agent clinical pipelines.
🔬 Thai Bar Exam LLM Judge Study: Read for the rubric ambiguity finding — it reframes what "agreement with humans" actually measures in LLM evaluation.
🔬 JudgmentBench: Read for the 0.150 vs. 0.908 correlation gap between rubric and pairwise evaluation — a concrete number to cite when arguing for evaluation design choices.
🔬 Verification-First Prompting: Read for the mechanism and benchmark numbers before you try it — knowing why it works tells you where it won't.
🔬 Tensor Mixture (MixT): Read for the parameter and memory reduction numbers on LLaMA2-7B before your next inference infrastructure decision.
Links
- A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays
arxiv.org
A two-phase study of LLM judges vs. human bar examiners on Thai legal essays reveals that LLMs systematically fail to reproduce minority human readings on ambiguous rubric cells, clustering instead toward majority interpretations across all model sizes and vendors. This exposes a hidden bias in LLM-as-judge benchmarks: optimizing for human panel agreement will inherit systematic convergence pathologies rather than balanced reasoning diversity, with major implications for legal AI evaluation and deployment.
- Auditing medical multi-agent AI reveals risks of false consensus
arxiv.org
Researchers introduce MedAgentAudit, a workflow audit framework that identifies ten categories of collaborative failure modes in medical multi-agent LLM systems—including unsupported observations (16.63% of cases), evidence-aversion in peer discussion (98.42% repeat initial views), and authority bias rising to 68.75% across rounds. This shifts medical AI evaluation from output accuracy to process-level safety, directly addressing deployment risks in high-stakes clinical settings where consensus appearance masks reasoning failures.
- Asking LLMs to Verify First is Almost Free Lunch
arxiv.org
Researchers propose Verification-First (VF) prompting, which asks LLMs to verify a candidate answer before solving, triggering reverse reasoning that prunes the output distribution and improves accuracy. The method achieves 94.9% on GPQA-Diamond with Gemini-3-Pro, outperforming standard CoT with negligible overhead—critical for practitioners seeking low-cost reasoning improvements at inference time.
- JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
arxiv.org
JudgmentBench releases a 30-task legal benchmark with 1,539 rubric scores and 1,530 pairwise preferences from practicing attorneys, showing comparative judgments substantially outperform rubric-based evaluation (0.908 vs. 0.150 Spearman correlation) while requiring half the annotation time. This directly impacts how practitioners design evaluation pipelines for high-stakes domains and contributes methodological guidance for eliciting and aggregating expert supervision signals where ground truth is unavailable.
- Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience
arxiv.org
Researchers built a knowledge graph from a single neuroscience textbook, fine-tuned a smaller LM on KG-derived QA pairs with RL feedback, and achieved expert-level reasoning surpassing larger LLMs with orders of magnitude fewer parameters. This signals a viable alternative to scale-dependent approaches for domain expertise and is directly reproducible via released code and curriculum.
- A general tensor-structured compression scheme for efficient large language models
arxiv.org
Researchers introduce Tensor Mixture (MixT), a general compression scheme replacing dense linear layers with mixtures of tensor operators, achieving 47.5% parameter reduction and 60.4% memory savings on LLaMA2-7B while preserving MMLU accuracy. Directly applicable across Transformer-based LLMs and solves a high-priority deployment constraint for practitioners scaling inference and adaptation.
- DCGAN inference on a microcontroller: 12.6M parameters, 512KB SRAM, 26-second generation, pure C [P]
reddit.com
Researcher successfully ran DCGAN inference (12.6M parameters, int8 quantization) on a CH32H417 RISC-V microcontroller with only 512KB SRAM, generating 64×64 images in 26 seconds using a custom C inference engine and SD-card weight streaming. This work opens a new frontier in on-device generative modeling for ultra-low-power systems and exposes a gap in RISC-V ML tooling compared to ARM's CMSIS ecosystem, with potential implications for edge AI deployment in resource-constrained environments.
- Multilingual Phonological Feature Recognition with Self-Supervised Speech Models
arxiv.org
Researchers released PhonoQ-2.0, a self-supervised speech model that directly predicts linguistic phonological features (manner, voicing, place) rather than phonemes, achieving 91.3% in-domain F1 across multiple languages and +6.7% improvement on unseen languages. This matters because phonological representations are more language-agnostic than phonemes, improving cross-lingual transfer and providing interpretable linguistic structure for downstream speech tasks.
- Benchmarking and Learning Real-World Customer Service Dialogue
arxiv.org
Researchers introduce OlaBench, a benchmark for real-world customer service dialogue spanning RAG, workflows, and agentic systems, and OlaMind, a reinforcement learning approach that distills expert patterns and outperforms GPT-5.2 and Gemini 3 Pro by 13+ points while delivering 23.67% higher issue resolution in production A/B tests. This bridges the offline-to-deployment gap in industrial dialogue systems, advancing reliability and human-like behavior in high-stakes customer-facing AI.
- GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation
arxiv.org
GeoSVG-RL introduces a reinforcement learning approach using geometric feedback (rendering validity, anchor placement, text containment) to improve LLM-generated SVG diagram quality, moving beyond token-likelihood optimization. This matters because structured diagram generation is a practical bottleneck for autonomous technical documentation and design systems; the method's focus on constraint-satisfaction through reward shaping suggests a replicable pattern for other code-generation tasks requiring structural guarantees.