Logical Thinking Performance Task

Omni Calculator Publishes ORCA V3 Research Report on AI Model Performance in Quantitative Reasoning

Omni Calculator announced the publication of the third iteration of its Omni Research on Calculation in AI (ORCA) Benchmark, an independent benchmarking initiative designed to evaluate the ...

Hosted on MSN

AI beats doctors in reasoning tasks, but trials urged

A Harvard-led study found a large language model outperformed physicians in diverse clinical reasoning tasks, including emergency department diagnoses. Researchers called the results a potential ...

How DeepSeek AI Uses 90% Fewer Tokens to Match Billion-Dollar Models

Explore how DeepSeek AI's new visual pointing method reduces computational costs by 90 percent while matching the performance ...

Geeky Gadgets

Deepseek-r1 vs OpenAI-o1 – AI Reasoning Performance Comparison

Deepseek, a Chinese company, has introduced its Deepseek R1 model, attracting attention for its potential to rival OpenAI’s latest offerings. Reportedly outperforming OpenAI’s o1 Preview in benchmarks ...

EurekAlert!

Large language models demonstrate strong performance in physicians’ clinical reasoning tasks

A cutting-edge large language model (LLM) outperformed human doctors in common clinical reasoning tasks including emergency room decisions, identifying likely diagnoses, and choosing next steps in ...

VentureBeat

New technique helps LLMs rein in CoT lengths, optimizing reasoning without exploding compute costs

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Reasoning through chain-of-thought (CoT) — ...

Euronews

AI models rival doctors on complex medical reasoning tasks, study finds

Researchers have found that an AI model outperformed human doctors on most medical reasoning tasks, from diagnoses to patient management advice. Artificial intelligence models outperformed physicians ...

25d

Alibaba's Metis agent cuts redundant AI tool calls from 98% to 2% — and gets more accurate doing it

Alibaba's HDPO framework trains AI agents to skip unnecessary tool calls, cutting redundant invocations from 98% to 2% while boosting reasoning accuracy.

The Conversation

Popular AIs head‑to‑head: OpenAI beats DeepSeek on sentence‑level reasoning

ChatGPT and other AI chatbots based on large language models are known to occasionally make things up, including scientific and legal citations. It turns out that measuring how accurate an AI model’s ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results