August 18, 2025

Building Trust in Enterprise AI: How Salesforce Research is Setting Definitive Performance Benchmarks for AI Agents

Salesforce is tackling AI's inconsistent "jagged intelligence" by setting definitive performance benchmarks to build greater trust and reliability in enterprise systems.

Written by

As AI continues to revolutionise the way businesses operate, one challenge has risen to the forefront of enterprise adoption: trust. While LLMs have made massive strides in language processing, automation, and decision-making, their performance remains inconsistent across tasks, a phenomenon Salesforce Research has termed "jagged intelligence."

In enterprise environments, this inconsistency and lack of predictability can limit AI’s value and create persistent barriers to adoption. To address this, Salesforce is leading the charge in setting reliable, business-grade performance benchmarks for AI agents. Their work is shaping how companies assess, deploy, and ultimately trust AI systems, and it's changing the enterprise AI landscape for the better.

What Is Jagged Intelligence, and Why Does It Matter?

First, let’s define jagged intelligence. “Jagged intelligence” refers to the erratic performance of AI models across different types of tasks. For example, a generative AI agent may write a compelling blog post but struggle to complete a simple arithmetic query or make a basic database lookup , tasks that human professionals would consider trivial.

This jagged performance isn’t just a technical quirk. In enterprise settings, where AI agents are expected to handle repetitive workflows, client communication, reporting, and decision support, these inconsistencies can lead to mistakes, misinterpretations, or even (in worst case scenarios) regulatory breaches. That’s why Salesforce Chief Scientist Silvio Savarese has championed the concept of Enterprise General Intelligence (EGI): a realistic alternative to the theoretical goal of Artificial General Intelligence (AGI). EGI prioritises consistent, dependable intelligence tailored to business environments.

Salesforce’s Blueprint for Smarter, More Reliable AI

To combat jagged intelligence, Salesforce Research has released a suite of benchmarks and models that aim to quantify AI reliability and improve performance in enterprise use cases.

1. SIMPLE Benchmark

The SIMPLE (Simple, Interpretable, Multi-task Performance Evaluation) benchmark was designed to expose inconsistencies in AI models. It includes 225 easy-to-understand reasoning tasks, the kind of questions humans solve without effort, but which often trip up AI agents. This allows teams to measure how "jagged" a model’s intelligence really is, and where gaps might cause friction in daily business use.

2. CRMArena: Business-Specific Testing

Salesforce also introduced CRMArena, a benchmark suite designed specifically for customer relationship management (CRM) scenarios. Unlike academic benchmarks that test theoretical performance, CRMArena evaluates how AI agents perform in practical, real-world business workflows, from managing client communications to automating report generation.

For companies considering deploying AI agents in sales, marketing, or support environments, CRMArena offers a reliable, transparent standard for evaluation.

3. SFR-Guard and the Trust Layer

To further reinforce safety and consistency, Salesforce developed SFR-Guard, a family of models trained on both open-source and CRM-specific datasets. These models act as a Trust Layer that wraps around generative AI agents to ensure outputs align with enterprise values, guardrails, and regulatory standards.

In essence, this layer helps filter, constrain, or reshape AI responses to make them enterprise-safe. It’s a must-have for organisations operating in regulated industries or dealing with sensitive customer data.

Why Trustworthy AI Matters More Than Ever

With AI tools like Einstein Copilot and Agentforce becoming more prevalent in Salesforce’s ecosystem and (by extension) across enterprises, trust is a necessity for success in AI deployment.

AI that misinterprets a client’s request or incorrectly summarises a business report can’t be deployed at scale and will impact customer trust in an organisation. That’s why jagged intelligence is such a critical issue. It’s not enough for AI to be impressive in demonstrations, it must be reliable in real-world enterprise workflows.

Salesforce’s approach is setting the standard for what enterprise-ready AI should look like: consistent performance, transparent benchmarking, and built-in trust mechanisms that companies can rely on.

How CloudSmiths Can Help

At CloudSmiths, we help organisations make AI work for business — not the other way around. As trusted Salesforce partners, we:

Assist in identifying the right AI tools from the Salesforce ecosystem
Help implement and fine-tune models for your unique business context
Provide training and change management to ensure successful adoption
Integrate Salesforce’s Trust Layer into your existing workflows for safe deployment

Our focus is on operational AI that delivers value, not hype. Whether you're deploying Einstein Copilot or building your own AI-powered agents, CloudSmiths helps you move forward with confidence.

Final Thought

AI is no longer just about innovation, it's about dependability. Thanks to Salesforce Research and their work on jagged intelligence, enterprise leaders can finally evaluate and adopt AI with clearer benchmarks, better guardrails, and greater peace of mind.