What’s the difference between calibration and Inter-Rater Reliability? Part 2 of a 3-Part Series on IRR


Share on LinkedIn

In my 14 years in the call center industry, I have had many occasions to visit call centers in nearly every industry imaginable. I’ve come across different examples of calibration, each intended to reduce risk to the organization from customer service:

  • A group of Quality Assurance (QA) folks sitting in a room listening to calls and then discussing them,
  • A group of agents sharing their opinions with QAs on how they think their calls should be graded,
  • QAs and agents debating the attributes that separate a good call from an excellent call, from a mediocre or bad call,
  • A lead QA, manager, trainer, consultant or client instructing members of the QA team on how to evaluate calls.
  • A lead QA, manager or trainer playing examples of (pre-selected) good and bad calls.

While these may be common call center practices, they are far from best practices. In order to drive long-term improvement in the consistency and accuracy of your QA team, the outcome of any calibration process must be quantifiable, repeatable and actionable.

Inter-Rater Reliability (IRR) versus Calibration

Inter-rater reliability studies are more than structured or unstructured conversations. IRR studies demand a rigorous approach to quantitative measurement. IRR studies require that an adequate number of calls be monitored, given the size of the Quality Assurance team, variability in scoring, the complexity of calls, complexity of the monitoring form, etc. Inter-rater reliability testing also requires that call scoring be completed individually (in seclusion if possible). While discussion is key in reducing scoring variability within any Quality Assurance team, scoring and discussion of scoring variations must become separate activities which are conducted at different points in time.

Inter-Rate Reliability testing aims to answer two key questions:

1. “How consistent are we in scoring calls?” and,

2. “Are we evaluating calls in the right way?”

In other words, certainly the goal of IRR is to ensure that each member of the Quality Assurance staff is grading calls consistently with his / her peers. However, a high degree of consistency between the members of the Quality Assurance Staff does not necessarily ensure that calls are being scored correctly, in view of organizational goals and objectives. A further step is needed to ensure that call scoring is conducted with reverence to brand image, organizational goals, corporate objectives, etc. This step requires that a member of the management team take part in each IRR study, acting as the standard of proper scoring for each call.

Efforts to attain high degrees of inter-rater reliability are necessary to ensure fairness to your agents whose calls are being evaluated. Your agents deserve to know, with a high level of confidence, that their monitored calls will be scored consistently, no matter which member of the Quality Assurance team scores them. And they need to know that they are scored well. Without valid and reliable methods of evaluating rep performance, you risk making bad decisions because you are basing them on faulty data; you risk lowering the morale of your agents through your very efforts to improve it; you open yourself to possible lawsuits for wrongful termination or discriminatory promotion and reward practices. You, too, need to know that your quality monitoring scores give reliable insight about the performance of your call center and about the performance of your agents on any individual call.

Sample Reports from Inter-Rater Reliability Study

Based on the figures above, it is very clear that the members of the QA team are relatively equal in scoring accuracy (defect rate) but that QA#1 struggles to accurately score in an area that is critical not only to the internal perception of agent performance but to the customer experience as well (auto-fail questions). QA#1 also tends to be the most consistent in his / her with the remaining members of the team (correlation). From a call perspective, it is clear that calls 6 and 10 included scenarios or situations that were difficult for the QA team to accurately assess. Improving upon the current standing may mean redefining what qualifies as excellent, good or poor, adding exemptions or special circumstances to the scoring guidelines or simply better adherence to the scoring guidelines that already exist.

Does your calibration process deliver results that are this quantifiable and specific?

A few tips from our Inter-Rater Reliability Standard Operating Procedure:

1. Include in your IRR studies any individual who may monitor and provide feedback to agents on calls, regardless of their title or department.

2. Each IRR should include scoring by an individual outside of the QA team who has responsibility for call quality as well as visibility to how the call center fits with larger organization objectives.

3. Make sure each IRR includes a sufficient sample size – 10 calls at minimum!

Republished with author's permission from original post.

Carmit DiAndrea
Carmit DiAndrea is the Vice President of Research and Client Services for Customer Relationship Metrics. Prior to joining Metrics, Carmit served as the Vice President of Behavior Analytics at TPG Telemanagement, a leading provider of quality management services for Fortune 500 companies. While at TPG she assisted clients in measuring behaviors, and provided management services to assist in affecting change based on newly created intelligence.


  1. It is a very fine line between objective and subjective scoring of an agent in call center environment.

    It is a fact that the same call monitor has daily interactions with the agent being scored within the same work environment.

    Many of these interactions affect how the agent is being scored.
    That human contact between the agent and the Quality Team of the company DOES affect the final grade.

    It is essential that the QA Team are held to the highest accountability for objectivity, adhering to Ethical and Moral standards, compliance with the rules of call monitor.

    Key to this is when an agent is empowered by the company to represent the company, to take a decision on the outcome of the call, to remember that that same empowerment of the agent is what ultimately will be used to resolve a call.

    To define the line between empowering an agent to act on behalf of the company should never be re-negotiated or questioned unless a set of specific guidelines are in place for the agent to refer to.

    Not every individual customer is the same. Not every call is of one. Every situation is unique.

    The call center agent should not be the only one to be scored.
    A company, its products, its services, the Clear and spelled out Processes of its call center infrastructures as well as the Knowledge, Training, reliability, performance, objectivity of its QA Team is as much a major factor in the success of a call center.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here