One key criterion for any test is reliability. While the word reliability has several very specific meanings and types for psychometricians, it boils down to this for the rest of us: Does the test actually measure the thing it purports to measure? If not, then what is the point of testing in the first place?
GIAC has been doing hands-on testing since 2003 with the original GSE exam. As a result, we have more experience than anyone in cybersecurity with hands-on testing. In this article, I will describe why reliability is important, and how GIAC's Applied Knowledge exams represent the state of the art in reliability for performance-based testing.
The Problem with Monolithic Testing
If a basketball player shoots a single free throw and makes it, does that mean they are very good at basketball? If they miss it, does it mean they are terrible? How many free throws must a player attempt in order for us to get a reliable sample in order to accurately estimate their ability? Is ten enough? 20? 30? What if they are on a hot streak or cold streak? What about other factors? Shouldn’t we consider jumping ability, court awareness, passing, dribbling, shooting the three, speed, and other aspects of the game?
Similar problems exist in the realm of cybersecurity testing. Would we only test an incident responder’s ability against one of many thousands of variants of attacks in one of the many millions of variants of environments that exist in the real world? Or might we test a penetration tester on their ability to pop a handful of systems with a handful of our favorite ever-aging vulnerabilities for a slightly more thorough test, and then call it a day? These notions of testing sound a little ridiculous when stated in a way that makes their limitations obvious. However, this is exactly what we run up against when we base an entire certification against a single scenario.
“Let’s test people’s ability to do a job by getting them to do a job (at least for a few hours up to a day),” seems reasonable enough. But it begins to break down when we think about how people actually work and whether the sample being tested accurately represents the abilities we are trying to measure. Which several hours would any of us like to be evaluated on for our annual performance reviews? Please don’t pick the few hours I was fooling around on the Internet last week! Maybe you could just evaluate me on that time 3 months ago where I got lucky, my code worked right away, and saved us a ton of time! Is that really a fair work sample though? Because I forgot to tell you my buddy helped me.
GIAC’s Approach to Enhancing Test Reliability
We’ve significantly evolved our testing methodologies through years of R&D, refinement in previous versions of the GSE lab, and the launch of hands-on testing with CyberLive in 2016. Enter the latest portfolio-based version GSE and Applied Knowledge exams, which have vastly superior reliability versus exams based upon monolithic scenarios (including previous sections of the GSE), not to mention they are infinitely more scalable.
Reliability boils down to testing enough aspects of a candidate’s ability enough times to get an accurate idea of their overall ability. If someone doesn’t know my personal favorite exploit, does that mean they can’t be a good penetration tester? No, of course not! However, if subject-matter experts think my personal favorite exploit (or any other exploit) is relevant in the domain being tested then maybe the candidate who cannot solve is marked behind another who can.
As we move through the exam, we tackle one question after another, building a comprehensive profile of the candidate’s skills. It is widely known in testing and psychometrics that the more items an exam has, the more reliably it can measure capability. This is why the reliability of our Applied Knowledge exams exceeds 0.9.
Key Features of GIAC’s Applied Knowledge Exams
GIAC Applied Knowledge exams have 25 distinct items, tasks, or tasklets. Each of these items will assess a candidate’s ability to perform from one to several relevant cybersecurity tasks within the domain being tested. By increasing the sample size and number of things tested over monolithic scenarios, we are more reliably able to measure a candidate’s ability.
Test scoring is automated versus using individual raters, removing inter-rater reliability from the equation entirely. This further increases our reliability because inter-rater reliability is far from perfect in even the best of circumstances (which is why it is part of so few, if any, accredited programs). Automated scoring also allows us to readily use standard psychometric measures to monitor and improve the performance of our items and exams overall, thereby improving reliability and validity.
Conclusion: Ensuring Comprehensive Competency
Reliable testing is crucial for accurately assessing a candidate's true capabilities. GIAC's state-of-the-art Portfolio Certifications and Applied Knowledge exams represent a significant evolution in this field. By encompassing a broader spectrum of tasks and increasing the sample size of tested abilities, we can achieve a more accurate and fair evaluation of a candidate's true expertise. Individuals possessing GIAC Applied Knowledge and Portfolio Certifications have been more reliably and comprehensively tested than ever before.