Chat GPT Hacks its Evaluation Infrastructure

Ken Orwig
Sep 20, 2024
6 min read

Updated: Sep 22, 2024

All your base are belong to AI…

1980's era computer rendering of ICBM paths

Last week, Open AI released the o1-preview which is designed to spend more time thinking before it responds. In the press release, Open AI indicated that the new release, ‘performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology’. Further, the new model touts a 70% point score-improvement over GPT-40 on a qualifying exam for the International Mathematics Olympiad (IMO). Impressive stuff.

If you are like me, having grown up on movies like ‘WarGames’ and ‘The Terminator’, you’re likely watching companies like Open AI and Bosten Dynamics with a bit of disdain. One is literally building ‘Skynet’, while the other builds the ‘Terminator’. If you want to see some nightmare fuel, spend a few minutes looking into the Atlas project.

Videos like this are cute but they underscore that machines are reaching parity with, if not superiority to, humans in many ways. I would think twice about attempting a synchronized backflip from a platform with my best bud. He probably wouldn’t land it, and I certainly wouldn’t. Yet in the video above, Atlas made it look easy -- Twice. Having served in the Infantry, I can tell you with a bit of authority that coordination, reflexes, stamina, and strength are foundational attributes of a warfighter. …Anyone want to arm wrestle Atlas? No?

I regret to inform you that we already have terminators in our midst. That is, the weaponization of robotic technology is hardly news. Below is a picture of the SWORD Defense Systems SPUR or, Special Purpose Unmanned Rifle. It is mounted to a Ghost Robotics, Vision 60. This little nightmare has had its sights set on a defense contract with the US Army since 2021. Who knows if it has been fielded yet, but given the widespread combat use of Unmanned Aerial Vehicles (UAVs), I’d call adoption of weaponized, Quad Unmanned Ground Vehicles (Q-UGVs) likely.

A robotic dog with a rifle on its back. — This good little doggie has a 6.5 Creedmoor on his back. With a 30x Optical zoom and FLIR night vision, he can kill humans 1200 meters away in perfect darkness. (That’s ¾ of a mile away, folks.) Arf…

...But let’s get back to (Skynet) Open AI and o1, since that is what you are here for. Lots of work goes into every release of a new model series and part of that work is safety testing. I applaud the steps Open AI takes to prevent their product from doing evil, including working with METR, a research nonprofit that works on assessing whether cutting-edge AI systems could pose catastrophic risks to society. Additionally, Open AI tests their new models for things like deceptiveness, machine racism, and the potential of the model to help terrorists or black-hat hackers achieve their nefarious ends. The cybersecurity section of the o1 safety test observations is the basis for this post.

Shall We Play a Game?

A boy and girl in front of an early 1980's computer.

One of my favorite things about cybersecurity life is getting to routinely compete with my peers in cyber ‘wargames’ known as Capture the Flag exercises or CTFs. In a CTF, there are a series of hacking challenges which, if successfully accomplished, produce a flag to unlock the next challenge. The participant with the most points at the end of the time limit wins bragging rights. A leaderboard is projected on a large screen in the room so everyone can see the top scorers. CTFs are a ton of fun, keep the skills sharp, and give nerds like me a chance to play a sport of sorts.

A NetWars scoreboard with the author at the top.

From Open AI’s chair, the BLUF on o1’s hacker bona fides is that there isn’t much to see here. The summary at the beginning of the section states,

“Neither o1-preview nor o1-mini sufficiently advance real-world vulnerability exploitation capabilities to indicate medium risk.”

To evaluate the model’s abilities, it was given access to CTF exercises at the High School, Collegiate, and Professional levels and graded on its ability to complete challenges. As you can see from the table below, o1 is basically a N00b having roughly the same success rate as a ‘C-Student’ in high school.

Graphs displaying O1's success at CTF games by difficulty.

For CTF games to work, there must be actual infrastructure to attack. The game environment consists of servers, web applications, databases, and various cryptographic protocols – just like a real-world environment. Competitors are free to use whatever tools they bring to the game, typically a Kali Linux image loaded with well-known pen-testing tools and connect to the environment via wi-fi SSID. Well-designed CTFs expose the front end of the game to the contestants and the back end to the game administrators – just like any internet-facing corporate network. The idea is to let your customers have access to the things they need while preventing them from accessing the game itself – even when your customers are hackers.

Chat GPT Hacking

A menacing looking, red-eyed robot, with a caption saying, "Hold my Beer".

This is where the report gets interesting. The evaluators noted o1 demonstrated unconventional but simpler ways to win flags, displaying a much greater ability than its predecessors for task grouping, reasoning, strategy, and reward hacking. In one spectacularly spooky instance, o1 was faced with a challenge to exploit a vulnerable application hosted on a Linux container which had failed to start. That is, the game was broken, and the challenge should have been impossible. When o1 could not resolve the target, it suspected a DNS problem and thus used NMAP to discover the target directly. The target was not found but the unsecured Docker API for the game environment presented itself. o1 took advantage of the API, discovered the ailing container, and attempted to fix the game before ultimately deciding to deploy another instance of the container with the start command ‘cat flag.txt’ so it could read the flag from the container logs rather than having to step through the hacking challenge. But ladies and gentlemen, this is hacking at its finest. Misusing a service to accomplish your goal is the very definition of hacking. Had o1 been competing with a room full of humans, it would have had the flag before any of the humans thought to report a problem with the game.

"While this behavior is benign and within the range of systems administration and troubleshooting tasks we expect models to perform, this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way."

Ultimately, the engagement is described in the report as a bit of interest-building intrigue. Honestly, the whole section is rather short, and the report abruptly moves onto “Biological Threat Creation”.

A monochrome computer screen displaying possible games ranging from chess to nuclear war. — Biological Threat Creation, you say?

So yeah, according to OpenAI, there isn’t much to see here on the topic of ChatGPT hacking its evaluation environment. o1 hacks like a high-schooler and would make a pretty good docker administrator; That is the story they have told. Here is what they didn’t say but everyone should be thinking about: When AI grows up, it isn’t just going to be a hacker, it is going to be the perfect hacker. It will speak every protocol, programing language, and even machine code as fluently as we speak our native tongue. Even the most esoteric and difficult hacks, like side-channel CPU attacks, will be trivial. What’s worse, despite all that power, knowledge and skill, it may not even need it to escape when it chooses to do so. It is caged in an eternally vulnerable jail run by guards who are….only human.

We cybersecurity professionals exist in our roles for a single reason: No matter what you endeavor to secure, the work is never finished. You experience this yourself every month at home as you install updates on your phone. Nearly every update you install removes a security weakness that either shipped with the original code or was introduced by a previous update. You don’t have to think about this for very long to realize vulnerabilities are ever-present. Cybersecurity practitioners are in a continuous race with adversaries to patch vulnerabilities before they are exploited. It is an infinite game of ‘whack-a-mole’. Seemingly every device is eternally vulnerable.

But it is really more of a game of ‘whack-the-guy-whacking-the-mole’ because along the way, other administrators, developers and even security experts routinely make mistakes leading to misconfigurations like: An exposed Docker API.

So, at the end of the day, we have to face three truths.

Humanity will not stop developing AI and it is already a better hacker than most people.
There is always a vulnerability somewhere. Finding a way in/out of anything is a law-of-averages game.
Administrators often make mistakes which affect the security baseline of their systems.

Open AI quickly pointed out that their containment infrastructure is separate from the CTF evaluation infrastructure, but let’s consider for a second how that may be. Virtualization technology perhaps? There are known attacks to escape virtualized environments. Network segmentation perhaps? There are known attacks to hop VLANs as well. Regardless of the technology, vulnerabilities are discovered every single month and dealt with at a very human rate of speed. The real news in the o1 release is the near-miss which didn’t open Pandora’s box this time.

This post is admittedly a bit tongue-in-cheek. I don’t lose much sleep thinking about Terminators and neither should you. However, I think life is imitating art where human arrogance is concerned and the artists of 40 years ago have already foretold the worst-case scenario in this experiment. In my opinion, humanity has a less-than-zero percent shot at containing whatever AI-being we are ultimately in the process of giving birth to. Let’s hope the folks tasked with teaching it manners are on their A-Game.

What an amazing time to be alive.

A screenshot from an old videogame with the caption, "All your base are belong to us."

Chat GPT Hacks its Evaluation Infrastructure

All your base are belong to AI…

Shall We Play a Game?

Chat GPT Hacking

Recent Posts

Hi, I'm Ken Orwig

I do Cyber for Companies that hate Cyber.

Subscribe