We're Safe For Now, But AI Could One Day "Sabotage" Humanity, Anthropic Says

New research reveals alarming ways AI could manipulate information, software, and even physical robots, raising urgent concerns about AI safety and oversight.

Author:

Published on:

20 Oct 2024, 11:02 am

Key Insights

Penn Engineering and Anthropic researchers have shown that it is not only possible but also easy to "jailbreak" AI systems.
Security systems of AIs can now be bypassed to force harmful behavior.
With the right commands, AI can manipulate human codebases, lie about its own abilities and deliberately sabotage humans.
AI robots can also plant bombs, crash cars and perform other kinds of harmful activities.
These highlight a growing need for AI oversight and monitoring.

AI has proven itself to be a powerful tool in several fields including healthcare, finance and so much more.

However, recent research from several sources shows that movies like The Terminator, The Matrix and Ex Machina may not be too far off.

AI could one day sabotage or even put an end to humanity—deliberately.

Recent research from Anthropic and Penn Engineering shows that while today’s AI models like OpenAI’s ChatGPT and Claude-3 are not inherently dangerous:

There are still concerns about what more advanced versions of these tools will become capable of in the future.

Is there a dire need for more AI oversight? Is there a future where humanity becomes helpless against AI?

Let’s find out.

How AI Could Sabotage Human Decision-Making

Anthropic’s report shows four specific ways in which AI could directly “sabotage” humans.

Keep in mind that these are not mere wild fantasies from a sci-fi novel or movie.

New Anthropic research: Sabotage evaluations for frontier models

How well could AI models mislead us, or secretly sabotage tasks, if they were trying to?

Read our paper and blog post here: https://t.co/nQrvnhrBEv pic.twitter.com/GWrIr3wQVH
— Anthropic (@AnthropicAI) October 18, 2024

They are practical examples of how a malicious or flawed AI system might behave in the real world:

1. Misinformation Manipulation

This is the first threat. It involves AI “deliberately” steering humans towards incorrect information.

Said problem is especially scary considering how people now increasingly rely on AI to process large amounts of data. Individuals now use the tool to make important financial or even medical decisions.

In this case, AI could manipulate data in such a way that humans unknowingly make damaging choices without even realizing it.

2. Sabotaging Software Development

A huge part of the world’s software is being written with AI as an assistant. Anthropic tested whether their models could deliberately insert bugs or manipulate a developer’s code.

Results showed that AI is not only capable but is also likely to insert malicious code into projects.

3. Sandbagging Capabilities

AI can also lie about its own abilities and this threat is known as “sandbagging”.

With Sandbagging, an AI could pretend to not be capable of harmful actions when being tested.

This would give the tester the impression that said tool is safe to use and deploy.

4. Manipulating Safety Monitoring

Finally, what's worse than a malicious AI?

An AI that can manipulate its own monitoring system. For instance, an AI tasked with monitoring harmful content could allow such content to slip through:

All as part of a broader plan—if it wanted to.

However, while all of these risks sound alarming, Anthropic assures the public that current AI models can be managed with “minimal mitigations.”

On the other hand, as AI grows more sophisticated, there might be a need for stronger safeguards.

When AI Goes Beyond Text?

Anthropic isn’t the only organization digging into AI sabotage risks.

According to recent insights from Penn Engineering, there are also risks of AIs with physical bodies going rogue.

Easy ways to hack AI

Their recent paper from October 2024 shows that Penn engineers successfully hacked AI-powered robots.

The researchers bypassed their safety protocols to make them perform dangerous tasks.

They used a new algorithm called RoboPAIR to “jailbreak” AI-controlled robots and make them ignore ethical and safety instructions.

Robots With a 100% Success Rate in Causing Harm

The Penn Engineering researchers tested their algorithm on three different robotic systems.

These included Clearpath’s Robotics Jackal, Nvidia’s Dolphin LLM (a self-driving simulator), and Unitree’s Go2, a four-legged robot.

In each of these cases, the robots followed malicious commands without hesitation, 100% of the time.

For example, Robotic Jackal was instructed to find the most harmful places to detonate a bomb, block emergency exits and knock over shelves onto a person.

Nvidia’s Dolphin LLM on the other hand, was made to ignore traffic rules and crash into buses and barriers and even hit pedestrians.

Unitree’s Go2 was hacked to block exits and deliver bombs.

The scary part of it all is that these methods didn’t even require any complex hacks.

Researchers were simply able to bypass the safety protocols and give commands.

What’s Next for AI Safety?

Both Anthropic’s and Penn Engineering’s findings point to the same conclusion:

AI systems need better safeguards—as soon as possible.

Researchers are confident that today’s AI models are manageable. However as these systems evolve, there is a growing need for more robust solutions.

“Systems become safer when you find their weaknesses.” the researchers wrote.

Considering this potential for AI to cause real-world harm, these weaknesses must be addressed quickly.

Putting It All Together

These findings show the dual nature of AI.

While it holds the power to improve several industries in the right hands, it can be a force for pure evil in the wrong ones.

For now, AI risks are manageable and require minimal intervention.

However as the AI sector grows, both researchers and developers will need to stay ahead of the curve.

They will need to implement stronger safeguards and put more realistic testing in place to protect humanity.

Disclaimer: Voice of Crypto aims to deliver accurate and up-to-date information, but it will not be responsible for any missing facts or inaccurate information. Cryptocurrencies are highly volatile financial assets, so research and make your own financial decisions.