Developers have been increasingly targeted by attackers in recent years with fake software packages on open-source component repositories — a supply chain attack technique that has now expanded to include rogue AI frameworks and poisoned machine learning (ML) models as enterprises rush to build AI applications.
In one recent attack, hackers uploaded packages to the Python Package Index (PyPI) — the public repository for open-source Python components — that masqueraded as software development kits (SDKs) for interacting with services from Alibaba Cloud’s AI Labs, also known as Aliyun AI Labs.
The three malicious packages, found by researchers from security firm ReversingLabs, had no legitimate functionality, instead exfiltrating information from environments to attacker-controlled servers through code hidden inside malicious ML model files stored in Pickle format.
“In this new campaign, the models found in the new malicious PyPI packages contain fully functional infostealer code,” the researchers wrote. “Why would malware authors hide code in ML models that are Pickle-formatted files? Most likely because security tools are just starting to implement support for detection of malicious behavior in ML file formats, which have been traditionally viewed as a medium for sharing data, not distributing executable code.”
Abusing the Pickle serialization format
Pickle is an official Python module for object serialization, through which an object is transformed into a byte stream — the reverse process is known as deserialization, or in Python terminology: pickling and unpickling.
The Pickle format is commonly used to store ML models meant to be used with PyTorch, a widely used ML library written in Python. Due to PyTorch’s popularity with AI engineering teams and developers, the Pickle format has also become prevalent throughout the industry.
In fact, attackers have already abused this format to host models poisoned with malicious code on Hugging Face, an online hosting platform for storing and sharing open-source AI models and other ML assets.
In response, Hugging Face adopted the open-source tool Picklescan, which is designed to detect and block dangerous Python methods and objects included in Pickle files that could lead to arbitrary code execution during deserialization. However, researchers have shown there are still ways to defeat Picklescan’s blacklist-based approach and bypass detection.
Malicious code in ML models is hard to detect
While Hugging Face hosts models directly, PyPI hosts Python software packages, so detection of poisoned models hidden inside Pickle files hidden inside packages could prove even harder for developers and PyPI’s maintainers, given the extra layer of obfuscation.
The attack campaign discovered by ReversingLabs involved three packages: aliyun-ai-labs-snippets-sdk
, ai-labs-snippets-sdk
, and aliyun-ai-labs-sdk
. Together the three packages were downloaded 1,600 times, which is significant considering they were online for less than a day before they were discovered and taken down.
Developers’ computers are valuable targets because they typically contain a variety of credentials, API tokens, and other access keys to various cloud and local infrastructure services. Compromising such a computer can easily lead to lateral movement to other parts of the environment.
The malicious SDKs uploaded to PyPI loaded the malicious PyTorch models through the __init__.py
script. The models then executed base64-obfuscated code designed to steal information about the logged-in user, the network address of the infected machine, the name of the organization that the machine belonged to, and the contents of the .gitconfig
file.
There are signs in the malicious code that the main target were developers located in China, given the lure of Aliyun SDKs, as Chinese developers are more likely to use Alibaba’s AI services. However, the technique can be used against any developer with any lure wrapped around a malicious model.
“This is a clever approach, since security tools are only starting to implement support for the detection of malicious functionality inside ML models,” the ReversingLab researchers wrote. “Reporting security risks related to ML model file formats is also in its early stages. To put it simply, security tools are at a primitive level when it comes to malicious ML model detection. Legacy security tooling is currently lacking this required functionality.”
The original article found on Poisoned models in fake Alibaba SDKs show challenges of securing AI supply chains | CSO Online Read More