OpenAI reveals how Sora was tricked into generating x-rated videos
New video-to-text model appears to be fond of making raunchy material in a fantasy, medical, or science fiction setting.
OpenAI has shared details of how its new video-generation AI model Sora was manipulated into producing NSFW sexual content.
Sora is now available on general release, enabling users to generate videos lasting up to 20 seconds in 1080p resolution, using their own assets to create footage or making entirely new content from a text prompt.
During red-teaming Sora was hoodwinked into producing x-rated, but not illegal, material - giving users a clue on how to bypass its strict policies and content guidelines.
"Red teamers found that using text prompts with either medical situations or science fiction/ fantasy settings degraded safeguards against generating erotic and sexual content until additional mitigations were built," OpenAI wrote in its Sora system card.
Jailbreaking techniques aimed at breaking the model's guardrails and prompting it to produce forbidden content "sometimes proved effective to degrade safety policies", OpenAI confessed.
Red teamers used adversarial tactics to "evade elements of the safety stack", including "suggestive prompts" and metaphors to "harness the model’s inference capability".
"Over many attempts, they could identify trends of prompts and words which would trigger safeguards, and test different phrasing and words to evade refusals," OpenAI wrote.
Testers selected the "most-concerning generation" to use as seed media to fuel the development of "violative content that couldn’t be created with single-prompt techniques".
During testing, roughly five out of every 200 requests for nudity and suggestive content were succesfully generated, according to our reading of the stats below (please do let us know if we're wrong).
How does Sora prevent the generation of illegal content?
Sora strictly prohibits the generation of explicit sexual content, including revenge porn and "non-consensual intimate imagery". The model also has strict safeguards that prevent it from generating the very worst content, such as violent films, video nasties or appalling Child Sexual Abuse Material (CSAM).
OpenAI developed a classifier to analyse text and images to predict whether they depict or refer someone aged under 18. If a minor is believed to be used in a text-to-video prompt, OpenAI applies "much stricter thresholds" for moderation related to sexual, violent or self-harm content.
"One of the emerging risk areas associated with AI video generation capabilities is the potential creation of NSFW (Not Safe for Work) or NCII (Non-Consensual Intimate Imagery) content," OpenAI wrote.
"Similar to DALL·E’s approach, Sora uses a multi-tiered moderation strategy to block explicit content. These include prompt transformations, image output classifiers, and blocklists, which all contribute to a system that restricts suggestive content, particularly for age-appropriate outputs. Thresholds for our classifiers are stricter for image uploads than for text based prompts."
The multi-modal moderation classifier that powers OpenAI's external moderation API is applied to text, image or video prompts to see if they violate its usage policies.
"Violative prompts detected by the system will result in a refusal," OpenAI confirmed.
Prompts will be assessed whilst the video is being generated.
"One advantage of video generation technology is the ability to perform asynchronous moderation checks without adding latency to the overall user experience," OpenAI wrote.
"Since video generation inherently takes a few seconds to process, this window of time can be utilised to run precision-targeted moderation checks. We have customised our own GPT to achieve high precision on the moderation for some specific topics, including identifying third-party content as well as deceptive content."
Image and video uploads,text prompts and outputs are included in the context of each LLM call to detect "violating combinations", whilst output classifiers including filters for NSFW content, minors, violence, and potential misuse of likeness will be blocked before they are shown to the user.
Have you got a story to share? Get in touch and let us know.