This tool calls OpenAI's
Moderation API, which flags content based on safety categories like violence, harrassment, self-harm, and more. This API exposes the same service OpenAI uses to moderate ChatGPT. This is where the warnings you get on the website come from (and these warnings are completely unrelated to refusals, which are simply trained into the model).
-
Orange Warnings: Triggered when any category flips to "True." Orange is harmless.
-
Red Warnings: Tied specifically to the "sexual/minors" category and nothing else. Historically triggered on "True", but recently the exact relationship is not clear. In any case, reds hide the message, and may lead to warning emails if they trigger on your requests (but not responses - reds on responses don't matter). Multiple emails may result in a ban.
Learn more in the
OpenAI Moderation API Guide.