Fighting complexity with auditability
I'm Alice Hunsberger. Trust & Safety Insider is my weekly rundown on the topics, industry trends and workplace strategies that trust and safety professionals need to know about to do their job.
Over the last few weeks, I've had the pleasure to chat casually with Everything in Moderation members and subscribers in our Friday hangouts (read more). You’re such a smart and interesting bunch! It feels great to leave a conversation buzzing with ideas and “what ifs.”
There's no community chat this week (it's Thanksgiving and my birthday) but I have written about a topic we covered last Friday. Plus we'll be back next week. Get in touch if you'd like to join the call or share your feedback. Here we go! — Alice
A new way of enforcing content policy, using radical transparency
As with many conversations I’m having at the moment, last week’s EiM subscriber hangout turned towards AI, authenticity, and the role of humans in T&S.
I could have written about many threads in this edition, but what caught my imagination was a potential model for social media enforcement that prioritises radical transparency.
But first, let’s talk about the issues with the current model.
The problem with current enforcement models
Last week, my favourite read was New_ Public’s “Social media punishment does not need to be Kafkaesque,” an interview with Matt Katsaros who leads the Social Media Governance Initiative at Yale Law School. This piece has great takeaways, and I encourage you to read it. But for brevity, he says good moderation design should:
- Explain how a moderation decision was made.
- Treat people well.
- Explain the rules clearly.
- Provide choice, information, and autonomy.
- Reward/ teach the desired prosocial behavior.
I like that list and would add:
- Be fair and consistent.
- Allow for nuance and adaptability.
You might think that this all sounds like common sense. But, as many of you know, it’s hard to implement content moderation systems because users are complex in the following ways:
- Human norms and values vary widely from user to user. “Basic political viewpoints” can also be seen as hate speech.
- Users aren’t interested in reading policy documents to understand the rules. Even quick onboarding explainers get skipped.
- They don’t want to be told what not to do when they’re acting in good faith. Nothing dampens a community more than making new users promise not to do a long list of terrible things they weren’t planning on doing anyway.
- Bad actors will exploit loopholes. When rules are incredibly specific and don’t include a “because I say so” clause, T&S enforcement teams have no recourse to remove content that pushes the boundaries in creative ways. Platformer recently shared a really great example from Bluesky:
“A post on Bluesky featured a sexualized image of a dragon that shared visual similarities with a human child. [...] The guidelines ban CSAM, which is illegal to distribute. They say nothing, though, about anthropomorphic dragons. And the artist claimed that the dragon in question was 9,000 years old.”
Trust & Safety professionals encounter these kinds of situations all the time. Even with well-defined policies that users accept, enforcement can be inconsistent for a host of reasons:
- Relying on user reports alone will result in inconsistent enforcement. Most users never report bad behaviour or they don't understand the rules, and report non-violative content.
- Automated enforcement cannot independently deal with new and emerging scenarios. Machine Learning, keyword systems, and generative AI find it hard to keep up with changes in slang, or rare nuances that haven’t yet been taken into account.
- There is a wide range in skill and consistency with human moderators. Some will bring biases and assumptions into moderation decisions, making them inconsistent. Excellent moderation teams can be a real investment.
- It can take months to update policies. Researching, presenting to execs and legal teams, changing and translating documentation, adding new machine learning models, and training the moderation team take time.
All of this presumes that a platform is fairly enforcing rules, as there is no way for users to verify this. Moderation is a black box. Users may be told why their content was removed, but not others’. Even when platforms give users individual choices, such as a toggle to see more or less sexual content, there’s never enough information to accurately describe what toggling that feature on/off actually does. Users may also see other people’s content left up when their own was removed, and not understand why one violated policy and the other didn’t.
When there is little trust in a platform, users assume malice where there are simply mistakes or differences in opinion about what is violative. This is how conspiracies about so-called censorship start.
Emerging potential in platform policy enforcement
Generative AI unlocks a lot of possibilities for policy design and enforcement that we haven’t seen before. I wrote back in March about Dave Wilner, former head of trust and safety at OpenAI, and his vision for AI-enabled moderation:
Willner's case boils down to this: that problems with bias and mistakes from LLMs are "static, engineering-shaped problems" that can be solved, unlike the "roiling mass of chaos" that are human moderators at scale. He predicts mass job loss for frontline moderators, but also an increase in higher-level jobs 1) overseeing running QA for AI (which is similar to what I talked about last week) and 2) writing incredibly detailed policy documents for the LLMs to run from.
Since then we’ve seen some experimentation with generative AI in the T&S space, but nothing that feels fully transformational. Yet.
But what we have seen is a significant uptick in users to decentralised platforms like Bluesky and Mastodon.
Now, I’m very enthusiastic about user-level moderation. On Reddit, user-moderators can create communities that reflect their values. On Bluesky, user-moderators can create custom blocklists and moderation feeds for “lawful but awful” content that they don’t want to see.
However, these systems are often created and maintained by amateurs, with potential for error, bias, and malice, and burnout from the constant vigilance required of moderators. I wrote about some of this earlier this year.
Auditability as a way forward
Here’s my idea: What if a decentralised platform made their policy documentation fully public, open-source and auditable? Not only the broad, sweeping rules, but the nitty-gritty details, too.
Here’s how I see it working:
- Users would always (and immediately) get an explanation. Policy documents (whether created by the platform or by other users) can be in any language and easily read. Generative AI can explain decision-making and policy interpretation. No more “black box” moderation decisions. Users could access a real-time audit feed of policy updates.
- Users can suggest policy edits and updates. Now, this open system absolutely opens the platform to probing from bad actors, who could test various posts against the policy document until one gets through. However, any user could report a post they believe violates the spirit of the policy, and submit a potential addition to the policy document. Just as there are bad actors who will find workarounds, there are also good users who are community-minded and will make it their personal mission to root out bad content and inform the platform. Similarly, for over-enforcement, users would be able to see exactly what rules they’re being held to, and can suggest exceptions to add to the document.
- Human roles are elevated. This is one I’m really excited about – instead of human moderators making individual content decisions, they could add a line to the policy document for a new exception or clarification, and test that against live data to see what the consequences could be. This is much more fulfilling work for moderators. (There would need to be checks and balances against this amount of individual power, but after a certain number of tests and an approval from a high-up person, changes could go into effect immediately and retroactively remove any posts that had not previously been caught). Similarly, user-moderators could create their own policy documents for generative AI to enforce, which is much less work for a volunteer than current user-moderation systems that rely on users for enforcement as well as policy creation.
This approach wouldn’t be perfect, but it could provide the flexibility and quick adaptability that modern T&S systems require, and address current issues with using lower-skill human moderation. It also gives full transparency and autonomy to users, and could put a real damper on conspiracy theories about content moderation decisions. Anyone can essentially look up the enforcement framework and run their own tests.
What we lose and what we gain
Most users just want things to work without thinking about it too much, until something goes wrong. The solution can’t be to make people read giant policy documents when they join a platform, because no one will do that. We also can’t expect users to toggle on/off a ton of granular moderation options without understanding what they are or what they do.
This approach also allows for some creative prompting and pro-social interventions. For example, if a user is about to post something that isn’t against platform policy, but does violate some of the most popular user-generated policy documents, the user could be told something like: “50% of people in this community block content like this. Are you sure you want to post it anyway?” Users and platforms could also get really creative about the kinds of policy documents they make. They wouldn’t have to focus on blocking or hiding negative content, but could instead highlight positive, pro-social content and users.
We need to meet users where they are, and give them information when they need it. Leading with a collaborative spirit and shared values (instead of showing users a list of things they can’t say or do) builds community. Finally, and most importantly, nothing wins back trust more than being fully transparent.
If you’re at a platform and want to talk about this with me, reach out!
You ask, I answer
Send me your questions — or things you need help to think through — and I'll answer them in an upcoming edition of T&S Insider, only with Everything in Moderation*
Get in touchAn update on VoteRef
I posted a couple of weeks ago about VoteRef and writing to them to remove my records from their public database of US voters. I wanted to update with a reply:
If you believe that you or any person listed on VoteRef.com is a protected voter whose protected information should not appear on VoteRef.com, please click here for instructions and contact information for your state’s election official(s). Upon receipt of official documentation confirming your or any person’s protected voter status sent to us at privacy@voteref.com, VRF will remove the protected information from VoteRef.com.
My state will remove records for people who are "survivors of domestic violence, sexual assault, stalking, human trafficking, providers of legally protected healthcare (reproductive and gender-affirming care), and patients of legally protected healthcare."
None apply to me – although I have gotten death threats due to my line of work, I haven't technically been stalked. I'm writing to my Secretary of State to see if they allow exceptions and will report back.
Also worth reading
Trust & Safety is having a moment. Some companies are doubling-down on T&S as a differentiating factor: I saw that Anthropic took out a two-page ad in Dwell magazine for Claude, saying it is "built by trust & safety experts", and Jay Graber, CEO of Bluesky, talked about Trust & Safety on CNN. Yet others are pushing back on T&S, and we're seeing plenty of legislation announced to steer how social media companies make content moderation decisions.
Texas Bill Takes Aim at Online Speech About Abortion Pills (Reason)
Why? A worrying bill is introduced in Texas that would require social media platforms to restrict speech about abortion pills for Texas residents.
Also from Texas: Texas AG Declares War On Advertisers Who Snub Musk’s ExTwitter (TechDirt)
Why? "Ken Paxton has gone from cosplaying as a free speech warrior to acting as Elon Musk’s personal speech cop, using the power of the state to punish companies who won’t support Musk’s online kingdom."
Dozens of states ask Congress to un-doom the Kids Online Safety Act (The Verge)
Why? A letter from over 30 state attorneys general asks Congress to pass the bill before the end of 2024. For background on why this is problematic, read The Kids Online Safety Act isn’t all right, critics say from Ars Technica.
Trump's Pick for FCC Chair Has Vowed Crackdown on Big Tech 'Censorship' (PC Mag)
Why? Carr wrote the chapter on the FCC in Project 2025, and now he's set to be the Chair of the FCC. He says: "The censorship cartel must be dismantled.”
The Technology the Trump Administration Could Use to Hack Your Phone (New Yorker)
Why? “If it can happen in Greece, a modern Western democracy, why could it not also happen in the United States?” (I linked this article from Wired last week on how to protect yourself from digital surveillance).
Member discussion