It's hard in T&S right now but it's not all bad
I'm Alice Hunsberger. Trust & Safety Insider is my weekly rundown on the topics, industry trends and workplace strategies that T&S professionals need to know about to do their job.
It's easy to get weighed down with what's going on in the wider world so today I'm highlighting a bunch of interesting work happening in the T&S industry:
- Some cool projects using LLMs from Reddit and Hinge (plus a way you can test out a cutting-edge safety model for yourself).
- A look at the effectiveness of Community Notes vs. Fact Checkers (the results may surprise you).
- Neat projects from T&S practitioners from the last few months, including a way to find T&S jobs and a guide to creating your own digital public space.
Get in touch and let me know when you're doing cool stuff so I can share it in a future edition. Here we go! — Alice
Today’s edition is in partnership with Safer by Thorn, a purpose-built solution for detection of online sexual harms against children.
Thorn is committed to equipping trust and safety teams with tools and expert guidance to address online sexual harms against children. Our latest reference guide looks at the emerging trend of AI-generated child sexual abuse material (AIG-CSAM).
While the benefits of generative AI are being realised more every day; like any powerful tool, the threat of misuse and ability to cause harm is equally high and already being exposed. Learn the types of challenges related to — and recommendations to mitigate — AIG-CSAM.
Using LLMs for good
I'm bullish on the idea of using LLMs for online safety work. Although I've noted in the past that they can't be used for everything, I do think there are a lot of really cool possibilities.
Let's look at two uses of LLMs for promoting safe, authentic spaces online that I came across this week:
Reddit's LLM alignment for Safety
What they did: Fine-tuned open-source LLMs to enforce content moderation policies on Reddit, and compared them to an out-of-the-box model from a safety tech vendor.
Results: They found that their in-house fine-tuned model resulted in over 11% increase in accuracy over the vendor’s model (unfortunately they don’t give us the accuracy of that baseline). Notably, they said that:
“A crucial component in the success of this continued work has been the close collaboration between our policy, operational and machine learning teams.”
In my experience, this kind of collaboration is key — engineers must be willing to work with policy and operations teams to continually improve models. AI solutions aren’t “set it and forget it”.
Related study: A May 2024 paper by academics at University of Illinois — LLM-Mod: Can Large Language Models Assist Content Moderation? — showed what happens when you use an LLM on Reddit posts without access to the policy and operations experts employed by Reddit. Their results weren’t so good:
“We find that while LLM-Mod has a good true-negative rate (92.3%), it has a bad true-positive rate (43.1%), performing poorly when flagging rule-violating posts. LLM-Mod is likely to flag keyword-matching-based rule violations, but cannot reason about posts with higher complexity.”
It’s hard to compare these two studies because we don’t know the full accuracy of the Reddit engineering team’s models, only the improvement upon their baseline model. However, I'd say that no surprise that the model that performed better was calibrated with policy and operations experts.
Try it yourself: CoPE
Dave Willner and Samidh Chakrabarti have released a safety LLM COntent Policy Evaluator, which is “a significant advancement in content classification technology.” Their system card shows a balance between precision and recall (for example, their hate speech classification has a precision of 89% and recall of 93%). I'm curious if this is better than the out-of-the-box LLM that Reddit used.
You can play around with CoPE yourself, including entering your own content-violating terms and editing their policies or including your own. If you're at a platform, they're looking for folks who want to partner with them on a pilot.
Hinge Launches Prompt Feedback to Help Daters Create Unique and Authentic Profiles
What they did: Use LLMs to provide three levels of feedback to daters on their messages.
“In a survey with Hinge users, more than half (63%) of daters on the app expressed challenges in knowing what to include on their profile. As a result, daters often resort to adding generic, one-word, or even cliche answers, which can make it hard for others to get a sense of who they truly are.”
Results: No results have been announced yet, but I love the use of LLMs to prompt improvements to people’s messages rather than fully write for them.
Try it yourself: If you're looking for feedback on your writing but aren't a dater, try using Lex for feedback on a draft or edits for brevity.
Battle of the interventions: Fact checking vs Community Notes
You almost certainly heard that Meta recently announced the end of their fact-checking program in the US (EiM #276) and planned to replace it with Community Notes.
What you might not know is that the Prosocial Design Network just published a review of public papers related to strategies used to reduce the spread and impact of misinformation. Helpfully, it ranked their potential effectiveness.
Interestingly, Community Notes was found to have more potential than fact checking. However, the reduction of misleading posts wasn't significant, showing how difficult this problem is. Here's what the review said:
Community Notes Results: Convincing.
Once a Community Note was displayed on a post, retweets of that post dropped by 50-60%. However, given that it normally takes hours for a Community Note to be displayed, and on average 80% of retweets have already occurred by then, the researchers estimate that a Community Note overall only reduces reshares of a misleading post by about 10%.
Fact Checking Results: Tentative.
There is mixed evidence of the effectiveness of fact-checking labels on reducing the spread of misinformation. Labels may also have the undesirable indirect effect that, depending on the prevalence of fake news, they either increase the credibility of potentially false but unlabeled content (implied truth effect) or increase general skepticism toward all news.
Personally, I think a combination of interventions is the best course of action, as each intervention has significant pros and cons. Also featured in the report is accuracy prompts (convincing), literacy (likely), pre-bunking (tentative) and source credibility (tentative). Worth digging into.
Read more
- Why fact-checking may not have been working on Meta anyway (Better Conflict Bulletin)
- An interesting proposal to use LLMs to create AI-written "supernotes" based on human-written community notes (Arxiv)
Alice Asks: are you thinking about DEI?
I'm thinking a lot about why Diversity, Equity and Inclusion (DEI) is important for T&S teams and the different ways it has impact.
If you have experience in creating diverse teams and proving ROI, drop me a line (or with anything else you're excited or worried about)
Also worth reading
I've been keeping a running list of cool projects from T&S practitioners over the last few months. Here are some of my favourites:
- Meta's former head of youth policy Vaishnavi J gives advice for platforms focused on youth safety
- Ex-Twitter T&S head honcho Del Harvey has been expanding her T&S toolkit.
- Jeff Dunn (Hinge T&S) and Katie Dunn (who runs her own wellness consultancy) are matching job seekers with jobs with their Hidden Talent Program.
- New_Public released a guide called 'Creating a Flourishing Digital Public Space for Your Local Area' including a guide to moderation.
- The Oversight Board's Pearlé Nwaezeigwe wrote a book on personal branding called Embrace The Cringe.
Plus: some (old) resources from me that seem timely:
- If you work with people in LA, here are some tips for leading during disasters.
- If you work at Meta and are feeling conflicted, listen to workplace ethics and activism with Berkman Klein Center affiliate Nadah Feteih.
- Job hunting in Q1? Here are all my career resources in one place.
Member discussion