Do messages criticizing Donald Trump’s performance on Covid-19 reduce his support among swing voters? What if the message comes from a conservative personality, like Tucker Carlson?
It was our job at ACRONYM to answer questions like these during the lead up to the 2020 election. Doing so would allow us to spend more efficiently: to put more money behind the most effective ads, and “first do no harm” by taking down ads that cause backlash — increasing Trump’s approval. But understanding persuasion is much harder than understanding conventional digital advertising, which relies on engagement metrics such as clicks, likes, and video views.
In this post, we’ll talk about how we developed an in-house field experimentation platform to understand persuasive political ads, and then parlayed the underlying data to develop near real-time metrics that indicate persuasive impact, long thought to be illusive in political advertising.
Ground Truth Field Experiments and Engagement
Our team started by building Barometer: a field experimentation platform to test the effectiveness of persuasive messages. We would first survey our audience to understand their baseline political preferences, then randomly assign each member to either be served an ad on Facebook (“treatment”) or not (“control”). Finally we issued a follow-up survey to see how the treated audience changed after seeing the ad.
Running randomly-controlled field experiments like these is the most rigorous way to answer the questions we were interested in, but it is slow and expensive — it can take weeks to design and run a well-defined message test. A campaign’s worst-performing ads could run for days before an advertiser would be able to realize what was happening and take them down. Our team needed a way to get a faster, less expensive read on which messages were resonating with our audience, and which were causing backlash.
Tracking engagement metrics such as likes, shares, and click-through-rate (CTR) to get a quick read on how well an ad is doing is common practice in most digital advertising operations. While these metrics may align well with the goal of getting the word out about a product or brand, they are far less useful in the world of political advertising. In 2018, a study from Swayable examined the relationship between engagement and persuasive impact of political ads; the results were discouraging, suggesting little if any relationship. Using Barometer, we conducted 82 message tests, consisting of more than one million surveys, and we found the same thing: conventional ad engagement metrics were not effective guides when it came to deciding which ads to axe and which to keep live in our quest to persuade swing voters away from Donald Trump.
However, there was one kind of engagement that no one had looked at yet; Facebook’s emoji reactions: “haha”, “angry”, “sad”, “wow”, and “love”.
It might seem almost comical that those cartoon faces could say something about how we communicate attitudes about politics, but emojis have been used to train machine learning models that attain state-of-the art performance detecting sentiment, emotion, and sarcasm. What’s more, we saw how in aggregate, these signals often served as leading indicators for real-world political developments in our past roles in tech.
We turned to our body of 82 completed field and in-survey experiments, thousands of emoji reactions, and set to work.
How we built DOROTHY
While we typically think of the “like” and “love” emojis as “good” and “sad” and “angry” emojis as “bad,” political psychology research shows that advertising that induces fear or anxiety often has a persuasive effect. That work suggests that an ad with a greater share of “angry” and “sad” reactions might indicate our ads working as intended — decreasing Donald Trump’s approval.
We also took a close look at some of our worst and best ads. We found that the ads that backfired on us seemed to have an unusually high number of “haha” reactions. These ads weren’t supposed to be funny; the comments sections were filled with sarcastic and derisive remarks.
A correlational analysis confirmed what we suspected: ads that prompted a negative emotional response, i.e. sadness or frustration, tended to be persuasive, while ads that garnered many haha reactions caused backlash. Internally, we called this the “Haha Ratio.”
With a few strong predictors (“haha”, “sad”) and a few weaker ones (“share”, “angry”), we built a machine learning model that would use these reactions to predict whether an ad or promoted news story was persuasive.
We called it DOROTHY, after American journalist Dorothy Thompson. Just as Thompson exposed the rise of Fascism in Germany in the 1930s, we hoped our model would help us identify the best journalism to convey Trump’s anti-Democratic tendencies and incompetence.
But how good was DOROTHY? We used a common strategy to evaluate the performance of machine learning models called leave-one-out-cross-validation. Cross-validation provides a sense of how well the model performs on data it hasn’t seen by training a version of the model on one subset of the data and testing its performance on the other. We calculated the cross-validated performance for both the engagement model with a baseline model based only on ad CTR, and found that DOROTHY performed 34% better.
While this may seem like a reasonable test of the model’s performance, cross-validation is not a complete panacea. How could we devise a test that would allow us to feel confident enough to use the model to make real-time decisions about how much money to invest in new ads or news content — in perhaps the most important presidential election in our lifetime?
We started by conducting additional exploratory analyses. For example, we examined the model’s predictions for Black Lives Matter ads across different demographic groups. What we found matched our intuition: Black Lives Matter content had large predicted persuasive effects on young people and almost no effect on older people.
But the most important test would be if we could use DOROTHY to identify persuasive news stories up front, and then run a field experiment to gauge their persuasive effects. In September, we did just that. Our DOROTHY models predicted that an article about the precarity of the Affordable Care Act (ACA), in particular the coverage for pre-existing conditions, would be persuasive in the wake of Ruth Bader Ginsburg’s passing. DOROTHY’s prediction and the treatment effect observed in our experiment were relatively similar, further increasing our confidence in the model.
At ACRONYM we’ve used DOROTHY for a wide range of applications, from sourcing persuasive content to an ad optimization framework that automatically reallocates budget based on its predictions.
We also wanted to share this tool with the wider progressive advertising community ahead of the election, so we built a web analytics dashboard that would take ad data and display DOROTHY predictions.
As of today, DOROTHY has been used to rank nearly 20,000 ads and articles across a dozen different organizations.
Next, to truly realize the tool’s potential, we went deep down the ad tech rabbit hole and built a system that automatically reallocated ads, audiences, and budget based on the performance of this metric. Given the variation we saw in our DOROTHY measure among different demographic groups (see above), we wanted to get the most persuasive content in front of the right audiences. We collected dozens of current articles grouped by event and topic, then used DOROTHY to select the most persuasive for each audience. Our system was able to adjust the ad spend up or down for each article-audience segment based on these predictions about persuasiveness once the article had received at least 25,000 impressions.
DOROTHY isn’t a perfect model — we saw a lot of hiccups early on and the system does make some mistakes – . but on average it works fairly well. Of course, there are still many questions left unanswered. Would this type of tool work beyond anti-Trump advertising? We used the same, obviously left-leaning Facebook page to distribute advertisements; how much does this source cue affect the reaction mix that we used to train our model? What about funny content that might get a lot of haha reactions — how might we account for that potential confound?
At the end of the day, through both DOROTHY and our larger Barometer measurement work, our team was proud of what we built this cycle – and how we were able to share learnings and resources with the progressive community. It is our hope that this style of innovation and experimental research will continue to form the foundation of progressive ad programs in years and election cycles to come.