Evidence in Plain Sight: Building a News Article Explorer

Spread the science

Here’s the paradox once again. Implementation Science is very top-down. That is, ideas and best practices are being pushed out into community-based settings. The implementation science field studies what makes certain settings more effective.

But but but.

So many ideas never make their way into academic literature because it is incredibly difficult to write and publish an article. For example, take this article that Dr. Victoria Scott and I worked on earlier this year. It probably took 20 hours to code, ~40 hours to draft the manuscript, then probably 8 hours to make the revisions once it came back from peer review. Oh, and then I paid $2890 to get it published. For what — 1500 views (as of this writing)? You can see why many community-based settings that are doing incredibly innovative stuff don’t even bother. Frankly, I’m not sure why we did either…

So, if the real, ground-level stuff doesn’t make its way into the academic literature, where can we find it? One potential source is the good old-fashioned news. I had always wanted to find a way to incorporate what’s known as the “grey literature” into PubTrawlr. This is where our Twitter and Reddit research comes in, as well.

While working on a project over the last few days, I came across the developmental version of the quicknews R package by Jason Timm. This was more much approachable than the GDELT database, so I coded up a quick app. Now, quicknews only pulls 100 articles, but beggars can’t be choosers here.

To try this out, let’s stick on a topic that’s still pretty hot; vaccine hesitancy. What are news organizations saying about the reasons why people are or are not choosing to get the COVID-vaccine.

What’s in the recent news?

First off, yes, vaccine hesitancy is in the news. People are still writing articles about it. This plot shows the number of articles per day over the past month (minding the 100 article throttle on quicknews).

Furthermore, these articles are distributed across different sites, so it’s not just one outlet dominating the narrative. News24 is a South African site.

Looking into the article content using the bag-of-words approach, we can see some interesting phrases appear that reflect the diversity of discourse on this issue. In the network plot specifically, we see concepts like herd immunity, cash incentive, conspiracy theory, social media, and housing insecurity. There’s a lot of different concepts bundled up in vaccine hesitancy (which is generally the case with most public health and social issues.) Click on the images to see a bigger version of them.

Topic in the Articles

The limitations of quicknews constrain how many topics we can look at. Since the total number of articles is relatively small, I prespecified the number of topics to five. This means that the algorithm starts with five groups, and then decides the best classification scheme for the articles. This graph then shows the words that best distinguish these topics. There’s stuff about mandates, and possibly nursing homes, but the other topics aren’t clearly interpretable. So, some tweaking is still to be done.

In other posts, I’ve been plotting correlation networks between the topics. I did some visual enhancements to this version, whereby the green lines are positive correlations, and the red lines are negative relationships. Like before, the thickness of the line corresponds to the strength of those relationships.

Next Steps

Right now, PubTrawlr is focused on health and social science. But really, you could pull news articles on any topic. For example, Kanye West’s new album dropped (maybe?) over the weekend. Here’s a gallery of the same visualizations for that search.

One idea that I have is to come up with a list of synonyms using BERT, GLoVE, or some other vectorization process to pull articles that are conceptually similar to the search terms. This would yield a set of articles several times larger. More data may be better, but some more testing is needed.

I plan to deploy this app over the next few days, so stay posted for when & where you’ll be able to access it.

Oh, and what else can you do?

Be sure to check out PubTrawlr for the latest in the scientific space. It’s open, free, and worth your precious time!


Spread the science

Leave a Reply

Your email address will not be published. Required fields are marked *