Harnessing Google Health Trends Data for Epidemiologic Research

Krista Neumann; Susan M. Mason; Kriszta Farkas; N. Jeanie Santaularia; Jennifer Ahern; Corinne A. Riddell

Disclosures

Am J Epidemiol. 2023;192(3):430-437. 

In This Article

Abstract and Introduction

Abstract

Interest in using internet search data, such as that from the Google Health Trends Application Programming Interface (GHT-API), to measure epidemiologically relevant exposures or health outcomes is growing due to their accessibility and timeliness. Researchers enter search term(s), geography, and time period, and the GHT-API returns a scaled probability of that search term, given all searches within the specified geographic-time period. In this study, we detailed a method for using these data to measure a construct of interest in 5 iterative steps: first, identify phrases the target population may use to search for the construct of interest; second, refine candidate search phrases with incognito Google searches to improve sensitivity and specificity; third, craft the GHT-API search term(s) by combining the refined phrases; fourth, test search volume and choose geographic and temporal scales; and fifth, retrieve and average multiple samples to stabilize estimates and address missingness. An optional sixth step involves accounting for changes in total search volume by normalizing. We present a case study examining weekly state-level child abuse searches in the United States during the coronavirus disease 2019 pandemic (January 2018 to August 2020) as an application of this method and describe limitations.

Introduction

There is growing interest in using internet search data to characterize epidemiologic patterns of exposure and disease because they are accessible, free, and near-real-time, enabling access to very recent data. The Google Health Trends (Google, Mountain View, California) Application Programming Interface (GHT-API) is one source of such data. To access these data, after obtaining an API key,[1] researchers specify the search term(s), geographic region, and time period of interest, and the GHT-API returns an estimated scaled probability of the search term(s) given a random sample of all Google searches within the specified geographic-time period. Google searches, which can be accessed through either the GHT-API or a separate, publicly available Google Trends website (GT)[2] (or its associated GT API) have been used to measure variables that are difficult to capture via traditional data sources, such as abuse,[3–5] racism,[6] and public sentiment around drinking water contamination and birth control.[7,8] These data can also be used to examine trends when real-time data are beneficial, such as during influenza seasons.[9–11]

The GHT-API is distinct from GT[2] and less cited in academic research—PubMed returned only 5 results for "Google Health Trends" compared with 780 results for "Google Trends." While both extract a random sample of all Google searches and allow comparisons of multiple search terms over geographic-temporal periods, their outputs differ.[12] GT rank-orders search volumes within the specified geographic-time period and returns a search volume indexed between 0 and 100, representing the relative popularity of the search term(s);[13] GHT-API returns the probability of the specified search terms, based on a random sample of all Google searches in the specified geographic-time period, scaled by 10 million for readability (2020 Google Health Trends API Getting Started Guide (unpublished document provided with API key)). Note that Google does not disclose the total number of searches used to calculate this probability, and thus returned results can only be interpreted as relative volume with an unknown denominator. The GHT-API has advantages over GT since values are not scaled to the highest result, permitting comparison of search data extracted across different points in time. In order to compare trends over time using GT, the entire time period of interest needs to be extracted at once so that the scaling doesn't change. In contrast, the GHT-API allows you to compare trends across different time periods, regardless of the time interval for which the data was extracted. This is useful for expanding the time period of interest at later points.

There is little formal guidance about how to craft a GHT-API search strategy to accurately measure a construct of interest.[12,14–17] Searches that are too broad can yield high search volumes but may capture many searches that are less relevant. Searches that are too narrow might be highly specific but result in missing values because the GHT-API suppresses data when the number of searches is below an undocumented threshold. Additionally, sampling variation needs to be considered, as the GHT-API estimates probabilities from a random sample that is updated once per day (2020 Google Health Trends API Getting Started Guide (unpublished document provided with API key)).

The objective of this paper is to describe best practices when using the GHT-API to measure a construct of interest, using a motivating case study.

processing....