Beyond demographics: How search engine data can enhance the understanding of determinants of suicide in India and inform prevention

Paper by Daniela Paolotti;  Elad Yom-Tov; Natalia Adler; Ciro Cattuto; Kyriaki Kalimeri; Michele Tizzoni; Stefaan Verhulst;  and Andrew Young in the Journal of Medical Internet Research: “India is home to 20% of the world’s suicide deaths. In India, and around the world, young people are especially at risk of suicide. While statistics regarding suicide in India are distressingly high, data and cultural issues likely contribute to a widespread underreporting of the problem. Social stigma and only recent de-criminalization of suicide are but two factors hampering official agencies’ collection and reporting of suicide rates.

As the product of a data collaborative – the cross-sector exchange of data to create new public value – this paper leverages private-sector search engine data toward gaining a fuller, more accurate picture of the suicide issue among young people in India. By combining official statistics on suicide with data generated through search queries, this paper seeks to: 1) add an additional layer of information to more accurately represent the magnitude of the problem; 2) determine whether search query data can serve as an effective proxy for factors contributing to suicide that are not represented in traditional datasets; and 3) consider how data collaboratives built on search query data could inform future suicide prevention efforts in India and beyond.We combined official statistics on demographic information with data generated through search queries from Bing to predict suicide rates per state in India as reported by the National Crimes Record Bureau of India. We have extracted English language queries on five topics (“suicide”, “depression”, “hanging”, “pesticide”, “poison”). For each query, we recorded the time and date of the query, the state in India from which the user made the query, and the text of the query. We have then collected data on demographic information at state level in India, including: Urbanization, Growth Rate, Sex Ratio, Internet Penetration, Population. We have modeled the suicide rate per state as a function of the queries on each of the 5 topics considered as linear independent variables. We also built a second model by integrating the demographic information on Urbanization, Growth Rate, Sex Ratio, Internet Penetration and Population, all considered as additional linear independent variables in the model.

Results of the first model fit (R2) when predicting the suicide rates from the fraction of queries in each of the 5 topics, as well as the fraction of all suicide methods, show a correlation of about 0.5. The correlation increases significantly with the removal of even 3 outliers, and improves slightly when 5 outliers are removed. In all cases, statistically significant correlation is reached, but the best correlation is obtained for suicide methods (hanging, pesticide, and poison), and only to a lesser extent for depression. Results for the second model fit using both query data and demographic data show that for all categories, if no outliers are removed, demographic data predict suicide rates better than query data. However, when 3 outliers are removed, query data about pesticides or poisons improves the model over using demographic data.

Conclusions: Internet search data has been shown in previous work to serve as a proxy for many health-related behaviors, enabling the measurement of rates of different conditions ranging from influenza to suicide. In this work, we used both search data and demographics to predict suicide rates. In this way, search data serves as a proxy for unmeasured (hidden) factors corresponding to suicide rates. Moreover, our procedure for outlier rejection serves to single out states where the suicide rates have substantially different correlations with both demographic factors and query rates….(More)”.