Machine Learning Human Rights and Wrongs
Department of Political Science, Michigan State University
TextLab uses digitally available, unstructured text along with computational tools to address core social science questions such as what countries are violating which human rights over time; on what issues are relations between China and the US improving or growing more strained; where and when is the next civil war likely to break out; and how do democracies keep national security secrets and still hold their leaders accountable. Dr.Colaresi's lab makes progress on these questions by deeply integrating our substantive knowledge of global processes with research designs inspired by machine learning and tools from Bayesian perspectives on how to credibly update our understanding from digitally available data.
Using Text as Data
Almost all social science research uses text in some form. It is an incredibly dense and ubiquitous source of information on how people and groups relate to each other. However, until relatively recently, almost all of this text was turned into measures through human reading. Thus the only way to get numeric scores for analysis, previously, was to have a team of individuals read every line. This research design strategy does not scale particularly well to the avalanche of text that is now digitally available. Where graduate students used to code a book or newspaper edition in print, we now have access to the Google Books corpus, Wikipedia, or the whole Congress Record over multiple centuries.
At the same time as they were gaining access to terabytes of text from around the world, the computational tools to help them systematically analyze the aspects and sentiments in the syntax were being created in computer science, with breakthroughs in natural language processing and machine learning. "Today, even though we are social scientists by training, we have the digital access and computational tools necessary to use large-scale corpora to better understand society, politics and international relations." said Dr. Colaresi.
Exploring Human Rights Violations Over Time
Dr. Colaresi and his team are currently working with human rights reports of government and armed group repression, records of state interests in United Nations speeches, and flight manifests. Because of the scale of this unstructured information — often millions of words, with more arriving over time, scattered across hundreds of thousands of documents — the high performance computing resources at iCER have been crucial to completing this research. HPCC allows him to collect, clean, parse and analyze text in new and exciting ways.
One ongoing project Dr. Colaresi is particularly excited about is in collaboration with Baekkwan Park, a post-doc here at MSU, and Kevin Greene, a graduate student. They explore the evidence for claims that there are even more violations now, as compared to past decades. One counter argument to the data that show a decline in human rights over time is that this simply reflects more information on violations and measurement changes. They show that when you build a set of automated algorithms that map the words in the State Department Human Rights reports into the human labeled scores for each country in a given year, there is credible evidence of slippage over time. Different information in the text is relevant to coding human rights behaviors, suggesting that there has indeed been changes how we measure and record human rights protections in recent years.
The research design for this project involves re-analyzing millions of words in rolling windows using dozens of computationally intensive machine learning algorithms. Dr.Colaresi and his lab compute not only one mathematical representation of how the natural language in the text led to good or poor human rights scores, but calculate how that representation changes over time.
In a follow up to this project, they use the evolving organization — the outline of sections and sub-sections — of the human rights country reports, along with the framework of aspect-based sentiment analysis from natural language processing, to identify exactly what information is present in recent reports that was absent in the past. To accomplish this they move beyond just looking at the lexical variation in words and utilize computational tools that automatically track the syntax of the sentence in the reports.
Dr. Colaresi's work shows, for the first time, that the information embedded in these human rights reports, written over 4 decades on countries around the world, have grown much more specific over time. While in the past, the discussion was on general civil rights or killings, more recently these categories have been disaggregated, into gender versus religious rights and extrajudicial versus judicial killings. These findings substantiate a growing perspective that the use of satellite technology, cell phones with cameras, and the growth in the number of human rights non-governmental organizations around the world, have changed the information environment in which we live and study international relations. Naive data collection efforts that fail to take this fact into account are likely to mischaracterize important over time trends.
From Word Order to World Order
In another project with Zuhaib Mahmood, a graduate student here, Dr. Colaresi's lab is using the speeches in the United Nations Security Council and General Assembly to create better measures of state preferences as well as identify the dimensions of disagreement between states over time. They are building models that automatically detect both the topic being discussed in a speech as well as the position that the speech represents. Given the difficulty of the inferential task and the size of the data, they utilizes approximate Bayesian inference in this research. Dr. Colaresi's lab is able to estimate the increasing distance between the US and Russia since 9/11 as well as increasing variance on a new dimension of conflict between the US and China since the end of the Cold War. These results have strong implications for understanding the opportunities for coordination and the potential for conflict around the world.