How does WHIV differ from The World Handbook of Political and Social Indicators III?
What makes WHIV different from other contemporary event datasets?
How does WHIV compare against King & Lowe (2003a)?
How should I aggregate the events?
How was WHIV data produced?
WHIV builds directly on the tradition of political event analysis laid down by the World Handbook of Political and Social Indicators: Third Edition (Taylor and Jodice 1983) by providing global country-level counts of contentious politics events. It differs in three ways from its predecessor. First, WHIV was constructed by automated coding applied to Reuters international newswire, as opposed to human coding of The New York Times and the other international newspapers used by Taylor and Jodice (1983). Second, WHIV includes a revised set of event forms based on the Integrated Data for Event Analysis (IDEA) event framework that was developed by Virtual Research Associates (www.vranet.com) and members of this research team. Third, WHIV adds transnational civil contention, such as protests and violent attacks by non-state actors and by states where civilian targets may be across international borders.
WHIV time coverage begins in January 1, 1990. Because of the tremendous growth of Reuters newswire distribution during the 1980s, we concluded that 1990 was the earliest year that comparable event data could be constructed from coding Reuters. Due to the profound differences between Reuters newswire and the multiple news sources used by Taylor and Jodice (1983), we concluded that it is not possible to splice WHIV with WHPSH III data.
The availability of event data has grown tremendously over the past decade (see Schrodt 2012 for a recent overview). What makes WHIV different is that its global coverage using a broad set of contentious politics events. Most other event data projects focus on one or a small number of countries, address a smaller set of event forms, and/or deal with smaller time periods. WHIV is derived from automated coding of the entirety of Reuters newswire stories. Automated coding has the advantage of reducing the costs for training coders, and increases coding consistency, which is useful for large event datasets. International newswires like Reuters are an advantage over traditional newspapers because they operate on a 24 hours/7 days per week schedule and they generate a much larger number of news stories, many of which are used by journalists to draft larger news stories for traditional newspapers. Reuters newswire is the world’s largest English language news source and, from the viewpoint of coding, has advantages of consistent grammar and inverted pyramid writing style.
WHIV is extracted from the same larger IDEA dataset distributed by Gary King from the IQASS Dataverse Network to accompany his article with Will Lowe assessing automated coding reliability (King and Lowe 2003a).
We have done several things to make these estimates more reliable and better adapted to the study of civil contention. First, we have removed all non-contentious and all state-to-state (or inter-state) interactions. Second, we removed erroneous computer coding present in the underlying IDEA dataset. One problem was the use of metaphorical language to describe sporting events, extreme weather, and business conflicts (e.g. company take-overs) in terms that sound like contentious politics. A second problem was a set of duplicate events that were used by VRA to map trans-border contentious events to both countries, which resulted in counting some events as occurring twice, once in each country, when in fact there was one event. These two cleaning measures reduced the total count of contentious events in the dataset by about 14.4percent. Third, we simplified the actor/target categories for physical objects and abstract concepts, like “political ideals,” “ancient beliefs” and the like so that they are more useful for event data researchers. While grammatically useful in terms of the automated coding of events, these are not useful categories for contentious politics research and are potentially confusing to users. Fourth, we have conducted numerous reliability checks on these data, both internally against a basecode and externally by comparing them with other events datasets. These comparisons will be forthcoming in future publications.
The lion’s share of event data research has focused on annual aggregation of event counts. But the increasing availability of daily event data means that analysts can now make decisions about how they want to organize their data. Previous research shows that aggregation impacts model fit, masks variability, and influences our ability to draw inferences (Freeman 1989; Alt, King, and Signorino 2001; Shellman 2004). However, researchers must make trade-offs. Data aggregated annually aligns with many commonly available social, economic, and political variables, but can oversimplify models and create spurious inferences about causation (Freeman 1989). On the other hand, daily data makes statistical inference difficult by significantly increasing the number of zeros, leading some to use larger time units, such as weeks, months or quarters. These trade-offs show that researchers must be mindful of issues of aggregation. We echo the advice of Shellman (2004) and Freeman (1989) who state that (1) decisions about aggregation should be “theoretically driven” where the level of aggregation matches the expectations about the temporal dynamics of the contentious process, (2) researchers need to pay attention to the underlying structure of the data (i.e. aggregation should not mask peaks and valleys that are relevant to the question of study), and (3) researchers should conduct robustness checks across multiple aggregations in order to ensure proper results.
WHIV data were produced with an automated parser, the VRA® Reader, a commercial natural language processing system owned by Virtual Research Associates. The input was Reuters international newswire. Reuters is preferred because it is the world’s largest English language news agency. It also has advantages of using aconsistent lexicon, grammar and an inverted pyramid writing style, which makes it easier to identify key events in lead lines and apply automated coding techniques. We code all events in the first two sentences of Reuters news stories (or the first 364 characters, whichever is less). Over time the specific Reuters source input has varied. From January 1, 1990 to May 30, 2003, Reuters Business Briefings (RBB) were used. From June 22, 2003-Sept. 9, 2003, Factiva World News was the source and, from September 10, 2003 to December 31, 2004, Reuters World News, including all 8 product channels, was used.
The major advantage of automated coding is the ability to generate large amounts of consistently coded event data, far more than traditional human coding can generate with comparable effort and consistency. Automated coding is also more transparent than traditional human coding; one can review the output and correct the computer code. This allows for continuous improvement and refinement with regards to coding accuracy, precision and reliability. Several studies have found the accuracy of machine coding is comparable that of trained undergraduate coders (King and Lowe 2003). There are also limitations to automated coding. Developing new computer protocol and dictionaries is a continuous process that requires considerable time and expertise. Even in a standardized text like Reuters, there is irregular grammar and language use that create coding challenges.
The VRA® Reader uses frame parsing techniques, which relies on the grammatical structure of natural language sentences to identify the actor, event form and target of contentious events. Similar to sparse parsing (Schrodt and Gerner 2001/2012), this system identifies an actor (typically the leading noun), an event form (typically the leading verb), and a target (typically the direct or indirect object). For example, the following Reuters lead line is parsed into a protest obstruction or sit-in protest (event form) by Japanese dissidents (actor) against France (target):
“Scores of Japanese anti-nuclear campaigners staged a sit-in on Friday near Hiroshima's atom-bomb memorial to protest France's decision to resume testing nuclear weapons.”
For further information about frame parsing and the VRA® Reader, please contact VRA.
Alt, James E., Gary King, and Curtis S. Signorino. 2001. “Aggregation among binary, count, and duration models: Estimating the same quantities from different levels of data.” Political Analysis 9(1):21–44.
Freeman, John R. 1989. “Systematic Sampling, Temporal Aggregation, and the Study of Political Relationships.” Political Analysis 1(1):61–98.
King, Gary and Will Lowe. 2003. “An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders.” International Organization 57:617-642.
King, Gary and Will Lowe. 2003. "10 Million International Dyadic Events", http://hdl.handle.net/1902.1/FYXLAWZRIA UNF:3:dSE0bsQK2o6xXlxeaDEhcg== IQSS Dataverse Network [Distributor] V3 [Version]
McClelland, Charles. 1978. “World Event/Interaction Survey (WEIS) Project, 1966-1978.” Third ICPSR Edition. Ann Arbor, MI: Inter-University Consortium for Political and Social Research.
Schrodt, Philip. A. 2012. “Precedents, Progress, and Prospects in Political Event Data.” International Interactions 38(4):546–569.
Schrodt, Philip A. and Deborah Gerner. 2001/2012. Analyzing International Event Data: A Handbook of Computer-Based Techniques. State College, PA: Dept. of Political Science, Pennsylvania State University (eventdata.psu.edu/papers.dir/AIED.Preface.pdf)
Shellman, Stephen M. 2008. “Coding Disaggregated Intrastate Conflict: Machine Processing the Behavior of Substate Actors Over Time and Space.” Political Analysis 16(4):464 –477.
Taylor, Charles Lewis and David A. Jodice. 1983. The World Handbook of Social and Political Indicators, III. New Haven, Ct.: Yale University Press.