Essay DOI: 10.21627/2019cd/data

A Note Regarding Datasets, Accuracy, and Errors


Due to the nature of the historical sources used for this study, and perhaps of humanistic sources more generally, the user may encounter spatial data that might at first seem inaccurate or imprecise. In the vast majority of cases, however, this will be the result not of inaccurate coordinate data but of the fact that our dataset involves unavoidable unevenness. For example, in certain instances we were only able to pin the location of a grave relocation to the level of a relatively large, mid-level administrative unit, such as a county. In such cases, the XY coordinates for the relocation will be pinned to what is known as the “centroid” of the administrative unit in question, a spatial/mathematical abstraction that corresponds to the “middle point” of the multisided polygon that represents the administrative unit on the map. In such cases, then, if a user zooms in closely on the map, below the level of the county, the point on the map will not correspond to any meaningful location or population center. Indeed, the “centroid” of a county might be in the middle of a body of water or a desert.


In other cases, there are many grave relocations for which we can confidently pin data to lower-level administrative units, such as a particular township or village. Once again, however, the XY coordinates of this location will be pinned not to a specific street corner or address but rather to the centroid of that township or village. So, again, should users zoom in closely, expecting to find the exact location of the relocation, they will be misled.


Such cases, it is important to clarify, are not to be considered “errors” in the spatial data, but rather outcomes of the unevenness of the dataset, an unevenness that is the rule rather than the exception in humanistic and historical analyses.


There is a further issue, which is specific to the challenges and politics of doing spatial and cartographic research in the People's Republic of China (PRC) in particular. Namely, the PRC has long outlawed the publication of maps, base layers, and GIS systems that provide highly precise spatial data about the territory of the PRC. Indeed, as is well known, the PRC enforces an “offset” to ensure that any geospatial data about the PRC is rendered slightly inaccurately when projected on conventional mapping tools. For this reason, even our most fine-grained geospatial data in our dataset will, when a user zooms in very closely, appear to have drifted away from population centers in question, sometimes appearing in the middle of a creek or in the outskirts of a town. These kinds of discrepancies are products of the unavoidable and inevitable unevenness of humanistic and historical data and of the particular regulatory hurdles that govern cartography in the People's Republic of China.


This is not to say that the user might not find real errors, of course. Although the editor and the publisher have gone to considerable lengths to check and recheck the spatial data, the possibility remains of incorrect datapoints having made their way into the final work. If you suspect that you have found an incorrect datapoint, you are encouraged to contact Stanford University Press to report the error. When doing so, please provide as much description as possible of the datapoint in question, so that the datapoint can be checked and, if erroneous, corrected.


A Note on Quantitative Data


A further point pertains to quantitative data on the platform—specifically, datapoints about the number of graves relocated in any particular locale. First and most obviously, the user will notice that only a subset of the datapoints in the platform carry quantitative values. The majority of grave relocation datapoints are assigned an “unknown” value, in the sense that while we know that a relocation took place in a particular place at a particular time, we do not in fact know how many bodies were exhumed and moved. In other instances, we are in possession of all such data: time, location, and scale. In those cases, such datapoints appear as larger circles (with the diameters of said circles corresponding to the number of bodies relocated).


The reason for this diversity is simple: the datasets upon which the platform is based are themselves diverse, comprising many different types of primary source material. A subset of these sources make it possible to identify and confirm the number of bodies moved, and others do not. Given the historical value of both kinds of available data, the decision by the editor and platform designers was to preserve this diversity.


A second point about quantitative data pertains to attribution. As outlined in the introductory essay by Mullaney, the scale of contemporary China's grave relocation initiatives is immense, involving the exhumation of more than ten million bodies. The most dramatic case, examined in the essay by Mullaney, is that of Zhoukou, in Henan Province. Here, approximately 2.5 million bodies were exhumed and reburied.


Were one to isolate all of the datapoints specific to Zhoukou, however—that is to say, all datapoints wherein grave relocations are taking place in one or another subadministrative territory of Zhoukou—the algebraic sum of these relocation events would not equal 2.5 million. Why is this so?


The reason for this seeming discrepancy—between quantitative values in the dataset and certain statements made in the context of the essay narratives—derives from the threshold below which we felt it inappropriate to assign quantitative values to a datapoint. Although multiple records confirm the aggregate scale of the Zhoukou relocation, the decision was that it would have been both useless to the user, and analytically irresponsible, to create a singular and immense “Zhoukou” datapoint, and then to assign it the value of “2.5 million.”. Instead, since our goal has been to exploit the dynamic functionality and granular possibilities of the platform to the greatest extent possible, the objective was to zoom in as closely as we could, and to identify as many of the specific instances of relocations as possible in the many sub-administrative units of Zhoukou (i.e., villages, townships, etc.), and where possible, to append these datapoints with temporal and scalar data. For this reason, one will not find a single, immense circle labeled “Zhoukou,” but rather a set of datapoints that collectively begins to fill out of the picture of the Zhoukou relocation. Some of these datapoints will have quantitative values assigned to them, while others are categorized as “unknown.” With that in mind, the reason the overall numbers for Zhoukou (as just one example) do not “add up” to the value of 2.5 million is, quite simply, because we wanted to be as faithful as possible to the gaps in the available primary source material.


A Note on Data Structures


A final point pertains to data structures and to the challenges of building ones that are compatible with heterogeneous datasets while also being versatile.


For the three essays in this volume, the datasets upon which each is based vary a great deal from one another in terms of structural requirements.


For example, in the contemporary period—as examined in the essay by Mullaney—the availability of fine-grained temporal data about grave relocations necessitated the inclusion of datapoints pertaining to, among other things, the “deadline” by which a grave or set of graves was required to be relocated, as per government decrees. For many of the datapoints in the Mullaney essay, then, the user will encounter temporal data specific to the year, month, and often day.


For the essays by Henriot and Snyder-Reinke, however, the very notion of a “grave relocation deadline” is either irrelevant or irretrievable. In many cases, there was no central decree requiring that a particular grave be moved, with relocation being undertaken by other historical agents with other motivations.


This heterogeneity is preserved in the platform. In the essay by Mullaney, for example, users will find meaningful data in the “deadline” field, while in the essays by Henriot and Snyder-Reinke, they will encounter either placeholder data or no data at all.


In theory the platform could have been tailored to reveal or conceal particular data fields premised upon their relevance to each essay, and yet this would have been both costly (perhaps prohibitively so) and ultimately unnecessary. Instead, users should understand that, by virtue of the perhaps inherently heterogeneous properties of historical and humanistic data, there will be times when certain frictions or even incompatibilities emerge between particular datapoints and the overall data structure.


The Datasets


At the same time, in the interest of making these datasets available to users with their original data structures preserved, you will find links to download any or all of the three essay datasets, so that you can explore the datasets in their original forms and organized according to their original data structures. What is more, the original datasets contain surplus data, in the sense of datapoints that, while rich and potential insightful, are not exploited by the online platform.


For the Snyder-Reinke dataset, download the linked CSV file here.

For the Henriot dataset, download the linked CSV file here.

For the Mullaney dataset, download the linked CSV file here.

Downloadable datasets should be read under a UTF-8 encoding setting.