Transforming texts into databases and maps

Early modern databases

The Spanish Empire created a vast 'database' of information about the Americas and its Indigenous peoples. How does the transformation of this record into a digital database allow new insights about the empire and the lands and peoples it claimed?

The Spanish Empire's many administrators, bureaucrats, and notaries were prolific record-keepers. Unlike other early modern empires, this vast paper trail was not just created in a top-down fashion. The Crown created a decentralized bureaucratic empire that allowed Indigenous communities, African slaves, and people from all strata of society to petition the government, engage in litigation, and, in essense, to negotiate their participation in the imperial project.

How can digital text analysis transform historians' study and understanding of the past?

The problem of scale

Many social scientists compile data, construct large datasets, and then share this data with other scholars.

Some literary scholars engage in 'distant reading, creating large corpora of digitized texts.

Historians - especially pre-modern historians - rarely work with larger digitized datasets or text corpora. There are good reasons for this: the labor required to compile such datasets, the fact that many of our historical sources are not published, and the difficulty in digitizing the ones that are.

Nonetheless, I believe there is a middle way for historians: a compromise between distant and close reading and between macro- and microanalysis.

Towards a medium reading and meso-analysis

Distant reading often involves the mining of 1000s of texts for patterns.

Historians often work at different scales. They may closely read and re-read a few key texts while also skimming thousands of other documents looking for a keyword here or a clue there.

My construction and analysis of this digital text corpus involves a mixture of both approaches. My corpus includes many of the most important historical texts for study of the sixteenth and seventeenth-century Andes. As a corpus that will ultimately include about 200 volumes or approximately 50,000 pages, it is too large for a careful close reading but small enough to allow some manual editing.

Borrowing approaches from close and distant reading / macro and micro-analysis, a meso-analysis of this corpus involves:

  • The manual checking of OCR'd texts of poorer quality, while programmatically correcting common errors in higher-quality scans. Conversely, scholars mining large corpora often have to tolerate OCR'd texts with frequent errors.
  • Structuring texts using a semi-automated approach. Using Python, I write programs that automatically encode structure (chapters, pagebreaks, paragraphs, footnotes, etc.) in xml/TEI. I then manually check for errors and re-run the program until I have 100% accuracy. In contrast, large-corpus text-mining techniques often rely on the analysis of "raw" or unstructured texts that make targeted queries and searches more difficult..
  • Encoding content information with a semi-automated approach. This involves automatically identifying named entities - names of places, people, and organizations - and other basic content information (like dates). On old Spanish documents the accuracy rate of named entity recognition (NER) is often fairly poor: ~50 or 60%. I then export frequency lists of tagged entities to check for recurring errors (i.e. the incorrect tagging of "Cusco" as a person name 110 times can be fixed in one step). In this way, I iterate back and forth between summary tables and the texts until I achieve 80 or 90% accuracy. Researchers working with large numbers of texts often have to spend significant time training NER and other natural language processing tools on hand-encoded training data. This helps increase the accuracy of NER but is a time-consuming process.
  • Performing complex and targeted queries of this encoded dataset. The searches and queries one can perform on raw texts is far more limited.

The problem of reproducibility

Social scientists commonly create and share datasets that allow their research to be tested and reproduced by others. By making these datasets publicly available, they also allow scholars to use the same data to answer different questions

Historians, however, commonly have to start their work from scratch. This project seeks to conceptualize and demonstrate how a searchable and visualizable text corpus serves not as a replacement for traditional research but as a complement.

Exploratory Data Analysis and Pattern Discovery

Applying data visualization to uncover hidden patterns. In my research, I use data visualization techniques to identify and examine patterns found within texts and to identify and explain gaps and silences within these texts. In this way, data visualization allows the interrogation of historical texts and the ways certain stories are privileged while others are marginalized or erased altogether.

The creation of the Early Colonial Andes corpus allows a variety of studies answering historical questions, both old and new. A few studies I am working on include:

Indigenous Geographies: Mapping and Decoding Andean Spatial Information encoded in Historical Texts

Only a few maps from the sixteenth and seventeenth-century Andes survive today. Nearly all spatial and geographic information from this period were recorded in texts.

This project is the first effort to decode this spatial enformation encoded in these texts for centuries. It begins with some simple but still unanswered questions. Who lived where? How did information and power travel across Andean mountain passes and vast distances? Where were the refuges and zones of escape and resistance to Spanish colonial power? How did the Spanish colonial project alter Andeans' environments and their relation with the land?

The Databases of a Bureaucratic Empire: An Early Modern Information Revolution

In the fifteenth and sixteenth centuries, Europe experienced multiple monumental historical events that created an information revolution that ended the Middle Ages and inaugurated the modern age. Europe experienced the invention of the printing press, the discovery of the Americas, and a broader cultural phenomenon that included the spread of humanism, the rise of literacy, and the Reformation. Various other information revolutions followed, including the publishing revolution of the nineteenth century and the current dawn of the digital age.

A study of the ECA corpus permits a comparison of these three historical moments. Many of the sources included in this corpus were written in the first, rediscovered and printed in the second, and now stand to benefit from being digitized in the third.

Moreover, the ECA corpus brings the dark sides of the onset of the global age into the conversation: conquest, colonialism, and exploitation.

The Corpus

The Early Colonial Andes digital text corpus

genre / type key examples approx. number of volumes
Imperial Surveys and Censuses of Indigenous Lands, Provinces, and Peoples
  • The Relaciones Geográficas (responses for the 1577, 1604, 1648, and other questionnaires)
  • provincial visitas, tasa (tribute) records, etc.
  • the 1689 ecclesiastical visita or survey of the Cusco region
10-20 volumes
Historical Chronicles Chronicles of pre-contact and post-contact history
  • Garcilaso de la Vega
  • Martín de Murúa
  • Pedro de Cieza de León
  • Guaman Poma
  • and many more...
~30 volumes
Document Anthologies and Collections
  • Colección de documentos inéditos (various)
  • Gobernantes del Perú
  • Cartas del Perú
100+ volumes
Summary Information of cosmographers
  • Juan López de Velasco
  • Vázquez de Espinoza
5-10 volumes
Traveler documents
  • Lizarraga
  • Cosme Bueno
  • Alexander Humboldt
  • Cieza de León, Part 1
~10 volumes
Town Council Records libros del cabildo de:
  • Cusco
  • Lima
  • Quito
  • Huamanga
  • etc....
~10 volumes