CLSL 6014

Thanks to Prof. Dilley, Sara Hales, Peter Miller, Laura Moser, Marissa Sarver, Andrea Scardina, Echo Smith, and Jonathan Young.

I have prepared a number of visualizations to interact with data collected by the Spring 2019 Latin Seminar, CLSL 6014. The data has been collected both by close readings of the letters in the Augustinian Heritage Institute translation and through digital text analysis of the Latin text from the Corpus Corporum housed by the Universität Zürich. The following sections will discuss the preparation of the visualizations, their current state of development, what the next steps will be for each, and their potential usefulness to research on Augustine’s epistolary corpus.

Epistolary Map

By far the most polished visualization, this map represents each of the 200 letters as two points joined by a link. A blue circle shows the sender location and a red circle the recipient location. The thickness of the link between them is determined by the length of the letter as measured by the file size in bytes of the Latin text of each letter. These files have been cleaned of references, punctuation, and non-Roman characters (a handful of Greek unicode words are also removed). So, the length of these files in bytes corresponds directly to the number of characters.

The visualization initializes with the pure geographic representation over an abstract modern country map of the Mediterranean. While the AWMC offers options and data to render the map over terrain tiles of the ancient world, the abstraction of the modern map avoids presenting the points as overly precise. The geospatial component is deliberately representative, rather than making a visual claim to exactness. There is a place-holder location for unknown points in an uninhabited region of the West Algerian desert. Hovering the mouse over a point shows much of the metadata collected by the seminar participants. Many points overlap in this representation and the user may click a button on the top right hand corner of the map to convert the fixed geographic points to a network graph that does not allow overlap, but each point feels a simulated gravity towards its set GPS location.

This behavior creates a large cloud of points around Roman North Africa, which stretches over part of Sardinia and almost to Sicily. While the metadata box for each point will tell the user the name of the location for its sender or recipient, a filter allows one to reduce the number of points to a more manageable portion. After clicking another box in the top right hand corner to activate the year slider filter along the top, users may choose to show only letters that have a date before or after the year specified by the slider. In the next stage of this visualization, two sliders or a brushable band will be used to allow for sorting both directions at once and to allow better use of the data for letters that have wide ranges. Currently, the filter uses the earliest year in a letter’s range as a single fixed point. By using two sliders, the filter for showing letters before a certain year would use the start of the date range as its cutoff, with the inverse for the filter for seeing letters after a certain year. I have noticed that the filter can cause problems with the metadata displayed for an individual node (sometimes the metadata from a removed point temporarily replaces a visible point incorrectly. These errors are at the top of the to-do list for this visualization).

The map allows a user to investigate the wealth of data collected by the seminar class both geospatially and chronologically. It must be noted that this visualization is not truly a blend of a network graph and a map. Each point is only connected to a single other point and so it is much more map than network. Still, I feel that rendering points to a GPS gravity is a large step forward in joining networks to geographic representations. It would be possible for example to render the correspondent network using this same process (either with each correspondent with a single point and single GPS gravity, which could be chosen by their most frequent location or a midpoint among several, or with correspondents having multiple points for each of their locations - as with any network graph, the question would be decided by what a user wished to show).[1] After identifying and correcting the unusual behavior between the filter and the metadata popup boxes, the next step for this visualization involves collecting the data for the remaining hundred letters. Additionally, Prof. Marley’s Cicero dataset, on which this project is based, could be rendered with through this process. On the distant wishlist of features, I would like the ability to filter the map by searching the metadata. I would like to be able to type ‘Aeneid’, ‘Rom 8’, or ‘Donatist’ into a box and display only points that include those phrases in their appropriate metadata field.

Correspondents Network Graph

This is a network graph showing each unique sender and recipient field from the collected metadata as a node (size is determined by number of letters; Augustine alone is a square, the rest are circles - Prof. Marley assigned shapes to gender, Cicero’s household, and slave status - I can imagine clerical rank as a possibility for this graph) with links being determined by the presence of a letter connection, thicker and shorter for more letter exchanges. The nodes are currently colored by a centrality measure (red most central, blue least - Prof. Marley assigned color to sentiment scores). Ideally, I would like color to associate with the presence of different types of content such as Platonic, Manichaean, or Catholic though I ran into problems defining those content ideas through word counts or topic models for discovery of a vocabulary. Another possibility for color would be to indicate time of correspondence with entries that span longer timeframes needing their own separate color. Generally, I am not sure that these highly centralized letter networks, focused as they are on the publisher of the letters, are terribly interesting as far as networks go. It seems to me that their key usefulness lies in comparing different variables through the symbology. The network itself is perhaps unexciting, but the nodes become ways to deliver visual representations of metadata variables. A letter network that was less dependent upon a single person would make for more interesting network shapes and analysis.
As mentioned above, I think mounting this network over the geogravity framework of the Epistolary Map would be significantly more interesting to explore. The spatial component of this graph bears little in the way of surprising information: Augustine is at the center. Spreading the nodes over a map might allow a user to explore trends that might connect the many single correspondent nodes.
With some data work, this graph could condense some of its nodes so that ‘Augustine, Alypius’ are not treated as separate from ‘Augustine’ and ‘Alypius’. On this question, we may find one of the few interesting network features of this graph. By not condensing those entries, we see that Augustine does not correspond with Innocent directly by himself (neither with Peregrinus, Naucelio, Castorius, or Maximus). Still, not all distinctions would be helpful. We note that ‘Paulinus and Therasia‘ and ‘Paulinus, Therasia’ indicate the difference in approach by whoever collected the data. Still, the question of treating Paulinus and Therasia as a single entry or not is a choice to be made by the composer of the network depending on what type of question someone wanted to explore. After Therasia dies, Paulinus will correspond without her, and a person may want to see if there is a difference between ‘Paulinus and Therasia’ and ‘Paulinus’ as collapsing all of the Paulinus into a single node would hide such a difference, or it would only be visible when comparing the ‘Paulinus’ connections to ‘Therasia’ connections though that network graph would disassociate ‘Paulinus’ and ‘Therasia’ from each other. In that case, we can once again find a compelling reason to add a geographic component. If ‘Paulinus’ and ‘Therasia’ were split into separate entries, their nodes would still be drawn at Nola and a relationship would be suggested even if someone didn’t know that they had been married.

Stylo Network Graphs

These network graphs use a similar process to the Correspondent Network Graph to show each letter’s relationship to the rest as organized by the R package stylo. The data for these graphs comes from the Corpus Corporum Latin so it includes 279 texts and I have chosen not to include the seminar metadata with the stylo results, because a third of them would be blank. Node size is determined by the file size of each letter (again roughly number of characters), shape by author (Augustine by himself: triangle point up, Augustine with collaborator(s): triangle point down, Jerome: circle, Anyone else: square), color by modularity class (Gephi uses an abridged Louvain method to determine clusters), link thickness and distance is determined by the weight assigned by stylo’s output of an edge list (stronger associations have are drawn closer with a thicker link). The texts were unlemmatized (the CLTK Latin lemmatizer does not perform well on later Latin generally, and understandably struggles to properly identify the many proper nouns in the letters - it would take a serious investment of time to prepare an acceptably lemmatized lext of the letters) and I used stylo’s default 100 most frequent words, which should lead to caution when reading results for the many brief letters (for example, letter 74 is only 131 words long with the highly formulaic address consisting of 29 of them - I have had success clustering unlemmatized texts by increasing the MFW parameter for stylo, casting the net deeper to account for case variations, but the length of the short letters makes that impossible here). I used Gephi’s modularity measure to help see communities better, often the physics of the network connections draw nodes in close proximity that do not have strong or even any connections to each other. The color clusters help to see the groupings even in the midst of densely overlapping connections.
These graphs might be better displayed with a different process[2] (probably these should be done with the table method, see below). The labels seem to tax the browser’s rendering capabilities and letter number by itself gives my fairly little information to work from. However challenging these are to read, I still argue that the web versions are far more comprehensible than the dendrogram outputs from R (stylo edge lists and dendrogram pdfs ). One thing that I do notice across all the graphs is to the degree that letter number roughly corresponds to date (generally low being early, high being late), time doesn’t seem to have much impact on associations between letters. I have found Eder’s Simple distance measure to be successful at grouping authors of unlemmatized text in the past (see footnote 3), and was pleased to see it perform well in the case too. If one filters out the non-Jerome entries in the edersimple.html page by pressing ‘t’, ‘d’, ‘x’, and ‘s’, all the of Jerome’s letters, save 75, have connections, and most of them strong. I suspect that lengthy biblical quotations and reframed statements of Augustine’s own letter causes 75 to stand apart. If I was going to probe the author signals more deeply, I might tie color to authors to see the top 8 senders more clearly and label the nodes by sender name followed by letter number.
I think the absence of a strong time signal could suggest that the letters were prepared for publication all around the same time with editing or that Augustine’s letter style did not change over time. In either case, that question would require a good deal of close reading and traditional philology to investigate further. If labels may be abandoned, these results may best be viewed in a visualization that provides popup metadata boxes or with a display table as in the following case.

All Entity Network Graphs

These graphs were demonstrated in class on 4/19. They are graphs of directed networks between letters and individual values from the following metadata fields: ‘Other Personal Names’, ‘Inferred Names’, ‘Groups Mentioned’, and ‘Literary/Historical References’. A good deal of preparing the dataset for the visualizations involved cleaning the entries themselves and harmonizing spelling errors and citation style variations. The search version allows the user to type a string into the box at the bottom, and if that string is included in the name of one of the nodes, it remains visible, while the rest disappear and then slowly return. There are over 1100 node entries (~200 letters and ~910 entities) and links are determined by the presence of an entity in a letter. I did not include weights, multiple occurrences do not affect the structure of this graph. A letter referencing John 10 times will show the same connection as a letter that references it once. Each graph allows the user some choice regarding the size of the nodes however the centrality measures of the searchable table are not very helpful because they do not work with directed graphs like this. The InDegree or OutDegree of the table graph is more helpful because it allows a user to see the difference between entities and letters, large InDegree nodes must be persons, groups, or references, while large OutDegree nodes must be letters. Colors are assigned by Gephi’s clustering method once again. I don’t not have the ability to assign shapes or labels to this visualization though both show popup boxes and the table version can show as much metadata as desired. The fields must be included in the dataset and then when a user holds shift, they can select a rectangle area of the graph to display the metadata values in a table on the right. Currently the values are uninterestingly limited to the node’s name, type (letter or person, etc), and modularity class (the Gephi assigned cluster number).
During the in-class demonstration I had suggested that the next step was to collapse this network into just letters, related by their connections through these entities and another graph of just persons, groups, and references, related through their shared letters. I ran into a problem when the resulting networks had too many edges. The person, group, reference network has over 23000 edges and the letter-letter network has over 19000 edges. They will not render in a browser. I will have to start these over in their specific categories. A persons mentioned or inferred network, a groups network and a references network. Several classmates suggested splitting the references into the sub categories of biblical and non, I think that will be helpful as well. It should also be possible to cull the weaker connections from the dataset and only load connections over a certain threshold into the browser. I think the table version has good potential for displaying all the metadata that was collected and would like to add the search capability to it. As I mentioned above, the table version would likely be an effective choice for a method to render the stylo graphs as well.
This network group requires the most work, but doesn’t just display all the data collected by the seminar, it begins proper network analysis of it. Traditional scholarship has already begun some of this work in the form of prosopography or indices of biblical references. The Augustinian Heritage Institute’s English translation already has a table of biblical references in each letter. I feel that seeing that data in an interactive visual format allows a person to process that information in a different way than a 9 page list and hope that the fruits of these visualization will lead to new research questions and approaches in the future.

[1] I am reminded of Marissa’s work on Pliny’s letters for an example of how the multiple points for each correspondent would be useful. She looked at differences in style for Pliny’s letters depending on location, suggesting that there was a City-Pliny and a Villa-Pliny. I can also imagine a graph/map of scriptural references by location of sender with links between references by number of letters in which they co-occur.

[2] For a much more readable version of a stylo cluster result rendered as a web network visualization, see https://edbkeogh.github.io/datavisusquecano/BAM/Networks/eyaler/glabels.html - Greek drama clustered by Eder’s Delta over 300 MFW, circles-tragedy, squares-comedy, blue-Aeschylus, red-Euripides, purple-Sophocles, green-Aristophanes. Author signal is very clear, also with a pleasing separation of comedy and tragedy.