Discworlds Who’s Who – Building Social Graphs from Ebooks

Long time no post, a lot of stuff’s been going on in the last years. But here’s a short one I did a while ago. It was inspired by an analysis of Game of Thrones character occurrences I did as an exercise during a data science tutorial. I thought it might be interesting to combine my newfound graph skills with my rather large collection of eBooks and try my hand at programmatically analyzing relationships.
I chose one of the Discworld novels by Terry Pratchett as a start. I love this series for its beautifully satiric view of fantasy worlds, social dynamics and politics. Also, It has a very large cast of characters spanning well over 40 novels with around 10 recurring sets of characters. In fact, there’s some static guides and even an interactive guide on which book to start with. This amount of available data makes the Discworld series a very interesting candidate for this project.

Strike the Earth – Mining the Data

Roughly there were three main challenges in the project in terms of data preparation

  • Extract text from the .epub file
  • Identify specific characters within the text
  • Assess how close the relationship between characters is

For extracting data I used a combination of a library called ebooklib for reading files and the almighty BeautifulSoup package for parsing and scraping the XML used within .epub files. With these two I managed to get a list of one clean string per paragraph relatively quickly. From here, the interesting part of the project began: Finding the Protagonists!

To identify characters, I used a technique called NER (Named Entity Recognition) which, as far as I understand, is a pretty standard task in natural language processing. I used the nltk package for Python and built a crude function to extract the names of persons from text in about 10 lines of code. I’ll link the albeit rough and unpackaged source code at the end of the post.

For detecting relationships between characters, I fiddled around for a bit and settled on the naive approach of “If two characters occur in the same paragraph, they must have a relationship”. Essentially this means I just counted all tuples of characters that co-occurred in a given paragraph. Of course, I corrected for permutations at the end. After all of this, I got a list resembling the following as a result:

("Character A", "Character B"): 9 Occurrences,
("Character A", "Character C"): 7 Occurrences,
("Character A", "Character D"): 5 Occurrences,
...

I collected all of this data into a table with every observed combination and their respective weights and stored it in a .csv file for further analysis.

The Motherlode

So now I have a dataset containing approximate relationship indicators of all characters in my book. The next step was to visualize it. I used the networkx package to fill my data into a graph structure and then exported the data to gephi, an open-source graph visualizer to finally come up with a graph of the relationships in my novel:

Relationship map for the Discworld novel “Jingo”
Clusters of characters are color coded, line strength is proportional to strength of relationship

The book I chose for the graph above was the novel “Jingo” in which the main character Samuel Vimes, commander of the Ankh Morpork city watch, embarks on a diplomatic mission to the fictional country of Klatch. The graph clearly identifies this main character, as well as some clusters of supporting cast. I used a built-in feature of gephi to classify the characters into clusters and I think it did a good job in most cases. Anyone familiar with the series will immediately appreciate the strong connections between Vimes, Carrot and Angua, all of whom are senior officers of the watch and recurring characters. What might seem odd is that the cluster of Nobby and Colon/Sergeant Colon is not connected as strongly to the other members of the city watch as expected. It is instead connected to Lord Vetinari and Leonard (of Quirm) with whom they share a significant side-arc in the book. There is quite some insight to be gained from the graph, even with its rather shaky fundament of data.

I think this is pretty cool. Without any domain knowledge, we can still get a substantial amount of information just from extracting unsupervised statistics from a body of text. That’s a pretty nifty trick to have up one’s sleeve!

What’s the Catch?

There are still a lot of issues in the project. One of the main issues is that the algorithm does not pick up on slightly different titles for the same person. For instance “Lord Vetinari” and “Vetinari” obviously refer to the same person but are still listed as separate entries. I am planning to use a NLP-ML-Model to disambiguate and deduplicate these entities better, but haven’t gotten it to work well enough for now. Another problem is that the algorithm only catches explicit mentions of a character’s name, but does not catch references by pronouns.

What’s Next?

There’s still a lot of interesting stuff to do: I’ve already run the extraction on my full library of Discworld novels and am planning to make a full network graph of all the characters. This will probably require some refinement and deduplication of characters as it will get very messy with the huge cast. Other ideas are incorporating the chapter number into the analysis and seeing how character ocurrence and relationships change over time.
In general, there’s still way too many interesting leads for me to follow in my limited time. If something neat does come out of it, I’ll be sure to post here.

References:
– Discworld on Wikipedia
Source Code (very rough! but the main notebook might be interesting)

Over-Engineered Songbook with Flask

So as you might have noticed from my previous posts, I play guitar – A lot. I picked it up back in 2014, mainly because I love singing which is much more fun with a guitar. Over the years, I’ve gotten modestly better at playing and learned a couple of songs by heart. Well not a couple… After sitting down and collecting a list together with my girlfriend we counted 80ish different songs.

Jukebox Hero

Once we had the list, the next step seemed obvious: Hanging around the house, my girlfriend would pick a random number and I’d try to play that song, usually without too many problems. But of course there’s always room for improvement. I started taking notes along with the songs in order to track what I needed to practice or how well a certain song was seated in my repertoire.

I Engineer

Having a list of my performance skills immediately struck a chord (pun intended) with me. If could track how well I knew each song, I might be able to devise a system to efficiently practice the songs that I was doing poorly at. But since manually going through a list and updating such data is cumbersome and gets in the way of doing what’s actually fun: Playing guitar. So I decided to throw some of my technology skills at the problem: I was going to write a web application.


The goals I set for myself were the following:

  • Store and display title, artist, play count and skill level for each song
  • Add new songs from a form
  • Add and edit individual notes per song
  • Reproduce the jukebox-like game
  • Work from my smartphone

How Far We’ve Come

After a couple of sessions of coding I came up with a first working version (code on my GitHub). The application is written in my go-to language Python using the flask framework (which I had almost no prior experience with) and uses semanticui for the styling and interface. I got to play around with a couple of nice new tools such as SQLAlchemy and much more Javascript than I was comfortable with at the time. Some features like the searchable list required a lot more effort than I’d thought. But I soldiered through and finally got my first version deployed to the Raspberry Pi plugged in next to my router.

Since then, the application is configured to start up along with the Pi and is accessible from within my own WiFi. I might upload an instance to Heroku or similar at some point, but for now you’ll have to make do with screenshots:

How Far I’ll Go

The web application worked perfectly, I’ve been using it regularly ever since I set it up and it has only crashed once in the three months I’ve been using it. But – There is something that bothers me: Even though I’m mostly confined to my apartment due to the ongoing pandemic right now, I still want to be able to take my list (and my guitar) along when I go out. This gave me a good excuse to sit down and build a clone as an android app. This will probably warrant another Post 🙂

Looking at the Sun

Who hasn’t looked at the sun and has been frustrated that its really hard to actually make out what it looks like due to the fact that it is so far away and so stunningly bright. Well, the Institute I work for has a really nice solution to the problem: We operate solar telescopes on a nearby mountain as well as some building sized ones on the Canary Islands.

I work in the data processing section of the operation so I get in touch with a lot of different kinds of data. The image I’m sharing today is a plot of data from the telescope control database. Every couple of seconds, the telescope observes where it is pointing. Below, I’ve plotted all of these data points on top of the region covered by the sun:

The Image contains about four years’ worth of observations. Isn’t it fascinating to look at the patterns and think that for each of the mesmerizing patterns we can see, someone had a scientific question they were trying to answer? There’s actually quite a lot of interesting features you can see, for instance the striking circle patterns:

They are artifacts of a process called flatfielding: When taking data, tiny characteristics of the optical setup such as mites of dust can remain on the image. To account for that, Observers take flatfield images, where the telescope is intentionally de-focused and moved over the solar surface. This yields a very homogeneous flat image, leaving only the optical artifacts. The images can then be used to correct actual data taken with a focused telescope.

This among other techniques allows scientists to take high-resolution images of features on the solar surface such as this image of a Sunspot:

Image of a Sunspot observed with the GREGOR Telescope
Sep. 2017 Copyright Leibniz Institute for Solar Physics

Buying a Guitar with Python – Part 1: Data Acquisition

One of my favorite activities in my free time is playing guitar. I already have more guitars in my apartment than someone with my modest skills should reasonably own (eight at the time of writing), so clearly the next step is to buy even more.

As another of my passions is programming and data analysis I set out not only to buy a nice new steel string guitar, but to also learn something from the process: I decided to analyze the market while I was at it.

The Setup

My go-to merchant for musical equipment has been the German company Thomann, who are currently Europe’s largest online merchant in the field. They list around 500 different models that would match my criteria, far too much to look at by hand. So wanting to play around with some web-scraping, I fired up my editor and got to work.

Of course there are some things that I’d like to get from the upgrade:

  • Steel string all the way!
  • Different body shape than the Dreadnought I currently have
  • Within my budget: 400€ to 1000€
  • Integrated pickup

The result is a scraper that goes through the individual listings and pulls information like the model name, manufacturer, model characteristics and sales data. The source code is hosted on my github.

A quick overview of the price distribution yields:

This still leaves a couple hundred potential matches, so I’ll have to dig deeper. This post is the first in a series of posts exploring the dataset.

Ice Saints, do they exist?

My flatmates and I have been tending to a small garden plot over the last few years and every year we have the same discussion: When can we safely transplant all of the fragile plants we’ve reared on our windowsills to the garden? I’m usually in favor of transplanting as soon as possible, while my roommates usually want to be a bit more careful.
An argument that always comes up is some form of he sentence: “We can’t plant them yet Eisheiligen hasn’t passed yet!” Where I live in southern Germany, the concept of Eisheiligen or Ice Saints is a rule of thumb when the last cold days of ground frost can occur. Wikipedia writes:

In parts of the Northern Hemisphere, the period from May 12 to May 15 is often believed to bring a brief spell of colder weather in many years, including the last nightly frosts of the spring.

Wikipedia – Ice Saints

This year, during the discussion I decided to look at the data to see whether I can disprove the statistical significance of the Ice Saints and subsequently have an edge in this year’s discussion.

The Setup

The German weather agency provides historical weather data from hundreds of weather stations, reaching back up to 1950. The data comes as a “.csv” file containing day-wise measurements which I will be analyzing with python’s pandas library. The interesting column according to the manual provided by the data source is “TMK”, the mean temperature over the course of a day.

I took the temperature for each day and determined a mean as well as a standard deviation for each day of the year. Interestingly to say, this requires some interpolation due to leap years. What is plotted below is the mean temperature for a day of the year along with the 1-sigma percentiles for depending on the temperature distribution of the given day.

Temperature by day of year. Upper and lower confidence intervals are given by the standard deviation. The region where the Ice Saints phenomenon is expected according to Wikipedia is marked with a dashed line.

If the Ice Saints were a statistically significant rule, I’d expect to see some sort of dip in the trend at the expected date. Looking at our data, there does not seem to be a visible systematic dip in the mean of temperatures. I’ll try to see if we can see the phenomenon directly by checking the individual years. To do this, I check every day of each year and count the percentage of years in which that day had a minimum Temperature below freezing. The result is shown below, again, the expected date of the phenomenon is marked with a dotted line.

In this analysis a small peak in probability for cold days shows up exactly when it would be expected. Also, after this date, none of the days up to autumn went below freezing. In summary we can see:

  • For all years in our sample, transplanting would have been safe to transplant after Ice Saints
  • There was a non-zero chance of sub-zero temperatures for ost days earlier than Ice Saints

So after all, waiting on the Ice Saints is clearly the safest bet you can make. But there are also arguments to be made for planting earlier with a very small chance of freezing temperatures and the added benefits of giving your plants more advantage over weeds and other such considerations.
Whether my analysis helps me in discussing with my roommates, with roommates who don’t really care for data remains to be seen 😉

It might be interesting to see whether a trend in when the last freezing temperature of the year occurs exists due to climate change, but that will be left for another post. Thanks for reading!

Links: