Long time no post, a lot of stuff’s been going on in the last years. But here’s a short one I did a while ago. It was inspired by an analysis of Game of Thrones character occurrences I did as an exercise during a data science tutorial. I thought it might be interesting to combine my newfound graph skills with my rather large collection of eBooks and try my hand at programmatically analyzing relationships.
I chose one of the Discworld novels by Terry Pratchett as a start. I love this series for its beautifully satiric view of fantasy worlds, social dynamics and politics. Also, It has a very large cast of characters spanning well over 40 novels with around 10 recurring sets of characters. In fact, there’s some static guides and even an interactive guide on which book to start with. This amount of available data makes the Discworld series a very interesting candidate for this project.
Strike the Earth – Mining the Data
Roughly there were three main challenges in the project in terms of data preparation
- Extract text from the .epub file
- Identify specific characters within the text
- Assess how close the relationship between characters is
For extracting data I used a combination of a library called ebooklib for reading files and the almighty BeautifulSoup package for parsing and scraping the XML used within .epub files. With these two I managed to get a list of one clean string per paragraph relatively quickly. From here, the interesting part of the project began: Finding the Protagonists!
To identify characters, I used a technique called NER (Named Entity Recognition) which, as far as I understand, is a pretty standard task in natural language processing. I used the nltk package for Python and built a crude function to extract the names of persons from text in about 10 lines of code. I’ll link the albeit rough and unpackaged source code at the end of the post.
For detecting relationships between characters, I fiddled around for a bit and settled on the naive approach of “If two characters occur in the same paragraph, they must have a relationship”. Essentially this means I just counted all tuples of characters that co-occurred in a given paragraph. Of course, I corrected for permutations at the end. After all of this, I got a list resembling the following as a result:
("Character A", "Character B"): 9 Occurrences, ("Character A", "Character C"): 7 Occurrences, ("Character A", "Character D"): 5 Occurrences, ...
I collected all of this data into a table with every observed combination and their respective weights and stored it in a .csv file for further analysis.
The Motherlode
So now I have a dataset containing approximate relationship indicators of all characters in my book. The next step was to visualize it. I used the networkx package to fill my data into a graph structure and then exported the data to gephi, an open-source graph visualizer to finally come up with a graph of the relationships in my novel:
The book I chose for the graph above was the novel “Jingo” in which the main character Samuel Vimes, commander of the Ankh Morpork city watch, embarks on a diplomatic mission to the fictional country of Klatch. The graph clearly identifies this main character, as well as some clusters of supporting cast. I used a built-in feature of gephi to classify the characters into clusters and I think it did a good job in most cases. Anyone familiar with the series will immediately appreciate the strong connections between Vimes, Carrot and Angua, all of whom are senior officers of the watch and recurring characters. What might seem odd is that the cluster of Nobby and Colon/Sergeant Colon is not connected as strongly to the other members of the city watch as expected. It is instead connected to Lord Vetinari and Leonard (of Quirm) with whom they share a significant side-arc in the book. There is quite some insight to be gained from the graph, even with its rather shaky fundament of data.
I think this is pretty cool. Without any domain knowledge, we can still get a substantial amount of information just from extracting unsupervised statistics from a body of text. That’s a pretty nifty trick to have up one’s sleeve!
What’s the Catch?
There are still a lot of issues in the project. One of the main issues is that the algorithm does not pick up on slightly different titles for the same person. For instance “Lord Vetinari” and “Vetinari” obviously refer to the same person but are still listed as separate entries. I am planning to use a NLP-ML-Model to disambiguate and deduplicate these entities better, but haven’t gotten it to work well enough for now. Another problem is that the algorithm only catches explicit mentions of a character’s name, but does not catch references by pronouns.
What’s Next?
There’s still a lot of interesting stuff to do: I’ve already run the extraction on my full library of Discworld novels and am planning to make a full network graph of all the characters. This will probably require some refinement and deduplication of characters as it will get very messy with the huge cast. Other ideas are incorporating the chapter number into the analysis and seeing how character ocurrence and relationships change over time.
In general, there’s still way too many interesting leads for me to follow in my limited time. If something neat does come out of it, I’ll be sure to post here.
References:
– Discworld on Wikipedia
– Source Code (very rough! but the main notebook might be interesting)