Graphing the sentences of your favorite authors
I’ve been reading through On the Road this week (which is great, by the way, even if you’re worried about being seen as the kind of person who reads On The Road), and was impressed by Kerouac’s characteristic stream-of-consciousness writing style. Curious about how it compared mechanically to other beloved authors of mine, I decided to fire up Python and do some basic analysis. Below is the numbers of syllables per sentence in the first hundred sentences of writing samples from a bunch of likable dudes:
(Note: as Reddit user /u/cincodenada points out, these should really be bar graphs since they aren’t reflecting time-series or continuous data. Unfortunately, I was having some issues getting the data to render as bar graphs at an acceptable speed.)
Without becoming a full-blown armchair literary critic, there are a bunch of fun observations you can make here:
- Dickens and Hemingway fulfill their stereotype of economical prose, with a few elongated digressions. In particular, Hemingway’s massive spike – from The Snows of Kilimanjaro – is worth quoting:
The cot the man lay on was in the wide shade of a mimosa tree and as he looked out past the shade onto the glare of the plain there were three of the big birds squatted obscenely, while in the sky a dozen more sailed, making quick-moving shadows as they passed.
Kerouac’s overwhelming prose comes less from long sentences and more from long thoughts (though there are ample of both).
Retroactive vindication for thinking Milton was absurdly dense when I had to read it in British Literature. Averaging one hundred syllables a sentence is absurd.
I also put this into a table for a significantly more compact (yet significantly less pretty!) view of the data. Be sure to try clicking the headers:
author | words | sentences | words per sentence | flesch-kincaid | syllables per word |
---|---|---|---|---|---|
kerouac | 21829 | 1281 | 17 | 81.2110779239 | 1.28095652572 |
joyce | 7609 | 686 | 11 | 94.1033026679 | 1.20055197792 |
hemingway | 8930 | 601 | 14 | 89.8165789474 | 1.21522956327 |
milton | 5381 | 83 | 64 | 49.7284473146 | 1.08920275042 |
dickens | 5542 | 288 | 19 | 85.2575784915 | 1.20913027788 |
fitzgerald | 5651 | 323 | 17 | 80.5177101398 | 1.28915236241 |
nabokov | 9942 | 407 | 24 | 66.8328756789 | 1.36692818346 |
vonnegut | 2417 | 237 | 10 | 83.9083347125 | 1.33305750931 |
More proof that Milton is ridiculous and that Flesch-Kincaid is not exactly the most airtight of measurements – Vonnegut harder to read than Joyce? Really? (And Ulysseys was the writing sample, mind you.)
From a programming perspective, nothing of note – feel free to check out the script yourself (though you’ll have to grab the writing samples on your own, as I’m too lazy to upload them.) I am, however, rather proud of my method of finding the number of syllables in a word:
from curses.ascii import isdigit
from nltk.corpus import cmudict as d
is_word = lambda word: isdigit(word[-1])
num_syllables = lambda word: len(filter(is_word, word))
return max(map(num_syllables, d[lowercase]))