January 19, 2014

Graphing the Sentences of your Favorite Authors

data

I've been reading through On the Road this week (which is great, by the way, even if you're worried about being seen as the kind of person who reads On The Road), and was impressed by Kerouac's characteristic stream-of-consciousness writing style. Curious about how it compared mechanically to other beloved authors of mine, I decided to fire up Python and do some basic analysis. Below is the numbers of syllables per sentence in the first hundred sentences of writing samples from a bunch of likable dudes:

(Note: as Reddit user /u/cincodenada points out, these should really be bar graphs since they aren't reflecting time-series or continuous data. Unfortunately, I was having some issues getting the data to render as bar graphs at an acceptable speed.)

Without becoming a full-blown armchair literary critic, there are a bunch of fun observations you can make here:

  • Dickens and Hemingway fulfill their stereotype of economical prose, with a few elongated digressions. In particular, Hemingway's massive spike -- from The Snows of Kilimanjaro -- is worth quoting:

The cot the man lay on was in the wide shade of a mimosa tree and as he looked out past the shade onto the glare of the plain there were three of the big birds squatted obscenely, while in the sky a dozen more sailed, making quick-moving shadows as they passed.

  • Kerouac's overwhelming prose comes less from long sentences and more from long thoughts (though there are ample of both).

  • Retroactive vindication for thinking Milton was absurdly dense when I had to read it in British Literature. Averaging one hundred syllables a sentence is absurd.

I also put this into a table for a significantly more compact (yet significantly less pretty!) view of the data. Be sure to try clicking the headers:

author words sentences words per sentence flesch-kincaid syllables per word
kerouac 21829 1281 17 81.2110779239 1.28095652572
joyce 7609 686 11 94.1033026679 1.20055197792
hemingway 8930 601 14 89.8165789474 1.21522956327
milton 5381 83 64 49.7284473146 1.08920275042
dickens 5542 288 19 85.2575784915 1.20913027788
fitzgerald 5651 323 17 80.5177101398 1.28915236241
nabokov 9942 407 24 66.8328756789 1.36692818346
vonnegut 2417 237 10 83.9083347125 1.33305750931

More proof that Milton is ridiculous and that Flesch-Kincaid is not exactly the most airtight of measurements -- Vonnegut harder to read than Joyce? Really? (And Ulysseys was the writing sample, mind you.)

From a programming perspective, nothing of note -- feel free to check out the script yourself (though you'll have to grab the writing samples on your own, as I'm too lazy to upload them.) I am, however, rather proud of my method of finding the number of syllables in a word:

from curses.ascii import isdigit
from nltk.corpus import cmudict as d
is_word = lambda word: isdigit(word[-1])
num_syllables = lambda word: len(filter(is_word, word))
return max(map(num_syllables, d[lowercase]))
You should .

Comments