Sonnets by the numbers

To warm up for larger projects on poetry and data, I decided to look at sonnets by 7 poets and chart a few of the numbers. Anthologists usually present poets alphabetically or chronologically by the poets' date of birth. I wondered, however, what might happen if poets were presented according to word counts of select poems?

Here's the order if we started with the number of words in each sonnet from least to most: 
• 100 words -- Countee Cullen's "Yet Do I Marvel"
• 101 words -- Helene Johnson's "Sonnet to a Negro in Harlem"
• 109 words -- Gwendolyn Brooks's "the sonnet-ballad"
• 111 words -- Paul Laurence Dunbar's  "Douglass"
• 115 words -- Claude McKay's "If We Must Die"
• 121 words -- Margaret Walker's "For Malcolm X"
• 126 words -- Robert Hayden's "Frederick Douglass"
If we arranged the poems by unique words, the line up would look like this:
• 75 unique words -- Gwendolyn Brooks's "the sonnet-ballad"
• 80 unique words -- Robert Hayden's "Frederick Douglass"
• 80 unique words -- Helene Johnson's "Sonnet to a Negro in Harlem"
• 81 unique words -- Paul Laurence Dunbar's  "Douglass"
• 82 unique words -- Countee Cullen's "Yet Do I Marvel"
• 82 unique words -- Claude McKay's "If We Must Die"
• 95 unique words -- Margaret Walker's "For Malcolm"

The 7 sonnets contain a total of 783 words, and collectively among the sonnets, there are 444 unique words. I'm fascinated that Walker's poem has so many more unique words than the others. Walker's poem, published during the late 1960s, is the most recent of the 7 sonnets in this set.  

Initially, I wondered if the numbers of unique words progressed over time for groups of poets, but the findings on Dunbar disrupt that notion. His poem is the earliest one in my set, and his poem appears in the middle for both lists (number of words and unique words).

Moving forward, I'll need to build a much larger dataset of sonnets to come to any solid conclusions. I'll also be interested in seeing what changes I discover over time by a single poet. What changed, for instance, when Walker published sonnets in the late 1930s and early 1940s as opposed to the sonnets she published during the 1960s and 1970s?   

I'm learning that text-mining provides one opportunity for thinking about sonnets in new ways, including how we might present and organize various poems.

