What follows are methods for a project on recordings of black women poets by Marit MacArthur, Lee M. Miller, Xiaoliu Wu, Qin Ding, and me.
Selection of Recordings
For this study, we selected two recordings each for 101 poets.1 The variety of the recordings—in terms of recording quality and age, media format (mostly MP3s or MP4s), date of recording, and the variety of venues, audiences, and size and character of the physical spaces in which the readings took place—would be a linguist’s experimental design nightmare. A typical linguistic study aims to use the same recording equipment in the same setting, such as a sound booth, and make recording quality as consistent as possible, both worthy and desirable goals. However, given the wealth of recordings online and our interests in the possible influences of contextual factors on a poet’s performance, it would not make experimental sense to use only recordings from one venue, of one media format, before one audience, etc.
Nevertheless, we want to distinguish between the contextual factors that might well affect a poet’s vocal performance choices—the audience, the venue, the physical character of the space—and the technical factors over which the poet would have no control, but which could influence the prosodic data—that is, the recording quality and age, media type, and date of recording. For instance, performance in a relatively noisy space (e.g. with many background talkers) may cause a performer to unconsciously or half-consciously alter vocal properties such as pitch, timbre, and timing—a phenomenon known as the Lombard effect, which improves listener comprehension. Differences in background noise and microphone idiosyncrasies can interact with the prosodic measures themselves, for instance by biasing estimates of intensity or volume. (e.g. microphone quality, for instance by our use of state-of-the art, noise-tolerant pitch estimates) so that relevant effects (such as venue type) can be clearly discerned. As much as possible, we control for and consider these factors before drawing even the most tentative conclusions about individual recordigs, the dataset as a whole, and trends in poetry performance among Black women poets in the U.S.
Audience & Recording Metadata
Self-recorded (no audience present): 15 (13 for Poets.Org)
Studio Audience: 57 (including 5 for PBS and 1 for NPR)
Live with an Academic Audience: 48 (likely to be predominantly white audiences, at various universities, 2 from AWP)
Live with a Spoken Word Audience: 45 (unlikely to be predominantly white, including 16 from slam competitions, 5 from DefJam, 5 from Camp Bar in St. Paul, Minnesota)
Live with a General Public Audience: 25 (galleries, museums, libraries, bookstores, urban poetry reading series, a church, one at the Apollo Theatre, one at the Library of Congress [Amanda Gorman] and one in front of the Spirit of Detroit statue [by jessica Care more])
Live at Book Festivals: 4
Live at a Presidential Inauguration: 3
(Maya Angelou for President Bill Clinton in 1993, Elizabeth Alexander for President Barack Obama in 2009, and Amanda Gorman for President Joe Biden in 2021)
Academic Venue: 41
Participatory Setting (Nightclubs, Spoken Word spaces, and auditoriums hosting slam poetry contests; the audience audibly engages the poet and poem with shouting affirmations during the recitations/performances): 46
Public Reading Space (Public libraries, poetry festivals, galleries, etc., where poets typically read from books and where the audience does not audibly engage the poet during the reading): 39
Self-recorded (usually the poet’s home): 22 Studio (radio, television, film, etc.): 45
Unknown Venue: 7
Presidential Inauguration outdoors: 3
The 101 Poets: Background Metadata
As we selected 101 poets, we gathered some basic background information about each poet, including: birth year, city and state/country of birth, the region in which they grew up, and the colleges where they pursued undergraduate and graduate study—including whether they attended private or public colleges or universities, an Ivy League institution, an HBCU, or the Iowa MFA program. (For graduate school, we did not distinguish among degrees; though many poets received MFAs, some received a law degree, a PhD, etc.) We noted whether they belonged to Cave Canem, a non-profit literary organization “founded by Toi Derricotte and Cornelius Eady in 1996 to remedy the under-representation and isolation of African American poets in the literary landscape,” which we interpret as especially benefiting poets working within the print-based, academically sponsored literary landscape.
We also noted whether they had served as U.S. Poet Laureate, and whether they had received major literary awards (Pulitzer Prize for Poetry, the Ruth Lilly Poetry Prize, the Robert Frost Medal, the Ruth Lilly and Dorothy Sargent Rosenberg Poetry Fellowships, the Kate Tufts Discovery Award, the Kingsley Tufts Poetry Award, the Jackson Poetry Prize, the MacArthur Fellowship, the National Book Award for Poetry, the National Book Critics Circle Award, and the Younger Poet Yale Series of Younger Poets award). In their article, “Who Gets to Be a Writer?” Claire Grossman, Stephanie Young, and Juliana Spahr point out that “a Harvard degree boosts the odds of winning for all writers, but a Black writer who attended Harvard is 19 times more likely to win, a significant odds increase in comparison to Black writers who did not attend Harvard.” They go on to note that “a Black writer with an elite degree is about 13 percent more likely to win a prize than a Black writer without an elite degree.”
We also categorized the poets as Spoken Word or not. For the study, we follow the Poetry Foundation’s definition of Spoken Word, with some reservations:
A broad designation for poetry intended for performance. Though some Spoken Word poetry may also be published on the page, the genre has its roots in oral traditions and performance. Spoken word can encompass or contain elements of rap, hip-hop, storytelling, theater, and jazz, rock, blues, and folk music. Characterized by rhyme, repetition, improvisation, and word play, Spoken Word poems frequently refer to issues of social justice, politics, race, and community. Related to slam poetry, Spoken Word may draw on music, sound, dance, or other kinds of performance to connect with audiences.
The entire genre of poetry, of course, “has its roots in oral tradition and performance,” and much poetry that is not Spoken Word is also “[c]haracterized by rhyme, repetition… and word play.” However, Spoken Word poetry, as we understand it, is written for performance first and the page second. This is not at all to imply that Spoken Word poets do not care about publication. Spoken Word poets typically perform or recite memorized poems that frequently take on “issues of social justice, politics, race and community,” sometimes but not always use rhyme, and often use repetition. Typically they also recite in a more animated style, and use hand gestures and sometimes bodily movement, compared to print-based poets who typically stand still behind a podium and read from a book. Seven of the older poets we assigned to the Spoken Word category we call proto-Spoken Word: Maya Angelou, Wanda Coleman, Jayne Cortez, Nikky Finney, June Jordan, Pat Parker, and Sonia Sanchez. These elder poets, some of whom have a background in theatre and a few of whom participated in the Black Arts movement, developing styles in participatory settings with mainly Black audiences, have influenced younger generations to do the same.
Spoken Word & Proto-Spoken Word Poets: 36
Non-Spoken Word Poets: 65
Major Award Winners: 31
Private college/university graduates (at least one private for undergraduate or graduate): 36
Public college/university graduates: 65
Ivy League graduates: 20
Harvard University graduates:
Eve L. Ewing, PhD after BA at University of Chicago
Amanda Gorman, BA
Robin Coste Lewis, MA in Theological Studies
Tracy K. Smith, BA
Simone White, Harvard Law, after BA at Wesleyan University
June Jordan, BA
Harmony Holiday, MFA after BA at UC Berkeley
Audre Lorde, MA in Library Science
Morgan Parker, BA
Claudia Rankine, MFA after BA at Williams College
Tracy K. Smith, MFA
Crystal Williams, MFA
Carolyn Beard Whitlow, BA
Elizabeth Alexander, BA
Alysia Nicole Harris, MA, Linguistics
Zora Howard, BA Emi Mahmoud, BA
University of Pennsylvania:
Elizabeth Alexander, PhD
Alysia Nicole Harris, BA, Linguistics
Marilyn Nelson, MA
Airea D. Smith, BS in Economics
Nicole Terez Dutton, BA and MFA;
Jamila Woods, BA
Carolyn Beard Whitlow, MFA
Brenda Marie Osbey
Fisk University: Nikki Giovanni and Vievee Francis
Hampton University: Sonya Renee Taylor
Howard University: Cheryl Clarke, Yona Harvey, and Alison C. Rollins
Jackson State University: Treasure Redmond
Paine College: Kamilah Aisha Moon
Spelman College: Bettina Judd
Talladega College: Nikky Finney
Iowa MFA graduates:
Margaret Walker, Rita Dove
Signal Processing and Statistical Methods
In the distant listening phase of the analysis, we processed the recordings using Voxit, an open-source toolkit that allows automated analysis of vocal parameters in recorded speech, specifically data related to pitch, timing and intensity patterns. While Voxit can provide data about many vocal parameters (more than 50), for this study we generated data for 16 prosodic measures related to pitch, timing and intensity. And then, after an exploratory phase in which we mapped correlations between background metadata about the poets and Voxit data, we chose just five prosodic measures to analyze for statistical significance: Average Pitch, Pitch Speed, Intensity Speed, Average Pause Length, and Dynamism.
The choice of these five was motivated in part by our prior work with the independent data set in "Beyond Poet Voice," and also by the exploratory phase with this data. For instance, three other measures we had analyzed in “Beyond Poet Voice” turned out to have a close correlation with one of the five we focused on here. A close correlation (.93) was found between Average Pause Length and Rhythmic Complexity of Pauses, another timing measure of interest in “Beyond Poet Voice.” Similarly, Average Pitch Speed and Average Pitch Acceleration had a close correlation (.70), and Dynamism turned out to have a fairly close correlation (.51 - .62) with Pitch Range, a measure we analyzed in Beyond Poet Voice. Pitch Entropy, also a measure of interest for expressive pitch, is a variable in the formula for Dynamism, so we did not test it independently. All of these measures are defined below.
Next we applied statistical analysis to the Voxit data about the 203 recordings. We used linear mixed models to explore the relationship between background metadata characteristics about the 101 poets and the recordings, and the Voxit data for each of the 203 recordings on the five prosodic measures. The background metadata characteristics included: the age of the poet (based on birth year), whether they are a Spoken Word poet or not, where the poet grew up (usually a region of the U.S.), whether the poet has an undergraduate degree, whether the poet has a graduate degree, whether the poet attended at least one private college or university or only attended public colleges or universities, whether the poet attended an Ivy League institution, whether the poet attended the Iowa MFA program, whether the poet attended a Historically Black Colleges and Universities (HBCU) institution, whether the poet was a U.S. Poet Laureate, whether the poet had received at least one Major Award, whether the poet is affiliated with the Cave Canem Foundation (as a fellow or faculty), and the year of the recording, recording type, the audience type, the venue type, and whether the venue was a poetry slam or not.
A linear mixed model is a mode of statistical analysis, specifically an extension of simple linear models to allow for both fixed and random effects. Let’s unpack that. If each of your data samples – let’s say Poet A’s average pitch – is independent from the other samples – Poet B’s average pitch, and Poet C’s, Poet D’s, etc. – then a simple linear model suffices. In that case, we have one or more explanatory variables or metadata characteristics (age, graduate degree, HBCU, etc.) which might help explain the outcome variable: Average Pitch. Across all poets, each explanatory variable or metadata characteristic is called a covariate.
When a model is made to “fit” the data as best it can, each covariate’s “weight” or coefficient describes how much the outcome variable changes with changes in that covariate, usually while also considering all the other covariates’ influence. Basically it means the explanatory variable or metadata characteristic (age) and the outcome value (Average Pitch) “covary” or are well correlated, either positively (coefficient > 0) or negatively (coefficient < 0). So, if a poet’s age has a strong effect on Average Pitch, with older age being associated with lower pitch, then the age covariate will have a large negative weight. And provided the variance is also low enough, meaning that the correlation is strong and clear, then the age covariate will be statistically significant and we conclude it does indeed have an effect on Average Pitch.
As mentioned above, the simple linear model suffices when all the data samples of the outcome variable—Average Pitch, in our example here—are independent. However, a mixed model is appropriate when there is some intrinsic structure or hierarchy to the data that is not controlled for or captured by the covariates or metadata characteristics, for instance when we have data samples from each individual poet over multiple time points, such as two recordings. In this case, Poet A’s Average Pitch measurement for recording #1 is likely correlated with Poet A’a Average Pitch measurement for recording #2 (when the poet is somewhat older), because after all it’s the same poet, even if she is older or reading in a different context. And this correlation may be qualitatively different from that which the covariates can explain. So we have two levels of effects: 1) the covariate’s effect across the entire population of poets, in this case the effect of age – this is called the “fixed” effect, and 2) differences with an individual poet across multiple recordings – this is called the “random” effect, because it reflects additional, uncontrolled variance that’s not part of our fixed-effect. Unlike a simple linear model, which would treat the individual poet variability as noise, a mixed effects model captures both of those levels.
Using a mixed effect model for hierarchical (or otherwise structured, correlated) data gives us two advantages over the simple linear model. First, it gives us the same fixed “population” effect that we’d get from the simple model—which for us is how our poet’s background metadata characteristics relate to the five prosodic measures. But because it also models the random (within-subject, with more than recording by the same poet) effect explicitly, the estimate of all the leftover, unexplained noise or variance is more accurate and generally lower. Lower unexplained variance means our fixed-effect hypothesis test is potentially more powerful. Or to put it another way, with a lower and more accurate variance estimate, we can obtain results with higher confidence. The second benefit of a mixed model versus a simple linear model is that it explicitly recognizes any trends within-subject, as with more than one recording by the same poet, in the random effect, for instance if Average Pitch tends to increase on the second performance relative to the first. So it gives us more refined, secondary interpretations than a simple fixed model could support.
In this project, we are primarily interested in the fixed or population effect of how poets’ background metadata characteristics might relate to their performance style. Specifically, does a poet’s educational background, region of upbringing, age, membership in Cave Canem, Spoken Word status, where the poet grew up, the recording type, or the audience type, etc., have a relationship to any of the five Voxit prosodic measures? Although we are not hypothesizing any within-subject (random) effects, we still use a mixed model to capture any within-subject correlations, so the significance of our fixed effects is evaluated appropriately.
Avoiding the Multiple Comparisons Problem and False Discovery Rate
Investigating whether any of the Voxit prosodic measures relate to this panoply of background metadata entails a large number of statistical tests. This presents a challenge known as the multiple comparisons problem, as it radically increases the likelihood of false positives, i.e. declaring an effect significant when it is truly not (i.e. the null hypothesis holds). In other words, the more tests an investigator conducts, the more likely it is their conclusions are wrong!
To illustrate the magnitude of the problem, consider the simple case when we perform a single hypothesis test with a significance level of 0.05: we will reject the null hypothesis if, given the observed data, the probability of falsely rejecting the null hypothesis is less than 0.05. But in most real-world situations, we have more than one hypothesis we would like to test. The naïve approach is to test all of them with the usual significance level i.e. 0.05, and conclude that all tests with p-value < 0.05 reflect real effects. Now consider a situation where we are testing 10 hypotheses and there is no real effect at all (all the null hypotheses are true). For simplicity we assume the tests are independent of one another. What is the probability that our naïve approach will find a “significant” effect – a false positive?
• If, as is conventional, the probability of finding a false positive result for one test is p = 0.05• then the probability of not finding a false positive for one test is 1-p = 0.95,• and the probability of NOT finding a false positive for N tests (so we multiply all the probabilities) is (1-p)N• This means the probability of finding a false positive for N tests is P(false positive) = 1 - (1-p)N• In our case of 10 tests, the probability that we will find a false positive is 1 - (1 - 0.05)10 ≈ 0.4
In other words, the naïve approach will find a false positive with a 40% probability. If we do 100 tests, we are virtually guaranteed to get a false positive, with 99.4% probability. It is still common in many fields of inquiry, including in some areas of the digital humanities, to report p-values for numerous tests without correction for false positives. This convention lacks rigor, inevitably leading to conclusions unsupported by the evidence, and may undermine progress in the fields of cultural analytics and the digital humanities.
We therefore strongly advocate using some form of correction. The classical correction for multiple comparisons is a Bonferroni correction. This simply imposes a lower, more stringent p-value by dividing the nominal, single-test p-value by the number of tests. If our nominal p-value is 0.05 and we conduct 10 tests, then we would instead use a corrected p-value of 0.05/10 = 0.005 to guard against false positives. The good news is that Bonferroni guards against the false positives. Specifically, Bonferroni controls for what is called the Family-Wise Error Rate (FWER): the probability of finding one or more false positives when performing multiple tests. The bad news is that for large numbers of tests or if the tests are correlated, it can be very conservative, leading to many false negatives. In that case, we may miss many real effects instead.
One widely used technique to correct for multiple comparisons without missing too many real effects is to take a different approach to significance entirely, called False Discovery Rate (FDR). Whereas Bonferroni correction controls the probability of finding one or more false positives when performing multiple tests (FWER), FDR controls the proportion of false positives among all the statistically significant results. Particularly with large numbers of tests, it may be more reasonable to tolerate a certain proportion of false positives (say 10% or 0.1) in order to maintain enough power to find the true positives. This is the approach we adopt below.
There are numerous ways of calculating FDR that differ in their assumptions and complexity. We use the function qvalue from the R statistics package, which follows the well-established work of J.D. Storey. We should note that technically, many FDR procedures assume independence or weak dependence among the p-values from the multiple tests. Although this condition may not always hold, the most common FDR procedures have been shown to be robust across a wide variety of common dependencies and data types including in genomics, neuroimaging, and many other “big data” sciences. Moreover, alternate or modified FDR procedures are available when there are strong, known dependencies. We are therefore confident recommending FDR for multiple comparisons correction in the digital humanities and cultural analytics.
In this study, we aim to highlight the multiple comparisons problem and demonstrate how to deal with it rigorously. We therefore present the results below with both Bonferroni correction and FDR, so the reader can appreciate how the different assumptions of these approaches inform one’s conclusions.
Typically, one would adopt only a single correction method a priori, based on the type of hypothesis test (FWER vs FDR) and tolerance for false positives vs false negatives.
1. For Jayne Cortez, we analyzed just one recording, of her poem “Rape,” because we only found one recording online without musical accompaniment. Musical accompaniment, unfortunately, makes it nearly impossible to separately analyze the pitch and timing of a voice. For Amanda Gorman, we analyzed three recordings, adding “The Hill We Climb” from President Joe Biden’s Inauguration in 2021 when she was chosen to read at the event after we had compiled the rest of the recordings. We also then added a third recording by Maya Angelou, “On the Pulse of Morning” from President Bill Clinton’s Inauguration from 1993, so that we could make comparisons among the three Black women poets chosen to read at presidential inaugurations, as we had already included Elizabeth Alexander’s recording of “Praise Song for the Day” from President Barack Obama’s Inauguration in 2009.