# Sentence length
across ELTeC collections
and Gutenberg Fiction
**Christof Schöch (Trier, Germany)** *** Distant Reading Closing Conference, April 21-22, 2022
https://christofs.github.io/krakow22/ ***
:: - Hi everyone, it's so great to speak here. -- ### Overview 1. [The issue with sentence length](#/2) 2. [Methods used](#/3) 2. [Findings from Gutenberg Fiction](#/4) 3. [Findings from ELTeC collections](#/5) 3. [Influence of direct speech](#/6) 5. [Conclusion](#/7) -- ## The issue with sentence length --- ### Why care about sentence length? * It is a proxy for syntactic complexity and one aspect of readability * It probably interacts with other features of texts, like narrator / character speech * It might vary also with first-person vs. third-person narration * People have assumed a decline of sentence length for a long time * but is it true, generally? * and specifically for different European literatures? --- ### Biber and Conrad: English fiction
* 500 words from each of 17 novels, 1720-1989 * Pretty clear decline, but very small sample * Reference: Biber and Conrad 1989 :: - Nice chronological spread - But tiny sample --- ### Hathi Trust
* Fiction (red) vs. non-fiction (blue) * Clear decline for fiction between ca. 1820 and 1940 * Source: Hathi 1M dataset; Bagga and Piper 2022 :: - Typical of DH / CLS: much larger dataset - Interesting that fiction declines and non-fiction does not -- ## Method(s) --- ### Sampling * For ELTeC corpora, the full corpus is used * For Gutenberg Fiction, random sampling is performed, partly with stratification by decade --- ### How to establish average sentence length? * Basic approach: establish number of tokens and number of sentences * Either: Using level2 encoding, use @type="SENT" as marker of sentence boundaries * Or: Using level1 encoding or plain text, use language-specific spacy tokenizer + sentencizer * Strong correlation between these two approaches * See Viera, Picoli and Mendes 2018 for a comparison of approaches * Scatterplot: Novels by publication year and average sentence length --- ### Test for significant difference * Equal-sized samples from early and later time slice * Density plot: for visual check of overlap and range * Significance test: Mann-Whitney-U-Test -- ## Findings based on
Gutenberg Fiction --- ### What do I mean by 'Gutenberg Fiction'? * Sample from the Gutenberg Project Corpus * Downloaded using tool by Gerlach and Font-Clos 2018: 63.208 items * Filtered out everything except English-language narrative fiction: 18.738 texts * Established year of publication for many of them using several heuristics * Information from Wikidata * Information from Worldcat * Years of author's birth and death * Sanity checks :: - The crazy thing is that year of publication is not included on PG! --- ### Gutenberg Fiction (100, 1840-1920)
* Sample size: 100 novels * Suitable for comparison with ELTeC-eng * Significant difference in average sentence length --- ### Gutenberg Fiction (1150, 1820-1920)
* Sample size: 1150 novels * Longer period, chronologically-stratified sample * Significant difference in average sentence length --- ### Gutenberg Fiction (4080, 1820-1940)
* Sample size: 4080 novels * Longer period, unbalanced sample * Significant difference in average sentence length -- ## Findings based on ELTeC :: - Smaller corpora, but many more languages --- ### ELTeC-eng (1840-1920)
* Corpus size: 100 novels * Standard period, quite even spread * Significant decline in average sentence length * Overall very similar to Gutenberg Fiction results --- ### ELTeC-deu (1840-1920)
* Corpus size: 100 novels * Standard period, quite even spread * Significant decline in average sentence length --- ### ELTeC-hun (1840-1920)
* Corpus size: 100 novels * Standard period, quite even spread * Significant decline in average sentence length --- ### ELTeC-por (1840-1920)
* Corpus size: 100 novels * Standard period, quite even spread * Significant decline in average sentence length --- ### ELTeC-fra (1840-1920)
* Corpus size: 100 novels * Standard period, quite even spread * No significant decline in average sentence length (!) --- ### French (ELTeC+, 1750-2010)
* Corpus size: 1079 novels (fra + ext1 + ext2 + cligs-rv) * Enlarged period, with gap, uneven spread * Significant decline in average sentence length -- ## Influence of direct speech :: - Remember the results from the Hathi Trust dataset - The fiction declined in sentence lengh; the non-fiction did not - Could this be connected to the increasing amount of direct speech? --- ### The case of French (1): overall
* Overall proportion of character vs. narrator speech * 56% narrator, 37% character speech :: - Reminder: No significant decrease over time during the ELTeC period - But: Significant difference between 1750-1800 vs. ELTeC period - Sample: 10 sentences from each of the 100 novels in ELTeC-fra --- ### The case of French (3): sentence length
* Average sentence length by speech type * Character speech does have, typically, shorter sentence length --- ### The case of French (2): per decade
* Proportion of speech type per decade * Some variation, but no clear trend -- ## Conclusions --- ### Sentence length * Sentences do get shorter over time, at least: * in novels / narrative fiction * for several languages * between 1840 and 1920 * Further data needed for link to direct speech * French: stable character speech proportion may explain stable sentence length * German, English, Hungarian, Portuguese and many more: lack of data --- ### More general issues * We need larger datasets, including for the 20th century, in multiple languages * For existing larger datasets, we need much better metadata (publication data, narrative perspective, subgenre labels) * For ELTeC, annotation of character speech (or: modes of enunciation) would be important --- ### Thank you!
--- ### References
* Bagga, Sunyam, und Andrew Piper. 2022. „HATHI 1M: Introducing a Million Page Historical Prose Dataset in English from the Hathi Trust“. Harvard Dataverse. https://doi.org/10.7910/DVN/HAKKUA. * Biber, Douglas, und Susan Conrad. 2009. Register, genre, and style. Cambridge textbooks in linguistics. Cambridge, UK ; New York: Cambridge University Press. * Byszuk, Joanna, Micha\l Woźniak, Mike Kestemont, Albert Leśniak, Wojciech \Lukasik, Artjoms Šeļa, und Maciej Eder. 2020. „Detecting Direct Speech in Multilingual Collection of 19th-Century Novels“. In Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, 100–104. Marseille, France: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.lt4hala-1.15. * Gerlach, Martin, und Francesc Font-Clos. 2018. „A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics“. arXiv:1812.08092 [physics], Dezember. http://arxiv.org/abs/1812.08092.
--- ### Data and code * Data and code: https://github.com/christofs/sentence-length * Direct speech analysis and data is here: https://github.com/dh-trier/directspeech2022 * A short write-up: https://dragonfly.hypotheses.org/1152 * Project Gutenberg Metadata: https://github.com/dh-trier/pg-fiction