Here is an example of using topic modeling on a real corpus. I used Mallet to extract 25 topics from the 2,473 texts in the ECCO collection of Eighteenth century English texts.
What do these tell you? Some are structural, e.g., topic 22 seems to capture foreign language words, and topic 3 captures titles of people. Topic 14 captures a “classical literature” theme with words such as Homer, Achilles, war, and Greece. Topic 16 captures religious themes with words such as god, lord, and religion.
What do you think about some of the other topics? Do they make sense? Do they tell you something about the works in the corpus that might not be obvious otherwise? Would the topics prove useful in automatically categorizing texts for indexed searches?
Click on the list of words for a topic to see the top-ranked documents in which the topic appears. The ranking is based upon the percentage of words comprising the topic which appear in a given work.
This is just great, Pib, thanks for sharing this experiment with us. Fascinating stuff!
I’ll leave the question of automation of bibliographical processes to librarians and real literary scholars, but what’s intriguing to me is that topic modeling seems most useful as another mechanism for confirming hunches and hypotheses about bodies of evidence. It provides another way (not more true or definitive, I would argue, but rather merely another way) of measuring what we come to know from both broad and close readings of texts.
The question I have (I welcome being convinced about this, really I do!) is whether topic modeling can provide patterns that point to new ideas and arguments: surprises, unexpected dimensions of texts or archival holdings, striking new glimpses of the past and how we might interpret it based on evidentiary bases.
In other words, I’m coming down here against the position put forward by Tom Scheinfeldt on his blog Found History about the justifications for DH research (http://www.foundhistory.org/2010/05/12/wheres-the-beef-does-digital-humanities-have-to-answer-questions/). I suspect that Ted Underwood, Matthew Jockers, Ben Schmidt, and others are starting to offer responses about how we understand genre, literary style, and historical data. I’m game to think through the question of how DH tactics such as topic modeling might lead to new arguments, interpretations, and questions, but I think that’s the key here in terms of keeping the contemporary humanities as part of the digital humanities.
It might sound like I am offering nothing but negative critique here, but I really do offer this response in the spirit of continuing to explore the topic of topic modeling! Thanks again!
Here is an example of using topic modeling on a real corpus. I used Mallet to extract 25 topics from the 2,473 texts in the ECCO collection of Eighteenth century English texts.
http://devadorner.northwestern.edu/eccotopics/
What do these tell you? Some are structural, e.g., topic 22 seems to capture foreign language words, and topic 3 captures titles of people. Topic 14 captures a “classical literature” theme with words such as Homer, Achilles, war, and Greece. Topic 16 captures religious themes with words such as god, lord, and religion.
What do you think about some of the other topics? Do they make sense? Do they tell you something about the works in the corpus that might not be obvious otherwise? Would the topics prove useful in automatically categorizing texts for indexed searches?
Click on the list of words for a topic to see the top-ranked documents in which the topic appears. The ranking is based upon the percentage of words comprising the topic which appear in a given work.
This is just great, Pib, thanks for sharing this experiment with us. Fascinating stuff!
I’ll leave the question of automation of bibliographical processes to librarians and real literary scholars, but what’s intriguing to me is that topic modeling seems most useful as another mechanism for confirming hunches and hypotheses about bodies of evidence. It provides another way (not more true or definitive, I would argue, but rather merely another way) of measuring what we come to know from both broad and close readings of texts.
The question I have (I welcome being convinced about this, really I do!) is whether topic modeling can provide patterns that point to new ideas and arguments: surprises, unexpected dimensions of texts or archival holdings, striking new glimpses of the past and how we might interpret it based on evidentiary bases.
In other words, I’m coming down here against the position put forward by Tom Scheinfeldt on his blog Found History about the justifications for DH research (http://www.foundhistory.org/2010/05/12/wheres-the-beef-does-digital-humanities-have-to-answer-questions/). I suspect that Ted Underwood, Matthew Jockers, Ben Schmidt, and others are starting to offer responses about how we understand genre, literary style, and historical data. I’m game to think through the question of how DH tactics such as topic modeling might lead to new arguments, interpretations, and questions, but I think that’s the key here in terms of keeping the contemporary humanities as part of the digital humanities.
It might sound like I am offering nothing but negative critique here, but I really do offer this response in the spirit of continuing to explore the topic of topic modeling! Thanks again!
Michael