Social Networking for digital humanities nerds? Which DH bloggers are you most compatible with? Let’s get the right nerds with the right nerds–match making made in digital humanities heaven.
After seeing Stefan Sinclair’s Voyeuristic analysis of the Day of DH Blog posts, I wrote and asked him how to get access to the “corpus” of posts. He hooked me up, and I pre-processed the data with a few php scripts, then ran an LDA topic modeling process and then some more post processing with R in order to see the most important themes of the day and also to cluster the 117 bloggers based on their thematic similarity.
So, here’s the what and then the how. As for the why? Why not?
What:
10 Unsupervised Topics (10 is arbitrary–I could have picked 100). These topics are generated by an analysis of the words and word sequences in the individual blogger’s sites. The purpose is to harvest out the most prominent “themes” or topics. These themes are presented in series of word lists. it is up to the researcher to then “label” the word clusters. I have labeled a few of them (in [brackets] at the beginning of the word lists below–you might use another label–this is the subjective part). Here they are:
- [human interaction in DH] work today people time working things email year week days bit good meeting tomorrow
- day thing mail dh de image based fact called things change ago encoding house
- [Academic Writing–including Grants] day time dh start post blog proposal google write great posts lunch nice articles
- [Digital publishing and archives] http talk future collection making online version publishing field morning life traditional daily large
- conference university blog morning read internet access couple computers archive involved including great written
- [DH Teaching] students dh teaching humanities class technology scholars university lab group library support scholarship student
- [DH Projects] digital project humanities work projects room meeting collections office building task database spent st
- data project xml working projects web interesting user set spend system ways couple time
- digital day humanities media writing post computing twitter english humanist real phd web rest
- [reading and text-analysis] book text tools software books today reading literary texts coffee edition search tool textual
Unfortunately, the Day of DH corpus isn’t truly big enough to get the sort of crystal clear topics that I have harvested from much larger collections, but still, the topics above, seen in aggregate, do give us a sense of what’s “hot” to talk about in the field.
But let’s get to the sexy part. . .
In addition to harvesting out the prominent topics, the modeling tool outputs data indicating how much (what proportion) of each blog is about each topic. The resulting matrix is of dimension 117×10 (117 blogs and 10 topics). The data in the cells are percentages for each topic in each author’s blog. The values in each row add up to 100%. With a little massaging in R, I read in the matrix and then use some simple distance and clustering functions to group the bloggers into 10 (again an arbitrary number) groups; groups based on shared themes. Using this data, I then output a matrix showing which author’s have the most in common; thus, I do a little subtle match-making in advance of our digital rendezvous in London–birds of a feather blog together?
Here are the groups:
- Group1
- aimeemorrison
- ariefwidodo
- barbarabordalejo
- caraleitch
- carlosmartinez
- carlwhithaus
- clairewarwick
- craigharkema
- ellimylonas
- geoffreyrockwell
- glenworthey
- guydaarmstrong
- henrietterouedcunliffe
- ianjohnson
- janrybicki
- jenterysayers
- jonbath
- juliaflanders
- juliannenyhan
- justinerichardson
- kai-christianbruhn
- kathleenfitzpatrick
- keithlawson
- lauramandell
- lauraweakly
- malterehbein
- matthewjockers
- meganmeredith-lobay
- melissaterras
- milenaradzikowska
- miranhladnik
- patricksahle
- paulspence
- peterrobinson
- pouyllau
- rafaelalvarado
- raysiemens
- reneaudet
- rogerosborne
- rudymcdaniel
- stanruecker
- stephanieschlitz
- susangreenberg
- victoriasmith
- vikazafrin
- williamturkel
- Group2
- alejandrogiacometti
- annacaprarelli
- danasolomon
- ernestopriego
- karensmith
- leedurbin
- matthewcarlos
- paolosordi
- sarasteger
- stephanethibault
- yinliu
- Group3
- alialbarran
- amandagailey
- cyrilbriquet
- federicomeschini
- ntlab
- stefansinclair
- torstenschassan
- Group4
- aligrotkowski
- ashtonnichols
- calenhenry
- devonfitzgerald
- enricasalvatori
- ericforcier
- garrywong
- jameschartrand
- joelyuvienco
- johnnewman
- peterorganisciak
- shannonlucky
- silviarussell
- simonmahony
- sophiahoosein
- stevenhayes
- taraandrews
- violalasmana
- willardmccarty
- Group5
- alunedwards
- hopegreenberg
- lewisulman
- Group6
- amandavisconti
- jamessmith
- martinholmes
- sperberg-mcqueen
- waynegraham
- Group7
- bethanynowviskie
- josephgilbert
- katherineharris
- kellyjohnston
- kirstenuszkalo
- margaretgraham
- matthewgold
- paulyoungman
- Group8
- charlestravis
- craigbellamy
- franzfischer
- jeremyboggs
- johnwall
- kathrynbarre
- shawnday
- teresadobson
- Group9
- jasonboyd
- jolanda-pieta
- joriszundert
- michaelmaguire
- thomascrombez
- williamallen
- Group10
- louburnard
- nevenjovanovic
- sharongoetz
- stephenramsay
Twitterers @sramsay and @mattwilkens were poking around here today and wondered what the topics would look like if there were only five topics and five clusters instead of 10 and 10. Here are the topics:
- data work time text working tools people thing system xml mail software things texts
- day time morning lot work bit find web class teaching student days dh real
- digital humanities day tomorrow book twitter university blog computing reading books writing tei emails
- day dh today time post things write start online writing working computer year hours
- project digital work projects students meeting today people humanities dh scholars library year lab
And here are the Blogger-Mates clusters when I set n=5:
- Group1
- aimeemorrison
- alejandrogiacometti
- alialbarran
- amandagailey
- annacaprarelli
- ashtonnichols
- barbarabordalejo
- carlosmartinez
- carlwhithaus
- clairewarwick
- craigbellamy
- craigharkema
- danasolomon
- devonfitzgerald
- enricasalvatori
- ernestopriego
- garrywong
- glenworthey
- guydaarmstrong
- henrietterouedcunliffe
- ianjohnson
- jameschartrand
- janrybicki
- jenterysayers
- joelyuvienco
- johnnewman
- jonbath
- juliannenyhan
- justinerichardson
- karensmith
- kathleenfitzpatrick
- keithlawson
- leedurbin
- lewisulman
- malterehbein
- matthewgold
- matthewjockers
- meganmeredith-lobay
- melissaterras
- michaelmaguire
- miranhladnik
- nevenjovanovic
- patricksahle
- peterrobinson
- raysiemens
- reneaudet
- rogerosborne
- shannonlucky
- silviarussell
- simonmahony
- sophiahoosein
- stefansinclair
- stephanieschlitz
- susangreenberg
- taraandrews
- thomascrombez
- torstenschassan
- vikazafrin
- violalasmana
- willardmccarty
- williamallen
- williamturkel
- yinliu
- Group2
- aligrotkowski
- ariefwidodo
- calenhenry
- caraleitch
- charlestravis
- ericforcier
- geoffreyrockwell
- jolanda-pieta
- juliaflanders
- lauraweakly
- margaretgraham
- matthewcarlos
- milenaradzikowska
- nt2lab
- paolosordi
- peterorganisciak
- rudymcdaniel
- sarasteger
- sharongoetz
- stanruecker
- stevenhayes
- victoriasmith
- Group3
- alunedwards
- hopegreenberg
- katherineharris
- stephanethibault
- teresadobson
- Group4
- amandavisconti
- cyrilbriquet
- federicomeschini
- jamessmith
- joriszundert
- martinholmes
- rafaelalvarado
- sperberg-mcqueen
- stephenramsay
- waynegraham
- Group5
- bethanynowviskie
- ellimylonas
- franzfischer
- jasonboyd
- jeremyboggs
- johnwall
- josephgilbert
- kai-christianbruhn
- kathrynbarre
- kellyjohnston
- kirstenuszkalo
- lauramandell
- louburnard
- paulspence
- paulyoungman
- pouyllau
- shawnday