Topic Analysis of SoWi BA Thesis


June 22, 2023

Topic Analysis of SoWi BA Thesis

Last week, I was asked whether I can provide a (quick) analysis of the BA theses of our faculty. The question was whether determining and clustering the topics of the theses provive a good-enough overview and to avoid clustering around 330 theses manually. I really liked the challenge and gave it a try. In our most recent DeLab API analytics services (which we will publish sooner or later), we have anyhow already implemented a procedure for computationally analyzing topics. For this task, I just adapted the procedure a little bit. In this post, I will only provide a brief description of the data and procedure:

  • The data consists of around 330 titles of BA theses of the Faculty of Social Sciences from the last few years. With each title, the institute is given that (first) supvervised the thesis. For the topic modeling procedure, I removed numbers and punctuation from the titles.
  • I rely on the excellent BERTopic Python package. And with the great R reticulate package, there is even no need to write Python code.
  • Because of German and English titles, I use a multilingual base model.
  • Mostly, the default parameters were used for training the model; I did not optimize the model.
  • To further shorten the list of top words, FLAN-T5 was used to provide a short label of each topic.

Due to data protection, I don’t show the interactive version of the results, but only provide a screenshot of the topic clustering:

The model works well: There are clearly distinct clusters of BA thesis separating the topic space. Also, a manual inspection reveals that these clusters indeed consist of similar titles. For instance, the lower right red cluster labeled “2_sport_sportunterricht_fußball” consists of BA theses focussing on sport-related aspects in their titles.

In a further step, I then used the generated labels and mapped the institutes. The respective figure then visualizes the frequency of topics by institutes. Not surprisingly, large institutes supervise more theses. And almost all of the red dots (topic “sports”) were first-supervised by the Institute of Sport Science. This is yet another indicator for the quality of the (default) model.