On 19 December 2024, at 14:15 Kristiina Vaik will defend her doctoral thesis "Beyond Genres: A Dimensional Text Model for Text Classification".
Supervisors:
Associate Professor Kadri Muischnek
Associate Professor Kairit Sirts
Opponent:
Professor Veronika Laippala (University of Turku)
Summary:
The internet is a huge repository of different texts. It's a goldmine of information, covering everything from casual chats to academic articles, and a great resource for many fields of science. Huge text collections, known as Web corpora, are transforming how we study language. They're like time capsules, capturing the ever-changing way we talk and write. The thing is, we don't actually know what's in these digital collections. Is it casual conversations, formal writing, or something else entirely? It's like trying to categorize every book in a giant library without knowing what's in them. Some researchers have focused on broad categories like news or fiction, while others make more fine-grained distinctions, such as dividing the news category into opinion pieces, sports reports, and interviews. Over the years, lots of different classifications have been created, but they all have one thing in common: the consensus among the annotators is low. This raises a question, how can we expect computers to do it, if even people can't agree on what kind of writing something is? To make the most of this linguistic goldmine, we need a better roadmap.
This research aims to offer an alternative way of categorizing texts found online. Rather than forcing texts into fixed categories (like news or fiction), this research looks at the underlying qualities (i.e., dimensions) of the text itself. For example, is the text formal or casual, factual or opinionated, complex or simple, and talking about abstract or concrete phenomena? The thesis aimed to seek whether the proposed dimensions are recognizable to humans and, if so, identify whether and how the proposed dimensions differ from one another. The thesis found that the proposed dimensions showed a consistent level of agreement among humans, suggesting clear communicative functions and definitions, and dimensions can be set apart by having unique linguistic fingerprints. Interestingly, the results show a clear divide between dimensions that resemble written spoken language (spontaneous, personal, subjective) and language that is more planned and formal (impersonal, informational). Other dimensions fall somewhere in between or have their special linguistic characteristics. Understanding how these dimensions relate to each other and recognizing unique linguistic patterns within them sets the stage for future research of uncovering the hidden structures in Web corpora.
The defence can be followed via Zoom
https://ut-ee.zoom.us/j/93964820129?pwd=jwy3phIgrSwenM9cOyFTNbSbMY9105.1
Meeting ID: 939 6482 0129
Passcode: 388469