Back
exact sciencesNews

Kristiina Vaik “Beyond Genres: A Dimensional Text Model for Text Classification”

RiE fb
Share

On 19 December 2024, at 14:15 Kristiina Vaik will defend her doctoral thesis “Beyond Genres: A Dimensional Text Model for Text Classification”.

The internet is a huge repository of different texts. It’s a goldmine of information, covering everything from casual chats to academic articles, and a great resource for many fields of science. Huge text collections, known as Web corpora, are transforming how we study language. They’re like time capsules, capturing the ever-changing way we talk and write.

The thing is, we don’t actually know what’s in these digital collections. Is it casual conversations, formal writing, or something else entirely? It’s like trying to categorize every book in a giant library without knowing what’s in them. Some researchers have focused on broad categories like news or fiction, while others make more fine-grained distinctions, such as dividing the news category into opinion pieces, sports reports, and interviews.

Over the years, lots of different classifications have been created, but they all have one thing in common: the consensus among the annotators is low. This raises a question, how can we expect computers to do it, if even people can’t agree on what kind of writing something is? To make the most of this linguistic goldmine, we need a better roadmap.

Photo: Keit Mõisavald.

This research aims to offer an alternative way of categorizing texts found online. Rather than forcing texts into fixed categories (like news or fiction), this research looks at the underlying qualities (i.e., dimensions) of the text itself. For example, is the text formal or casual, factual or opinionated, complex or simple, and talking about abstract or concrete phenomena? The thesis aimed to seek whether the proposed dimensions are recognizable to humans and, if so, identify whether and how the proposed dimensions differ from one another.

The thesis found that the proposed dimensions showed a consistent level of agreement among humans, suggesting clear communicative functions and definitions, and dimensions can be set apart by having unique linguistic fingerprints. Interestingly, the results show a clear divide between dimensions that resemble written spoken language (spontaneous, personal, subjective) and language that is more planned and formal (impersonal, informational). Other dimensions fall somewhere in between or have their special linguistic characteristics. Understanding how these dimensions relate to each other and recognizing unique linguistic patterns within them sets the stage for future research of uncovering the hidden structures in Web corpora.

This article was originally published on the webpage of University of Tartu.


If categorizing feels like herding cats, don’t worry – we’ve got the system to tame the chaos! Sort yourself out and check out our next article for tips and tricks on Personality profiles of hundreds of professions published in new study by University of Tartu researchers!

Read more

Get our monthly newsletterBe up-to-date with all the latest news and upcoming events