Daniel Gaszewski

Vikings language, the speech of the king Vasa or today's Swedish? Text classification with ML.NET.

This model could classify historical Swedish texts perfectly. But when it saw Polish, it thought it was reading Viking runes. Here’s why.

Vikings language, the speech of the king Vasa or today's Swedish? Text classification with ML.NET.
#1about 2 minutes

Classifying historical Swedish text with ML.NET

The project aims to build a system using ML.NET to classify Swedish text into its correct historical period, from Viking runes to modern language.

#2about 4 minutes

The personal inspiration behind the project

The idea for the project originated from a university exam on Swedish language history and observing linguistic differences on a Nobel Prize diploma.

#3about 1 minute

Understanding how all languages evolve over time

Language evolution is a natural process for living languages, illustrated by comparing Old English to modern English and old C# syntax to new pattern matching.

#4about 6 minutes

An overview of Swedish language history

The Swedish language is divided into distinct historical periods, including Runic, Old Swedish, and Modern Swedish, each with unique alphabets, grammar, and vocabulary.

#5about 2 minutes

Getting started with the ML.NET framework

ML.NET is an open-source framework that allows .NET developers to build machine learning models without needing deep expertise in underlying algorithms.

#6about 3 minutes

The critical process of data collection and cleaning

Preparing the dataset is the most time-consuming step, requiring cleaning inconsistent formats, removing irrelevant characters, and standardizing text units for training.

#7about 3 minutes

How to train a model using the ML.NET UI

The ML.NET Model Builder in Visual Studio provides a simple UI to select a scenario, load data, and train a model with a single button click.

#8about 3 minutes

Demo results and identifying model limitations

While the model successfully classifies valid Swedish text, it incorrectly categorizes any garbage or non-Swedish input as Runic Swedish, highlighting a data quality issue.

#9about 4 minutes

Q&A on ML.NET, data, and model capabilities

The Q&A covers topics like using ML.NET versus Python, the importance of balanced training data, and the model's inability to extrapolate future language changes.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

Related Articles

View all articles
LM
Luis Minvielle
What Are Large Language Models?
Developers and writers can finally agree on one thing: Large Language Models, the subset of AIs that drive ChatGPT and its competitors, are stunning tech creations. Developers enjoying the likes of GitHub Copilot know the feeling: this new kind of te...
What Are Large Language Models?

From learning to earning

Jobs that call for the skills explored in this talk.