Layout Optimization Best Practices: The Corpus (Part 1)
In this series we are designing our own custom keymaps, logical layouts, you name it. We’ve laid the groundwork by looking into how good/bad QWERTY is, the power of layers (SpaceFN), and also the huge potential of alternative layouts and custom keymaps. Today, we take the first step in designing your ultimate keymap by exploring our options for compiling a corpus. What's a corpus? Essentially, it's just a fancy term for a big chunk of text. In this context it means a usually large collection of textual data used directly or indirectly as an input for our layout optimization algorithms. Often literally a single text file. Why does it matter to you? Because a well-crafted, personalized corpus is crucial for keymap wizards. If you're aiming to design your own custom logical layout, the corpus plays a key role in determining the language statistics that reflect your typing habits, thus the outcome of the optimization. These statistics, which we extract through analysis of...
Aug 19, 2024
Custom corpora for logical layout design There are different approaches to obtaining or compiling corpora. This can quickly become a deep rabbit hole in itself, some people seem downright obsessive, so let's focus on how to put together a corpus effectively, without spending weeks on the process. Basically, the goal here is to gather as many texts as you can/need into a single file (or database). The list of the most common sources may include:
- general language statistics
- downloading existing corpora
- grabbing online articles, ebooks, etc.
- keylogging
- aggregating your own texts
I'm an advocate for personalized corpora based on texts you've personally typed out. One reason is that the overwhelming majority of people type in multiple languages, while available corpora typically focus on a single language. In addition, "average English" (or any other language) doesn't really exist on the personal level. Nonetheless, let's walk through the most common options: General language statistics Layout optimization algorithms simulate typing in various ways, but in general, they need the text of our corpus only for generating language statistics: letter, bigram and trigram frequencies, to calculate a score based on finger movement, hurdles, rolls, and all the various metrics. So basically you could use the language stats directly, skipping the corpus step.(Letter frequencies of various languages differ a lot! Source: Wikipedia) However, it's highly unlikely that general letter frequencies found online would accurately represent your typing habits or provide enough data beyond basic statistics. It’s interesting and often useful to contemplate on the difference of letter frequencies between languages, but these numbers won’t really help you when compiling your corpus, so I recommend not wasting time on these, at least for now. Existing corpora Similarly, downloading massive corpora like the full Wikipedia or Project Gutenberg is not only overkill but also involves someone else's text -- often from hundreds of years ago and on topics you'll never think about, let alone type about. In conclusion, if I were in your shoes, I'd disregard these easily accessible but too general and thus irrelevant corpora. Grabbing articles, e-books, etc. A more targeted selection of online content, easy to crawl and grab automatically or even to harvest manually, might be a better idea. If you're an aspiring writer, ebooks in the genre and style you're pursuing could be useful, as they better represent the words and character frequencies you'll encounter in your work. The same goes for journalists and bloggers: compiling a corpus based on articles and blog posts similar to your main topics and writing style may be quicker and easier than creating a fully personalized corpus. However, the result will only be as good as the effort you put into it. And if you happen to have a portfolio of published articles or books already, use those instead. Keyloggers vs aggregating When compiling a personal corpus tailored to your own unique typing habits, you are left with two main approaches: using a simple and trustworthy keylogger or meticulously creating a text file from your own written content.
Aggregating text manually or semi-manually is a cleaner, safer and quicker approach. However, this method may result in losing some useful data, such as modifiers, hotkeys, form data, passwords, navigation, etc. Pros and cons of aggregating text, i.e. copypasting your texts into a single file:
Summary After trying pretty much all of the options above, I've found that manually compiling a corpus from my own typewritten text works best for me. However, I'm aware that some people prefer using yearlong keylogged data. Both approaches can work as long as you’re aware of their advantages and disadvantages, and keep your corpus clean and sanitized. Either way, if I were in your shoes, I'd quickly start compiling my own personal corpus to aim for more optimal keymaps. Next time, in Part 2, we will look into how to do it exactly, and how long of a corpus you need for the best results. (Cover image by Michael D Beckwith.)