Free Standard Shipping in the US on orders over $99

Layout optimization best practices: sources of your personal corpus (part 2)

Sep 4, 20241641 VIEWS

Welcome back to our series on designing custom keymaps! After looking into how good/bad QWERTY is, the power of layers, and the potential of custom keymaps, last time we took the first real step by examining your options for compiling a corpus.

As a recap: The corpus is simply a big chunk of text. We use this collection of textual data, often a single text file, to characterize your typing habits (calculating various language statistics), and feed it directly or indirectly as an input into layout optimization algorithms – to find the optimal keymap for you!

Today we’ll expand on this idea by exploring your options if, like me, you prefer a personalized corpus rather than grabbing some general (and mostly irrelevant) data available online.

Image 1: Letter frequencies – the most basic use of corpora – in this very post We've seen that a well-crafted, personalized corpus plays a key role in determining the outcome of the optimization process. We've also explored what some of its common sources are: general language statistics, existing corpora, grabbing online articles, e-books, etc. Or, and this is my preference, you can compile your own personal corpus by either using a keylogger, aggregating your own texts, or even better: both of these methods combined. Keyloggers AND aggregating! As already mentioned, you have two main options to get your personal language statistics: using a trustworthy keylogger or creating a text file by aggregating your own written content. Both approaches have their pros and cons, which we covered in more detail last time. As a quick reminder:

The keylogger registers every keypress (hotkeys, modifiers, passwords – and typos too), but it takes some time to collect a decent amount of data this way.
Aggregating text is a cleaner, safer and quicker approach since you're working with existing content. However, this way you lose some data, such as modifiers, backspace, hotkeys, function keys, form data, navigation, etc.

Anyway, let's continue with the latter one! Sources of personal corpora Open up your favorite text editor, create a new document, and consider all the types of written content you've produced during the last couple of years. Your sources may be very versatile, but here are some obvious starting points: Personal diary Your diary, particularly if you've been typing it on your computer for years, is likely the best source of text for layout optimization. It's readily available and probably long enough to accurately represent your typing habits. Over time, regular writing accumulates and can yield surprisingly lengthy texts in just a few months. Moreover, your diary is highly relevant to the topics you think and write about. Reflecting your own life and impressions, it is one of the best sources to calculate the language statistics that are perfectly characteristic of your typing habits.

Image 2: My never released in-browser layout optimizer from 2018 -- with "Language & Corpus" as a main setting Keep in mind your main goal here: the optimal keymap – and being able to feed relevant data to the algorithm (Image 2) is an essential part of the project. That’s why combining a keylogger with using already existing text may be a good idea. Emails (sent!) Maybe I'm old-fashioned, but email is still my main method for online communication, especially when it comes to serious business matters. If you're not a programmer, the most straightforward way is to copy & paste the text of your sent emails manually. (Obviously, it doesn’t make sense to process incoming mails, written by others.) While this may seem quite tedious, the copy & paste method doesn't involve coding and is a simple way to get rid of copied, quoted (not typed) parts like urls, tables, etc. on the way. Some email providers make exporting your Sent mail folder quite easy (e.g. Google), you can use export functions or extensions of your favorite email client, or you can really simply copy the mails into a file one by one. Posts & comments (social media, forums etc.) If you regularly post and comment on social media or various platforms (e.g. keyboard communities on Discord, Reddit, or here on Drop), you may sit on a huge amount of typed text already. Some sites may offer an export option. Others can be grabbed utilizing their RSS or Ajax features. You may look for third-party tools too. Either way, you can harvest all your content almost instantly. Blog posts I mean your own posts, so exporting or accessing the database directly should be easy. Watch out for formatting, markup, tags and non-typed parts. The entries may require a serious cleanup, e.g. embedded tweets messed up my statistics, so you have to get rid of those and similar snippets before running any calculations. Professional documents Many people produce a substantial amount of text daily through their work -- such as technical documents, articles, scientific publications, book chapters, PhD theses, novels, etc. These texts can quickly accumulate to form a valuable corpus. Other sources There are undoubtedly other excellent sources I haven't mentioned. Feel free to include them (after proper cleanup) and combine all the text into a single file for further processing. But when should you stop? How long should your corpus be? Optimal corpus length Let me cover this very briefly here, and we may get back to this in detail later. In summary, forget about gigabyte-sized corpora. They are inefficient, slow, may require a lot of resources and even special tools to work with -- and they are not better with regards to providing personalized data.

Image 3: Frequencies of common characters stabilizing (these are my reddit comments :) A few megabytes are more than enough. In the next post I'll give you examples of why an even shorter corpus of a few hundred kilobytes (the length of a single novel) can serve our purposes just as well as a much larger one. Basically, this is because the frequencies of common letters stabilize very quickly (and rare characters below a certain threshold have minimal impact on the optimization). In the image above I cut the corpus of my reddit comments into chunks of 1,000 characters. Taking a look at the cumulative letter frequencies, depending on how you squint, you can see that after the early hectic part, at about 120,000 characters, most frequencies are practically constant. Changes over time A common criticism of the personal corpus approach is that it doesn't account for temporal changes, whereas a general corpus may represent hundreds of years of a given language. My response is that you are designing the optimal keymap for your current typing habits, which may shift slightly over time. However, with a general corpus, you'll never achieve such an optimal keymap – neither now nor in the future. In addition, nothing is stopping you from reevaluating your keymap e.g. annually. Summary If you haven't done so yet, I'd start compiling my own personal corpus by applying the methods and using the sources we mentioned above. Keep in mind that the ultimate goal is to design your optimal keymap! Find your best sources, sanitize your text by removing non-typed parts, and know when to stop! Tweaking your corpus may be addictive, but a few megabytes of data is more then enough. PS: I will share the links to the tools mentioned above later, but for now, I want you to focus on building your corpus. ;)

(Cover image by Michael D Beckwith.)

(Edited)

100% upvoted

1.6K

Sort by: Newest

Sep 4, 2024

This is very cool. Even if it's just data that I don't understand fully, I'm very interested to figure out what keys I press more than others and what might be more useful to use layout-wise than the standard QWERTY and others!

ThereminGoatMK

A Few Obscure Keyboard Switch Modifications

Figure 1: Oh yeah, I meant it when I said obscure... There’s no doubt that mechanical keyboard switches have gotten increasingly better in their stock forms over the past half decade of releases. Despite switches now having tighter manufacturing tolerances, smoother factory lubing, and overall higher quality per dollar spent, aftermarket modifications of switches is still one of the most discussed topics by people freshly joining the hobby today. This hyper fixation on switch modding is due in no small part to the glut of keyboard content creators that produced videos, shorts, and all manner of content during the peak of COVID talking about the art and science behind lubing and filming for switches. For a while there, it almost seemed as if you had to have some content about lubing, filming, and/or ‘frankenswitching’ switches if you wanted to cut it as being a true keyboard content creator in the space. However, as people like this have flooded the internet with...

Oct 22, 2024

Mech Keys

Guides

cobertt

Build-A-Board Workshop

That would be a cool shop to go to in a mall. In some of my past posts and reviews I’ve written there have been requests to walk through my own process for building a keyboard for myself. I’m fortunate in that I get to build many keyboards. I haven’t logged every single keyboard that I’ve built, that would have been great, but hindsight is 20/20. The vast majority of the builds that I do are for other hobbyists. I built a small name for myself doing commissions and build services specializing in leveraging my extensive knowledge of the hobby to help acquire unique boards, make recommendations in build materials, and providing a truly personalized board for those who might not have known much about mechanical keyboards before reaching out to me. I started doing this service back in 2018, and now, being a dad, husband and full time IT specialist, I tend to only accept a couple commissions at a time. I’ve got my own backlog of boards, my collection seems to continually grow, and I...

Oct 15, 2024

Mech Keys

Guides

storyboardtech

Fink Different: Keyboards as counter-culture.

If you watched Star Wars for the first time, without seeing images of the Empire’s perfectly spaced thousands of goose-stepping minions in spotless white-lacquered armor. If you didn’t see the fleets of black and grey tie-fighters, the immaculately designed star cruisers, the evil moon-shaped flagship… you wouldn’t know that the rebels were rebels. After all, rebels don’t look like rebels if they don’t have something to contrast them against. They just look like normal people. That’s probably why when you see Luke Skywalker, Han Solo or Finn (all rebels) dressed in stormtrooper garb, they somehow seem even more rebellious then they were before. It’s not what they’re wearing, it’s how they wear it. Dirty, scuffed, broken. Helmet missing or askew. An out of place, beat up weapon slung diagonally across their body. It’s the simple act of defacing the uniform that identifies them in our mind as counter-cultural. Funnily enough, it works in reverse. To the dismay of...

Oct 6, 2024

Mech Keys

Guides

dovenyi

Keymap optimization: language statistics and important indicators

Welcome back to this series where we’re designing kick-ass keymaps! After covering basics like how good/bad QWERTY is, the power of layers and the potential of custom keymaps, we took the first real steps in designing your tailor-fit keymap by looking into some options for compiling a corpus in general and also with a more useful personal corpus in mind. Quick recap: in this context, corpus is simply a fancy name for a big chunk of text. Today, we’re going to analyze your corpus (or pretty much any text if you haven't done your homework yet) and discuss some basic language statistics along with common metrics that can be used to quickly evaluate a keymap, and also to compare layouts. This is the next logical step in our journey if you're aiming to craft the optimal keymap for yourself. Character/bigram/trigram frequencies To begin with, let's examine the character frequencies in our corpus. The occurrence of different letters can vary significantly not only between...

Oct 3, 2024

Mech Keys

Guides

cobertt

Leveraging Layers

Making your keyboard work for you! When shopping for a new keyboard you may have heard that you want to have a keyboard that is compatible with QMK, VIA, or VIAL. These are three different programs that allow you to modify the assigned keys on a keyboard. This is one of the major advantages of using a custom mechanical keyboard and one that I feel is criminally underused. Making small adjustments to your layers can allow you to tune your keyboard to your exact specifications. For example, I always swap the position of left control and caps lock. I’ve always felt that caps lock was a waste of such a valuable space. For those of you that read some of my earlier articles, you’ll know that my first mechanical keyboard was a Happy Hacking Keyboard. As you’ll see later on, my personal layouts are heavily inspired by this keyboard, even going so far as to mapping my backspace to the pipe key on nearly every single physical layout, yes including tenkeyless. Today, I hope you can take away...

Sep 17, 2024

Mech Keys

Guides

storyboardtech

LEGENDS, PART ONE: a journey to the past

I don’t think this is an article. I’m pretty sure it’s the beginning of a book. The "technology" section of every thrift store is a potential treasure trove. Swap meets, estate sales... they're the same. I honestly enjoy sifting through stacks of typewriters, radios, turntables and dusty old keyboards... hunting for treasure. That's what has led me to ask my friends in the hobby a pair of simple but nuanced questions… what’s your dream thrift store find and what’s your holy grail? It turns out, that for many, that’s the same question. For a lot of people in this hobby, finding the perfect board for you happens at some point, and rather than staring at your collection of boards on the wall, or continuing to buy keyboards until your room looks like Wall-E’s trailer… Instead, mature keyboard hobbyists tend to do a number of things: Sell/trade artisan keycaps Purchase keycap sets Sell or trade rare/nice (and sometimes new) keyboards to keep things fresh Most of us haven’t...

Sep 9, 2024

Mech Keys

Guides

Drop Updates