Free Standard Shipping in the US on orders over $99

Layout Optimization Best Practices: The Corpus (Part 1)

Aug 19, 20241314 VIEWS

In this series we are designing our own custom keymaps, logical layouts, you name it. We’ve laid the groundwork by looking into how good/bad QWERTY is, the power of layers (SpaceFN), and also the huge potential of alternative layouts and custom keymaps. Today, we take the first step in designing your ultimate keymap by exploring our options for compiling a corpus. What's a corpus? Essentially, it's just a fancy term for a big chunk of text. In this context it means a usually large collection of textual data used directly or indirectly as an input for our layout optimization algorithms. Often literally a single text file. Why does it matter to you? Because a well-crafted, personalized corpus is crucial for keymap wizards. If you're aiming to design your own custom logical layout, the corpus plays a key role in determining the language statistics that reflect your typing habits, thus the outcome of the optimization. These statistics, which we extract through analysis of the corpus, are fundamental to achieving the best optimization results, so let's explore how to compile a tailor-fit corpus with this in mind.

TLDR; A good personal corpus is the best way to ensure you end up with a great layout during optimization. Sources for your personal corpus can include your diary, emails, posts and comments, professional writings. Surprisingly, you don't need a large amount of text for layout optimization, a few megabytes are more than enough.

Custom corpora for logical layout design There are different approaches to obtaining or compiling corpora. This can quickly become a deep rabbit hole in itself, some people seem downright obsessive, so let's focus on how to put together a corpus effectively, without spending weeks on the process. Basically, the goal here is to gather as many texts as you can/need into a single file (or database). The list of the most common sources may include:

general language statistics
downloading existing corpora
grabbing online articles, ebooks, etc.
keylogging
aggregating your own texts

I'm an advocate for personalized corpora based on texts you've personally typed out. One reason is that the overwhelming majority of people type in multiple languages, while available corpora typically focus on a single language. In addition, "average English" (or any other language) doesn't really exist on the personal level. Nonetheless, let's walk through the most common options: General language statistics Layout optimization algorithms simulate typing in various ways, but in general, they need the text of our corpus only for generating language statistics: letter, bigram and trigram frequencies, to calculate a score based on finger movement, hurdles, rolls, and all the various metrics. So basically you could use the language stats directly, skipping the corpus step.

(Letter frequencies of various languages differ a lot! Source: Wikipedia) However, it's highly unlikely that general letter frequencies found online would accurately represent your typing habits or provide enough data beyond basic statistics. It’s interesting and often useful to contemplate on the difference of letter frequencies between languages, but these numbers won’t really help you when compiling your corpus, so I recommend not wasting time on these, at least for now. Existing corpora Similarly, downloading massive corpora like the full Wikipedia or Project Gutenberg is not only overkill but also involves someone else's text -- often from hundreds of years ago and on topics you'll never think about, let alone type about. In conclusion, if I were in your shoes, I'd disregard these easily accessible but too general and thus irrelevant corpora. Grabbing articles, e-books, etc. A more targeted selection of online content, easy to crawl and grab automatically or even to harvest manually, might be a better idea. If you're an aspiring writer, ebooks in the genre and style you're pursuing could be useful, as they better represent the words and character frequencies you'll encounter in your work. The same goes for journalists and bloggers: compiling a corpus based on articles and blog posts similar to your main topics and writing style may be quicker and easier than creating a fully personalized corpus. However, the result will only be as good as the effort you put into it. And if you happen to have a portfolio of published articles or books already, use those instead. Keyloggers vs aggregating When compiling a personal corpus tailored to your own unique typing habits, you are left with two main approaches: using a simple and trustworthy keylogger or meticulously creating a text file from your own written content.

Yikes, keyloggers are dangerous and scary. Yep. But trustworthy, basic keyloggers, open-source options or ideally ones you code yourself, are an easy way to harvest keystrokes over time. Ensure that the keylogger does not transmit any data externally. Ideally, write your own basic keylogger or opt for one that counts keystrokes and bi/trigrams rather than saves the entire text to log files.

Pros: Registers every keypress (including hotkeys, modifiers, filling out forms, and yep, passwords too.)
Cons: Registers typing errors. Gaming input, typing practice (monkeytype), language learning (Duolingo) may completely mess up the statistics. Also security of course. And it may take some time to gather enough data this way.

Aggregating text manually or semi-manually is a cleaner, safer and quicker approach. However, this method may result in losing some useful data, such as modifiers, hotkeys, form data, passwords, navigation, etc. Pros and cons of aggregating text, i.e. copypasting your texts into a single file:

Pros: More controlled input, safer, almost instant result if the appropriate content is available.
Cons: Labor intensive. No backspace, del, modifiers, function keys, hotkeys, etc.

Summary After trying pretty much all of the options above, I've found that manually compiling a corpus from my own typewritten text works best for me. However, I'm aware that some people prefer using yearlong keylogged data. Both approaches can work as long as you’re aware of their advantages and disadvantages, and keep your corpus clean and sanitized. Either way, if I were in your shoes, I'd quickly start compiling my own personal corpus to aim for more optimal keymaps. Next time, in Part 2, we will look into how to do it exactly, and how long of a corpus you need for the best results. (Cover image by Michael D Beckwith.)

(Edited)

100% upvoted

Comment

1.3K

dovenyi

Sort by: Newest

Let’s get the conversation started!

Be the first to comment.

PRODUCTS YOU MAY LIKE

Drop + MiTo GMK Laser Custom Keycap Set

$49

Drop + Matt3o MT3 Susuwatari Custom Keycap Set

$89

Drop + biip MT3 Extended 2048 Custom Keycap Set

$35

Drop Holy Panda X Mechanical Switches

$35

Drop Skylight Series Keycap Set

$19

Drop + Oblotzky SA Oblivion V2 Custom Keycap Set

$20

Drop + RedSuns GMK Red Samurai Keycap Set

$69

Drop GMK White-on-Black Custom Keycap Set

$89

Drop + The Lord of the Rings™ MT3 Elvish Keycap Set

$99

Drop + MiTo SA Godspeed Custom Keycap Set

$39

Trending Posts in Mechanical Keyboards

jdsvdropper

Drop ENTER keyboard with DCX Sleeper Mac variants and Rocky Bird

Black Drop ENTER keyboard with DCX Sleeper Mac variants for the Option and Command keys, and Rocky Bird red and black DCX keycaps.

Nov 19, 2024

Mech Keys

Photos

FAbs101

Matrix Navi-Gator

Metorite Grey Navi CWKL + Xmas Add On Black Flash Gator GMK ZX

Nov 18, 2024

Mech Keys

Photos

AngryTank

Favorite Artisans

COME FORTH SHENRON!

Purple, Dragon Balls, and Seta! What more does a simple man need?

Nov 17, 2024

Mech Keys

Favorite Artisans

InsufferablePedant

ZealPC Aqua Zilents

Please ignore the filthy keyboard, it's been on a shelf for a minute.

Nov 15, 2024

Mech Keys

Photos

InsufferablePedant

NYM96

Nov 15, 2024

Mech Keys

Photos

Kyle-L

How to sell things on Drop & How Drop charge them?

Hi , this Kyle from China. Since a month ago, my team wanted to build a brand to produce the high quality of mechanical keyboard and headset to sell aboard. Now we already got the license to do so, and we can't wait to bring our products to meet everyone. But the products are still in the period of designing, it would take some time :( There are some questions we can't find the ways to solve - how to sell keyboard on Drop, and the details about how Drop charge per product? - Actually I sent an email to Drop Studio for some questions, but I didn't receive the answer.( Drop Studio also mentioned that they can contact me only if they were interested in my design / products ). So I'm here ask for help. If anyone can answer my question, I would really appreciate it :)

Nov 14, 2024

Mech Keys

Questions