Layout optimization best practices: sources of your personal corpus (part 2)
Welcome back to our series on designing custom keymaps! After looking into how good/bad QWERTY is, the power of layers, and the potential of custom keymaps, last time we took the first real step by examining your options for compiling a corpus. As a recap: The corpus is simply a big chunk of text. We use this collection of textual data, often a single text file, to characterize your typing habits (calculating various language statistics), and feed it directly or indirectly as an input into layout optimization algorithms – to find the optimal keymap for you! Today we’ll expand on this idea by exploring your options if, like me, you prefer a personalized corpus rather than grabbing some general (and mostly irrelevant) data available online. Image 1: Letter frequencies – the most basic use of corpora – in this very post We've seen that a well-crafted, personalized corpus plays a key role in determining the outcome of the optimization process. We've also explored...
Sep 4, 2024
Image 1: Letter frequencies – the most basic use of corpora – in this very post We've seen that a well-crafted, personalized corpus plays a key role in determining the outcome of the optimization process. We've also explored what some of its common sources are: general language statistics, existing corpora, grabbing online articles, e-books, etc. Or, and this is my preference, you can compile your own personal corpus by either using a keylogger, aggregating your own texts, or even better: both of these methods combined. Keyloggers AND aggregating! As already mentioned, you have two main options to get your personal language statistics: using a trustworthy keylogger or creating a text file by aggregating your own written content. Both approaches have their pros and cons, which we covered in more detail last time. As a quick reminder:
- The keylogger registers every keypress (hotkeys, modifiers, passwords – and typos too), but it takes some time to collect a decent amount of data this way.
- Aggregating text is a cleaner, safer and quicker approach since you're working with existing content. However, this way you lose some data, such as modifiers, backspace, hotkeys, function keys, form data, navigation, etc.
Anyway, let's continue with the latter one! Sources of personal corpora Open up your favorite text editor, create a new document, and consider all the types of written content you've produced during the last couple of years. Your sources may be very versatile, but here are some obvious starting points: Personal diary Your diary, particularly if you've been typing it on your computer for years, is likely the best source of text for layout optimization. It's readily available and probably long enough to accurately represent your typing habits. Over time, regular writing accumulates and can yield surprisingly lengthy texts in just a few months. Moreover, your diary is highly relevant to the topics you think and write about. Reflecting your own life and impressions, it is one of the best sources to calculate the language statistics that are perfectly characteristic of your typing habits.Image 2: My never released in-browser layout optimizer from 2018 -- with "Language & Corpus" as a main setting Keep in mind your main goal here: the optimal keymap – and being able to feed relevant data to the algorithm (Image 2) is an essential part of the project. That’s why combining a keylogger with using already existing text may be a good idea. Emails (sent!) Maybe I'm old-fashioned, but email is still my main method for online communication, especially when it comes to serious business matters. If you're not a programmer, the most straightforward way is to copy & paste the text of your sent emails manually. (Obviously, it doesn’t make sense to process incoming mails, written by others.) While this may seem quite tedious, the copy & paste method doesn't involve coding and is a simple way to get rid of copied, quoted (not typed) parts like urls, tables, etc. on the way. Some email providers make exporting your Sent mail folder quite easy (e.g. Google), you can use export functions or extensions of your favorite email client, or you can really simply copy the mails into a file one by one. Posts & comments (social media, forums etc.) If you regularly post and comment on social media or various platforms (e.g. keyboard communities on Discord, Reddit, or here on Drop), you may sit on a huge amount of typed text already. Some sites may offer an export option. Others can be grabbed utilizing their RSS or Ajax features. You may look for third-party tools too. Either way, you can harvest all your content almost instantly. Blog posts I mean your own posts, so exporting or accessing the database directly should be easy. Watch out for formatting, markup, tags and non-typed parts. The entries may require a serious cleanup, e.g. embedded tweets messed up my statistics, so you have to get rid of those and similar snippets before running any calculations. Professional documents Many people produce a substantial amount of text daily through their work -- such as technical documents, articles, scientific publications, book chapters, PhD theses, novels, etc. These texts can quickly accumulate to form a valuable corpus. Other sources There are undoubtedly other excellent sources I haven't mentioned. Feel free to include them (after proper cleanup) and combine all the text into a single file for further processing. But when should you stop? How long should your corpus be? Optimal corpus length Let me cover this very briefly here, and we may get back to this in detail later. In summary, forget about gigabyte-sized corpora. They are inefficient, slow, may require a lot of resources and even special tools to work with -- and they are not better with regards to providing personalized data.
Image 3: Frequencies of common characters stabilizing (these are my reddit comments :) A few megabytes are more than enough. In the next post I'll give you examples of why an even shorter corpus of a few hundred kilobytes (the length of a single novel) can serve our purposes just as well as a much larger one. Basically, this is because the frequencies of common letters stabilize very quickly (and rare characters below a certain threshold have minimal impact on the optimization). In the image above I cut the corpus of my reddit comments into chunks of 1,000 characters. Taking a look at the cumulative letter frequencies, depending on how you squint, you can see that after the early hectic part, at about 120,000 characters, most frequencies are practically constant. Changes over time A common criticism of the personal corpus approach is that it doesn't account for temporal changes, whereas a general corpus may represent hundreds of years of a given language. My response is that you are designing the optimal keymap for your current typing habits, which may shift slightly over time. However, with a general corpus, you'll never achieve such an optimal keymap – neither now nor in the future. In addition, nothing is stopping you from reevaluating your keymap e.g. annually. Summary If you haven't done so yet, I'd start compiling my own personal corpus by applying the methods and using the sources we mentioned above. Keep in mind that the ultimate goal is to design your optimal keymap! Find your best sources, sanitize your text by removing non-typed parts, and know when to stop! Tweaking your corpus may be addictive, but a few megabytes of data is more then enough. PS: I will share the links to the tools mentioned above later, but for now, I want you to focus on building your corpus. ;)
(Cover image by Michael D Beckwith.)