Fink Different: Keyboards as counter-culture.
If you watched Star Wars for the first time, without seeing images of the Empire’s perfectly spaced thousands of goose-stepping minions in spotless white-lacquered armor. If you didn’t see the fleets of black and grey tie-fighters, the immaculately designed star cruisers, the evil moon-shaped flagship… you wouldn’t know that the rebels were rebels. After all, rebels don’t look like rebels if they don’t have something to contrast them against. They just look like normal people. That’s probably why when you see Luke Skywalker, Han Solo or Finn (all rebels) dressed in stormtrooper garb, they somehow seem even more rebellious then they were before. It’s not what they’re wearing, it’s how they wear it. Dirty, scuffed, broken. Helmet missing or askew. An out of place, beat up weapon slung diagonally across their body. It’s the simple act of defacing the uniform that identifies them in our mind as counter-cultural. Funnily enough, it works in reverse. To the dismay of...
Oct 6, 2024
Image 1: Letter frequencies – the most basic use of corpora – in this very post We've seen that a well-crafted, personalized corpus plays a key role in determining the outcome of the optimization process. We've also explored what some of its common sources are: general language statistics, existing corpora, grabbing online articles, e-books, etc. Or, and this is my preference, you can compile your own personal corpus by either using a keylogger, aggregating your own texts, or even better: both of these methods combined. Keyloggers AND aggregating! As already mentioned, you have two main options to get your personal language statistics: using a trustworthy keylogger or creating a text file by aggregating your own written content. Both approaches have their pros and cons, which we covered in more detail last time. As a quick reminder:
- The keylogger registers every keypress (hotkeys, modifiers, passwords – and typos too), but it takes some time to collect a decent amount of data this way.
- Aggregating text is a cleaner, safer and quicker approach since you're working with existing content. However, this way you lose some data, such as modifiers, backspace, hotkeys, function keys, form data, navigation, etc.
Anyway, let's continue with the latter one! Sources of personal corpora Open up your favorite text editor, create a new document, and consider all the types of written content you've produced during the last couple of years. Your sources may be very versatile, but here are some obvious starting points: Personal diary Your diary, particularly if you've been typing it on your computer for years, is likely the best source of text for layout optimization. It's readily available and probably long enough to accurately represent your typing habits. Over time, regular writing accumulates and can yield surprisingly lengthy texts in just a few months. Moreover, your diary is highly relevant to the topics you think and write about. Reflecting your own life and impressions, it is one of the best sources to calculate the language statistics that are perfectly characteristic of your typing habits.Image 2: My never released in-browser layout optimizer from 2018 -- with "Language & Corpus" as a main setting Keep in mind your main goal here: the optimal keymap – and being able to feed relevant data to the algorithm (Image 2) is an essential part of the project. That’s why combining a keylogger with using already existing text may be a good idea. Emails (sent!) Maybe I'm old-fashioned, but email is still my main method for online communication, especially when it comes to serious business matters. If you're not a programmer, the most straightforward way is to copy & paste the text of your sent emails manually. (Obviously, it doesn’t make sense to process incoming mails, written by others.) While this may seem quite tedious, the copy & paste method doesn't involve coding and is a simple way to get rid of copied, quoted (not typed) parts like urls, tables, etc. on the way. Some email providers make exporting your Sent mail folder quite easy (e.g. Google), you can use export functions or extensions of your favorite email client, or you can really simply copy the mails into a file one by one. Posts & comments (social media, forums etc.) If you regularly post and comment on social media or various platforms (e.g. keyboard communities on Discord, Reddit, or here on Drop), you may sit on a huge amount of typed text already. Some sites may offer an export option. Others can be grabbed utilizing their RSS or Ajax features. You may look for third-party tools too. Either way, you can harvest all your content almost instantly. Blog posts I mean your own posts, so exporting or accessing the database directly should be easy. Watch out for formatting, markup, tags and non-typed parts. The entries may require a serious cleanup, e.g. embedded tweets messed up my statistics, so you have to get rid of those and similar snippets before running any calculations. Professional documents Many people produce a substantial amount of text daily through their work -- such as technical documents, articles, scientific publications, book chapters, PhD theses, novels, etc. These texts can quickly accumulate to form a valuable corpus. Other sources There are undoubtedly other excellent sources I haven't mentioned. Feel free to include them (after proper cleanup) and combine all the text into a single file for further processing. But when should you stop? How long should your corpus be? Optimal corpus length Let me cover this very briefly here, and we may get back to this in detail later. In summary, forget about gigabyte-sized corpora. They are inefficient, slow, may require a lot of resources and even special tools to work with -- and they are not better with regards to providing personalized data.
Image 3: Frequencies of common characters stabilizing (these are my reddit comments :) A few megabytes are more than enough. In the next post I'll give you examples of why an even shorter corpus of a few hundred kilobytes (the length of a single novel) can serve our purposes just as well as a much larger one. Basically, this is because the frequencies of common letters stabilize very quickly (and rare characters below a certain threshold have minimal impact on the optimization). In the image above I cut the corpus of my reddit comments into chunks of 1,000 characters. Taking a look at the cumulative letter frequencies, depending on how you squint, you can see that after the early hectic part, at about 120,000 characters, most frequencies are practically constant. Changes over time A common criticism of the personal corpus approach is that it doesn't account for temporal changes, whereas a general corpus may represent hundreds of years of a given language. My response is that you are designing the optimal keymap for your current typing habits, which may shift slightly over time. However, with a general corpus, you'll never achieve such an optimal keymap – neither now nor in the future. In addition, nothing is stopping you from reevaluating your keymap e.g. annually. Summary If you haven't done so yet, I'd start compiling my own personal corpus by applying the methods and using the sources we mentioned above. Keep in mind that the ultimate goal is to design your optimal keymap! Find your best sources, sanitize your text by removing non-typed parts, and know when to stop! Tweaking your corpus may be addictive, but a few megabytes of data is more then enough. PS: I will share the links to the tools mentioned above later, but for now, I want you to focus on building your corpus. ;)
(Cover image by Michael D Beckwith.)