Fink Different: Keyboards as counter-culture.
If you watched Star Wars for the first time, without seeing images of the Empire’s perfectly spaced thousands of goose-stepping minions in spotless white-lacquered armor. If you didn’t see the fleets of black and grey tie-fighters, the immaculately designed star cruisers, the evil moon-shaped flagship… you wouldn’t know that the rebels were rebels. After all, rebels don’t look like rebels if they don’t have something to contrast them against. They just look like normal people. That’s probably why when you see Luke Skywalker, Han Solo or Finn (all rebels) dressed in stormtrooper garb, they somehow seem even more rebellious then they were before. It’s not what they’re wearing, it’s how they wear it. Dirty, scuffed, broken. Helmet missing or askew. An out of place, beat up weapon slung diagonally across their body. It’s the simple act of defacing the uniform that identifies them in our mind as counter-cultural. Funnily enough, it works in reverse. To the dismay of...
Oct 6, 2024
Custom corpora for logical layout design There are different approaches to obtaining or compiling corpora. This can quickly become a deep rabbit hole in itself, some people seem downright obsessive, so let's focus on how to put together a corpus effectively, without spending weeks on the process. Basically, the goal here is to gather as many texts as you can/need into a single file (or database). The list of the most common sources may include:
- general language statistics
- downloading existing corpora
- grabbing online articles, ebooks, etc.
- keylogging
- aggregating your own texts
I'm an advocate for personalized corpora based on texts you've personally typed out. One reason is that the overwhelming majority of people type in multiple languages, while available corpora typically focus on a single language. In addition, "average English" (or any other language) doesn't really exist on the personal level. Nonetheless, let's walk through the most common options: General language statistics Layout optimization algorithms simulate typing in various ways, but in general, they need the text of our corpus only for generating language statistics: letter, bigram and trigram frequencies, to calculate a score based on finger movement, hurdles, rolls, and all the various metrics. So basically you could use the language stats directly, skipping the corpus step.(Letter frequencies of various languages differ a lot! Source: Wikipedia) However, it's highly unlikely that general letter frequencies found online would accurately represent your typing habits or provide enough data beyond basic statistics. It’s interesting and often useful to contemplate on the difference of letter frequencies between languages, but these numbers won’t really help you when compiling your corpus, so I recommend not wasting time on these, at least for now. Existing corpora Similarly, downloading massive corpora like the full Wikipedia or Project Gutenberg is not only overkill but also involves someone else's text -- often from hundreds of years ago and on topics you'll never think about, let alone type about. In conclusion, if I were in your shoes, I'd disregard these easily accessible but too general and thus irrelevant corpora. Grabbing articles, e-books, etc. A more targeted selection of online content, easy to crawl and grab automatically or even to harvest manually, might be a better idea. If you're an aspiring writer, ebooks in the genre and style you're pursuing could be useful, as they better represent the words and character frequencies you'll encounter in your work. The same goes for journalists and bloggers: compiling a corpus based on articles and blog posts similar to your main topics and writing style may be quicker and easier than creating a fully personalized corpus. However, the result will only be as good as the effort you put into it. And if you happen to have a portfolio of published articles or books already, use those instead. Keyloggers vs aggregating When compiling a personal corpus tailored to your own unique typing habits, you are left with two main approaches: using a simple and trustworthy keylogger or meticulously creating a text file from your own written content.
Aggregating text manually or semi-manually is a cleaner, safer and quicker approach. However, this method may result in losing some useful data, such as modifiers, hotkeys, form data, passwords, navigation, etc. Pros and cons of aggregating text, i.e. copypasting your texts into a single file:
Summary After trying pretty much all of the options above, I've found that manually compiling a corpus from my own typewritten text works best for me. However, I'm aware that some people prefer using yearlong keylogged data. Both approaches can work as long as you’re aware of their advantages and disadvantages, and keep your corpus clean and sanitized. Either way, if I were in your shoes, I'd quickly start compiling my own personal corpus to aim for more optimal keymaps. Next time, in Part 2, we will look into how to do it exactly, and how long of a corpus you need for the best results. (Cover image by Michael D Beckwith.)