Reverse ungineering

Title with apologies to Glyph.

Recently, some friends of mine suggested that "software engineer" is not a good job title. While they are of course free to call their profession whatever they like, I respectfully disagree: I think "engineer" is a perfectly cromulent description of what we do.

This is an opinion piece. Despite arriving at opposite conclusions, the disagreement is feathery at best.

What if buildings failed as often as software projects?

When it comes to getting things done, we're just not very good, as Glyph points out:

Most software projects fail; as of 2009, 44% are late, over budget, or out of specification, and an additional 24% are canceled entirely. Only a third of projects succeed according to those criteria of being under budget, within specification, and complete.

He then rightly points out that those shenanigans would never be accepted in civil engineering, a Serious Engineering Discipline:

Would you want to live in a city where almost a quarter of all the buildings were simply abandoned half-constructed, or fell down during construction? Where almost half of the buildings were missing floors, had rents in the millions of dollars, or both?

I certainly wouldn't. Computers are terrible, but not quite that bad, as Glyph points out. "Failure" simply means something different for software projects than it does for construction projects. Many of those "failed" software projects were quite successful by other measures; the problem isn't with software projects, it's with applying civil engineering standards to a project that isn't.

I emphatically agree that software projects aren't civil engineering projects. Attempts to treat the former like the latter have done much more harm than good. That said, I think reality is more nuanced.

Firstly, civil engineering is the outlier here. Other engineering disciplines don't do as well by the civil engineering success yardstick either. In other fields, the few engineering endeavors that do are typically civil engineering in disguise, such as nuclear and chemical plants. Rank-and-file projects in most fields of engineering operate a lot more like a software project than the construction of a skyscraper. Projects are late and over budget, often highly experimental in nature, and in many cases also subject to changing requirements. It's true that we just can't plan ahead in software, but we're not alone.

On a related note, we may be confounding cause and effect here, even if we overlook that not all engineering is civil engineering. Are software projects unable to stick to these standards because it's not engineering, or is civil engineering the only thing that sticks to them because they have no other choice? Conversely, do we fail early and often because we're not engineering, or because, unlike civil engineering projects, we can? [1]

Finally, software has existed for decades, buildings for millennia. Bridges used to collapse all the time. Tacoma Narrows wasn't so long ago. If the tour guide on my trip to Paris is to be believed, one of those bridges has collapsed four times already.

But this isn't science!

Supposedly, software engineering isn't "real" engineering because, unlike "real" engineering, it is not backed by "real" science or math. This statement is usually paired with a dictionary definition of the word "engineering".

I feel this characterization is incongruent with what engineers do daily.

Consider the civil engineer, presumably the engineeringest engineer there is. [2] If you ask me to dimension an I-beam for you, I would:

• draw a free-body diagram,
• probably draw a shear and moment diagram,
• and pick the smallest standard beam that'll do what you want.

If you want to know how far that beam is going to go, I'll draw you some conjugate beams. I would also definitely not use the moment-area theorem, even though it wouldn't be too difficult for the reasonable uses of an I-beam.

Once upon a time, someone inflicted a variety of theories on me. Euler-Bernouilli beam theory, for example. Very heavy textbooks with very heavy math. Neither my physical therapist nor my regular one expect me to ever truly recover. Nonetheless, area moments and section moduli are the only way to understand where the I in I-beams comes from.

Nasty math didn't prevent me from dimensioning that I-beam. And I do really mean math, not physics: Euler-Bernouilli is a math hack. You get it by taking Hooke's law and throwing some calculus at it. Hooke's law itself is more math than physics, too: it's a first-order approximation based only on the observation that stuff stretches when you pull it. It's wrong all the time, even for fairly germane objects. [3] Both theories were put together long before we had serious materials science. [4] We use them because they (mostly) work, not because they are a consequence of a physical model.

That was just one example from a single discipline, but it holds more generally, too. I analyze circuits by recognizing subsections. If you show me a piece that looks like a low-pass filter, I am not distracted by Maxwell's equations to figure out what that little capacitor is doing. I could certainly derive its behavior that way: in fact, someone made me do that once, and it was quite instructive. But I'm not bothered with the electrodynamics of a capacitor right now; I'm just trying to understand this circuit.

This isn't just how engineers happen to do their jobs in practice, either. Engineering breakthroughs live on both sides of science's cutting edge. When Shockley et al. first managed to get a transistor to work, we didn't really understand what was going on. [5] Carnot was building engines long before anyone realized he had stumbled upon one of the most fundamental properties of the universe. Nobody was doing metaphysics. Sadi wanted a better steam engine.

To me, saying that I-beam was dimensioned with the help of beam theory is about as far from the truth as saying that a software project was built with the help of category theory. I'm sure that there's some way that that thing I just wrote is a covariant functor and you can co-Yoneda your way to proving natural isomorphism, but I don't have to care in order to successfully produce some software. It's easy to reduce an applied field to just the application of that field, but that doesn't make it so; especially if we haven't even really figured out the field yet.

So, even if the math and science behind computer engineering is somehow less real than that other math and science, I think that difference is immaterial, and certainly not enough to make us an entirely different profession.

But that isn't art!

Many people smarter than I have made the argument that programming is an art, not dissimilar from painting, music or even cooking. I'm inclined to agree: many talented programmers are also very talented artists in other fields. However, I do disagree that those things are art-like unlike engineering, which is supposedly just cold, hard science.

There's a not-so-old adage that science is everything we understand well enough to explain to a computer, and art is everything else. If that's true, there's definitely plenty of art to be found in engineering.

Nobody wants to get dragged into a semantic argument about what art is. Even with a much narrower view of art, engineers do plenty of it, as I've tried to argue before. Not all engineering calls are direct consequences of quantum mechanics; sometimes, it is really just down to what the engineer finds most platable. Even civil engineers, the gray predictable stalwarts of our story, care about making beautiful things.

Conclusion

I think the similarities run deep. I hope we don't throw that away essentially just because our field is a little younger. We're all hackers here; and we're all engineers, too.

Footnotes

 [1] I suppose this is really analogous to the anthropic principle, except applied to engineering disciplines instead of humans.
 [2] I'm using civil engineer here in the strict American sense of person who builds targets, as opposed to the military engineer, who builds weapons. Jokes aside, perhaps this is related to the disagreement. Where I come from, "civil engineer" means "advanced engineering degree", and encompasses many disciplines, including architectural (for lack of better word; I mean the American "civil engineer" here), chemical, electrical, and yes, computer.
 [3] Got a rubber band?
 [4] I don't mean to characterize previous efforts as not serious. They simply didn't have the tools to do what we can do today.
 [5] While it is very easy to make up a sensible-sounding narrative time line after the fact for the breakthroughs in physics and engineering that eventually made the transistor possible, this ignores the strong disagreements between theoretical predictions and practical measurements of the time. Regardless of their cause, it would be foolish to assume that Shockley just sat down and applied some theory. The theory just wasn't there yet.

On multiplayer turn-based game mechanics

Most classic turn-based games, from chess all the way to Civilization V, are sequential in nature. A player makes a move, then the next player makes a move, and so on. The details can vary, for example:

• There could be two players, or multiple. This number is tightly bound for scaling reasons, which we'll discuss later.
• The game could have perfect information, like chess, where all players see a move as soon as it is played. The game could also have imperfect information, like Civilization V, where players see part of a move, but the effets may be obscured by fog of war.
• The players may play in a consistent order (chess, Civilization V), or in a somewhat random one (D&D's initiative system).

All of those things are more or less orthogonal to the turn system. Players play turns sequentially, so I'm going to call these sequential turn-based games.

Sequential turns make scaling the number of players up difficult. Even with only 8 players, any given player will spend most of their time waiting. While 8 players are a lot for most turn-based games, it's nothing compared to an MMORPG.

An alternative to sequential turn-based play is simultaneous turn-based play. In simultaneous turn-based play all players issue their moves at the same time, and all moves are played out at the same time. The simplest example is rock-paper-scissors, but Diplomacy works the same way. More recently, this system has been explored by the top-down tactical game Frozen Synapse.

While simultaneous turn-based play gets us closer to making massively multiplayer turn-based games feasible by turning a linear scaling problem into a constant time one, we're not quite out of the woods yet.

Consider what happens when a player does not make a move. There are a few reasons that might happen:

• The player is not playing the game right now.
• The player has stopped playing the game altogether.
• The player may be in a hopeless position, where stalling is better than losing. (Stalling may tie up lots of enemy resources.)

If you've ever gotten frustrated at a multiplayer game that has a "ready" system before you begin a game, but had to wait because one of the players disappeared; this is essentially the problem turn-based games face every turn.

There are a number of ways to mitigate this problem. Games can duplicate playing fields. That works for both sequential games like Hero Academy and simultaneous ones like Frozen Synapse. If a player doesn't make a move, that particular instance of the game world doesn't go anywhere; but you can play any number of games simultaneously.

For this strategy to work, the playing fields have to be independent. You don't lose heroes or soldiers because they're stuck on some stale game. The worst possible outcome is that your game statistics don't reflect reality.

That works, but rules out a permanent game world with shared resources. If there's a larger story being told, you would want these worlds to be linked somehow: be it through shared resources, or because they're literally the same game world.

There's a number of creative ways to get out from under that problem, usually by involving wall-clock time. For example, if a player doesn't respond within a fixed amount of time, they may forfeit their turn. Fuel consumption might be based on wall-clock time, not turns. [1] There's a lot of degrees of freedom here. Do you use a global clock, or one local to a particular area?

A global clock is probably simpler, but poses some game play challenges. How long is the tick? Too fast, and a player may see their empire annihilated while they're sleeping. Too slow, and the most trivial action takes forever. There isn't necessarily one right answer, either. In an all-out cataclysmic struggle between two superpowers, a complete tactical battle plan may take a long time. Any timescale that isn't frustratingly short for that situation will be frustratingly long for anyone trying to guide their spaceship (or kodo, depending which universe you're in) across the Barrens.

Local clocks have their own share of difficulties. You still need to answer what happens for anything that isn't in a particular battle; you still need to answer what happens when battles merge or diverge.

I'm currently exploring the shared global clock. In order to mitigate the issues I described, I'm contemplating two ideas:

• Allow programmable units; a la Screeps, CodeWars...
• Allow players to plan several turns ahead of time.

These are, of course, not mutually exclusive.

Footnotes

 [1] I don't particularly like this, because it "breaks the fourth wall" in a sense. If my engines are still consuming fuel real time, why can't the enemy fire missiles? Either time is stopped, or it isn't. Sure, games can be abstract, but that feels like an undue inconsistency.

hypercathexis dev notes part 1

hy·per·ca·thex·is, n, pl hy·per·ca·thex·es \-kə-ˈthek-səs, -ka-\: excessive concentration of desire upon a particular object

I'm considering renaming the project to its plural, hypercathexes, because then it can be about a hyper cat in a bunch of hexes.

The only real constraints I started with was that I wanted a simultaneous turn-based space game on a hex grid.

Amit Patel from Red Blob Games has basically the awesomest page about hex grids, and a ton of awesome pages about many other areas of game development. I think it's a fantastic resource for programmers like myself who don't do game dev as a day job, but just want to make a little game on the side.

Simultaneous turn-based means that all players plan their moves simultaneously, and they are then also executed simultaneously. This has an interesting scaling effect. On the one hand, it clearly scales better to many players, because players "play" simultaneously. On the other hand, you start getting interesting problems. For example, clock synchronization. Does the entire world advance with the same tick-tock pattern? What happens when a player does not make a move within the allotted time? If you allow different clocks in the world, does time advance faster if both players submit a move, or do you always wait until the maximum timeout?

I wanted an excuse to play with Om and found Chestnut.

The first version had a working hex grid, but displayed it using offset coordinates. I wanted to work using axial coordinates as much as possible, because it makes a lot of math so much easier. Axial coordinates work together with the "grain" of the hex map:

Second thing I did was move from individual <img> tags produced by Om to a single <svg> with hexes (<polygon>) inside it. This fixed a number of annoying placement issues with CSS. CSS really wants to position things based on bounding box, not midpoints. That's great for web pages, not so much for my hex grid. I ran into a number of annoying issues where certain hex borders would be wider than others. Browsers aren't made to draw hex grids based on left/right offsets, I guess...

I started by expressing distances in the SVG in terms of the hex width, which would become my unit. The height would then be $\sqrt{3}/2$. Then, I realized that I could make my life easier by expressing all x coordinates in terms of a single hex width, and all y coordinates in terms of a single hex height; then I could just scale differently across x and y at the end.

Developing in Firefox was mostly painless, but I discovered many discrepancies once trying it in Chrome. Things that should be the same aren't, particularly when it comes to transforms. For example, Chrome would occasionally literally do the inverse of the scaling it was supposed to:

I guess trying to implement things in browsers was a mistake.

What the heck is a clojure.lang.IFn$LO? It's no secret that I love Clojure. Like any tool though, it isn't perfect. Today, I was trying to write unit tests that use clojure.core.async/timeout, so I wrote a test double analogous to Twisted's Clock. As I tried to with-redefs it in, I got the most inscrutable error message out: java.lang.ClassCastException: icecap.handlers.delay_test$fake_timeout$timeout__22934 cannot be cast to clojure.lang.IFn$LO.

Wha? I know clojure.lang.IFn, Clojure's function type, but what the heck is a clojure.lang.IFn$LO? Searching for the term didn't give any particularly useful results. It was clear this happened when I was redeffing the original timeout, so I looked at its documentation: clojure.core.async/timeout ([msecs]) Returns a channel that will close after msecs  Doesn't look too special to me. What's the type of that thing, anyway? Let's find out: > (parents (type timeout)) #{clojure.lang.IFn$LO clojure.lang.AFunction}


Aha! So that is actually part of timeout, not something else wonky going on. What does the source say? It's a pretty lame shim:

(defn timeout
"Returns a channel that will close after msecs"
[^long msecs]
(timers/timeout msecs))


I mean, nothing interesting there, just a type hint.

Oh. Wait. That's not just a type hint. long is a primitive. Testing:

> (parents (type (fn [^long x] x)))
#{clojure.lang.IFn$LO clojure.lang.AFunction}  Aha! Due to a JVM quirk, functions with a primitive type hint are special. That works for doubles, too: > (parents (type (fn [^double x] x))) #{clojure.lang.IFn$DO clojure.lang.AFunction}


And multiple arguments:

> (parents (type (fn [^double x ^double y] x)))
#{clojure.lang.IFn$DDO clojure.lang.AFunction} > (parents (type (fn [^double x ^long y] x))) #{clojure.lang.IFn$DLO clojure.lang.AFunction}


Adding a simple type hint to the function fixed it. Success!

On discussing software security improvements

A common criticism of information security folks is that they tend to advise people to not do any crypto. Through projects like Crypto 101, I've attempted to make a small contribution towards fixing that.

In the open source world, various people often try to improve the security of a project. Because designing secure systems is pretty hard, they often produce flawed proposals. The aforemetioned tendency for infosec-conscious people to tell them to stop doing crypto is experienced as unwelcoming, even dismissive. Typically, the only thing that's accomplished is that a lot of feelings get hurt; it seems to only rarely result in improved software.

I think that's quite unfortunate. I think open source is great, and we should be not just welcoming and inclusive, but aiming to produce secure software. Furthermore, even if a flawed proposal is unsalvageable, a clear description of why it is flawed will presumably result in fewer negative interactions. Best case scenario, the issues with a proposal can be discussed and potentially rectified.

In an effort to improve this situation, I'm documenting what I believe to be a useful way to discuss security changes and their tradeoffs. As Zooko has taught me:

This document aims to be the equivalent of an SSCCE for generic bug reports: a blueprint for making suggestions likely to lead to productive discourse, as long as we can agree that we're trying to produce more secure software, as well as provide a welcoming development environment.

Important points

A good proposal should contain:

1. A brief description of what you're suggesting.
2. A description of the attack model you're considering, why the current system does not address this issue, and why the suggested system does address this issue.
3. A motivation of the attack model. Why is it important that this issue is actually addressed?
4. How does this change affect the attack surface (i.e. all of the ways an attacker can attempt to attack a system)?
5. What does the user experience for all users of the system look like? Many cryptosystems fall over because they're simply unusable for most users of the system.

An example

Wul (the widely underestimated language, pronounced /wool/) is a general purpose programming language. It has a package repository, WuPI (the Wul package index, pronounced /woopie/), the de facto standard for distributing and installing Wul software.

WuPI uses TLS as a secure transport. The WuF (Wul foundation, pronounced /woof/), maintains a root certificate, distributed with Wul. Thanks to a well-managed system of intermediary CAs run by a tireless army of volunteers, this means that both package authors and consumers know they're talking to the real WuPI.

Alice is the WuPI BDFL. Bob is a Wul programmer, and would like to improve the security of WuPI.

While consumers and authors know that they're talking to the real WuPI, there is no protection against a malicious WuPI endpoint. (This problem was recently made worse because WuPI introduced a CDN, greatly increasing the number of people who could own a node.). You know that you're talking to something with a WuF-signed certificate (presumably WuPI, provided the WuF has done a good job managing that certificate), but you have no idea if that thing is being honest about the packages it serves you.

Bob believes WuPI could solve this by using GPG signatures.

He starts with a brief description of the suggestion:

I would like to suggest that WuPI grows support for GPG signatures of packages. These signatures would be created when a package author uploads a package. They would optionally be verified when the user downloads a package.

He continues with the attack model being considered:

I believe this would secure WuPI consumers against a malicious WuPI endpoints. A malicious WuPI endpoint (assuming it acquires an appropriate certificate) is currently free to deliver whatever packages it wants.

He explains why the current model doesn't address this:

The current system assures authenticity and secrecy of the stream (through TLS), and it ensures that the server authenticates itself with a WuPI/WuF certificate. It does not ensure that the package is what the author uploaded.

He explains why he believes his model does address this:

Because the signatures are produced by the author's GPG key, a malicious WuPI endpoint would not be able to forge them. Therefore, a consumer is sure that a package with a valid signature is indeed from the author.

He explains why this attack model is important:

With the new CDN support, the number of people with access to such a certificate has greatly increased. While I certainly trust all of the volunteers involved, it would be nice if we didn't have to. Furthermore, the software on the servers can always be vulnerable to attack; as a high-value target, it certainly isn't inconceivable that an attacker would use an unknown vulnerability to take over a WuPI endpoint.

He (believes to) address the attack surface:

Because the signatures are optional, the attack surface remains the same.

Finally, he addresses the user experience:

The weak point of this scheme is most likely the user experience, because users historically seem to dislike using GPG.

I am hopeful that this increased value of participating in the GPG web of trust will mean that more people participate.

Alice reviews this, and notes a flaw in the proposal:

This proposal aims to address a security flaw when the WuPI endpoint is malicious by adding signatures. However, a malicious WuPI endpoint can lie by omission, and claim a package was never signed by the author.

Bob now realizes this issue, and suggests an improvement:

This could be rectified if the user insists on a signature for packages they expect to be signed.

As a side note, Alice notes that the attack surface does increase:

This places trust in author's ability to manage private keys, which has historically been shown to be problematic. That introduces a new attack vector: an attacker can attempt to go after the author's private key.

Regardless of the outcome of this conversation, there actually was a conversation. I believe this to be an improvement over the overall status quo.

Switched to Nikola

I've migrated from Octopress to Nikola.

Nothing personal. Octopress is fine software, but:

• External packages, like themes, made my installation quickly impossible to understand and manage. What's the difference between javascripts and js, css and style? I don't know.
• Performance. Nikola builds sites nearly instantly. Even with a moderate amount of pages, I found Octopress too slow. Maybe I just need a better laptop.
• Python. While I'm sure the Ruby tools are of similar quality to Python's; I know Python's tools. Porting to Nikola was easier than learning about A-grade Ruby developer installations.

Anyway, I'm on Nikola now. Pretty happy with it. I used nikola-octopress-import. Did most of what I wanted; I contributed some minor PRs so that it would also handle extended date formats (with seconds and timezones) and legacy HTML posts.

On TrueCrypt and full-disk encryption

Since the early hours of May 29th (CEST), the TrueCrypt website (https://www.truecrypt.org) has pointed to http://truecrypt.sourceforge.net/, with an ominous-looking error message:

WARNING: Using TrueCrypt is not secure as it may contain unfixed security issues

This page exists only to help migrate existing data encrypted by TrueCrypt.

The development of TrueCrypt was ended in 5/2014 after Microsoft terminated support of Windows XP. Windows 8/7/Vista and later offer integrated support for encrypted disks and virtual disk images. Such integrated support is also available on other platforms (click here for more information). You should migrate any data encrypted by TrueCrypt to encrypted disks or virtual disk images supported on your platform.

The website then explains how you can install BitLocker, a proprietary disk encryption system available in many versions of Windows, as well as how you could "rescue" existing TrueCrypt volumes.

People were pretty unhappy, for a variety of reasons:

• Lack of trust in proprietary full-disk encryption software.
• Many versions of Windows didn't even ship with BitLocker, only a few "premium" versions did.

As a proprietary system, BitLocker is not susceptible to the same amount of public scrutiny as a publicly available system. This is in stark contrast with TrueCrypt, where Matt Green recently raised around 70k USD to perform an audit.

There are variety of scenarios that could've caused the TrueCypt website to suddenly sport that message. The live Internet audience has speculated wildly. I'll share my thoughts on that near the end of this post, but there are two points I'd like to make that I think are far more important:

1. Consider if full-disk encryption is really what you want.
2. If it is, consider if it should be TrueCrypt.

Full-disk encryption

Something I learned from the inimitable Zooko:

You can't just add more crypto junk to something and expect to somehow get better security. You have to consider what it is buying you, and what it's costing you.

Full-disk encryption buys you one simple thing: if someone steals your device while the encrypted volume is locked, they probably can't read it.

If the encrypted volume is unlocked, it's over; and that's the state it's probably usually in. If the key lives in RAM (it usually does), there's a variety of ways that can be extracted. There's devices that have complete direct memory access, including everything that speaks FireWire or has FireWire-compatibility built-in. Even if you shut off your machine, cold boot attacks mean that the key can be extracted for a limited about of time. Various jurisdictions can try to force you to hand over the keys.

If an attacker can write to the (encrypted!) volume, it's probably over. Virtually all sector-level full-disk encryption formats are unauthenticated (with good reason), they're all malleable and vulnerable to (adaptive) chosen-ciphertext attacks.

If you're encrypting files or blobs of data within other files, an authenticated encryption scheme like GPG is far more useful.

Don't get me wrong. Full-disk encryption is a good idea, and you should do it. I'm just saying that what it actually protects against is pretty limited. If there's files you want to keep secret, full-disk encryption is probably not enough.

For more details, try Thomas & Erin Ptacek's blog post on why XTS isn't what you want. XTS is a way to build tweakable ciphers that can be used to create full-disk encryption; but the post applies to full-disk encryption generically as well.

TrueCrypt as a full-disk encryption mechanism

So, you probably want full-disk encryption, and you probably want to make sure that sensitive data is encrypted on top of that. Great. What full-disk encryption scheme do you use?

I don't want to bash TrueCrypt. It is (was, perhaps?) a great go-to project for full-disk encryption. It got a lot of things right. It also got a bunch of things wrong.

TrueCrypt was made by a bunch of people we don't know occasionally throwing a bunch of binaries over the wall. The source is available (under a non-OSI license), so you could audit that and compile your own binaries, but the truth is that the vast majority of TrueCrypt users never actually did that.

The TrueCrypt disk format and encryption standards are a bit iffy. It has terrible file system support. It has major performance issues, partially due to questionable cryptographic practices such as cipher cascades. It was great for when it was originally conceived and full-disk encryption was still new and exciting, it's still pretty decent now, but we can certainly do better.

What actually happened to the website?

We don't really know. There's a couple of guesses.

First of all, it looks like it are the original authors that folded: the key is the same one that was being used to sign releases months ago. Of course, that could mean that it was compromised or that they were forced to hand it over.

Since the DNS records changed, the e-mail server behavior changed, the same key was used, the Sourceforge client was involved, and fairly major changes to the source code were involved, it's unlikely that it's a simple defacement.

It's unlikely that they folded because they felt discovery of some backdoor was imminent. Folding wouldn't actually stop that discovery, because the source was already open.

It's strange that they would point to alternatives like Bitlocker for Windows and even more dubious alternatives for other operating systems. The authors certainly knew that this would not be an acceptable alternative for the vast majority of their users.

Additionally, the support window for XP ending isn't a particularly convincing impetus for migrating away from TrueCrypt right now. This has lead some to believe that it's an automated release. Worse, it could be a gagged response: a big and powerful three-letter agency might be twisting their arm and forcing them to fold, in return for something else (like, say, not being thrown in a Gitmo cell and forgotten about). Again: pure speculation.

Another option is that some of the developers just really, genuinely folded. They felt that for atechnical users, BitLocker was good enough, while the particularly discerning TrueCrypt user would be able to find some other alternative.

Bottom line is that none of this really matters. What matters is what you should do next if you want to have full-disk encryption.

So what do I do now?

I think LUKS is probably the best system that we have right now, and one of the few that actually improves on TrueCrypt.

If you're on Linux, dm-crypt + cryptsetup + LUKS is probably what you want. If you're on OS X, FileVault 2 is probably what you want. If you're on Windows, your options are looking a bit thin right now:

• Keep using old versions of TrueCrypt.
• Give up and use BitLocker.
• Use a Linux virtual machine to use dm-crypt/LUKS, eventually migrating to native support.

Personally, I think the first option is the most reasonable right now, and then wait until someone actually writes the native LUKS support.

Oh, and you should probably install GPG to encrypt some of those files stored on that encrypted volume.

More on financial aid grant optimization

In a previous post, I talked about ways to optimize PyCon financial aid grants. This is a follow-up on those efforts. Quick recap:

• There is a fixed budget $b$ available for grants, between 100k and 200k USD.
• There are a number of people (approximately 300) requesting various amounts $r_i$ (approximately between 100 USD and 2000 USD) of financial aid, and receive a grant $g_i$ so that $0 \le g_i \le r_i$.
• Financial aid applicants can be assigned scores $s_i$, a relative value describing how much we'd like to have them at PyCon.
• PyCon wants to optimize the total expected value of scores. That means getting as many people as possible to come, weighted by score.
• We've conjectured that we can estimate the probability $p_i$ that someone attends as either $g_i/r_i$, or $(g_i/r_i)^2$. The former prefers to spread the budget across a larger number of smaller grants, whereas the latter prefers to focus the budget into a smaller number of larger grants.

If any of that doesn't make sense, you should read the previous blog post for more details.

In short, we're trying to solve the optimization problem:

$$\max \sum E[S_i] = \sum s_i \cdot p_i$$

Since most optimization texts appear to prefer minimization, alternatively:

$$\min - \sum s_i \cdot p_i$$

Subject to a budget constraint and an individual grant constraint:

$$\sum g_i \le b$$

$$0 \le g_i \le r_i$$

The greater-than-zero constraint for individual grants is fairly important, otherwise the algorithm might casually give you answers like:

k = 1: [ 10.  20. -20.  30.  50.  10.  50.], sum: 150.0


That list in the middle are the per-person grants. Notice how the algorithm feels that the third person really ought to pony up some cash so that some of the other people can go to PyCon ;-)

Squared problems

In the previous blog post I ran into issues using a very generic constraint solver. I ended that post saying that I would try to remeedy that by applying a more specific solver that takes advantage of a particular structure of the problem.

When you set $p_i = g_i/r_i$, this turns into a linear programming problem, since $r_i$ is a constant. When you set $p_i = \left(g_i/r_i\right)^2$, it turns into a quadratic programming problem. Turns out there's two things I missed about the quadratic problem:

• The resulting problem is not convex. That means it's (probably) difficult to solve.
• The estimated probability that someone will attend falls off sharply as soon as they don't receive the full amount they requested. At 50% of the requested grant, the estimated probability of attending is only 25%; at 90%, it's 81%. This is the opposite of what we want.

Fixing the model

That doesn't mean we should put the $(g_i/r_i)^k$ out to pasture: it just means that I didn't pick the $k$ I really wanted. Specifically, if I were to pick $k=1/2$, I'd get:

$$p_i = \left(\frac{g_i}{r_i}\right)^{1/2} = \sqrt{\frac{g_i}{r_i}}$$

In general, if $k = 1/n$:

$$p_i = \left(\frac{g_i}{r_i}\right)^{1/n} = \sqrt[n]{\frac{g_i}{r_i}}$$

That problem is convex, but it isn't linear, quadratic, or some other easy specific problem. It's just constrained multivariate convex optimization. That's okay, there are still a couple of applicable optimization algorithms.

Some of these algorithms require the derivative of the goal function with respect to a particular grant size $g_j$ at a particular point. In case we don't have the real derivative, we can still provide a numerical approximation. In our case, we don't really need to approximate, since the derivatives are fairly easy to compute analytically:

$$\frac{\partial}{\partial g_j} \sum_i s_i \cdot \sqrt[n]{\frac{g_i}{r_i}} = \frac{s_j \cdot {g_j}^{\frac{1}{n} - 1}}{\sqrt[n]{r_j} \cdot n}$$

Finding a solution with Python

There are a few Python packages that contain optimization algorithms:

• SciPy
• cvxopt
• pyopt

Someone originally pointed me towards cvxopt. While I'm sure it's excellent software, I already knew SciPy, so I went with that.

SciPy's optimization module provides the following algorithms:

• fmin_l_bfgs_b - Zhu, Byrd, and Nocedal's constrained optimizer
• fmin_tnc - Truncated Newton code
• fmin_cobyla - Constrained optimization by linear approximation
• fmin_slsqp - Minimization using sequential least-squares programming
• nnls - Linear least-squares problem with non-negativity constraint

The first two are not applicable because they only appear to support bounds on individual variables. I also need a constraint over the sum of variables for the budget: $\sum g_i \le b$. The last one isn't applicable because this isn't a linear least-squares problem. That leaves cobyla and slsqp.

Experimenting with linear approximation (COBYLA)

Making COBYLA work ended up being pretty easy. The only non-trivial part is expressing all your constraints as expressions greater than or equal to zero.

I've uploaded my IPython notebook (viewer, gist).

This produced the following results:

scores: [1 1 1 2 3 5 5]
requested: [10 20 30 30 50 10 50] total: 200, budget: 150
k = 1/0.5: [ 10.  20.  30.  30.  50.  10.   0.], sum: 150.0
k = 1/1: [ 10.  -0.   0.  30.  50.  10.  50.], sum: 150.0
k = 1/2: [ 10.  10.   7.  27.  36.  10.  50.], sum: 150.0
k = 1/3: [ 10.  11.   9.  25.  35.  10.  50.], sum: 150.0
k = 1/5: [ 10.  11.  10.  24.  35.  10.  50.], sum: 150.0
k = 1/10: [ 10.  11.  11.  23.  35.  10.  50.], sum: 150.0
k = 1/100: [ 10.  11.  11.  23.  34.  10.  50.], sum: 150.0


Some takeaways:

• As predicted, as $1/k$ increases, the optimization gradually starts spreading the budget out more evenly; preferring to give many partial grants rather than a few large ones.
• This effect is mostly only pronounced for $1/k = 2, 3$; after that, increasing $1/k$ doesn't make much of a difference anymore. This is what I'd expect when I visualize the $p_i$ functions in my head, but I haven't ruled out numerical instability.
• Even at high $1/k$, the two high scorers get their full grant amount. That's quite understandable for the one that's only asking for 10, but even the one asking for a large grant gets it unconditionally. This probably means that tweaking the scoring functions is going to be very important.
• The linear version is apparently already quite brutal: the person with score 3 requesting 50 simply gets it entirely, leaving no budget left over for the people with score 1.

Since I'm so pleased with these results, I'm skipping slsqp until I feel like it. Next up, I'll try to compare the COBYLA results above with the results of the greedy algorithm under various fitness metrics.

Optimal hotel room pairing

The greedy hotel pairing algorithm from my previous post gives good solutions. If you make some fairly reasonable-sounding assumptions about the problem, very good solutions. At any rate nicer than forcing a human to drudge through it; both in terms of person-hours spent and quality of the solutions.

However, it turns out I was wrong about them being optimal solutions. There's no guarantee they would be, and in fact a good chance that they won't.

Finding optimal solutions

The good news is that there is a way to find the optimal solution. This problem can be modeled as a graph matching problem; what we're trying to find is a weighted maximal matching.

This is really easy if the graph is bipartite, but in our case we're actually dealing with two complete graphs: we try to pair people with similar gender. Within that compatible group, anyone can theoretically pair with anyone.

Picking an algorithm

There are several feasible algorithms for this.

The current state of the art in computerized algorithms appears to be Blossom V by Vladimir Kolmogorov. This algorithm finds perfect matchings (i.e. all nodes are included), which may be impossible for us, e.g. because of an odd number of pairs.

Issues

One issue with this approach is that it becomes pretty much impossible to adapt this to rooms for more than two people. In order to support that case my previous greedy algorithm remains the best thing I've found.

I was unable to find an implementation of this in Java or Clojure. Fortunately, there is a Python library called NetworkX that does have an implementation.

I have not yet turned this into a fully functional program, because:

• it's too late to apply it for this year
• there's a good chance we won't be doing this again for next year

Credits

I'd like to thank Tor Myklebust (tmyklebu on Freenode) for pointing out I was wrong, and shoving me in the direction of the answers.

In this blog post I used several freely available graph illustrations from Wikipedia. The complete graph illustrations were made by koko90.

Optimization problems and PyCon financial aid

This year, I have partly taken on some new responsibilities in organizing PyCon. Specifically, I've done some work on financial aid. (Next year, I will be taking on the position of financial aid chair.) There have been some challenges that I've thrown some code at. Some of it stuck.

I'm very much interested in your comments, thoughts, alternative approaches, et cetera. Hit me up on Twitter.

Hotel room allocation

The first issue I solved was one of hotel room allocation. We provide the financial aid applicants with hotel rooms. To keep costs down, we pair people up, two by two.

We want to pair people so that we have to book a minimal number of days. If there's days on the start of the room booking where only one person is booking the room, we have to pay for the rest.

A greedy algorithm

I start out with generating every possible pairing. Assuming we have 300 financial aid applicants, that's:

$${300 \choose 2} = \frac{300!}{2! \cdot 298!} = 44850$$

That's more than we can check by hand, but still pretty easy for a computer. Besides, there are actually fewer pairs than that: some people request to be paired together explicitly, and we only pair people of the same gender.

Then, I sort them by number of unmatched days, that is, how many days would have a single person in the hotel room if we paired them together. Finally, I just greedily start grabbing pairs, keeping track of the people that have been allocated rooms as I go along. As soon as everyone is accounted for, we stop.

Other than perhaps minding the edge conditions where you have an odd number of people to pair, this isn't too tricky. The implementation for this is on Github, in lvh/pairing.

Solution, meet reality

Before I wrote this, humans did this by hand. That took pretty long, and the solutions were decent, but suboptimal. All the humans that have tackled this problem seem to take the same approach: sort by check-in date, find exact or near matches in check-out dates, rinse, repeat.

The algorithm above did quite well, coming up with a significantly better result than the previous human allocation. It also produced very interesting solutions. Turns out that by taking a big hit on one of the pairs (say, a pair that's 4 days mismatched four days), you can limit the damage on a large number of other pairs, and still come out on top. After reviewing with the person who previously had to do this manually, we quickly agreed that a human would most likely not have come up with this.

Furthermore, a lot of the pairs have matching check-out dates but with a mismatched check-in date; whereas humans typically only look for solutions in the other direction. This is easy to explain in hindsight; the algorithm has no concept of an ordering of events in time. It just tries to minimize the number of mismatched days.

I'm quite happy with the result. Any dollar I'm not devoting to a room that isn't being fully used will instead be going into an increased grant.

Unfortunately, finding this solution isn't the end of the story. It rarely survives the clash with reality. Many complex factors affect it: people will drastically change their room dates, people's visas get denied, et cetera. All sorts of unforseeable circumstances have lead to changes to the answer the algorithm originally produced. I don't expect that to change until PyCon is over; unfortunately, I also don't think that's a problem I can fix in a few lines of Clojure.

Grant allocation

The second issue is a much thornier one. Given the financial aid budget, figure out how to optimally distribute it. The budget is quite limited, a good deal smaller than the sum of what everyone requests.

There's a few reasons why this problem is complex. First of all, "optimal" is highly subjective. As I'm sure you'll notice very soon after thinking about it, it's very easy to verge off from math and straight into the realm of politics.

A very simple greedy algorithm

You might say that you want to get as many people as possible to come to PyCon, and you just greedily allocate the smallest grant requests first. This has a few disadvantages:

1. It biases against the people that ostensibly need the money the most.
2. It is ordering-sensitive: equivalent applications that happen to come later in the queue are disadvantaged.
3. It oversimplifies how people react to grants, by making grant allocation binary. Someone receiving 90% of their requested grant amount will likely still be able to attend. Suppose you have 11 people, all requesting 100 USD, and you have 1000 USD to spend. The greedy algorithm would give the first ten of them 100 USD and find its purse empty for the eleventh. It would almost certainly be better to give them all 90.91 USD instead.
4. Game-theoretically, it makes a lot of sense for everyone to apply for a small grant that they will almost certainly get. (This is not a real issue for us, because we still have humans eyeballing all the applications.)
5. The only difference between applications is the amount they're requesting, but there are people that we would like to bias in favor of. For example, we'd like to try very hard to get speakers or tutorial presenters to the conference, we might want to reward people doing open source work, et cetera.

Selecting these critera is definitely squarely in the realm of politics, and I'd like to stick to the realm of math; so let's just assume that we can assign scores to applicants and that those scores are fair and just. If you feel like we shouldn't bias in favor of anyone at all, just assume that whenever I say "the score of an applicant" I mean "the number 1".

A slightly smarter greedy algorithm

One of the key ideas (which a lot of people seem to have when tackling this problem) is to translate an application's score into an amount of the budget. So, for example, let's say that the sum of all scores is 100, and your score is 10, that means you can have up to 10% of the budget allocated to you.

So, I make a pass over all the remaining applicants. If you're asking less than your budget slice, you get your requested grant. If you're asking for more, I save you for a future pass. Even though the budget goes down between passes, budget slices will go up, because the total score drops.

You can't do this in one pass, increasing the budget as you go along if people request less than their slice, because you might introduce order dependence. For example, consider what happens when Alice and Charlie both need more than the budget allows for (and they have the same score and requested amount), and Bob needs less. If you process them in the order Alice - Bob - Charlie, Charlie may get his requested grant and Alice may not, just because Bob's in the middle making the budget bigger for Charlie.

Once everyone left is requesting more than their slice, they just get their slice instead of their requested amount.

This solution is adequate, but I'm not completely satisfied; everyone besides the last pass will get full grant amounts. You can find my implementation of this algorithm on Github as well.

A better model

I think we can do better by applying some elementary probability theory. What we really want is to maximize the total score of the applicants that we expect to see at PyCon as a consequence of financial aid. Probability theory has a thing called the expected value of a random variable, which is pretty much just the integral of the random variable with respect to the probability. In our case, that just means that we want to maximize the sum of the probabiliies that a given applicant will come to PyCon given a particular grant amount, weighted by their score:

$$\max \sum s_i \cdot p_i$$

That just restates problems we can't solve in terms of problems we can't solve. What on earth could the probability that someone shows up be? We can't know the real value, since it's specific to each individual, and dependent on a lot of random, unknowable events. We can, however, make an educated guess. If we don't give them any money, they probably won't come. If we give them the amount they're asking for, they probably will.

We could conjecture that the probability someone will come to the conference is approximately the fraction of the amount they requested that they actually received:

$$p_i = \frac{g_i}{r_i}$$

For example, if someone gets 90% of their requested grant, we estimate that the odds they will attend are also 90%; if we give them only 10%, we estimate it at just 10%.

Alternatively, we could conjecture that it's closer to the square of that ratio; when giving someone a value that's very close to what they asked for, their probability of attending is still going to be very high; but if we only give them half, the odds they will attend will be closer to 25%, much lower than 50%. So, the expression becomes:

$$p^{\prime}_i = \left(\frac{g_i}{r_i}\right)^{2}$$

Update: I now realize that this is probably not the expression that I actually want: it falls off very quickly as soon as an applicant gets anything less than the full amount. I've elaborated on this in a new blog post.

We will see how the square model emphasizes focusing the available funds on fewer grants, whereas the linear model will spread smaller grants more liberally.

An attempt with constraint solvers

(I would like to thank Mark Engelberg for helping me with the stuff in this chapter. It may not have panned out, but it was a very educational experience nonetheless.)

It'd be nice if we could just declaratively describe the problem, throw it into a computer program, and have it magically tell us the answer. Of course, there are programs that do exactly that, called solvers.

When I reached out on the Clojure mailing list, Mark Engelberg helpfully reached out and immediately handed out some cool runnable example for me to play with. Sure enough, they worked super, and with a carefully crafted three-person example we could clearly see the difference between the linear and quadratic models I was talking about above: under the square model, the concentrated grants, trying harder to give high-scoring applications the amount they requested. The linear model spread money out more evenly. That is, with the following applications:

(def alice {:name "Alice", :score 5, :requested 120})
(def bob {:name "Bob", :score 4, :requested 100})
(def carol {:name "Carol", :score 3, :requested 80})


In a linear model, it comes up with:

{alice 120
bob 80
carol 0}


However, in the quadratic model, the optimizer rather gives Carol (who has a lower score) a complete grant than give Bob a partial one:

{alice 120
bob 0
carol 80}


You can find the code on Github. The tests that confirm the difference in behavior for the linear and quadratic models are in the tests.

There's one issue though; it didn't scale. Not bad enough that you'd notice on the three-person example (I thought my REPL was just being laggy), but bad enough that any real-world problem would be completely unsolvable. Why? Clearly constraint solvers are awesome real world tools with real world usage, it's not like they're supposed to fall over as soon as you throw a problem of reasonable size at it.

Well, let's do some napkin math. If we have 200 financial aid recipients that are all getting 1000 possible options (somewhere between 0 USD and 1000 USD, in whole-dollar increments), there's 200,000 possible places to put any given dollar. If we have 100,000 USD to spend, that's:

$${2 \cdot 10^{5} \choose 10^{5}} = \frac{(2 \cdot 10^{5})!}{10^{5}! \cdot 10^{5}!} \approx 10^{60203}$$

That is a very big number. I needed logarithms to compute it. It is so huge that it makes the number of atoms in the observable universe (about $10^{80}$) look like rounding error.

In an attempt to "fix" that problem, I tried only handing out funds in chunks of 100 USD each. Then, we have 1000 chunks to hand out and 2000 places to put them:

$${2000 \choose 1000} = \frac{2000!}{1000! \cdot 1000!} \approx 10^{600}$$

I have made the problem 59,603 decimal orders of magnitude easier! Rejoice!

A constraint solver really doesn't "know" anything about the structure of the problem you're feeding it. That's a trade-off: in return, it grants you a lot of freedom in expressing your problem. Constraint solvers like LoCo are very valuable, they just aren't a good fit for this problem. They are designed for problems where the constraints really constraining the solution space. We're clearly not in that territory. We have many valid solutions, and we're struggling to look for a good one.

There are two ways we can fix this:

1. Compromising on the freedom of expressing the problem. By pouring our problem into a specific structured shape, we may be able to use a much faster solver, specific to that class of problems. Also, by allowing arbitrary real numbers instead of just integers, we can use much faster solvers.
2. Compromising on finding the optimal solution, instead searching for a decent solution. You can do that with algorithms like tabu search or simulated annealing.

I will be elaborating on these problems in a future blog post.