hypercathexis dev notes part 1

hy·per·ca·thex·is, n, pl hy·per·ca·thex·es \-kə-ˈthek-səs, -ka-\: excessive concentration of desire upon a particular object

I'm considering renaming the project to its plural, hypercathexes, because then it can be about a hyper cat in a bunch of hexes.

The only real constraints I started with was that I wanted a simultaneous turn-based space game on a hex grid.

Amit Patel from Red Blob Games has basically the awesomest page about hex grids, and a ton of awesome pages about many other areas of game development. I think it's a fantastic resource for programmers like myself who don't do game dev as a day job, but just want to make a little game on the side.

Simultaneous turn-based means that all players plan their moves simultaneously, and they are then also executed simultaneously. This has an interesting scaling effect. On the one hand, it clearly scales better to many players, because players "play" simultaneously. On the other hand, you start getting interesting problems. For example, clock synchronization. Does the entire world advance with the same tick-tock pattern? What happens when a player does not make a move within the allotted time? If you allow different clocks in the world, does time advance faster if both players submit a move, or do you always wait until the maximum timeout?

I wanted an excuse to play with Om and found Chestnut.

The first version had a working hex grid, but displayed it using offset coordinates. I wanted to work using axial coordinates as much as possible, because it makes a lot of math so much easier. Axial coordinates work together with the "grain" of the hex map:

axial base vectors

Second thing I did was move from individual <img> tags produced by Om to a single <svg> with hexes (<polygon>) inside it. This fixed a number of annoying placement issues with CSS. CSS really wants to position things based on bounding box, not midpoints. That's great for web pages, not so much for my hex grid. I ran into a number of annoying issues where certain hex borders would be wider than others. Browsers aren't made to draw hex grids based on left/right offsets, I guess...

I started by expressing distances in the SVG in terms of the hex width, which would become my unit. The height would then be \(\sqrt{3}/2\). Then, I realized that I could make my life easier by expressing all x coordinates in terms of a single hex width, and all y coordinates in terms of a single hex height; then I could just scale differently across x and y at the end.

Simple SVG hex grid

Developing in Firefox was mostly painless, but I discovered many discrepancies once trying it in Chrome. Things that should be the same aren't, particularly when it comes to transforms. For example, Chrome would occasionally literally do the inverse of the scaling it was supposed to:

Differences in scaling behavior across browsers

I guess trying to implement things in browsers was a mistake.

What the heck is a clojure.lang.IFn$LO?

It's no secret that I love Clojure. Like any tool though, it isn't perfect. Today, I was trying to write unit tests that use clojure.core.async/timeout, so I wrote a test double analogous to Twisted's Clock. As I tried to with-redefs it in, I got the most inscrutable error message out: java.lang.ClassCastException: icecap.handlers.delay_test$fake_timeout$timeout__22934 cannot be cast to clojure.lang.IFn$LO.

Wha? I know clojure.lang.IFn, Clojure's function type, but what the heck is a clojure.lang.IFn$LO?

Searching for the term didn't give any particularly useful results. It was clear this happened when I was redeffing the original timeout, so I looked at its documentation:

clojure.core.async/timeout
([msecs])
Returns a channel that will close after msecs

Doesn't look too special to me. What's the type of that thing, anyway? Let's find out:

> (parents (type timeout))
#{clojure.lang.IFn$LO clojure.lang.AFunction}

Aha! So that is actually part of timeout, not something else wonky going on. What does the source say? It's a pretty lame shim:

(defn timeout
"Returns a channel that will close after msecs"
[^long msecs]
(timers/timeout msecs))

I mean, nothing interesting there, just a type hint.

Oh. Wait. That's not just a type hint. long is a primitive. Testing:

> (parents (type (fn [^long x] x)))
#{clojure.lang.IFn$LO clojure.lang.AFunction}

Aha! Due to a JVM quirk, functions with a primitive type hint are special. That works for doubles, too:

> (parents (type (fn [^double x] x)))
#{clojure.lang.IFn$DO clojure.lang.AFunction}

And multiple arguments:

> (parents (type (fn [^double x ^double y] x)))
#{clojure.lang.IFn$DDO clojure.lang.AFunction}
> (parents (type (fn [^double x ^long y] x)))
#{clojure.lang.IFn$DLO clojure.lang.AFunction}

Adding a simple type hint to the function fixed it. Success!

On discussing software security improvements

A common criticism of information security folks is that they tend to advise people to not do any crypto. Through projects like Crypto 101, I've attempted to make a small contribution towards fixing that.

In the open source world, various people often try to improve the security of a project. Because designing secure systems is pretty hard, they often produce flawed proposals. The aforemetioned tendency for infosec-conscious people to tell them to stop doing crypto is experienced as unwelcoming, even dismissive. Typically, the only thing that's accomplished is that a lot of feelings get hurt; it seems to only rarely result in improved software.

I think that's quite unfortunate. I think open source is great, and we should be not just welcoming and inclusive, but aiming to produce secure software. Furthermore, even if a flawed proposal is unsalvageable, a clear description of why it is flawed will presumably result in fewer negative interactions. Best case scenario, the issues with a proposal can be discussed and potentially rectified.

In an effort to improve this situation, I'm documenting what I believe to be a useful way to discuss security changes and their tradeoffs. As Zooko has taught me:

Security isn't about perfect versus imperfect or about better versus worse, it's about this attack surface versus that attack surface.

This document aims to be the equivalent of an SSCCE for generic bug reports: a blueprint for making suggestions likely to lead to productive discourse, as long as we can agree that we're trying to produce more secure software, as well as provide a welcoming development environment.

Important points

A good proposal should contain:

  1. A brief description of what you're suggesting.
  2. A description of the attack model you're considering, why the current system does not address this issue, and why the suggested system does address this issue.
  3. A motivation of the attack model. Why is it important that this issue is actually addressed?
  4. How does this change affect the attack surface (i.e. all of the ways an attacker can attempt to attack a system)?
  5. What does the user experience for all users of the system look like? Many cryptosystems fall over because they're simply unusable for most users of the system.

An example

Wul (the widely underestimated language, pronounced /wool/) is a general purpose programming language. It has a package repository, WuPI (the Wul package index, pronounced /woopie/), the de facto standard for distributing and installing Wul software.

WuPI uses TLS as a secure transport. The WuF (Wul foundation, pronounced /woof/), maintains a root certificate, distributed with Wul. Thanks to a well-managed system of intermediary CAs run by a tireless army of volunteers, this means that both package authors and consumers know they're talking to the real WuPI.

Alice is the WuPI BDFL. Bob is a Wul programmer, and would like to improve the security of WuPI.

While consumers and authors know that they're talking to the real WuPI, there is no protection against a malicious WuPI endpoint. (This problem was recently made worse because WuPI introduced a CDN, greatly increasing the number of people who could own a node.). You know that you're talking to something with a WuF-signed certificate (presumably WuPI, provided the WuF has done a good job managing that certificate), but you have no idea if that thing is being honest about the packages it serves you.

Bob believes WuPI could solve this by using GPG signatures.

He starts with a brief description of the suggestion:

I would like to suggest that WuPI grows support for GPG signatures of packages. These signatures would be created when a package author uploads a package. They would optionally be verified when the user downloads a package.

He continues with the attack model being considered:

I believe this would secure WuPI consumers against a malicious WuPI endpoints. A malicious WuPI endpoint (assuming it acquires an appropriate certificate) is currently free to deliver whatever packages it wants.

He explains why the current model doesn't address this:

The current system assures authenticity and secrecy of the stream (through TLS), and it ensures that the server authenticates itself with a WuPI/WuF certificate. It does not ensure that the package is what the author uploaded.

He explains why he believes his model does address this:

Because the signatures are produced by the author's GPG key, a malicious WuPI endpoint would not be able to forge them. Therefore, a consumer is sure that a package with a valid signature is indeed from the author.

He explains why this attack model is important:

With the new CDN support, the number of people with access to such a certificate has greatly increased. While I certainly trust all of the volunteers involved, it would be nice if we didn't have to. Furthermore, the software on the servers can always be vulnerable to attack; as a high-value target, it certainly isn't inconceivable that an attacker would use an unknown vulnerability to take over a WuPI endpoint.

He (believes to) address the attack surface:

Because the signatures are optional, the attack surface remains the same.

Finally, he addresses the user experience:

The weak point of this scheme is most likely the user experience, because users historically seem to dislike using GPG.

I am hopeful that this increased value of participating in the GPG web of trust will mean that more people participate.

Alice reviews this, and notes a flaw in the proposal:

This proposal aims to address a security flaw when the WuPI endpoint is malicious by adding signatures. However, a malicious WuPI endpoint can lie by omission, and claim a package was never signed by the author.

Bob now realizes this issue, and suggests an improvement:

This could be rectified if the user insists on a signature for packages they expect to be signed.

As a side note, Alice notes that the attack surface does increase:

This places trust in author's ability to manage private keys, which has historically been shown to be problematic. That introduces a new attack vector: an attacker can attempt to go after the author's private key.

Regardless of the outcome of this conversation, there actually was a conversation. I believe this to be an improvement over the overall status quo.

Switched to Nikola

I've migrated from Octopress to Nikola.

Nothing personal. Octopress is fine software, but:

  • External packages, like themes, made my installation quickly impossible to understand and manage. What's the difference between javascripts and js, css and style? I don't know.
  • Performance. Nikola builds sites nearly instantly. Even with a moderate amount of pages, I found Octopress too slow. Maybe I just need a better laptop.
  • Python. While I'm sure the Ruby tools are of similar quality to Python's; I know Python's tools. Porting to Nikola was easier than learning about A-grade Ruby developer installations.

Anyway, I'm on Nikola now. Pretty happy with it. I used nikola-octopress-import. Did most of what I wanted; I contributed some minor PRs so that it would also handle extended date formats (with seconds and timezones) and legacy HTML posts.

On TrueCrypt and full-disk encryption

Since the early hours of May 29th (CEST), the TrueCrypt website (https://www.truecrypt.org) has pointed to http://truecrypt.sourceforge.net/, with an ominous-looking error message:

WARNING: Using TrueCrypt is not secure as it may contain unfixed security issues

This page exists only to help migrate existing data encrypted by TrueCrypt.

The development of TrueCrypt was ended in 5/2014 after Microsoft terminated support of Windows XP. Windows 8/7/Vista and later offer integrated support for encrypted disks and virtual disk images. Such integrated support is also available on other platforms (click here for more information). You should migrate any data encrypted by TrueCrypt to encrypted disks or virtual disk images supported on your platform.

The website then explains how you can install BitLocker, a proprietary disk encryption system available in many versions of Windows, as well as how you could "rescue" existing TrueCrypt volumes.

People were pretty unhappy, for a variety of reasons:

  • Lack of trust in proprietary full-disk encryption software.
  • Many versions of Windows didn't even ship with BitLocker, only a few "premium" versions did.

As a proprietary system, BitLocker is not susceptible to the same amount of public scrutiny as a publicly available system. This is in stark contrast with TrueCrypt, where Matt Green recently raised around 70k USD to perform an audit.

There are variety of scenarios that could've caused the TrueCypt website to suddenly sport that message. The live Internet audience has speculated wildly. I'll share my thoughts on that near the end of this post, but there are two points I'd like to make that I think are far more important:

  1. Consider if full-disk encryption is really what you want.
  2. If it is, consider if it should be TrueCrypt.

Full-disk encryption

Something I learned from the inimitable Zooko:

Security isn't about perfect versus imperfect or about better versus worse, it's about this attack surface versus that attack surface.

You can't just add more crypto junk to something and expect to somehow get better security. You have to consider what it is buying you, and what it's costing you.

Full-disk encryption buys you one simple thing: if someone steals your device while the encrypted volume is locked, they probably can't read it.

If the encrypted volume is unlocked, it's over; and that's the state it's probably usually in. If the key lives in RAM (it usually does), there's a variety of ways that can be extracted. There's devices that have complete direct memory access, including everything that speaks FireWire or has FireWire-compatibility built-in. Even if you shut off your machine, cold boot attacks mean that the key can be extracted for a limited about of time. Various jurisdictions can try to force you to hand over the keys.

If an attacker can write to the (encrypted!) volume, it's probably over. Virtually all sector-level full-disk encryption formats are unauthenticated (with good reason), they're all malleable and vulnerable to (adaptive) chosen-ciphertext attacks.

If you're encrypting files or blobs of data within other files, an authenticated encryption scheme like GPG is far more useful.

Don't get me wrong. Full-disk encryption is a good idea, and you should do it. I'm just saying that what it actually protects against is pretty limited. If there's files you want to keep secret, full-disk encryption is probably not enough.

For more details, try Thomas & Erin Ptacek's blog post on why XTS isn't what you want. XTS is a way to build tweakable ciphers that can be used to create full-disk encryption; but the post applies to full-disk encryption generically as well.

TrueCrypt as a full-disk encryption mechanism

So, you probably want full-disk encryption, and you probably want to make sure that sensitive data is encrypted on top of that. Great. What full-disk encryption scheme do you use?

I don't want to bash TrueCrypt. It is (was, perhaps?) a great go-to project for full-disk encryption. It got a lot of things right. It also got a bunch of things wrong.

TrueCrypt was made by a bunch of people we don't know occasionally throwing a bunch of binaries over the wall. The source is available (under a non-OSI license), so you could audit that and compile your own binaries, but the truth is that the vast majority of TrueCrypt users never actually did that.

The TrueCrypt disk format and encryption standards are a bit iffy. It has terrible file system support. It has major performance issues, partially due to questionable cryptographic practices such as cipher cascades. It was great for when it was originally conceived and full-disk encryption was still new and exciting, it's still pretty decent now, but we can certainly do better.

What actually happened to the website?

We don't really know. There's a couple of guesses.

First of all, it looks like it are the original authors that folded: the key is the same one that was being used to sign releases months ago. Of course, that could mean that it was compromised or that they were forced to hand it over.

Since the DNS records changed, the e-mail server behavior changed, the same key was used, the Sourceforge client was involved, and fairly major changes to the source code were involved, it's unlikely that it's a simple defacement.

It's unlikely that they folded because they felt discovery of some backdoor was imminent. Folding wouldn't actually stop that discovery, because the source was already open.

It's strange that they would point to alternatives like Bitlocker for Windows and even more dubious alternatives for other operating systems. The authors certainly knew that this would not be an acceptable alternative for the vast majority of their users.

Additionally, the support window for XP ending isn't a particularly convincing impetus for migrating away from TrueCrypt right now. This has lead some to believe that it's an automated release. Worse, it could be a gagged response: a big and powerful three-letter agency might be twisting their arm and forcing them to fold, in return for something else (like, say, not being thrown in a Gitmo cell and forgotten about). Again: pure speculation.

Another option is that some of the developers just really, genuinely folded. They felt that for atechnical users, BitLocker was good enough, while the particularly discerning TrueCrypt user would be able to find some other alternative.

Bottom line is that none of this really matters. What matters is what you should do next if you want to have full-disk encryption.

So what do I do now?

I think LUKS is probably the best system that we have right now, and one of the few that actually improves on TrueCrypt.

If you're on Linux, dm-crypt + cryptsetup + LUKS is probably what you want. If you're on OS X, FileVault 2 is probably what you want. If you're on Windows, your options are looking a bit thin right now:

  • Keep using old versions of TrueCrypt.
  • Give up and use BitLocker.
  • Use a Linux virtual machine to use dm-crypt/LUKS, eventually migrating to native support.

Personally, I think the first option is the most reasonable right now, and then wait until someone actually writes the native LUKS support.

Oh, and you should probably install GPG to encrypt some of those files stored on that encrypted volume.

More on financial aid grant optimization

In a previous post, I talked about ways to optimize PyCon financial aid grants. This is a follow-up on those efforts. Quick recap:

  • There is a fixed budget \(b\) available for grants, between 100k and 200k USD.
  • There are a number of people (approximately 300) requesting various amounts \(r_i\) (approximately between 100 USD and 2000 USD) of financial aid, and receive a grant \(g_i\) so that \(0 \le g_i \le r_i\).
  • Financial aid applicants can be assigned scores \(s_i\), a relative value describing how much we'd like to have them at PyCon.
  • PyCon wants to optimize the total expected value of scores. That means getting as many people as possible to come, weighted by score.
  • We've conjectured that we can estimate the probability \(p_i\) that someone attends as either \(g_i/r_i\), or \((g_i/r_i)^2\). The former prefers to spread the budget across a larger number of smaller grants, whereas the latter prefers to focus the budget into a smaller number of larger grants.

If any of that doesn't make sense, you should read the previous blog post for more details.

In short, we're trying to solve the optimization problem:

$$\max \sum E[S_i] = \sum s_i \cdot p_i$$

Since most optimization texts appear to prefer minimization, alternatively:

$$\min - \sum s_i \cdot p_i$$

Subject to a budget constraint and an individual grant constraint:

$$\sum g_i \le b$$

$$0 \le g_i \le r_i$$

The greater-than-zero constraint for individual grants is fairly important, otherwise the algorithm might casually give you answers like:

k = 1: [ 10.  20. -20.  30.  50.  10.  50.], sum: 150.0

That list in the middle are the per-person grants. Notice how the algorithm feels that the third person really ought to pony up some cash so that some of the other people can go to PyCon ;-)

Squared problems

In the previous blog post I ran into issues using a very generic constraint solver. I ended that post saying that I would try to remeedy that by applying a more specific solver that takes advantage of a particular structure of the problem.

When you set \(p_i = g_i/r_i\), this turns into a linear programming problem, since \(r_i\) is a constant. When you set \(p_i = \left(g_i/r_i\right)^2\), it turns into a quadratic programming problem. Turns out there's two things I missed about the quadratic problem:

  • The resulting problem is not convex. That means it's (probably) difficult to solve.
  • The estimated probability that someone will attend falls off sharply as soon as they don't receive the full amount they requested. At 50% of the requested grant, the estimated probability of attending is only 25%; at 90%, it's 81%. This is the opposite of what we want.

Fixing the model

That doesn't mean we should put the \((g_i/r_i)^k\) out to pasture: it just means that I didn't pick the \(k\) I really wanted. Specifically, if I were to pick \(k=1/2\), I'd get:

$$p_i = \left(\frac{g_i}{r_i}\right)^{1/2} = \sqrt{\frac{g_i}{r_i}}$$

In general, if \(k = 1/n\):

$$p_i = \left(\frac{g_i}{r_i}\right)^{1/n} = \sqrt[n]{\frac{g_i}{r_i}}$$

That problem is convex, but it isn't linear, quadratic, or some other easy specific problem. It's just constrained multivariate convex optimization. That's okay, there are still a couple of applicable optimization algorithms.

Some of these algorithms require the derivative of the goal function with respect to a particular grant size \(g_j\) at a particular point. In case we don't have the real derivative, we can still provide a numerical approximation. In our case, we don't really need to approximate, since the derivatives are fairly easy to compute analytically:

$$\frac{\partial}{\partial g_j} \sum_i s_i \cdot \sqrt[n]{\frac{g_i}{r_i}} = \frac{s_j \cdot {g_j}^{\frac{1}{n} - 1}}{\sqrt[n]{r_j} \cdot n}$$

Finding a solution with Python

There are a few Python packages that contain optimization algorithms:

  • SciPy
  • cvxopt
  • pyopt

Someone originally pointed me towards cvxopt. While I'm sure it's excellent software, I already knew SciPy, so I went with that.

SciPy's optimization module provides the following algorithms:

  • fmin_l_bfgs_b - Zhu, Byrd, and Nocedal's constrained optimizer
  • fmin_tnc - Truncated Newton code
  • fmin_cobyla - Constrained optimization by linear approximation
  • fmin_slsqp - Minimization using sequential least-squares programming
  • nnls - Linear least-squares problem with non-negativity constraint

The first two are not applicable because they only appear to support bounds on individual variables. I also need a constraint over the sum of variables for the budget: \(\sum g_i \le b\). The last one isn't applicable because this isn't a linear least-squares problem. That leaves cobyla and slsqp.

Experimenting with linear approximation (COBYLA)

Making COBYLA work ended up being pretty easy. The only non-trivial part is expressing all your constraints as expressions greater than or equal to zero.

I've uploaded my IPython notebook (viewer, gist).

This produced the following results:

scores: [1 1 1 2 3 5 5]
requested: [10 20 30 30 50 10 50] total: 200, budget: 150
k = 1/0.5: [ 10.  20.  30.  30.  50.  10.   0.], sum: 150.0
k = 1/1: [ 10.  -0.   0.  30.  50.  10.  50.], sum: 150.0
k = 1/2: [ 10.  10.   7.  27.  36.  10.  50.], sum: 150.0
k = 1/3: [ 10.  11.   9.  25.  35.  10.  50.], sum: 150.0
k = 1/5: [ 10.  11.  10.  24.  35.  10.  50.], sum: 150.0
k = 1/10: [ 10.  11.  11.  23.  35.  10.  50.], sum: 150.0
k = 1/100: [ 10.  11.  11.  23.  34.  10.  50.], sum: 150.0

Some takeaways:

  • As predicted, as \(1/k\) increases, the optimization gradually starts spreading the budget out more evenly; preferring to give many partial grants rather than a few large ones.
  • This effect is mostly only pronounced for \(1/k = 2, 3\); after that, increasing \(1/k\) doesn't make much of a difference anymore. This is what I'd expect when I visualize the \(p_i\) functions in my head, but I haven't ruled out numerical instability.
  • Even at high \(1/k\), the two high scorers get their full grant amount. That's quite understandable for the one that's only asking for 10, but even the one asking for a large grant gets it unconditionally. This probably means that tweaking the scoring functions is going to be very important.
  • The linear version is apparently already quite brutal: the person with score 3 requesting 50 simply gets it entirely, leaving no budget left over for the people with score 1.

Since I'm so pleased with these results, I'm skipping slsqp until I feel like it. Next up, I'll try to compare the COBYLA results above with the results of the greedy algorithm under various fitness metrics.

Optimal hotel room pairing

The greedy hotel pairing algorithm from my previous post gives good solutions. If you make some fairly reasonable-sounding assumptions about the problem, very good solutions. At any rate nicer than forcing a human to drudge through it; both in terms of person-hours spent and quality of the solutions.

However, it turns out I was wrong about them being optimal solutions. There's no guarantee they would be, and in fact a good chance that they won't.

Finding optimal solutions

The good news is that there is a way to find the optimal solution. This problem can be modeled as a graph matching problem; what we're trying to find is a weighted maximal matching.

This is really easy if the graph is bipartite, but in our case we're actually dealing with two complete graphs: we try to pair people with similar gender. Within that compatible group, anyone can theoretically pair with anyone.

People interested in additional reading might be interested in:

Picking an algorithm

There are several feasible algorithms for this.

The current state of the art in computerized algorithms appears to be Blossom V by Vladimir Kolmogorov. This algorithm finds perfect matchings (i.e. all nodes are included), which may be impossible for us, e.g. because of an odd number of pairs.

Issues

One issue with this approach is that it becomes pretty much impossible to adapt this to rooms for more than two people. In order to support that case my previous greedy algorithm remains the best thing I've found.

I was unable to find an implementation of this in Java or Clojure. Fortunately, there is a Python library called NetworkX that does have an implementation.

I have not yet turned this into a fully functional program, because:

  • it's too late to apply it for this year
  • there's a good chance we won't be doing this again for next year

Credits

I'd like to thank Tor Myklebust (tmyklebu on Freenode) for pointing out I was wrong, and shoving me in the direction of the answers.

In this blog post I used several freely available graph illustrations from Wikipedia. The complete graph illustrations were made by koko90.

Optimization problems and PyCon financial aid

This year, I have partly taken on some new responsibilities in organizing PyCon. Specifically, I've done some work on financial aid. (Next year, I will be taking on the position of financial aid chair.) There have been some challenges that I've thrown some code at. Some of it stuck.

I'm very much interested in your comments, thoughts, alternative approaches, et cetera. Hit me up on Twitter.

Hotel room allocation

The first issue I solved was one of hotel room allocation. We provide the financial aid applicants with hotel rooms. To keep costs down, we pair people up, two by two.

We want to pair people so that we have to book a minimal number of days. If there's days on the start of the room booking where only one person is booking the room, we have to pay for the rest.

A greedy algorithm

I start out with generating every possible pairing. Assuming we have 300 financial aid applicants, that's:

$${300 \choose 2} = \frac{300!}{2! \cdot 298!} = 44850$$

That's more than we can check by hand, but still pretty easy for a computer. Besides, there are actually fewer pairs than that: some people request to be paired together explicitly, and we only pair people of the same gender.

Then, I sort them by number of unmatched days, that is, how many days would have a single person in the hotel room if we paired them together. Finally, I just greedily start grabbing pairs, keeping track of the people that have been allocated rooms as I go along. As soon as everyone is accounted for, we stop.

Other than perhaps minding the edge conditions where you have an odd number of people to pair, this isn't too tricky. The implementation for this is on Github, in lvh/pairing.

Solution, meet reality

Before I wrote this, humans did this by hand. That took pretty long, and the solutions were decent, but suboptimal. All the humans that have tackled this problem seem to take the same approach: sort by check-in date, find exact or near matches in check-out dates, rinse, repeat.

The algorithm above did quite well, coming up with a significantly better result than the previous human allocation. It also produced very interesting solutions. Turns out that by taking a big hit on one of the pairs (say, a pair that's 4 days mismatched four days), you can limit the damage on a large number of other pairs, and still come out on top. After reviewing with the person who previously had to do this manually, we quickly agreed that a human would most likely not have come up with this.

Furthermore, a lot of the pairs have matching check-out dates but with a mismatched check-in date; whereas humans typically only look for solutions in the other direction. This is easy to explain in hindsight; the algorithm has no concept of an ordering of events in time. It just tries to minimize the number of mismatched days.

I'm quite happy with the result. Any dollar I'm not devoting to a room that isn't being fully used will instead be going into an increased grant.

Unfortunately, finding this solution isn't the end of the story. It rarely survives the clash with reality. Many complex factors affect it: people will drastically change their room dates, people's visas get denied, et cetera. All sorts of unforseeable circumstances have lead to changes to the answer the algorithm originally produced. I don't expect that to change until PyCon is over; unfortunately, I also don't think that's a problem I can fix in a few lines of Clojure.

Grant allocation

The second issue is a much thornier one. Given the financial aid budget, figure out how to optimally distribute it. The budget is quite limited, a good deal smaller than the sum of what everyone requests.

There's a few reasons why this problem is complex. First of all, "optimal" is highly subjective. As I'm sure you'll notice very soon after thinking about it, it's very easy to verge off from math and straight into the realm of politics.

A very simple greedy algorithm

You might say that you want to get as many people as possible to come to PyCon, and you just greedily allocate the smallest grant requests first. This has a few disadvantages:

  1. It biases against the people that ostensibly need the money the most.
  2. It is ordering-sensitive: equivalent applications that happen to come later in the queue are disadvantaged.
  3. It oversimplifies how people react to grants, by making grant allocation binary. Someone receiving 90% of their requested grant amount will likely still be able to attend. Suppose you have 11 people, all requesting 100 USD, and you have 1000 USD to spend. The greedy algorithm would give the first ten of them 100 USD and find its purse empty for the eleventh. It would almost certainly be better to give them all 90.91 USD instead.
  4. Game-theoretically, it makes a lot of sense for everyone to apply for a small grant that they will almost certainly get. (This is not a real issue for us, because we still have humans eyeballing all the applications.)
  5. The only difference between applications is the amount they're requesting, but there are people that we would like to bias in favor of. For example, we'd like to try very hard to get speakers or tutorial presenters to the conference, we might want to reward people doing open source work, et cetera.

Selecting these critera is definitely squarely in the realm of politics, and I'd like to stick to the realm of math; so let's just assume that we can assign scores to applicants and that those scores are fair and just. If you feel like we shouldn't bias in favor of anyone at all, just assume that whenever I say "the score of an applicant" I mean "the number 1".

A slightly smarter greedy algorithm

One of the key ideas (which a lot of people seem to have when tackling this problem) is to translate an application's score into an amount of the budget. So, for example, let's say that the sum of all scores is 100, and your score is 10, that means you can have up to 10% of the budget allocated to you.

So, I make a pass over all the remaining applicants. If you're asking less than your budget slice, you get your requested grant. If you're asking for more, I save you for a future pass. Even though the budget goes down between passes, budget slices will go up, because the total score drops.

You can't do this in one pass, increasing the budget as you go along if people request less than their slice, because you might introduce order dependence. For example, consider what happens when Alice and Charlie both need more than the budget allows for (and they have the same score and requested amount), and Bob needs less. If you process them in the order Alice - Bob - Charlie, Charlie may get his requested grant and Alice may not, just because Bob's in the middle making the budget bigger for Charlie.

Once everyone left is requesting more than their slice, they just get their slice instead of their requested amount.

This solution is adequate, but I'm not completely satisfied; everyone besides the last pass will get full grant amounts. You can find my implementation of this algorithm on Github as well.

A better model

I think we can do better by applying some elementary probability theory. What we really want is to maximize the total score of the applicants that we expect to see at PyCon as a consequence of financial aid. Probability theory has a thing called the expected value of a random variable, which is pretty much just the integral of the random variable with respect to the probability. In our case, that just means that we want to maximize the sum of the probabiliies that a given applicant will come to PyCon given a particular grant amount, weighted by their score:

$$\max \sum s_i \cdot p_i$$

That just restates problems we can't solve in terms of problems we can't solve. What on earth could the probability that someone shows up be? We can't know the real value, since it's specific to each individual, and dependent on a lot of random, unknowable events. We can, however, make an educated guess. If we don't give them any money, they probably won't come. If we give them the amount they're asking for, they probably will.

We could conjecture that the probability someone will come to the conference is approximately the fraction of the amount they requested that they actually received:

$$p_i = \frac{g_i}{r_i}$$

For example, if someone gets 90% of their requested grant, we estimate that the odds they will attend are also 90%; if we give them only 10%, we estimate it at just 10%.

Alternatively, we could conjecture that it's closer to the square of that ratio; when giving someone a value that's very close to what they asked for, their probability of attending is still going to be very high; but if we only give them half, the odds they will attend will be closer to 25%, much lower than 50%. So, the expression becomes:

$$p^{\prime}_i = \left(\frac{g_i}{r_i}\right)^{2}$$

Update: I now realize that this is probably not the expression that I actually want: it falls off very quickly as soon as an applicant gets anything less than the full amount. I've elaborated on this in a new blog post.

We will see how the square model emphasizes focusing the available funds on fewer grants, whereas the linear model will spread smaller grants more liberally.

An attempt with constraint solvers

(I would like to thank Mark Engelberg for helping me with the stuff in this chapter. It may not have panned out, but it was a very educational experience nonetheless.)

It'd be nice if we could just declaratively describe the problem, throw it into a computer program, and have it magically tell us the answer. Of course, there are programs that do exactly that, called solvers.

When I reached out on the Clojure mailing list, Mark Engelberg helpfully reached out and immediately handed out some cool runnable example for me to play with. Sure enough, they worked super, and with a carefully crafted three-person example we could clearly see the difference between the linear and quadratic models I was talking about above: under the square model, the concentrated grants, trying harder to give high-scoring applications the amount they requested. The linear model spread money out more evenly. That is, with the following applications:

(def alice {:name "Alice", :score 5, :requested 120})
(def bob {:name "Bob", :score 4, :requested 100})
(def carol {:name "Carol", :score 3, :requested 80})

In a linear model, it comes up with:

{alice 120
 bob 80
 carol 0}

However, in the quadratic model, the optimizer rather gives Carol (who has a lower score) a complete grant than give Bob a partial one:

{alice 120
 bob 0
 carol 80}

You can find the code on Github. The tests that confirm the difference in behavior for the linear and quadratic models are in the tests.

There's one issue though; it didn't scale. Not bad enough that you'd notice on the three-person example (I thought my REPL was just being laggy), but bad enough that any real-world problem would be completely unsolvable. Why? Clearly constraint solvers are awesome real world tools with real world usage, it's not like they're supposed to fall over as soon as you throw a problem of reasonable size at it.

Well, let's do some napkin math. If we have 200 financial aid recipients that are all getting 1000 possible options (somewhere between 0 USD and 1000 USD, in whole-dollar increments), there's 200,000 possible places to put any given dollar. If we have 100,000 USD to spend, that's:

$${2 \cdot 10^{5} \choose 10^{5}} = \frac{(2 \cdot 10^{5})!}{10^{5}! \cdot 10^{5}!} \approx 10^{60203}$$

That is a very big number. I needed logarithms to compute it. It is so huge that it makes the number of atoms in the observable universe (about \(10^{80}\)) look like rounding error.

In an attempt to "fix" that problem, I tried only handing out funds in chunks of 100 USD each. Then, we have 1000 chunks to hand out and 2000 places to put them:

$${2000 \choose 1000} = \frac{2000!}{1000! \cdot 1000!} \approx 10^{600}$$

I have made the problem 59,603 decimal orders of magnitude easier! Rejoice!

A constraint solver really doesn't "know" anything about the structure of the problem you're feeding it. That's a trade-off: in return, it grants you a lot of freedom in expressing your problem. Constraint solvers like LoCo are very valuable, they just aren't a good fit for this problem. They are designed for problems where the constraints really constraining the solution space. We're clearly not in that territory. We have many valid solutions, and we're struggling to look for a good one.

There are two ways we can fix this:

  1. Compromising on the freedom of expressing the problem. By pouring our problem into a specific structured shape, we may be able to use a much faster solver, specific to that class of problems. Also, by allowing arbitrary real numbers instead of just integers, we can use much faster solvers.
  2. Compromising on finding the optimal solution, instead searching for a decent solution. You can do that with algorithms like tabu search or simulated annealing.

I will be elaborating on these problems in a future blog post.

External drive lossage on Mavericks

Welp, my WD 1TB Passport drive borked. Seems to be a common issue on Mavericks with a select number of drives, including this one, apparently...

Common themes:

  1. Started with OS X complaining at me because I supposedly ejected the drive unsafely, even though the drive was still in the USB slot.
  2. Drive isn't mounted automatically anymore. Console says filesystem is busted: 17/12/13 20:18:35,000 kernel[0]: hfs: Runtime corruption detected on My Passport, fsck will be forced on next mount.
  3. Drive can't be /unmounted/ using Disk Utility either both in regular desktop mode and in recovery mode (booting with ⌘+R). This means that even trying to just nuke the partition and starting over doesn't work; it just sits there for a while and then eventually times out saying it can't unmount the drive.

I did not, at any point, have WD's drive management software installed (unless it managed to do it behind my back). First thing I did was put a new GUID table on it with one partition.

Apparently the solution is to boot into Linux to fix the drive. I'll let you know how that pans out.

UPDATE: Booted into Linux, was able to read everything. fsck does report something wrong with the filesystem, but not bad enough that I can't mount or umount it. Was able to salvage what looks to be all the data I wanted; didn't bother to try and get e.g. Time Machine or Spotlight index data off, which is where I suspect the issues to stem from.

Using protractor with AngularJS and Yeoman

In writing the website for Crypto 101, my upcoming book, I've been toying around with AngularJS. The basic setup was done using Yeoman.

First of all, the Yeoman Angular generator gives you a karma-e2e.conf.js, but doesn't really configure grunt to do any end-to-end testing. Turns out that the new hotness for Angular scenario testing is based on Protractor. Karma isn't deprecated, it's just for unit tests. Protractor is based on Selenium, which is probably an upgrade :-)

Anyway, here's what I had to do to get started.

  1. In the root of your app, install protractor and the Selenium standalone server.
npm install --save-dev protractor
./node_modules/protractor/bin/install_selenium_standalone
  1. Optionally, add selenium to your .gitignore, since it contains a ~30MB binary that doesn't belong in your git repo.

  2. Add a protractor.conf.js file. Here's mine. Remember that there's some stuff in there for you to uncomment, so don't just copy paste. On my machine, the default driver was Internet Explorer for Windows, which is strange given that I'm running on OS X. Oh well. This uses Chrome instead.

exports.config = {
  seleniumAddress: 'http://localhost:4444/wd/hub',
  specs: [
    // To run plain JS files, uncomment the following line:
    // './test/integration/*.js',
    // To run Coffeescript files, uncomment the following line:
    // '.tmp/integration/*.js'
  ],

  // Set capabilities.
  capabilities: {
    'browserName': 'chrome'
  },

  jasmineNodeOpts: {
    showColors: true,
    defaultTimeoutInterval: 30000
  }
}
  1. If you're using Coffeescript, change your Gruntfile to build your integration tests:
  grunt.initConfig({
    // ...
    coffee: {
      // ...
      test: {
        files: [
        // ...
        {
          expand: true,
          cwd: 'test/integration',
          src: '{,*/}*.coffee',
          dest: '.tmp/integration',
          ext: '.js'
        }]
      }
    },
  // ...
})
  1. Write some tests in test/integration.

  2. You can now manually run your tests. First run Selenium:

./node_modules/protractor/bin/install_selenium_standalone

Then run your tests:

./node_modules/protractor/bin/protractor protractor.conf.js

Once I figure out the proper way to do this in Grunt, I'll let you know.

Share