TrueCrypt and full-disk encryption

Since the early hours of May 29th (CEST), the TrueCrypt website (https://www.truecrypt.org) has pointed to http://truecrypt.sourceforge.net/, with an ominous-looking error message:

WARNING: Using TrueCrypt is not secure as it may contain unfixed security issues

This page exists only to help migrate existing data encrypted by TrueCrypt.

The development of TrueCrypt was ended in 5/2014 after Microsoft terminated support of Windows XP. Windows 8/7/Vista and later offer integrated support for encrypted disks and virtual disk images. Such integrated support is also available on other platforms (click here for more information). You should migrate any data encrypted by TrueCrypt to encrypted disks or virtual disk images supported on your platform.

The website then explains how you can install BitLocker, a proprietary disk encryption system available in many versions of Windows, as well as how you could "rescue" existing TrueCrypt volumes.

People were pretty unhappy, for a variety of reasons:

  • Lack of trust in proprietary full-disk encryption software.
  • Many versions of Windows didn't even ship with BitLocker, only a few "premium" versions did.

As a proprietary system, BitLocker is not susceptible to the same amount of public scrutiny as a publicly available system. This is in stark contrast with TrueCrypt, where Matt Green recently raised around 70k USD to perform an audit.

There are variety of scenarios that could've caused the TrueCypt website to suddenly sport that message. The live Internet audience has speculated wildly. I'll share my thoughts on that near the end of this post, but there are two points I'd like to make that I think are far more important:

  1. Consider if full-disk encryption is really what you want.
  2. If it is, consider if it should be TrueCrypt.

Full-disk encryption

Something I learned from the inimitable Zooko:

Security isn't about perfect versus imperfect or about better versus worse, it's about this attack surface versus that attack surface.

You can't just add more crypto junk to something and expect to somehow get better security. You have to consider what it is buying you, and what it's costing you.

Full-disk encryption buys you one simple thing: if someone steals your device while the encrypted volume is locked, they probably can't read it.

If the encrypted volume is unlocked, it's over; and that's the state it's probably usually in. If the key lives in RAM (it usually does), there's a variety of ways that can be extracted. There's devices that have complete direct memory access, including everything that speaks FireWire or has FireWire-compatibility built-in. Even if you shut off your machine, cold boot attacks mean that the key can be extracted for a limited about of time. Various jurisdictions can try to force you to hand over the keys.

If an attacker can write to the (encrypted!) volume, it's probably over. Virtually all sector-level full-disk encryption formats are unauthenticated (with good reason), they're all malleable and vulnerable to (adaptive) chosen-ciphertext attacks.

If you're encrypting files or blobs of data within other files, an authenticated encryption scheme like GPG is far more useful.

Don't get me wrong. Full-disk encryption is a good idea, and you should do it. I'm just saying that what it actually protects against is pretty limited. If there's files you want to keep secret, full-disk encryption is probably not enough.

For more details, try Thomas & Erin Ptacek's blog post on why XTS isn't what you want. XTS is a way to build tweakable ciphers that can be used to create full-disk encryption; but the post applies to full-disk encryption generically as well.

TrueCrypt as a full-disk encryption mechanism

So, you probably want full-disk encryption, and you probably want to make sure that sensitive data is encrypted on top of that. Great. What full-disk encryption scheme do you use?

I don't want to bash TrueCrypt. It is (was, perhaps?) a great go-to project for full-disk encryption. It got a lot of things right. It also got a bunch of things wrong.

TrueCrypt was made by a bunch of people we don't know occasionally throwing a bunch of binaries over the wall. The source is available (under a non-OSI license), so you could audit that and compile your own binaries, but the truth is that the vast majority of TrueCrypt users never actually did that.

The TrueCrypt disk format and encryption standards are a bit iffy. It has terrible file system support. It has major performance issues, partially due to questionable cryptographic practices such as cipher cascades. It was great for when it was originally conceived and full-disk encryption was still new and exciting, it's still pretty decent now, but we can certainly do better.

What actually happened to the website?

We don't really know. There's a couple of guesses.

First of all, it looks like it are the original authors that folded: the key is the same one that was being used to sign releases months ago. Of course, that could mean that it was compromised or that they were forced to hand it over.

Since the DNS records changed, the e-mail server behavior changed, the same key was used, the Sourceforge client was involved, and fairly major changes to the source code were involved, it's unlikely that it's a simple defacement.

It's unlikely that they folded because they felt discovery of some backdoor was imminent. Folding wouldn't actually stop that discovery, because the source was already open.

It's strange that they would point to alternatives like Bitlocker for Windows and even more dubious alternatives for other operating systems. The authors certainly knew that this would not be an acceptable alternative for the vast majority of their users.

Additionally, the support window for XP ending isn't a particularly convincing impetus for migrating away from TrueCrypt right now. This has lead some to believe that it's an automated release. Worse, it could be a gagged response: a big and powerful three-letter agency might be twisting their arm and forcing them to fold, in return for something else (like, say, not being thrown in a Gitmo cell and forgotten about). Again: pure speculation.

Another option is that some of the developers just really, genuinely folded. They felt that for atechnical users, BitLocker was good enough, while the particularly discerning TrueCrypt user would be able to find some other alternative.

Bottom line is that none of this really matters. What matters is what you should do next if you want to have full-disk encryption.

So what do I do now?

I think LUKS is probably the best system that we have right now, and one of the few that actually improves on TrueCrypt.

If you're on Linux, dm-crypt + cryptsetup + LUKS is probably what you want. If you're on OS X, FileVault 2 is probably what you want. If you're on Windows, your options are looking a bit thin right now:

  • Keep using old versions of TrueCrypt.
  • Give up and use BitLocker.
  • Use a Linux virtual machine to use dm-crypt/LUKS, eventually migrating to native support.

Personally, I think the first option is the most reasonable right now, and then wait until someone actually writes the native LUKS support.

Oh, and you should probably install GPG to encrypt some of those files stored on that encrypted volume.

More on financial aid grant optimization

In a previous post, I talked about ways to optimize PyCon financial aid grants. This is a follow-up on those efforts. Quick recap:

  • There is a fixed budget \(b\) available for grants, between 100k and 200k USD.
  • There are a number of people (approximately 300) requesting various amounts \(r_i\) (approximately between 100 USD and 2000 USD) of financial aid, and receive a grant \(g_i\) so that \(0 \le g_i \le r_i\).
  • Financial aid applicants can be assigned scores \(s_i\), a relative value describing how much we'd like to have them at PyCon.
  • PyCon wants to optimize the total expected value of scores. That means getting as many people as possible to come, weighted by score.
  • We've conjectured that we can estimate the probability \(p_i\) that someone attends as either \(g_i/r_i\), or \((g_i/r_i)^2\). The former prefers to spread the budget across a larger number of smaller grants, whereas the latter prefers to focus the budget into a smaller number of larger grants.

If any of that doesn't make sense, you should read the previous blog post for more details.

In short, we're trying to solve the optimization problem:

$$\max \sum E[S_i] = \sum s_i \cdot p_i$$

Since most optimization texts appear to prefer minimization, alternatively:

$$\min - \sum s_i \cdot p_i$$

Subject to a budget constraint and an individual grant constraint:

$$\sum g_i \le b$$

$$0 \le g_i \le r_i$$

The greater-than-zero constraint for individual grants is fairly important, otherwise the algorithm might casually give you answers like:

k = 1: [ 10.  20. -20.  30.  50.  10.  50.], sum: 150.0

That list in the middle are the per-person grants. Notice how the algorithm feels that the third person really ought to pony up some cash so that some of the other people can go to PyCon ;-)

Squared problems

In the previous blog post I ran into issues using a very generic constraint solver. I ended that post saying that I would try to remeedy that by applying a more specific solver that takes advantage of a particular structure of the problem.

When you set \(p_i = g_i/r_i\), this turns into a linear programming problem, since \(r_i\) is a constant. When you set \(p_i = \left(g_i/r_i\right)^2\), it turns into a quadratic programming problem. Turns out there's two things I missed about the quadratic problem:

  • The resulting problem is not convex. That means it's (probably) difficult to solve.
  • The estimated probability that someone will attend falls off sharply as soon as they don't receive the full amount they requested. At 50% of the requested grant, the estimated probability of attending is only 25%; at 90%, it's 81%. This is the opposite of what we want.

Fixing the model

That doesn't mean we should put the \((g_i/r_i)^k\) out to pasture: it just means that I didn't pick the \(k\) I really wanted. Specifically, if I were to pick \(k=1/2\), I'd get:

$$p_i = \left(\frac{g_i}{r_i}\right)^{1/2} = \sqrt{\frac{g_i}{r_i}}$$

In general, if \(k = 1/n\):

$$p_i = \left(\frac{g_i}{r_i}\right)^{1/n} = \sqrt[n]{\frac{g_i}{r_i}}$$

That problem is convex, but it isn't linear, quadratic, or some other easy specific problem. It's just constrained multivariate convex optimization. That's okay, there are still a couple of applicable optimization algorithms.

Some of these algorithms require the derivative of the goal function with respect to a particular grant size \(g_j\) at a particular point. In case we don't have the real derivative, we can still provide a numerical approximation. In our case, we don't really need to approximate, since the derivatives are fairly easy to compute analytically:

$$\frac{\partial}{\partial g_j} \sum_i s_i \cdot \sqrt[n]{\frac{g_i}{r_i}} = \frac{s_j \cdot {g_j}^{\frac{1}{n} - 1}}{\sqrt[n]{r_j} \cdot n}$$

Finding a solution with Python

There are a few Python packages that contain optimization algorithms:

  • SciPy
  • cvxopt
  • pyopt

Someone originally pointed me towards cvxopt. While I'm sure it's excellent software, I already knew SciPy, so I went with that.

SciPy's optimization module provides the following algorithms:

  • fmin_l_bfgs_b - Zhu, Byrd, and Nocedal's constrained optimizer
  • fmin_tnc - Truncated Newton code
  • fmin_cobyla - Constrained optimization by linear approximation
  • fmin_slsqp - Minimization using sequential least-squares programming
  • nnls - Linear least-squares problem with non-negativity constraint

The first two are not applicable because they only appear to support bounds on individual variables. I also need a constraint over the sum of variables for the budget: \(\sum g_i \le b\). The last one isn't applicable because this isn't a linear least-squares problem. That leaves cobyla and slsqp.

Experimenting with linear approximation (COBYLA)

Making COBYLA work ended up being pretty easy. The only non-trivial part is expressing all your constraints as expressions greater than or equal to zero.

I've uploaded my IPython notebook (viewer, gist).

This produced the following results:

scores: [1 1 1 2 3 5 5]
requested: [10 20 30 30 50 10 50] total: 200, budget: 150
k = 1/0.5: [ 10.  20.  30.  30.  50.  10.   0.], sum: 150.0
k = 1/1: [ 10.  -0.   0.  30.  50.  10.  50.], sum: 150.0
k = 1/2: [ 10.  10.   7.  27.  36.  10.  50.], sum: 150.0
k = 1/3: [ 10.  11.   9.  25.  35.  10.  50.], sum: 150.0
k = 1/5: [ 10.  11.  10.  24.  35.  10.  50.], sum: 150.0
k = 1/10: [ 10.  11.  11.  23.  35.  10.  50.], sum: 150.0
k = 1/100: [ 10.  11.  11.  23.  34.  10.  50.], sum: 150.0

Some takeaways:

  • As predicted, as \(1/k\) increases, the optimization gradually starts spreading the budget out more evenly; preferring to give many partial grants rather than a few large ones.
  • This effect is mostly only pronounced for \(1/k = 2, 3\); after that, increasing \(1/k\) doesn't make much of a difference anymore. This is what I'd expect when I visualize the \(p_i\) functions in my head, but I haven't ruled out numerical instability.
  • Even at high \(1/k\), the two high scorers get their full grant amount. That's quite understandable for the one that's only asking for 10, but even the one asking for a large grant gets it unconditionally. This probably means that tweaking the scoring functions is going to be very important.
  • The linear version is apparently already quite brutal: the person with score 3 requesting 50 simply gets it entirely, leaving no budget left over for the people with score 1.

Since I'm so pleased with these results, I'm skipping slsqp until I feel like it. Next up, I'll try to compare the COBYLA results above with the results of the greedy algorithm under various fitness metrics.

Optimal hotel room pairing

The greedy hotel pairing algorithm from my previous post gives good solutions. If you make some fairly reasonable-sounding assumptions about the problem, very good solutions. At any rate nicer than forcing a human to drudge through it; both in terms of person-hours spent and quality of the solutions.

However, it turns out I was wrong about them being optimal solutions. There's no guarantee they would be, and in fact a good chance that they won't.

Finding optimal solutions

The good news is that there is a way to find the optimal solution. This problem can be modeled as a graph matching problem; what we're trying to find is a weighted maximal matching.

This is really easy if the graph is bipartite, but in our case we're actually dealing with two complete graphs: we try to pair people with similar gender. Within that compatible group, anyone can theoretically pair with anyone.

People interested in additional reading might be interested in:

Picking an algorithm

There are several feasible algorithms for this.

The current state of the art in computerized algorithms appears to be Blossom V by Vladimir Kolmogorov. This algorithm finds perfect matchings (i.e. all nodes are included), which may be impossible for us, e.g. because of an odd number of pairs.

Issues

One issue with this approach is that it becomes pretty much impossible to adapt this to rooms for more than two people. In order to support that case my previous greedy algorithm remains the best thing I've found.

I was unable to find an implementation of this in Java or Clojure. Fortunately, there is a Python library called NetworkX that does have an implementation.

I have not yet turned this into a fully functional program, because:

  • it's too late to apply it for this year
  • there's a good chance we won't be doing this again for next year

Credits

I'd like to thank Tor Myklebust (tmyklebu on Freenode) for pointing out I was wrong, and shoving me in the direction of the answers.

In this blog post I used several freely available graph illustrations from Wikipedia. The complete graph illustrations were made by koko90.

Optimization problems and PyCon financial aid

This year, I have partly taken on some new responsibilities in organizing PyCon. Specifically, I've done some work on financial aid. (Next year, I will be taking on the position of financial aid chair.) There have been some challenges that I've thrown some code at. Some of it stuck.

I'm very much interested in your comments, thoughts, alternative approaches, et cetera. Hit me up on Twitter.

Hotel room allocation

The first issue I solved was one of hotel room allocation. We provide the financial aid applicants with hotel rooms. To keep costs down, we pair people up, two by two.

We want to pair people so that we have to book a minimal number of days. If there's days on the start of the room booking where only one person is booking the room, we have to pay for the rest.

A greedy algorithm

I start out with generating every possible pairing. Assuming we have 300 financial aid applicants, that's:

$${300 \choose 2} = \frac{300!}{2! \cdot 298!} = 44850$$

That's more than we can check by hand, but still pretty easy for a computer. Besides, there are actually fewer pairs than that: some people request to be paired together explicitly, and we only pair people of the same gender.

Then, I sort them by number of unmatched days, that is, how many days would have a single person in the hotel room if we paired them together. Finally, I just greedily start grabbing pairs, keeping track of the people that have been allocated rooms as I go along. As soon as everyone is accounted for, we stop.

Other than perhaps minding the edge conditions where you have an odd number of people to pair, this isn't too tricky. The implementation for this is on Github, in lvh/pairing.

Solution, meet reality

Before I wrote this, humans did this by hand. That took pretty long, and the solutions were decent, but suboptimal. All the humans that have tackled this problem seem to take the same approach: sort by check-in date, find exact or near matches in check-out dates, rinse, repeat.

The algorithm above did quite well, coming up with a significantly better result than the previous human allocation. It also produced very interesting solutions. Turns out that by taking a big hit on one of the pairs (say, a pair that's 4 days mismatched four days), you can limit the damage on a large number of other pairs, and still come out on top. After reviewing with the person who previously had to do this manually, we quickly agreed that a human would most likely not have come up with this.

Furthermore, a lot of the pairs have matching check-out dates but with a mismatched check-in date; whereas humans typically only look for solutions in the other direction. This is easy to explain in hindsight; the algorithm has no concept of an ordering of events in time. It just tries to minimize the number of mismatched days.

I'm quite happy with the result. Any dollar I'm not devoting to a room that isn't being fully used will instead be going into an increased grant.

Unfortunately, finding this solution isn't the end of the story. It rarely survives the clash with reality. Many complex factors affect it: people will drastically change their room dates, people's visas get denied, et cetera. All sorts of unforseeable circumstances have lead to changes to the answer the algorithm originally produced. I don't expect that to change until PyCon is over; unfortunately, I also don't think that's a problem I can fix in a few lines of Clojure.

Grant allocation

The second issue is a much thornier one. Given the financial aid budget, figure out how to optimally distribute it. The budget is quite limited, a good deal smaller than the sum of what everyone requests.

There's a few reasons why this problem is complex. First of all, "optimal" is highly subjective. As I'm sure you'll notice very soon after thinking about it, it's very easy to verge off from math and straight into the realm of politics.

A very simple greedy algorithm

You might say that you want to get as many people as possible to come to PyCon, and you just greedily allocate the smallest grant requests first. This has a few disadvantages:

  1. It biases against the people that ostensibly need the money the most.
  2. It is ordering-sensitive: equivalent applications that happen to come later in the queue are disadvantaged.
  3. It oversimplifies how people react to grants, by making grant allocation binary. Someone receiving 90% of their requested grant amount will likely still be able to attend. Suppose you have 11 people, all requesting 100 USD, and you have 1000 USD to spend. The greedy algorithm would give the first ten of them 100 USD and find its purse empty for the eleventh. It would almost certainly be better to give them all 90.91 USD instead.
  4. Game-theoretically, it makes a lot of sense for everyone to apply for a small grant that they will almost certainly get. (This is not a real issue for us, because we still have humans eyeballing all the applications.)
  5. The only difference between applications is the amount they're requesting, but there are people that we would like to bias in favor of. For example, we'd like to try very hard to get speakers or tutorial presenters to the conference, we might want to reward people doing open source work, et cetera.

Selecting these critera is definitely squarely in the realm of politics, and I'd like to stick to the realm of math; so let's just assume that we can assign scores to applicants and that those scores are fair and just. If you feel like we shouldn't bias in favor of anyone at all, just assume that whenever I say "the score of an applicant" I mean "the number 1".

A slightly smarter greedy algorithm

One of the key ideas (which a lot of people seem to have when tackling this problem) is to translate an application's score into an amount of the budget. So, for example, let's say that the sum of all scores is 100, and your score is 10, that means you can have up to 10% of the budget allocated to you.

So, I make a pass over all the remaining applicants. If you're asking less than your budget slice, you get your requested grant. If you're asking for more, I save you for a future pass. Even though the budget goes down between passes, budget slices will go up, because the total score drops.

You can't do this in one pass, increasing the budget as you go along if people request less than their slice, because you might introduce order dependence. For example, consider what happens when Alice and Charlie both need more than the budget allows for (and they have the same score and requested amount), and Bob needs less. If you process them in the order Alice - Bob - Charlie, Charlie may get his requested grant and Alice may not, just because Bob's in the middle making the budget bigger for Charlie.

Once everyone left is requesting more than their slice, they just get their slice instead of their requested amount.

This solution is adequate, but I'm not completely satisfied; everyone besides the last pass will get full grant amounts. You can find my implementation of this algorithm on Github as well.

A better model

I think we can do better by applying some elementary probability theory. What we really want is to maximize the total score of the applicants that we expect to see at PyCon as a consequence of financial aid. Probability theory has a thing called the expected value of a random variable, which is pretty much just the integral of the random variable with respect to the probability. In our case, that just means that we want to maximize the sum of the probabiliies that a given applicant will come to PyCon given a particular grant amount, weighted by their score:

$$\max \sum s_i \cdot p_i$$

That just restates problems we can't solve in terms of problems we can't solve. What on earth could the probability that someone shows up be? We can't know the real value, since it's specific to each individual, and dependent on a lot of random, unknowable events. We can, however, make an educated guess. If we don't give them any money, they probably won't come. If we give them the amount they're asking for, they probably will.

We could conjecture that the probability someone will come to the conference is approximately the fraction of the amount they requested that they actually received:

$$p_i = \frac{g_i}{r_i}$$

For example, if someone gets 90% of their requested grant, we estimate that the odds they will attend are also 90%; if we give them only 10%, we estimate it at just 10%.

Alternatively, we could conjecture that it's closer to the square of that ratio; when giving someone a value that's very close to what they asked for, their probability of attending is still going to be very high; but if we only give them half, the odds they will attend will be closer to 25%, much lower than 50%. So, the expression becomes:

$$p^{\prime}_i = \left(\frac{g_i}{r_i}\right)^{2}$$

Update: I now realize that this is probably not the expression that I actually want: it falls off very quickly as soon as an applicant gets anything less than the full amount. I've elaborated on this in a new blog post.

We will see how the square model emphasizes focusing the available funds on fewer grants, whereas the linear model will spread smaller grants more liberally.

An attempt with constraint solvers

(I would like to thank Mark Engelberg for helping me with the stuff in this chapter. It may not have panned out, but it was a very educational experience nonetheless.)

It'd be nice if we could just declaratively describe the problem, throw it into a computer program, and have it magically tell us the answer. Of course, there are programs that do exactly that, called solvers.

When I reached out on the Clojure mailing list, Mark Engelberg helpfully reached out and immediately handed out some cool runnable example for me to play with. Sure enough, they worked super, and with a carefully crafted three-person example we could clearly see the difference between the linear and quadratic models I was talking about above: under the square model, the concentrated grants, trying harder to give high-scoring applications the amount they requested. The linear model spread money out more evenly. That is, with the following applications:

(def alice {:name "Alice", :score 5, :requested 120})
(def bob {:name "Bob", :score 4, :requested 100})
(def carol {:name "Carol", :score 3, :requested 80})

In a linear model, it comes up with:

{alice 120
 bob 80
 carol 0}

However, in the quadratic model, the optimizer rather gives Carol (who has a lower score) a complete grant than give Bob a partial one:

{alice 120
 bob 0
 carol 80}

You can find the code on Github. The tests that confirm the difference in behavior for the linear and quadratic models are in the tests.

There's one issue though; it didn't scale. Not bad enough that you'd notice on the three-person example (I thought my REPL was just being laggy), but bad enough that any real-world problem would be completely unsolvable. Why? Clearly constraint solvers are awesome real world tools with real world usage, it's not like they're supposed to fall over as soon as you throw a problem of reasonable size at it.

Well, let's do some napkin math. If we have 200 financial aid recipients that are all getting 1000 possible options (somewhere between 0 USD and 1000 USD, in whole-dollar increments), there's 200,000 possible places to put any given dollar. If we have 100,000 USD to spend, that's:

$${2 \cdot 10^{5} \choose 10^{5}} = \frac{(2 \cdot 10^{5})!}{10^{5}! \cdot 10^{5}!} \approx 10^{60203}$$

That is a very big number. I needed logarithms to compute it. It is so huge that it makes the number of atoms in the observable universe (about \(10^{80}\)) look like rounding error.

In an attempt to "fix" that problem, I tried only handing out funds in chunks of 100 USD each. Then, we have 1000 chunks to hand out and 2000 places to put them:

$${2000 \choose 1000} = \frac{2000!}{1000! \cdot 1000!} \approx 10^{600}$$

I have made the problem 59,603 decimal orders of magnitude easier! Rejoice!

A constraint solver really doesn't "know" anything about the structure of the problem you're feeding it. That's a trade-off: in return, it grants you a lot of freedom in expressing your problem. Constraint solvers like LoCo are very valuable, they just aren't a good fit for this problem. They are designed for problems where the constraints really constraining the solution space. We're clearly not in that territory. We have many valid solutions, and we're struggling to look for a good one.

There are two ways we can fix this:

  1. Compromising on the freedom of expressing the problem. By pouring our problem into a specific structured shape, we may be able to use a much faster solver, specific to that class of problems. Also, by allowing arbitrary real numbers instead of just integers, we can use much faster solvers.
  2. Compromising on finding the optimal solution, instead searching for a decent solution. You can do that with algorithms like tabu search or simulated annealing.

I will be elaborating on these problems in a future blog post.

External drive lossage on Mavericks

Welp, my WD 1TB Passport drive borked. Seems to be a common issue on Mavericks with a select number of drives, including this one, apparently...

Common themes:

  1. Started with OS X complaining at me because I supposedly ejected the drive unsafely, even though the drive was still in the USB slot.
  2. Drive isn't mounted automatically anymore. Console says filesystem is busted: 17/12/13 20:18:35,000 kernel[0]: hfs: Runtime corruption detected on My Passport, fsck will be forced on next mount.
  3. Drive can't be /unmounted/ using Disk Utility either both in regular desktop mode and in recovery mode (booting with ⌘+R). This means that even trying to just nuke the partition and starting over doesn't work; it just sits there for a while and then eventually times out saying it can't unmount the drive.

I did not, at any point, have WD's drive management software installed (unless it managed to do it behind my back). First thing I did was put a new GUID table on it with one partition.

Apparently the solution is to boot into Linux to fix the drive. I'll let you know how that pans out.

UPDATE: Booted into Linux, was able to read everything. fsck does report something wrong with the filesystem, but not bad enough that I can't mount or umount it. Was able to salvage what looks to be all the data I wanted; didn't bother to try and get e.g. Time Machine or Spotlight index data off, which is where I suspect the issues to stem from.

Using protractor with AngularJS and Yeoman

In writing the website for Crypto 101, my upcoming book, I've been toying around with AngularJS. The basic setup was done using Yeoman.

First of all, the Yeoman Angular generator gives you a karma-e2e.conf.js, but doesn't really configure grunt to do any end-to-end testing. Turns out that the new hotness for Angular scenario testing is based on Protractor. Karma isn't deprecated, it's just for unit tests. Protractor is based on Selenium, which is probably an upgrade :-)

Anyway, here's what I had to do to get started.

  1. In the root of your app, install protractor and the Selenium standalone server.
npm install --save-dev protractor
./node_modules/protractor/bin/install_selenium_standalone
  1. Optionally, add selenium to your .gitignore, since it contains a ~30MB binary that doesn't belong in your git repo.

  2. Add a protractor.conf.js file. Here's mine. Remember that there's some stuff in there for you to uncomment, so don't just copy paste. On my machine, the default driver was Internet Explorer for Windows, which is strange given that I'm running on OS X. Oh well. This uses Chrome instead.

exports.config = {
  seleniumAddress: 'http://localhost:4444/wd/hub',
  specs: [
    // To run plain JS files, uncomment the following line:
    // './test/integration/*.js',
    // To run Coffeescript files, uncomment the following line:
    // '.tmp/integration/*.js'
  ],

  // Set capabilities.
  capabilities: {
    'browserName': 'chrome'
  },

  jasmineNodeOpts: {
    showColors: true,
    defaultTimeoutInterval: 30000
  }
}
  1. If you're using Coffeescript, change your Gruntfile to build your integration tests:
  grunt.initConfig({
    // ...
    coffee: {
      // ...
      test: {
        files: [
        // ...
        {
          expand: true,
          cwd: 'test/integration',
          src: '{,*/}*.coffee',
          dest: '.tmp/integration',
          ext: '.js'
        }]
      }
    },
  // ...
})
  1. Write some tests in test/integration.

  2. You can now manually run your tests. First run Selenium:

./node_modules/protractor/bin/install_selenium_standalone

Then run your tests:

./node_modules/protractor/bin/protractor protractor.conf.js

Once I figure out the proper way to do this in Grunt, I'll let you know.

Thoughts on RDRAND in Linux

This is a response to Linus' response to the petition to remove RDRAND from /dev/random. RDRAND is a CPU instruction introduced by Intel on recent CPUs. It (supposedly) uses a hardware entropy source, and runs it through AES in CBC-MAC mode, to produce random numbers. Out of fear that RDRAND may somehow be backdoored, someone petitioned to remove RDRAND support to "improve the overall security of the kernel". If RDRAND contains a back door, and an unknown attacker can control the output, that could break pretty much all userland crypto.

Linus fulminated, as he does. He suggested we go read drivers/char/random.c. I quote (expletives and insults omitted):

we use rdrand as one of many inputs into the random pool, and we use it as a way to improve that random pool. So even if rdrand were to be back-doored by the NSA, our use of rdrand actually improves the quality of the random numbers you get from /dev/random.

I went ahead and read random.c. You can read it for yourself in Linus' tree. The function I'm interested in is extract_buf:

    /*
     * If we have a architectural hardware random number
     * generator, mix that in, too.
     */
    for (i = 0; i < LONGS(EXTRACT_SIZE); i++) {
        unsigned long v;
        if (!arch_get_random_long(&v))
            break;
        hash.l[i] ^= v;
    }

This is in the extraction phase. This is after the hash is being mixed back in to the pool (and that's for backtracking attacks: not intended as an input to the pool). The output of arch_get_random_long is being XORed in with the extracted output, not with the pool.

If I were to put on my tin-foil hat, I would suggest that the difficulty has now been moved from being able to subvert the pool as one of its entropy sources (which we believe is impossible), versus being able to see what you're about to be XORed with. The latter seems a lot closer to the realm of stuff a microcode instruction can do.

To put it into Python:

from inspect import currentframe
from random import getrandbits

def extract_buf():
    """Gets 16 bytes from the pool, and mixes them with RDRAND output.

    """
    pool_bits = extract_from_pool()
    rdrand_bits = rdrand()
    return  pool_bits ^ rdrand_bits

def extract_from_pool():
    """Pretend to get some good, unpredictable bytes from the pool.

    Actually gets a long with some non-cryptographically secure random
    bits from random.getrandbits, which is usually a Mersenne Twister.

    """
    return getrandbits(32)

def rdrand():
    """
    A malicious hardware instruction.
    """
    pool_bits = currentframe().f_back.f_locals["pool_bits"]
    return pool_bits ^ 0xabad1dea

if __name__ == "__main__":
    assert extract_buf() == 0xabad1dea

Why can't RDRAND work like this?

Some comments based on feedback I've gotten so far:

  1. This attack does not need to know where the PRNG state lives in memory. First of all, this isn't an attack on the PRNG state, it's on the PRNG output. Secondly, the instruction only needs to peek ahead at what is about to happen (specifically, what's about to be XORed with) the RDRAND output. That doesn't require knowing where the PRNG state (or its output) is being stored in memory; we're already talking register level at that point.

  2. While it's certainly true that if you can't trust the CPU, you can't trust anything, that doesn't really make this problem go away. RDRAND being broken wouldn't make software crash, which is a lot harder for almost all other instructions. RDRAND being broken wouldn't result in measurable side-effects, unlike what would happen if PCLMULDQ contained a back door. Furthermore, it's a lot easier to backdoor one single microcode instruction and a lot more plausible and feasible for a CSPRNG to be backdoored than it is to think of a CPU as some kind of intelligent being that's actively malicious or being remotely controlled.

For what it's worth, it seems Zooko agrees with me.

Securing against timing attacks with Twisted

What are timing attacks?

Timing attacks are side-channel attacks that rely on inferring secret information from operations by measuring how long they take to execute.

A complete explanation is outside of the scope of this article, but the Wikipedia article might be a good starting point for the interested reader.

Why should I care about them?

Because they can creep in before you know it, and break your otherwise fine system.

A common way they're introduced are string comparisons. String comparisons in many langauge implementations, including all implementations of Python I know of, short-circuit. They work roughly like this:

def strcmp(s1, s2):
    if len(s1) != len(s2):
        return False
    for c1, c2 in zip(s1, s2):
        if c1 != c2:
            return False
    return True

This means that as soon as they can prove the two strings are not equal, they return False and ignore the rest of the string. This is a very simple and effective performance optimization. And that's precisely why, from a security point of view, it's a liability.

Since it takes longer to compare "The quick brown fox jumps over the lazy dog" to "The quick brown fox jumps over the lazy god" than it does to compare it to "Lorem ipsum", or even a Lipsum of the same length, an attacker can use that timing information to figure out what the original string is that is being compared to, even when he shouldn't.

Why did you care?

Password resets.

The typically recommended way of doing them is to generate a random number, one that's sufficiently long that an attacker can't guess it. Then, you relay that number to the user using some alternative channel (usually a URL in an e-mail).

Bar the random number, those URLs are predictable. That means an attacker can trivially try any number he wants. If an attacker is allowed to do that enough, he could measure timing differences between different numbers (introduced by string comparison functions that short-circuit or even by the database's index) to efficiently deduce the value of a "good" number.

Also, if membership is private, an attacker may exploit timing differences in the password reset request function to farm e-mail addresses.

How do I protect against timing attacks?

Just to be clear: timing attacks are a complicated problem, and this article describes just one strategy I've applied to help secure against them.

The easiest way to prevent a timing attack is to make sure that the timing your hypothetical attacker can measure is unrelated to the work you have to do.

Since I was using Twisted and AMP, this was actually quite easy. I wrote a decorator for AMP responder functions that does exactly that:

def _immediateResponder(f):
    """
    A decorator for responder functions that should return immediately and
    execute asynchronously, as a defense against timing attacks.

    The responder decorator should be applied after (above) this decorator::

        @SomeCommand.responder
        @_immediateResponder
        def responder(...):
            ....

    This should be timing attack resistant since it is unconditional: the
    the AMP response is returned immediately, and the real responder is
    scheduled to run at the next chance the reactor has to do so.

    This only works with AMP commands with empty responses. That's probably a
    good idea anyway: almost all information you could add to the response
    is liable to introduce a timing attack vulnerability.

    Since this precludes your ability to communicate success or failure to
    the caller, the decorated function should return quite quickly (or, if it
    can't, that should be clearly documented). Otherwise, you may end up in a
    a race condition, where the caller assumes the operation has completed,
    but it is in progress or hasn't started yet.

    The original responder function is available on the decorated function as
    the ``responderFunction`` attribute.
    """
    @functools.wraps(f)
    def wrapped(self, *args, **kwargs):
        reactor.callLater(0, f, self, *args, **kwargs)
        return {}

    wrapped.responderFunction = f
    return wrapped

When I get an incoming RPC call, the function doing the actual work is scheduled to run at the next reactor iteration. Then, an empty response is returned. All of this happens in amortized constant time, and independent of any secrets. As a result, it can't really leak much about them.

An abstraction too high

The above was an effective response to a proof-of-concept timing attack exploit. Unfortunately, that doesn't mean you've fixed every timing attack.

In particular, this example is a few layers of abstraction removed from the grit of real-world I/O. Just because I returned {} (an empty response) immediately, doesn't mean the underlying IO happens immediately.

In particular, the write output latency could be coerced to depend on the computation time, because the function passed to it could be executed before the write. If that happens, and the time it takes is dependant on some secret, delaying the write, the latency on the attacker's side could be used to measure the work done.

I have not yet been able to turn the above into a working exploit.

There are a number of ways this could be mitigated. Since there's no working exploit, it's unclear if this mitigation would render a timing attack infeasible.

One way I've considered to mitigate this is to limit the time that the reactor is blocked. In my concrete example, this was fortunately already the case, since one of the first things it did was defer to a thread that released the GIL (to compute the key from the password using scrypt). Alternatively, if you're doing this for Python code, you could write your function cooperatively.

Designing a REST API: logging in

Hey. As many of you probably know, I'm building a business, and that business involves a REST API being consumed by browsers (for now).

The problem I'm hitting is registration and logging in. It just doesn't seem to be a problem that fits REST very well -- maybe I'm thinking too much about services and not enough about resources, so I'm writing this blog posts to get my thoughts in order and hopefully get some useful feedback.

This post is about user registration and login, and how they're hard to fit into REST.

Context: users are uniquely identified by a user id, which is a random string of sufficient size. These ids are immutable and uniquely refer to this user from registration until forever. User emails are also unique, but they can change, so they are not a good way to refer to a user. Users register and log in using email and password.

Option 1: separate authentication endpoint

The idea here is that there are two things:
  • the REST API, which authenticated things talk to
  • the authentication endpoint, where unauthenticated things go to become authenticated
The downside is that there are two things. One of them speaks REST, and the other speaks gnarly ad-hoc RPC or maybe JSON-RPC or something.

The upside is that the REST API only needs to understand one kind of token.

Option 2: everything in REST

Special cases aren't special enough to break the rules.
-- Zen of Python

The advantage is that there is one canonical source for everything. Everything including access grants (which should be nice for OAuth later). You can use the REST API to perform any action, see anything in the system...

The downsides:
  • the REST API has to understand more than one kind of authentication. This also might make it a bit uglier to interoperate with OAuth later.
  • Logging in takes too many round trips. First I have to query users/?email={EMAIL} to go from e-mail address to user id. Then, I have to POST to users/accessGrants -- which is if I'm allowed to guess the URL, which I really shouldn't, I should GET users/ first (HATEOAS and all that -- but I'm willing to ignore that for now). All of this happens over HTTPS with HTTP Basic auth. TLS handshakes are slow, and verifying securely stored passwords is even slower.

Option 2b: everything in REST, with shortcuts
Although practicality beats purity.
-- Zen of Python 
... but, unfortunately, also:
There should be one-- and preferably only one --obvious way to do it.
-- Zen of Python 

The upsides are the same as for the REST API, except we get rid of the downside that logging in takes too many round trips, because there's a shortcut for that.

That shortcut would be something that speaks some kind of ad-hoccy RPC, or possibly JSON-RPC. One of the exported methods/procedures would be "getGrant", which takes a username and password and returns the key. These endpoints talk to the same database behind the scenes. Killer feature: one round trip.

Downside is that the API should probably still be able to take user credentials, since this is is a shortcut....

Yes, I know OAuth exists. For now, we have no interest in opening this API up yet. We'll do that as soon as we have a platform worth caring about, an API that doesn't change every three seconds, and a security expert that's reviewed it.