charlie's blog

Thursday, May 22, 2014

identity is hard

Equivalence

Take a look at the following pairs and decide whether the two things are equivalent (the same):

"cat" vs. "cat"
"dog" vs. "Dog"
4 vs. 2+2
"color" vs. "colour"
"wanna" vs. "want to"
1+2 vs. 2+1

Your specific answers probably differ from mine, but I bet you said "the same" for some, "different" for some, and maybe "it depends" for some.

For instance, I'm sure we agree that "cat" and "cat" are the same. We would probably say that "dog" and "Dog" are the same thing too, at least most of the time. What about "color" and "colour"? I'm betting that most English speakers would say they're essentially the same thing, just two ways of spelling the same word. An etymologist might disagree and say that they differ in some interesting technical sense. Someone with strong American or British pride might give you an earful.

Likewise, the difference between 4 and 2+2 is arguable. In some senses they are the same thing, since they can both be seen to represent the quantity 4. However, if you wanted to tell someone what time your kids come home from school, you probably wouldn't say they come home at "2 plus 2 o'clock", so they obviously aren't totally interchangeable.

The crux of the matter is that equivalence is context dependent. Whether 4 is the same as 2+2 depends on whether you're in a mathematical context or a social context. The equivalence of "dog" and "Dog" might depend on whether you're typing the words into a search engine, where you'll probably get the same results for both, or using them to start a sentence, where one is correct and the other is wrong.

So why do I care about equivalence being fuzzy? Mainly because computers aren't very good at "fuzzy".

Fuzzy problems

One of the most common things computers do is compare things. For instance, a tax program might compare your income against some threshold value to determine whether you owe more or less money. Or it might check to see if your age is greater than 65, to help determine eligibility for retirement benefits. Comparisons that involve numbers are generally pretty easy, so computers do reasonably well at this. But what happens when we try to compare something like a name?

Imagine you have a customer account with www.bunnyslippers.com. When you registered the account last year, you provided your name (Fred Flanders) to create a customer account. Each time you log in, the server looks through the list of known customers and decides whether any of those names matches "Fred Flanders". If it finds one, it can verify the associated password and allow you to proceed.

Now what happens if you try to log in and accidentally type "FreD Flanders" instead of "Fred Flanders"? If it was a human handling this sort of request, they would probably not even notice that you accidentally capitalized the D in "FreD". On the other hand, a computer might or might not see the two as the same, depending on how careful the programmer was being. By default the computer actually sees the letters as numbers; 'd' is 100 and 'D' is 68. So when you ask a computer whether "Fred" and "FreD" are the same, it sees two different sets of numbers, and it says they aren't the same.

So why not just tell the computer that "d" and "D" are the same? Okay, done. Hopefully you don't mind if Microsoft Word now occasionally replaces all the d's in your term paper with D's. They're the same now, so what difference does it make?

Hopefully the problem is becoming clearer now. Sometimes we want our computers to see "d" and "D" as the same thing, and sometimes we don't. The difference is in the context.

Computers aren't fuzzy

The fundamentals of computing are deeply rooted in mathematics. The earliest computers were designed to calculate solutions to complex ballistics problems, and for the most part they remain glorified calculators. The only things they can really do are math and various operations on individual bits. In this sort of basic math, there's not a lot of room for context. The number four is always the same, so there's not much fuzziness to deal with.

As befits their mathematical underpinnings, most programming languages support operations like addition, subtraction, and the like. They also support comparisons, including tests for equality (many languages use "==" for this instead of "=", since the latter is often employed for another purpose). These operations make good sense for working with numbers, but it gets trickier when they get applied to other sorts of data.

For dealing with textual data, computers use something called a "string", which is a sequence of characters (single letters). "Fred" can be treated as a string, composed of the characters "F", "r", "e", and "d". Most languages allow the equality test (==) to be used on strings, and here's where things get tricky: by default this test looks for a very strict numeric equivalence, the kind that says "d" and "D" are not the same. The human programmer may not have intended that, though; the programmer is quite often aiming for some fuzzier form of equivalence.

To deal with this, computer languages often provide various specialized ways to compare strings, which can be used in different contexts. One way is the super-strict "no differences whatsoever", but you can also specify a comparison that ignores case (so that "d" and "D" become the same), or even a comparison that first applies all kinds of interesting linguistic normalizations to smooth out variations. The strict comparison can be used in strict contexts (e.g. verifying passwords), and the looser comparisons can be used for things like finding names in customer databases.

Identity is fuzzy too

So far, all the examples I've used to talk about equivalence have been simple, interchangeable things, like apples. Two totally identical red apples are more or less the same, and you'd probably be equally happy to have one vs. the other. There are lots of things like this in life, but not everything fits that description. If I were to replace your favorite old leather jacket, which smells like your dad and has years of old memories attached to it, with an old leather jacket I found at Goodwill, you likely wouldn't be happy at all. As another example, if you repainted your red Ford Mustang to be chartreuse, you'd still expect everyone to know it was your car, right?

What I'm getting at is the concept of identity. Just like equivalence, identity is something we understand naturally and deal with all the time. Everyone understands that the lime green car you're driving today is the very car you were driving yesterday: it's your car, the particular car that you own. Likewise, your old leather jacket has specific unique value to you; it has a particular identity which distinguishes it from other similar leather jackets. Identity is closely related to continuity over time; the significance of your jacket's identity has a lot to do with it being the very jacket that you had on all those previous occasions in you life.

The word "same" can refer to both equivalence and identity, even when the two concepts are in opposition (that pesky fuzziness again). For instance, if we're talking about identity, I might say that the green car I see today is the same car that you were driving yesterday (i.e. both cars were you particular Mustang). However, visually the cars are not equivalent, so I could just as correctly say that the car is not the same today as it was yesterday.

Computers and identity

Programmers often refer to the "identity" concept above as "reference equality" (as distinguished from "value equality", which is the equivalence concept). For reference equality we usually pick some stable identifier, such as a VIN or an email address, and use that for identity. Reference equality is often less fuzzy than value equality, but there's still room for error; for example, if you use an email address as the identifier, do "fred@domain.com" and "Fred@domain.com" identify the same account? Probably they should.

Sometimes computers and programmers don't pick the right type of equality to match our expectations, or they don't implement it the way you'd expect. A good example is that term paper you wrote last week on your computer: "The Strange World of Ocelots". You probably view that paper as having a particular identity, after slaving over it for hours. If I asked you where the paper is right now, you could presumably tell me what computer and folder it sits in. You might even have created a shortcut to it on your desktop.

So what happens when you rename the file to "Ocelots - Strange but Wonderful"? I'll tell you what happens: the shortcut on your desktop may not work anymore. That's because most computers consider the identity of a file to be solely a matter of the file's name (as well as the names of the folders it lives in), and you just changed the name. Of course, this behavior seems totally wrong to the average person, because in our minds, the paper on ocelots is still perfectly well identifiable as itself.

By the way, this example does actually work in newer versions if Windows, thanks to the Distributed Link Tracking service. However, the need for a specialized service just to make this work just emphasizes the fact that this is a tricky problem.

Where am I going with all this?

The point of all this is that equivalence, identity, and sameness are hard; hard to describe, and hard to correctly implement. I don't have any magic solutions to offer, but I believe that thinking more about this topic can help programmers write better software with less effort and pain.

Some specific ideas for programmers to keep in mind (including my future self):

  • When designing new systems, think about what types of equivalence/identity might be involved, and what behavior the user will expect.
  • Be careful with standard/easy ways of comparing things (e.g. operator== and Object.Equals). Does the easy way actually have the semantics you want?
  • Better yet, be explicit about how two things will be compared. For instance, use function overloads which explicitly specify string comparison modes.
  • Make sure you understand the language and use it as intended. For example, C# provides a very simple reference equality for reference types. It also uses value equality for built-in value types, which means you should usually do the same with your own value types (so you don't surprise people).
  • Find or build automated tools to help verify your code's correctness. For instance, you can use FxCop to verify that you're explicitly specifying string comparison modes whenever possible.

Labels: