One aspect of computer science I really like, but that I don’t get to talk about often, is that it makes clear the map-territory distinction. What we represent in a computer program (the map) is necessarily a simplification of the real thing we are trying to represent (the territory). This means we must pick our representations carefully, so that everything we want to capture can be expressed with that representation. The opposite problem – of having too expressive a representation – is only a problem if efficiency is necessary.
I think the general idea is something that even beginner students realize relatively quickly. In my previous post about diversity, I asked students to verify that a chapter and verse number from the Quran is valid, which means that their program (the map) must represent the chapter and the verse (the territory). This example is somewhat trivial, and even my current students will dismiss it as such. What I don’t think they realize, however, is that the principle runs deeper than that, and their choice of representation may have consequences that they did not foresee.
I decided to try and get this message across in the lab last week. We just covered classes and exception handling, so the first part of the lab asked them to write a program to survey the visitors on campus, and having functions that verify their input (name, age, gender, phone number, etc.). I helpfully provided short vignettes like this one:
- Rudolph Feierabend has lived in Eagle Rock/north-east LA since 1961, when his parents immigrated to the US, bringing then-seventeen Rolf in tow. He has been retired for a couple years now, and takes his husky lab Winston on walks through campus every so often. Although he has an email account email@example.com, Rolf says he doesn’t check it anymore and should just call him at home (323) 827-6316.
That was about the level of instruction I gave to students, leaving it up to them as to what they should check for. The object-oriented programming was straightforward, while students had to think through how they would verify the various fields. Some students made sure that names do not have punctuation, some students checked that a phone number had 10 or 11 digits (depending on whether the user typed “+1”), some students asked for the city and state of origin separately. Fairly standard programming exercise, took 1.5-2 hours, with just enough creative freedom to not be boring.
Once students showed me their code works (with me playing devil’s advocate on what they may want to rule out), the second part of the lab springs the trap. I gave students two new vignettes:
- Chun Ying Tsang (20) is an exchange student from Birmingham in England. He is here for the summer program, and only just moved into his dorm yesterday. Although Chun Ying has an email address (Terrif.Ying@yahoo.com), he only has his UK cell phone +44 075 9921 9264.
- Ash Reid-Chapman (33) has been invited to speak at an event co-hosted by CODE and Project SAFE. Drawing on personal experience growing up in Portage, MI, Ash will be talking the discrimination faced by the transgender community, and how allies can support those who are transitioning. Due to previous harassment, Ash refused to give out a phone number or an email, instead directing you to the National Center for Transgender Equality website.
The twist, of course, is that people are more complicated and diverse than students thought. My original vignettes deliberately depicted a homogeneous sample of visitors: single-word first and last names, easily identifiable as male or female, all with US phone numbers and places of origin. I had to be a little evasive with how I answer student questions – asking if they themselves would be happy with their checks if they needed the data – but everyone made at least one assumption that didn’t hold. The verification for gender caught the fewest students (although one or two pairs did enforce a binary gender), while the verification for phone number caught most groups (some groups allowed anything with seven or more digits). The hyphenated last name caught some groups that only allowed letters (I should have had a name with an apostrophe as well), and I don’t think a single group allowed the user to not provide an answer.
The lab actually goes one step further, by asking students if they can think of other kinds of information that may be distorted. Many students picked up on the issues of race (ie. what if you are bi-/multi-racial?); other students found more interesting examples with, such as how Facebook limits the types of “reactions” you can have, or how filters for swearing could accidentally ban real people/location names (aka. the Scunthorpe problem). The last question in the lab asks generally what students learned about computer science. From reading the students’ answers, I’m not convinced that even half of them realize how deep this map/territory disconnect could go, but I do think that at least some students understood the difficulty – or even the impossibility – of accurately representing the range of diversity that exists, or of not imposing any assumptions on the data you might get. And this is not even touching on the tradeoff between data validity and data representativeness.
I tweeted the night before that this may be my best lab yet, and I stand by that assessment now. This is an exercise that I’ve wanted to do since the beginning of the semester, and I’m really happy that it turned out well. To me, this is exactly the right mix of technical knowhow and broader impact that a computer science course (introductory or not) should have, especially in a liberal arts setting.
The instructions and code for this lab can be found on GitHub.