My first semester is done! Last week was finals, and grades were due yesterday. I want to do a more holistic review of my semester, but the student evaluations are not released until Friday, so I’ll wait for next week.

Instead, I want to talk about evaluating students. I re-read a particular Math with Bad Drawing comic – the Church of the Right Answer – and realized there’s something in there that relates to both of my classes.

Let me start with Intro to Cog Sci, which is what I had planned (weeks in advance) to write about. I had written about my experience teaching A* before, and how I was going to apply what I learned to teaching the perceptron algorithm. As I laid out in that post, I made sure that the surface structure of the algorithm was kept constant, that I emphasize the unintuitive parts of the algorithm (the activation threshold tripped up some students), and so on. I even provided them with a web app where they can play around with the algorithm and check their work, in the same vein as what I did for my Topics in AI students. I made students do several trials of perceptron learning in the homework, which went well in general (some students thought that all connections weights would change by the same amount every time).

Switching it up for the exam, I instead asked them how a perceptron model might explain tip-of-the-tongue states (as explained on SciShow) – and was promptly surprised that many students didn’t get the connection. I later realized that while students could do the math, they didn’t understand how the math relates to cognitive science; they know what to do with the numbers, but don’t understand what they mean.

Which is where the Math with Bad Drawings post comes in. The author talks about how students can get the right answers (in this case, the mathematical operations) without really understanding what is going on (in this case, why we’re studying perceptrons in a cognitive science class). The most interesting quote, however, is his suggestion that students can *always* do this, no matter what assessment method we us. “As for our testsâ€”no matter how well-intentioned, no matter how clever and fair, there will always be a back-road to the right answer. There will be something to memorizeâ€”a procedure, a set of buzzwords, whateverâ€”that will function as a fake ID, a convincing charade of understanding.”

Which is what made me think of my Topics in AI course. 80% of the grade for the course is from the projects, with another 10% coming from participation and peer evaluation. The last 10%, however, is what I called a “grad student chat” – essentially, they come in and have a 30-45 minute conversation with me about what they learned. I stole this from a Northwestern professor of mine, who made students do the same in his compilers course. He called them “code walks”, where they actually talked about the structure and design of their compilers; you can peak at how he does them in his current courses. (This is actually more elaborate than when I did them; notably, other students are now doing part of the evaluation, and are themselves evaluated on how they evaluated.) Most of my students didn’t code, so our conversation was at a level up, but the idea is roughly the same. I even have them give themselves a grade for the conversation, before I reveal what grade I will actually give them.

There are several downsides to this method of evaluation – notably, it’s somewhat subjective and *extraordinarily* time consuming – but I find that it works well in general. The resulting grades roughly correlate with my perception of the students, and I get to see them think through answers to unfamiliar questions (and nudge them with hints if I have to). I like to think that it’s less stressful than an exam as well, and also provides students more flexibility in terms of scheduling.

But, to go back to the church of the right answer – can students *cheat* through a code walk or this kind of chat/interview/whatever? I suppose due to time constraints it’s possible that they are not asked about a topic on which they are weak, but the same could be said for any assessment, and it’s not really *cheating*. It’s hard for me to imagine what it would mean to cheat through a conversation, given that the questions are really hard to prepare for, and I can always push students for a deeper explanation if I think they’re just faking it.

Which raises the question of *why* it’s so hard to get away with just a surface understanding. You could argue that a PhD defense is this format for the same reason. When I first contemplated this post, my thoughts immediately jumped to the Turing Test, which also uses conversation as the medium, but for an entirely different reason (because it is a sufficient demonstration of intelligence – at least, that’s how the argument would go). Although, now I think about it, maybe the underlying reason actually *is* the same: there’s something about the potential breadth and depth of a conversation that makes it amendable as a measure of both topical understanding as well as intelligence. I still can’t put my finger on what it is about conversations that give it this property though.