Step 29: Design an Autograder

The best answer to the question, “What is the most effective method of teaching?” is that it depends on the goal, the student, the content, and the teacher. But the next best answer is, “students teaching other students”. – Wilbert J. McKeachie

One thing I’ve been working on – in the copious free time that I have – is an autograder. I’m mildly surprised that Oxy does not already have one, since we have existing computer science courses, but we don’t. I’m not writing the autograder for fun; there are a number of open source/free autograders out there, and I also have access to the autograder that Michigan uses. No, the reason I wanted/needed to write my own is that none of the existing autograders do what I want.

My idea is to build an autograder that allows students to use the previous submissions of other students. As per the McKeachie quotation above, I think students can learn a lot from using other student’s work. I did a little bit of that this semester, when I workshopped the papers from a blotched assignment, but the potential for an actual programming class is much higher. I am currently planning two things:

  • Students first write test cases, then write programs that must pass each other’s test cases. One of the problems we faced at Michigan was getting students to write test cases (for details, see [1]). Instead, the idea here is students have an earlier deadline to submit test cases, then when they submit their code, it’s run against the entire test case collection (which they will have access to). They get points not only for writing correct code, but also for submitting test cases that other students fail. I’m still a little fuzzy on the exact point system, but I am definitely deducting points for incorrect tests.
  • Students first write a library of functions, then write programs that use someone else’s library. This is something I’ve never heard done anywhere, but it seems like a good way for students to learn to write good code (and to experience bad ones). Aside from the obvious lesson about coding style and commenting, there is also a deeper lesson of having good representations. I suspect students will be completely surprised that other people represented things entirely differently, and furthermore, that the API can completely change how easy or hard it is to do something. I am also unsure of the exact grading scheme here, but again the ease-of-use of the library should be part of the grade.

These are the two main ones I can think of, although other crowd-sourcing techniques can be used as well (eg. having students come up with small coding questions, then solving each others’ as homework).

On the autograder side, the main accommodation is that the instructor needs to be able to specify which projects depend on which previous ones, and to specify which students can see which other students’ files. There may or may not also be settings to allow instructor test cases, which are hidden from the students – this would then require them to come up with good enough test cases to help each other get pass my tests. (Side note: the dependencies are not just a tree, but a directed acyclic graph – I can imagine having students write both the test cases and the libraries, then using those for the final product. I can also imagine a library that’s general enough to be used in multiple projects, again with an eye towards the issue of data representation.)

I’m calling this project the Demograder – demo-, from Ancient Greek demos, meaning of or pertaining to people, a la democracy. I have barely started the coding – using a Django middle layer with some scripts for the actual compilation and running. If anyone has any cool ideas for how this could be used, I would love to hear them.

[1] The system at Michigan had the staff write buggy code, then gave students points if their test suite would produce different output between the buggy code and the correct code. There are several issues with this. The first is that students don’t get to see what the buggy code is, and finding bugs is really specific to the program in which said bugs exist and to the representation used. So students were sort of shooting in the dark as to what would be a complete enough test case (and I was devious enough to have a test case that almost no one found). The second issue is that this kind of “testing” only works if the correct code already exists – great for regression testing, not so great for developing new software (which was the case for the students).

This led to the third issue: students saw the test case and their code as completely separate parts of the assignment. Since they had no output to regress upon, they would have to manually look through their test cases to see if all the outputs were what they expect. This is tedious and error-prone, and I don’t know of a single student who went through this process. The result is that students are barely testing their code at all, the opposite effect of making them write test cases.

Step 29: Design an Autograder

7 thoughts on “Step 29: Design an Autograder

  1. Bryce says:

    I’ve had this open in a browser tab for the last week because I’m interested in this project and would be happy to contribute. Let me know if there’s a way I can help.

    I think your plan is a good start, but won’t necessarily incentivise the quality of test cases you’re hoping for. One of the other visitors at Swarthmore tried something very similar, and reported that a few students submitted really good test cases, while everybody else did just enough to pass the bar and then free-rode on those good ones.


    1. What was their incentive structure (ie. how were the tests graded)?

      I’ve put some thought into the issue. My current solution is to grade the tests based on how many other students got that test wrong, which is a proxy for how good an edge case it is. There are some scaling issues here – ie. if the whole class passes all tests on first try – but in general that should separate the test cases a little. I might also make only the “best” n cases count, so students get as good a coverage as they want, but is measured only by the trickiest ones. Making tests worth 20% of their project grade, the other 80% being their actual code, would incentivize students to write good tests.

      It’s not mentioned here, but one thing that EECS 183 did at Michigan is limit the number of submissions students can make per day. I’m undecided on whether to do something similar, to force students to manually examine the test cases and run them themselves. On the other hand, this is an unrealistic constraint, so I don’t know how I feel about it yet.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s