One of the main reasons I wrote my own autograder is I wanted students to be able to test their code on other people’s test cases. This is my second semester teaching CS1, and although in both semesters I have used this feature, I have not found it as flexible as I would like. There is only a small window when students are sufficiently comfortable with the concepts – especially functions, but also loops – where they can write good test cases, but not may still struggle with writing the code. I suppose crowdsourced test cases are useful beyond that point, but that is also where CS1 starts becoming more creative, meaning I start reducing the use of the autograder. Together with the schedule of the course (weekly assignments), I really only get to use the crowdsourcing feature once.
This semester, the crowdsourcing was used on a Connect 4 lab. Students had to write a function that determined where a token would go in a column (
lowest_empty_row), a function that asked for and verified user input (
get_input), and a function that checked for the winning condition (
has_connect_four). I motivated test-driven development at the beginning, then asked students to write and submit test cases for
has_connect_four, using provided helper functions to easily create a board. The autograder is set so that during the lab, their own code would only run against their own test case. (Of course, they could also do this directly on their computer.) After the lab is over, I would then toggle the autograder so everyone’s test cases are used, then re-run all submissions on all the test cases. These results – both the correctness of the code, and the discrimination of the test cases – then become the students’ grades.
I thought the lab was mostly successful. The success is from students having “better” test cases than I did – there were a number of off-by-one errors that I didn’t catch, and some near-hits that I didn’t look too closely into. The failure is that, although students are rewarded/punished for the quality of the test case, I do not have any deliberate process for students to reflect on and improve their own test cases. The main constraint is time; although it’s a three-hour lab, students can only afford to spend an hour (more realistically, 30-45 minutes) on the test cases, so they still have time to finish the functions. Even if students had more time, however, I am not sure how I would guide this reflection. One idea would be for them to see the test cases that others (including themselves) failed, and see if they can identify the common programming error that caused it. I am not convinced that this is feasible, nor that it would transfer to awareness of likely mistakes in the future.
I have mentioned this test case crowdsourcing idea to several people, including the folks at zyBooks (which I use for my course), and interestingly I have had minor pushback on its utility. In addition to how students may not write good enough test cases – which this lab in particular offers a counter example – the argument was almost that testing (aka. quality assurance) is a whole different skill set, and therefore it does not play as big a role in CS1. To me, the fact that quality assurance exists as a separate department is evidence for the opposite – that writing good test cases is so hard we need experts to do it, which is all the more reason students should start learning early. That said, my inability to fully take advantage of the autograder, even in the second iteration of the course, suggests a mismatch between what I want to do and how I organize the course.
I want to end on some slightly technical thoughts on how test cases should be graded. For the Connect 4 lab, I graded not by test case but by test suite – that is, how many students failed any of the test cases that a student submitted. In particular, the student who fails the most people gets full marks, then everyone else gets a proportionally lower score based on how many people they did not fail. This means that everyone’s code was perfect, students would not be penalized for not finding bugs with their test cases.
While this gets the rough correlation correct, it doesn’t contribute to the goal of making students reflect on how to write good tests. I have a vague idea of grading individual test cases based on how many other students failed that test. The problem I don’t have a good way of combining this information; simply summing these numbers would favor quantity over quality of tests, while discounting students who fail multiple would revert the grading to what I do now.
I still believe in the idea of crowdsourced testcases. All the issues I brought up here are logistical and not technological, and I would love to hear success stories from others of how they teach test-writing skills.