Automatic Grading of Code Submissions in MOOCs isn’t Perfect Either

20 Replies

There are many obvious problems with the “peer review” model of essay grading Coursera is piloting. On the other hand, one may think that automatic grading of code submissions in MOOCs is much more reasonable. It’s certainly more practical, and it might explain why the majority of current online courses are related to computer science. Obviously, some piece of code either works, or it doesn’t. Yet, a few important issues seem to be neglected. Working code should of course be a necessity, but it’s not all there is to writing “good” code. While one may want to argue that the points I am going to raise are more relevant for real-life software development and less of an issue in university courses, bad practices can still lead to bad habits that only take more time to correct later on.

A big issue is that auto-graders do not recognize poorly structured code. If the code is working but sloppily written, without any comments and poor visual guidance, the software won’t notice it. To some degree this could be rectified, but the current versions of the auto-graders on Coursera, EdX and Udacity are oblivious to it. Questions of coding style are, of course, to some degree subjective. Yet, following well-thought out style guides can lead to greatly increased readability. Even just blank lines to separate functions and variables can go a long way, and certainly few people doubt that it’s a good idea to not write more than a certain amount of characters per line. (It may not be 80, though.) A good tutor, either in industry or at university, will point this out to you. The autograder won’t because it can’t.

Hand in hand with poor structure goes over-engineering. I’m currently going through MITx’s 6.00x Introduction to Computer Science and Programming. Among the introductory CS courses I’ve either taken or had a look at, it seems to be the best one by a wide margin. One reason is that the exercises extend beyond the merely mechanical application of knowledge. For instance, in an early exercise (lecture 4: problem 5), you are asked to find the maximum of three numbers without using conditional statements. Instead, you have to use the in-built min() and max() functions of Python.

Here is the code skeleton:

def clip(lo, x, hi):

'''
 Takes in three numbers and returns a value based on the value of x.
 Returns:
 - lo, when x < lo
 - hi, when x > hi
 - x, otherwise
 '''
 # Your code here

This is not an overly difficult exercise, but you may have to think about it for a minute or two.

One possible solution is:

return max(min(hi,x),lo)

One of the students, though, presented a fabulous case for overthinking the problem. He writes:

The trick here is to know that the boolean True and False have values you can use:
True == 1
False == 0
So you can test for the three conditions (lower, in-range, higher) and store each in a separate boolean variable. Then you can use those booleans (think now of 1 or 0) in a return statement that multiplies the low-test boolean by lo, the in-range boolean by x, and the high-test boolean by hi.
The return value is simply the booleans multiplied by each argument, and added together. That will result in math something like these
1 * lo + 0 * x + 0 * hi # too low
0 * lo + 1 * x + 0 * hi # in range
0 * lo + 0 * x + 1 * hi # too high

This is only possible on a weakly moderated platform, where any post with the veneer of plausibility, and written in an authoritative tone, tends to impress the inexperienced. Indeed, other students were busy thanking that guy for his “help”, until finally someone chimed in and clarified that this solution was entirely missing the point.

Poor structure and complicated code are not the only issues you may face. If you’ve ever worked with legacy code, then you’ve probably come across poorly chosen or variable names. This may not be an big issue in a 50-line script someone writes for his CS 101 class. However, in any context where someone else has to work with your code, this can quickly lead to problems because the other person may need much more time to familiarize himself with it. Besides, that other person may well be the original author in a few months, once he is no longer familiar with the problem he wanted to solve.

What you also sometimes see is people writing code in languages other than English. Yet, English is the de facto Lingua Franca of programming. It’s not so uncommon that someone posts a code snippet somewhere, and asks for help. If a variable is called, say, “naam” instead of name, I don’t necessarily need to know that this is Dutch to correctly guess its meaning. It remains a minor annoyance, though. Besides, there are enough languages out there, and a plethora of possible variable names that bear little to no resemblance to their English counterparts, that guessing won’t help much in the long run.

This is no trifling matter. Just think of Open Office, which is based on an office suite called Star Office. Star Office was originally developed by a team of German developers who, you guessed it, did not comment their code in English. The company behind Star Office was acquired by Sun in 1999, and the software was finally open sourced in 2000. In 2012, though, there are still German comments left in the source code, waiting to be translated into English. The German programmers could have written their comments in English, which may have taken a bit longer. However, by neglecting this proverbial “stitch in time” the problem got compounded. I don’t even want to speculate how much time was wasted on fixing this issue subsequently. Eric S. Raymond writes about this as well in “How To Become A Hacker“, where he points out that English is the “working language of the hacker culture and the Internet.” This situation hasn’t changed.

On a side note, the EdX team recently had some technical issues with their auto-grader. It couldn’t evaluate code that contained Unicode characters. However, this wasn’t a bug but a feature! If anything, you want to discourage students from putting special characters from all the languages in the world in their code. It may even make sense to artificially restrict the auto-grader to the ASCII standard. This doesn’t apply to the real world, but for CS 101 it’s probably appropriate.

In general, I think that having a few units on how to write clean code would be very beneficial as it would help students to acquire good habits. This doesn’t just apply to beginners. There are also some seemingly experienced programmers on those platforms, presumably using online courses to learn a new programming language or brushing up on their computer science knowledge. At least that’s my guess when I see someone posting nicely written code in Python, in which the lines end with semicolons. This only illustrates my point: Old habits die hard, so you better acquire good ones as early as you can.

20 thoughts on “Automatic Grading of Code Submissions in MOOCs isn’t Perfect Either”

proakisOctober 28, 2012 at 12:42 pm

Ah! When you said functional code, I thought you were talking about functional programming. Please consider changing it to “working code”, if that is what you mean.

Reply ↓
1. Gregor Ulm Post authorOctober 28, 2012 at 1:33 pm
  
  Thanks for pointing this out! I’ve changed the two instances of “functional code” to “working code” to avoid any confusion.
  
  Reply ↓
elssarOctober 28, 2012 at 1:28 pm

Well I agree that the auto graders aren’t as good as they should be, but the problem you talk about is mitigated in two ways(I have only taken Udacity courses, so I can only talk about them) –
1. Answer videos – The problem is solved by the instructor or the TA in the video and the solution is explained. So the students can go through the solution provided and compare it with theirs.
2. Forums – If the student can’t get the solution provided in the answer video, can always go ask in the forums. The TA and students are eager to answer the questions. Some even like to post their solution to a particular in a post and encourage others to post their solutions as answers in that post so there is a good corpus of both good and bad solutions in the forums.
So if a student carefully watches the solution videos and participates in the forum, the shortcomings of the automatic grader can be overcome.

Reply ↓
1. Gregor Ulm Post authorOctober 28, 2012 at 1:43 pm
  
  Those are good points. I agree that, in theory, great explanatory videos and a well-moderated forum could help tremendously. However, in practice, both are not necessarily given. Especially on Udacity I had a far from satisfying experience.
  In fact, I made a couple of posts on the CS101 forum to provide alternative explanations that were much shorter and, in my humble opinion, clearer than the ones provided by the instructors. I am not saying that they were all bad, but especially Sarah Norell added a couple of videos that were more confusing than helpful. The feedback I had received seemed to confirm that I wasn’t the only one having this perception.
  The forums could be a great asset, too. However, as I point out in an example in the blog post, the risk is that answers that merely look competent get easily accepted, even though they may be misleading. This issue is discussed in more detail in the comments section of this post:
  http://www.angrymath.com/2012/09/udacity-statistics-101.html
  Probably the most relevant quote from Angry Math is:
  “I did briefly browse the discussion forums a few times, not heavily (there’s one quote from it up in the blog post). There were some people spending a lot of time trying to clarify stuff, but in my opinion the student explanations were really super-shaky (c.f. “a little knowledge is a dangerous thing”). I have a hard time seeing how, or any examples of, someone truly qualified to answer tough questions also being a student in the class (excepting a case like my own).”
  None of those two issues can’t be solved, though. However, the current state is far from ideal.
  
  Reply ↓
elssarOctober 28, 2012 at 1:31 pm

Also, there is a case to be made about learning from mistakes. A lot of the solutions are far from perfect, but they are creative none the less, with students trying hard to come up with the solutions, stretching their abilities and their mind to get to them. That isn’t a bad thing, as long as there is some way for them to learn how to improve

Reply ↓
1. Gregor Ulm Post authorOctober 28, 2012 at 1:45 pm
  
  Learning from mistakes can be great. However, if the official explanations are lacking, or if solutions pass that are working but which are stylistially insufficient you may reinforce bad habits.
  
  Reply ↓
NickOctober 28, 2012 at 1:48 pm

I am taking the edX CS169.1 course and I find that I will consistently have a “less than elegant” solution that the auto grader accepts but that I feel is sub-par. The irony is this class has a large BDD/TDD aspect and is teaching RED-GREEN-REFACTOR, but with an auto grader once its green there is little reason to go back and refactor.

Reply ↓
SamOctober 28, 2012 at 3:08 pm

Good points. I’m taking the Coursera python courses and think “I am the problem” here. My code works but I know it’s not really well written code. However, the grader gives 100% as you note. That devalues the certificate learning experience at this point in time. Interestingly, there is a follow-on course already created to address this issue.
Having said that, I think the courses are an excellent introductions. They are well thought out, well paced for neophytes, and a resource I wish I had had when I was a teenager. Nothing would please me more than to see a hundred of them leading up to topics of greater and greater complexity.

Reply ↓
1. Gregor Ulm Post authorOctober 30, 2012 at 9:36 pm
  
  Are you referring to Peter Norvig’s “Design of Computer Programs (CS212)”? This one is on my to-do list, but it has a lower priority.
  I do agree that many of those courses seem to be excellent. The course catalogue at Coursera looks certainly very promising. I am amazed at the great strides that have been made in such a short amount of time. We’re already at a level where you could get, say, an undergraduate level education in computer science online. While the information was available before, the great benefit of those courses is structure, but also pacing.
  One aspect where I see Udacity lacking, though, is that the courses aim to be self-contained and don’t even refer to textbooks. This can quickly lead to rather shaky foundations, or hand-waving explanations (cf. Sebastian Thrun’s “Statistics 101” course). As I said in the article, my experience with MITx’s 6.00x has been excellent so far. In terms of presentation, it’s certainly the most mature course I’ve come across yet.
  
  Reply ↓
AniketOctober 28, 2012 at 3:54 pm

Not all courses suffer from this problem though. For instance, Coursera’s Functional Programming in Scala incorporates a style checker that you can run to make sure your code hygiene is up to the mark. There are also a couple of points on all the programming assignments for making sure that your solution adheres to the code cleanliness standard.
Agreed that these are more an anomaly than a norm, and in general more can be done about improving the auto-graders in MOOCs, but the point is if you are in a MOOC, you have proactively taken steps to learn something new and you possibly can show the same pro-activity in learning how to write “good” code than relying on auto-graders chastising you for writing bad code.

Reply ↓
1. Gregor Ulm Post authorOctober 30, 2012 at 9:37 pm
  
  From the perspective of a motivated autodidact, I certainly agree. However, as MOOCs become more popular, I don’t think this is necessarily the main audience. Skimming the discussion forums on Coursera, EdX and Udacity certainly gives the impression that quite a few people require some more hand-holding.
  
  Reply ↓
FiskerOctober 28, 2012 at 4:32 pm

Perhaps this isn’t an issue that an auto-grader needs to solve. I took one of these classes online and I was able to solve the problems but wasn’t sure if the implementation was clear and followed standards/conventions. I thought of a few ways that would have helped me get a better understanding of coding:
1. Have a coding style guide specifically for the set of problems you’ll be working on.
2. After you submit your solution show the instructors solution. This could simply be a video of the instructor solving the problem and talking through the solution.
3. After submitting your solution and getting a pass allow peer review of other solutions. Allow others, including TA’s, to give feedback and vote on solutions that they find elegant. This could be some percentage of your overall grade.

Reply ↓
1. Gregor Ulm Post authorOctober 30, 2012 at 9:41 pm
  
  Especially your third point sounds excellent because you can normally learn a lot from looking at alternative solutions. Right now this is done in a mostly informal way on the associated discussion forums of those courses. Having solutions that are “approved” by the teaching assistants would certainly lead to an even better learning experience.
  
  Reply ↓
SomebodyOctober 28, 2012 at 6:28 pm

It’s depressing to see several HNers respond to this with “don’t check code quality because everyone else is sloppy about it too.”

Reply ↓
ZachOctober 28, 2012 at 6:47 pm

I’m making http://codehs.com to teach beginners how to code. We’re focusing on high schoolers and promoting good style and good practices.
We have a mixture of an autograder for functionality and human grading for style.
It’s really important to get both. Our class uses a mastery model rather than grades, so you shouldn’t move on until you’ve mastered an exercise, and mastery does not just stop at functionality. Style is included.
Making your code readable to other people is really important, and it can and should be taught and stressed even on small exercises.
At Stanford, code quality is half your grade in the first two intro classes because it’s just as important that someone else understand your code as it is to just make it work.

Reply ↓
ScottOctober 28, 2012 at 8:03 pm

The multiply-by-boolean-result is used extensively in high performance code to avoid branching in inner loops. This prevents pipeline flushes. It’s not overthinking at all, it’s the correct answer to not using an if statement. Calling python’s built in min and max functions uses an if statement, and a procedure call. In summary, that particular guy was the only one that got it right.

Reply ↓
1. Gregor Ulm Post authorOctober 30, 2012 at 9:45 pm
  
  Thank you for providing some background information.
  However, please note that the exercise specifically asked to make use of the in-built min() and max() functions. Thus, that guy clearly missed the mark.
  
  Reply ↓
SomebodyOctober 29, 2012 at 4:49 pm

The multiply-by-boolean way is actually quite succinct, and can be written as a 1 line lambda:
clip = lambda lo, x, hi: lo*(x=lo)*(xhi)
It is definitely a different approach than doing an if-else block, however it should not be referred to as incorrect or over complicated. Using boolean algebra to solve a problem can be quite useful.

Reply ↓
1. SomebodyOctober 29, 2012 at 4:50 pm
  
  Looks like some formatting got messed up, I’ll try again:
  clip = lambda lo, x, hi: lo*(x=lo)*(xhi)
  
  Reply ↓
  1. Gregor Ulm Post authorOctober 30, 2012 at 9:48 pm
    
    I hadn’t thought of that. However, using lambda goes beyond the syllabus of that course:
    https://www.edx.org/static/content-mit-600x~2012_Fall/handouts/6.00x_syllabus.5c9cae040ec5.pdf
    6.00 is not 6.001.
    
    Reply ↓

20 thoughts on “Automatic Grading of Code Submissions in MOOCs isn’t Perfect Either”

Leave a Reply to elssar Cancel reply