If programming language design is to become a science, we need more experiments like this one.
The author measured time for 49 subjects to build a simple parser in Purity, a language similar to Smalltalk implemented for this experiment in two variants. Twenty-five subjects implemented the parser in the dynamically-typed variant, and 24 used the statically-typed variant. Two measurements were taken: the time at which the lexical scanner passed all its tests, and the percentage of tests passed by the parser after 27 hours; the tests were equally divided between accept and reject, so a random program would pass 50% of the tests.
- Subjects using the dynamically-typed language completed the scanner in an average of 5.2 hours, statically-typed in an average of 7.7 hours; the difference is statistically significant at level p=0.04.
- Fourteen subjects using the dynamically-typed language failed to complete the parser (that is, passed only 50% of tests), 11 subjects using the statically-typed language failed.
- Subjects using the dynamically-typed language passed an average of 60.2% of the tests, subjects using the statically-typed language passed an average of 64.5% of the tests; but the difference is not statistically significant at level p=0.40.
"It is not clear why Mann-Whitney's U-Test is used to assess significance rather than Student's T-Test."
I think the note in section 5.2 explains this (along with the Wikipedia pages for t-test and non-parametric statistics):
"At the current point, we cannot assume any underlying distribution for the measured data. Hence, it is necessary to apply a so-called non-parametric significance test"
The t-test assumes normally distributed data.
I'm a former student of yours currently working as a senior engineer.
I should say up front that I've not read the paper.
It's an interesting subject. Most of my work is in a statically typed language (C++) but my current project uses a lot of lua.
My concern with the results here is that the task assigned is pretty much nothing like a real world software project.
If I understand correctly, only one person was working on each parser, so this one person had the program's data structures already in their head.
Additionally the size of the project is so small, that there's not yet the code complexity where static checks really pay off.
Even in C++, static analysis (PC Lint, Visual Studio with /analyze) is a big win for catching bugs before they even get into source control.
There was some interesting discussion about it in John Carmack's QuakeCon 2011 keynote.
As PeterM says, it sounds like the program is small enough and written in a short enough period of time that people can actually keep it all in their head.
I find static types and the relatively flat structure of Haskell (you got types and functions.. and not much else), really shine when you are doing things like:
1. writing a program that uses dozens independently designed and implemented libraries
2. making big changes to existing code
3. working on code that you wrote a long time ago
4. modifying code that someone else wrote
Why would it be interesting to compare a statically typed language to another (richer) statically typed language?
I work in web development. I started out with PHP -- the barriers are low, low, low whichever way you look.
At first it was great and I really, really liked PHP. But after a while I realised that PHP quickly became amazingly expensive for any non-trivial long-term project.
Once confronted with a serious project that would have to be built from scratch, I took a deep breath, explained the options to the client and we agreed to use Haskell.
To put it mildly, I have not regretted this decision!
I agree with the need for studies, but...
I'll be happy to design a different study with you. And I agree with many of the comments here regarding the importance of the long-term maintainability variable.
I *did* read the paper, and attended the talk.
There were some people in the audience who, even after listening to the talk, did not understand the purpose of the experiment. One person complained that they worked on aviation software and that building a parser is trivial to that, and another person complained that most specifications are ambiguous. Stefan's reply was polite, but I will paraphrase it intensely:
PAY ATTENTION TO WHAT STEFAN IS ARGUING. SHEESH.
Stefan is not arguing that you should use this study to extrapolate to projects of 10 million lines of code. What he did do is re-visit an old study by Walter Tyche et al and see if slightly different conditions would change the result.
That's what you need to understand.
Disadvantages of any study measuring single programmer productivity:
* Does not measure typing speed during activity, as well as other input devices such as a mouse
* Does not measure how much the participant typed
* Does not compare the task duration to typing duration (e.g., notice that nobody appeared to finish sooner than X seconds, but suppose John Doe typed without stopping at 100 adjusted words per minute, what would his finish time have been?) In other words, I want to know how much time statically typed languages actually cost students. I don't care as much that statically typed was slower. I need a self-comparison baseline, and typing speed appears to be it.
I spoke with Stefan for what must have been 2-4 hours on the last day of the conference. His story about why he wants to do this stuff is just insanely inspiring. We exchanged a ton of ideas for future studies. Stefan's objective is to re-do classical experiments over and over and see if he can gradually improve scientific rigor.
Why not ask Stefan to collaborate? He mentioned to me some advantages to collaborating with him, but you are best hearing that from him rather than second hand from me!
I'm not convinced many researchers have argued that static types decrease development time (other than some ML or Haskell true believers) - the arguments have been in favour of increased execution speed and runtime error prevention, and reduced memory use. The received wisdom is that dynamically typed languages are generally implemented less efficiently, by provide quicker development time.
The question is: how large do programs have to be before static typing's benefits are manifest. Inasmuch as this paper has any results - those results suggest that a toy parser for a simple language is large enough to benefit from static types.
My main piece of advice to anyone thinking about working in this area is certainly to collaborate - but collaborate with an HCI researcher, or ideally an experimental psychologist. Researchers in these disciplines have much more experience in designing, conducting, and reporting on human studies than computer science / programming language people.
Regarding your proposed experiment, I don't think that "Java without generics" counts as dynamically typed.
It is my anecdote that most structural problems in real world software are due to functional decomposition, rather than problem decomposition. Functional decomposition decomposes problems into a tree data structure. Then re-use of subtrees is promoted, but this turns the overall tree data structure into a lattice. Therefore, I have never understood the argument that static typing in and of itself helps as system's grow larger (in KSLOC). Doing maintenance edits to a lattice has to be mathematically more complicated than doing maintenance edits to a tree, due to the seeming increase required in type judgments.
I am certainly not an expert as you, but would you say my argument above to Greg is (a) clear (b) correct?
[Tangent: There are whole classes of problems that "static type system" (as used in this discussion) do not necessarily solve, such as coordination. Fitness for a particular purpose is the raison d'etre of language design, and the sweet spot for a given domain should be based on (a) mathematical underpinnings (b) human factors (c) how b and a *interact*.]
The program is to be announced shortly.
A reason for Hanenberg's results could also be that dynamically typed languages do indeed have advantages over statically typed languages
Sure. Unfortunately the conduct and results of this study don't really let us come to any firm conclusion.
Another approach to this problem is Capers-Jones style lines-per-function point metrics (see e.g. http://www.sigada.org/wg/cauwg/Expediting%20Commercial%20Ada%20Use%202.pdf). The old tables on the web have Smalltalk & Eiffel both rated around 15 - about the same as scripting languages, as good as it gets before DSLs. But again: would you rather program say a file archiver utility in Lisp1.5 or in Go?
Why would large programs benefit more from static typing than small programs?
A good question! Certainly quite small programs benefit from static types for memory layout --- while really small programs want to reuse memory at different types and are best done in assembly...
I think I said this was the received wisdom - it's certainly an hypothesis I've heard. Presumably because in a large program there will be many more "entities" (objects/functions/structures/whatever) that can be typed, and thus more opportunities for type errors. Static typing also makes some kinds of IDE support much easier to build, especially things like crossreferencers and autocomplete, and again it seems reasonable to surmise such features are more useful on larger programs.
But many of these effects can be difficult to disentangle. In CLOS, type specifiers can encode type information explicitly, but it is checked dynamically rather than statically. In Smalltalk, variable names by convention encode types - so text-based matching may do as well as type-based matching. In Smalltalk, again, the message syntax militates against parameter arity, position, and type errors, and then the spelling checker catches typos --- one (Bob Harper?) could argue Smalltalk programs are statically checked against a single interface of all the messages defined in the program. Certainly when I used to program Tcl, and now when I'm playing with the dynamically typed Grace prototypes, I think I make more arity, typo, parameter errors than I would in Smalltalk. But these errors are also almost always caught quickly.
On the other hand, many large programs do not have large test suites, and to take advantage of compile-time optimisation, statically typed languages can also take a long time to build (modulo Go and CERN's C++ interpreter), and then report any errors badly. Most dynamically-typed systems build quickly and (should!) report good error messages in terms of a straightforward operational model.
It is difficult to isolate all these factors in experiments, more so considering the variability of programmers, and even more so to support statistically valid claims. Empirical experiments on large programs are even more difficult to carry out well than experiments on small programs.
That doesn't mean there aren't experiments we can do: it just means we need to be careful about experimental design, and about how we report and interpret the results. This is not surprising: this is just "science" or "engineering" or "good practice".