An experiment about static and dynamic type systems

Stefan Hanenberg, An experiment about static and dynamic type systems: doubts about the positive impact of static type systems on development time, OOPSLA, Reno/Tahoe Nevada, pp. 22–35, October 2010.

If programming language design is to become a science, we need more experiments like this one.

The author measured time for 49 subjects to build a simple parser in Purity, a language similar to Smalltalk implemented for this experiment in two variants. Twenty-five subjects implemented the parser in the dynamically-typed variant, and 24 used the statically-typed variant. Two measurements were taken: the time at which the lexical scanner passed all its tests, and the percentage of tests passed by the parser after 27 hours; the tests were equally divided between accept and reject, so a random program would pass 50% of the tests.There are many potential objections to these results. Subjects were not told to complete the scanner before working on the parser. It is not clear why Mann-Whitney's U-Test is used to assess significance rather than Student's T-Test. (I didn't check whether the T-Test yielded different significance.)

Any test of this kind is fraught with difficulties. One issue is that experimenter, to reduce variables such as familiarity or different IDEs, developed his own language, Purity, in two variants. One way around that problem would be to use Java to compare dynamically and statically typed programming: that is, compare subjects using Java 4 with Java 5, that is, Java with and without generics. In Java before generics, a collection would have type List, while in Java with generics a collection would have type List<Integer> or List<List<String>>. This would let one compare the effects of presence or absence of type information without the need to create an artificial language. It would fail to measure the benefits claimed for completely dynamic languages such as Javascript or Python, but perhaps it is ok to walk before we run. Anyone interested in collaborating on an empirical comparison of two versions of Java?

You note:

"It is not clear why Mann-Whitney's U-Test is used to assess significance rather than Student's T-Test."

I think the note in section 5.2 explains this (along with the Wikipedia pages for t-test and non-parametric statistics):

"At the current point, we cannot assume any underlying distribution for the measured data. Hence, it is necessary to apply a so-called non-parametric significance test"

The t-test assumes normally distributed data.
Hi Philip,

I'm a former student of yours currently working as a senior engineer.

I should say up front that I've not read the paper.

It's an interesting subject. Most of my work is in a statically typed language (C++) but my current project uses a lot of lua.

My concern with the results here is that the task assigned is pretty much nothing like a real world software project.

If I understand correctly, only one person was working on each parser, so this one person had the program's data structures already in their head.

Additionally the size of the project is so small, that there's not yet the code complexity where static checks really pay off.

Even in C++, static analysis (PC Lint, Visual Studio with /analyze) is a big win for catching bugs before they even get into source control.

There was some interesting discussion about it in John Carmack's QuakeCon 2011 keynote.

Best wishes,
I have not read the paper either. Based on the summary -- I am not surprised by the results.

As PeterM says, it sounds like the program is small enough and written in a short enough period of time that people can actually keep it all in their head.

I find static types and the relatively flat structure of Haskell (you got types and functions.. and not much else), really shine when you are doing things like:

1. writing a program that uses dozens independently designed and implemented libraries

2. making big changes to existing code

3. working on code that you wrote a long time ago

4. modifying code that someone else wrote
I think these comments defending static types (on Wadler's blog of all places) are kind of missing the point. Don't just logic your way into your beliefs, do studies. If you think static types are better at this or that, do an experiment to check your beliefs.
Development time isn't a useful measure on its own - you also need to compare long term support costs, with number of bugs and time/cost to identify & repair
You should repeat this test with a huge project that needs maintenance.
> that is, compare subjects using Java 4 with Java 5, that is, Java with and without generics.

Why would it be interesting to compare a statically typed language to another (richer) statically typed language?
I would agree with a previous comment regarding the size of the project. I woud posit that the benefits of static typing are not linear with respect to project size and/or code complexity; in particular, the benefits increase significantly as the project size grows beyond some meaure.
Where's all the unit-tests and the instructions for using the language? I want to know exactly how to replicate this experiment.
I agree with Philip. Its a toy problem.

I work in web development. I started out with PHP -- the barriers are low, low, low whichever way you look.

At first it was great and I really, really liked PHP. But after a while I realised that PHP quickly became amazingly expensive for any non-trivial long-term project.

Once confronted with a serious project that would have to be built from scratch, I took a deep breath, explained the options to the client and we agreed to use Haskell.

To put it mildly, I have not regretted this decision!

I agree with the need for studies, but...
Phil, thank you for pointing your spotlight to this study. The study is small and has many weaknesses, but this is the kind of study that absolutely needs to be done if the PL community is to engage in scientific argumentation.
I'll be happy to design a different study with you. And I agree with many of the comments here regarding the importance of the long-term maintainability variable.

I *did* read the paper, and attended the talk.

There were some people in the audience who, even after listening to the talk, did not understand the purpose of the experiment. One person complained that they worked on aviation software and that building a parser is trivial to that, and another person complained that most specifications are ambiguous. Stefan's reply was polite, but I will paraphrase it intensely:


Stefan is not arguing that you should use this study to extrapolate to projects of 10 million lines of code. What he did do is re-visit an old study by Walter Tyche et al and see if slightly different conditions would change the result.

That's what you need to understand.
A dynamically typed language is not just a language where type information is absent. There is a lot more to it than that!

Disadvantages of any study measuring single programmer productivity:

* Does not measure typing speed during activity, as well as other input devices such as a mouse
* Does not measure how much the participant typed
* Does not compare the task duration to typing duration (e.g., notice that nobody appeared to finish sooner than X seconds, but suppose John Doe typed without stopping at 100 adjusted words per minute, what would his finish time have been?) In other words, I want to know how much time statically typed languages actually cost students. I don't care as much that statically typed was slower. I need a self-comparison baseline, and typing speed appears to be it.

I spoke with Stefan for what must have been 2-4 hours on the last day of the conference. His story about why he wants to do this stuff is just insanely inspiring. We exchanged a ton of ideas for future studies. Stefan's objective is to re-do classical experiments over and over and see if he can gradually improve scientific rigor.

Why not ask Stefan to collaborate? He mentioned to me some advantages to collaborating with him, but you are best hearing that from him rather than second hand from me!
This OOPSLA paper has certainly generated a large amount of interest - in spite of many issues with its methodology and presentation.

I'm not convinced many researchers have argued that static types decrease development time (other than some ML or Haskell true believers) - the arguments have been in favour of increased execution speed and runtime error prevention, and reduced memory use. The received wisdom is that dynamically typed languages are generally implemented less efficiently, by provide quicker development time.

The question is: how large do programs have to be before static typing's benefits are manifest. Inasmuch as this paper has any results - those results suggest that a toy parser for a simple language is large enough to benefit from static types.

My main piece of advice to anyone thinking about working in this area is certainly to collaborate - but collaborate with an HCI researcher, or ideally an experimental psychologist. Researchers in these disciplines have much more experience in designing, conducting, and reporting on human studies than computer science / programming language people.
In my opinion, the benefits of statically typed languages in an industrial setting have nothing to do with initial, from-scratch development time, and everything to do with scalability and maintainability. Most industrial development involves wading around large masses of code and trying to inject or modify a relatively small portion of the whole; static typing provides a useful scaffolding for that process.

Regarding your proposed experiment, I don't think that "Java without generics" counts as dynamically typed.

It is my anecdote that most structural problems in real world software are due to functional decomposition, rather than problem decomposition. Functional decomposition decomposes problems into a tree data structure. Then re-use of subtrees is promoted, but this turns the overall tree data structure into a lattice. Therefore, I have never understood the argument that static typing in and of itself helps as system's grow larger (in KSLOC). Doing maintenance edits to a lattice has to be mathematically more complicated than doing maintenance edits to a tree, due to the seeming increase required in type judgments.


I am certainly not an expert as you, but would you say my argument above to Greg is (a) clear (b) correct?

[Tangent: There are whole classes of problems that "static type system" (as used in this discussion) do not necessarily solve, such as coordination. Fitness for a particular purpose is the raison d'etre of language design, and the sweet spot for a given domain should be based on (a) mathematical underpinnings (b) human factors (c) how b and a *interact*.]
As if static vs dynamic typing was the only- or even the most important reason to choose a language.
As if static vs dynamic typing was the only- or even the most important reason to choose a language. I smell the taint of language bigotry here.
Yeah - there's a lot left unmeasured/unspecified. To be scientific you have to define (or measure), not just that which you assume determines the outcome (experimenter's bias) but also anything else that might do so (which is admittedly difficult). Many so called scientific experiments try to isolate the experiment from external factors (so as to exclude such from consideration) but unless those factors really are excluded they will sneak in and alter the experiment. The problem being that they then go unrepresented in the experiment's interpretation. The interpretation becomes useless.
A reason for Hanenberg's results could also be that dynamically typed languages do indeed have advantages over statically typed languages. This would also be in line with several other studies that have been carried out in the past. In Hudak, Jones, "Haskell vs. Ada vs. C++ vs. Awk vs ...", Relational Lisp, a dynamically typed language, yielded the best result with regard to productivity. (This was closely followed by Haskell, so maybe the important difference is not static vs. dynamic, but maybe rather manifest vs. implicit typing.) In Prechelt, "An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl", the dynamically typed scripting languages yielded the best results with regard to productivity, while still faring pretty well with regard to performance, memory usage, and reliability (sometimes better, sometimes equally well compared to the statically typed languages used in that study). In Gat, "Lisp as an alternative to Java", Prechelt's experiment was repeated with Common Lisp and Scheme, and again productivity of the dynamically typed Lisp dialects was better, while showing good performance, but similarly bad memory usage as Java. So there seems to be a pattern here. @James: Why would large programs benefit more from static typing than small programs? What you absolutely need in large programs are good test suites, and they probably cover most problems that static type systems also cover. This seems to suggest that the added value of a static type system is probably not very high. (I find it hard to imagine that you can do without a test suite.)
If you are interested in this kind research I encourage you to participate in this years Evaluation and Usability of Programming Languages and Tools (PLATEAU) 2011 workshop co-located with SPLASH and Onward!


The program is to be announced shortly.

Kind regards,
Pascal writes:

A reason for Hanenberg's results could also be that dynamically typed languages do indeed have advantages over statically typed languages

Sure. Unfortunately the conduct and results of this study don't really let us come to any firm conclusion.

Another approach to this problem is Capers-Jones style lines-per-function point metrics (see e.g. http://www.sigada.org/wg/cauwg/Expediting%20Commercial%20Ada%20Use%202.pdf). The old tables on the web have Smalltalk & Eiffel both rated around 15 - about the same as scripting languages, as good as it gets before DSLs. But again: would you rather program say a file archiver utility in Lisp1.5 or in Go?

Why would large programs benefit more from static typing than small programs?

A good question! Certainly quite small programs benefit from static types for memory layout --- while really small programs want to reuse memory at different types and are best done in assembly...

I think I said this was the received wisdom - it's certainly an hypothesis I've heard. Presumably because in a large program there will be many more "entities" (objects/functions/structures/whatever) that can be typed, and thus more opportunities for type errors. Static typing also makes some kinds of IDE support much easier to build, especially things like crossreferencers and autocomplete, and again it seems reasonable to surmise such features are more useful on larger programs.

But many of these effects can be difficult to disentangle. In CLOS, type specifiers can encode type information explicitly, but it is checked dynamically rather than statically. In Smalltalk, variable names by convention encode types - so text-based matching may do as well as type-based matching. In Smalltalk, again, the message syntax militates against parameter arity, position, and type errors, and then the spelling checker catches typos --- one (Bob Harper?) could argue Smalltalk programs are statically checked against a single interface of all the messages defined in the program. Certainly when I used to program Tcl, and now when I'm playing with the dynamically typed Grace prototypes, I think I make more arity, typo, parameter errors than I would in Smalltalk. But these errors are also almost always caught quickly.

On the other hand, many large programs do not have large test suites, and to take advantage of compile-time optimisation, statically typed languages can also take a long time to build (modulo Go and CERN's C++ interpreter), and then report any errors badly. Most dynamically-typed systems build quickly and (should!) report good error messages in terms of a straightforward operational model.

It is difficult to isolate all these factors in experiments, more so considering the variability of programmers, and even more so to support statistically valid claims. Empirical experiments on large programs are even more difficult to carry out well than experiments on small programs.

That doesn't mean there aren't experiments we can do: it just means we need to be careful about experimental design, and about how we report and interpret the results. This is not surprising: this is just "science" or "engineering" or "good practice".
Certainly, the current state of PL "experts" putting forth untested, rationalistic arguments for or against particular language features is embarrassing to the field. However, studies like this make me feel the creep of scientism and the cargo cult into an area that is probably best left as art. Mathematicians and physicists do not need statistical studies to introduce a new notations and theories, so why should programmers? If your language is useless nobody will use it.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?