Scholarly writing suggests software defects are holding big data back.
We've heard a lot about scientific fraud recently, and it's a serious concern. But how reliable are honest research results? On the science website iSGTW, the journalist Adrian Giordani points to a growing concern with software defects:
In October 2012, a workshop about maintainable software practices in e-science highlighted that unchecked errors in code have caused retractions in major research papers. For example, in December 2006, Geoffrey Chang from the Department of Molecular Biology at the Scripps Research Institute, California, US, was horrified when a hand-written program flipped two columns of data, inverting an electron-density map. As a result, a number of papers had to be retracted from the journal Science.
I (Hatton) have worked for 40 years in meteorology, seismology, and computing, and most of the software I've used has been corrupted to some extent by such defects - no matter how earnestly the programmers performed their feats of testing. The defects, when they eventually surface, always seem to come as a big surprise.
The defects themselves arise from many causes, including: a requirement might not be understood correctly; the physics could be wrong; there could be a simple typographical error in the code, such as a + instead of a - in a formula; the programmer may rely on a subtle feature of a programming language which is not defined properly, such as uninitialized variables; there may be numerical instabilities such as over-flow, under-flow or rounding errors; or basic logic errors in the code. The list is very large. All are essentially human in one form or another but are exacerbated by the complexity of programming languages, the complexity of algorithms, and the sheer size of the computations.
A site called RunMyCode, developed by the Columbia University computer scientist Victoria Stodden, helps scientists discover errors by sharing code and data, accelerating replication of experiments.