Photo by Fabian Grohs on Unsplash
As a scientist and a graduate student in public health, I do a lot of coding and programming to solve computational problems. Few people would deny the statement that the code we write for data analyses is rife with bugs (problems that make a routine break or not work properly) and errors (incorrect values and data entered by an analyst). Wrong results from bad code can be consequential; some of which had led to the retraction of publications. Despite the ubiquity of bugs and errors, code review is something I have never heard of in our school.
Code review is the process of checking the code by someone who is not the author(s) with pre-specified aims such as reducing bugs and errors, improving code readability and documentation, etc. It is a well-established routine in software engineering.
The problem is that sharing data and code along with the corresponding scientific publication is an exception – not a norm – in fields like medicine and public health. This provides scientists little incentive to spend time to check their code thoroughly. However, many studies that deal with human participants cannot (understandably) make the data and code publicly available because of confidentiality concerns. In these cases, having an internal routine code review mechanism within a lab would be useful.
Compared to a decade ago, we now have many tools at our disposal to make the code review process easier. There are many style guides to improve code readability, platforms (such as Github, Gitlab, etc) to share code and files (both online and offline), and version control systems (such as git) to help systematically organise code review process.
In the process though, we need to be careful not to turn code review into personal criticism. Some ground rules need to be established beforehand – feedback should be constructive and useful, comments should stick to the objectives of the review laid out by the author as much as possible, etc. Generally, be nice to your colleague!
Scientists often work separately on their own tasks or projects with relatively little interaction with colleagues. For some, the only opportunity for interaction is perhaps regular lab meetings but some labs don’t even have that. Code review may help bridge this gap by fostering trust and building rapport in a lab or with other labs because of its highly interactive nature. It is also a good way to learn how to communicate code and a technical subject. Further, it helps spread the best practices and reusable code; bring a new member up to speed; and spur new ideas for analysis and collaboration. These social benefits tend to be overlooked in conversations about code review.
The most important ingredient in implementing routine code review in science, I would argue, is a commitment from leaders of labs and institutions. Although the process of code review has become easier, it is still a bit of learning curve to master the tools and techniques. Most scientists don’t have a background in software engineering. Though they may have learned to code in their data analysis courses, they are not necessarily trained to troubleshoot and document code. Besides, some scientists are still using Microsoft Excel and other point and click software to solve their data problems without any coding. This approach obviously renders code review impossible and also limits the reproducibility of results. These are significant barriers that need to be tackled through support (by investing time and resources) and assertion from leaders.
The discussion about code review in science is not new – there is no short of literature and blog posts on this topic. Look here, here, here, and here. The question now is not whether we should do code review in science but how to do it. It may seem difficult at first but once it becomes a norm in a lab, the payoff in productivity, collaboration, and work quality will be great in the long run.