One hundred days and one hundred lines of code
Last week I attended ‘100 days and 100 lines of code’, which was organised by the Epiverse team at LSHTM. The overall idea was to think about when the next pandemic happens, what the first 100 lines of code written would be (I think more as a cute reference to similar thoughts about vaccine development, rather than a totally serious concept).
The format was over three days:
- Talks from academics, public health and field epidemiologists on their thoughts and experiences with epidemiology software.
- Exercise: starting with data simulated from an outbreak, create problems in the data and a list of questions to answer. Then swap with another group and try and answer their questions. (You can find our exercise response on github)
- Summarise common experiences and problems with software from the second day.
The only couple of reservations about the event I had was that no software developers or research software engineers spoke – which I found odd considering it was ostensibly an event about writing code – and I think we missed their perspective, and whether problems with epidemiology software are similar to other scientific fields. There was also more of a focus on outbreak response and field epi, rather than pandemic response, but maybe that’s reasonable.
While I personally learned a lot about the early stages of outbreak response, afterwards I tried to think about whether there were any things the community at large could do, and came away with five thoughts from the final group discussion:
1. Searchable curated package indexes would be useful
A lot of the groups knew many useful tools and packages to approach the exercise, due to years of experience in the field. As a relative newcomer, I hadn’t heard of the existence of many of these.
An index of trusted epidemiology packages, possibly incorporating user rankings, and short/keyword explanations of their use would be really useful. A search over the package documentation for terms such as ‘case fatality’ might be even more helpful. I think this would be really useful in bioinformatics too!
Applied Epi seemed to be somewhat on top of this, and the Epiverse have their own set of packages. They have also made a map – a more up-to-date version of this was shown at the meeting, but I can’t find it online.
2. Data cleaning – unloved but crucial
Everyone seemed to agree more packages to help automate and make data cleaning reproducible would be helpful. In particular, tools which keep manual control of the cleaning, but help make the process smoother.
Ultimately my feeling was this doesn’t particularly fit into the current models of academic funding. Should academics even be working on this? Perhaps we need to think more about who should do this work – professional RSEs seems the natural choice, but it seems they are like hen’s teeth.
Overall, it felt like not a huge amount of progress in this area had been made since Hackout3, the last similar event I attended. Possibly because everyone’s been in emergency response mode for the last three years!
3. Don’t make one big package to do it all
Seems like the packages available are fairly modular, and it should stay this way.
One thing I definitely got out of the COVID software response was that it’s better to make smaller packages which can be independently tested, updated, and deployed. Making them easily interoperable is helpful too.
I therefore liked the Epiverse idea of having interoperable packages similar to the tidyverse. How exactly should this be achieved? I don’t think epi data is particularly special: usually case line lists (a data table), contact tracing data (a network with attributes), and sometimes genomic data (which is special 🙃). So maybe just a tibble, with some ‘special’ columns with consistent names such as date of onset etc, that certain packages require?
4. Help users engage with the software community more
Software developers in this area have a hard time to know where is most effective to spend their efforts. User feedback on things that didn’t work, difficulties with basic tasks, or feature requests aren’t well fed back.
Github and gitlab basically have all the tools to manage this, and software developers use them effectively internally.
One suggestion: as well as teaching people how to use git, maybe we should also think about adding more tuition on how to use github and its community features?
5. Add more case studies in documentation
Most documentation for research software is fairly poor. R packages/CRAN enforce a minimum standard, which I’d say is positive, and the relative ease of writing vignettes in R has always felt like a strength to me.
I’m quite taken by ’the four types of documentation’:
- How to guides
We usually do reference ok (because we’re forced to), explanation is usually in papers (for better or worse), but I think we don’t pay enough attention to the first two. More case studies and examples would be a good starting point. Generally I think time spent on documentation is some of the most valuable time you can spend on software, and so I promise to try and practise what I preach in future!
I should also mention the R4Epi handbook here, which is a good effort.
These are all quite focused on outbreak response. The next pandemic needs more systemic changes: guaranteed funding and career progression for RSEs; more software and modelling expertise in public health agencies (in the UK at least); funding for software maintenance. I doubt any of these things will really happen without funders and governments rethinking their strategy for supporting epidemiology software. But maybe some of the above will still help.