Some thoughts on bioinformatics software maintenance

2019-09-16 1327 words 7 minutes

Contents

Overall thoughts

I released my first software package for bioinformatics about four years ago. I now have four, all of which see some usage, but certainly nothing like the heavy usage of the most popular utilities. Despite this, I would guess on average I spend around 20-30% of my time maintaining these packages.

I love that people find our software of some use, and it’s still exciting getting messages from users from countries all around the world. I want to maintain and improve our software, and help people use it as much as I can. After all, one of the great things about working at a University is that everything I work on goes into the public domain, rather than being kept closed-source.

I do feel like this effort is not likely to be rewarded by typical indicators such as publications or funding. With analysis papers I’ve published, the paper itself has usually been the end of the story, and it is then time to move on to the next thing, whereas with software it is really only the start of the work. That’s ok for me right now, mostly because it’s one of the things I enjoy most about my job, but maybe that will change as I spend more time doing it. Perhaps this is a broader problem in genomics/bioinformatics that needs change from funders, publishers etc? (I enjoyed this recent article which talked about this and proposed some solutions.)

Here are some things which have helped me lighten this load, or I think I’ve gotten better at, so far:

Make sure people can install it

I really think this is the number one thing to do! Spending a little time on this at the start has saved me so much more time later, and it’s one of the hardest things to diagnose via email. The first package I wrote was really hard to install, and realistically you probably needed to be a C++ developer to do it. Possibly, you needed to be me. This was a big mistake!

On conda

I’d come to view bioconda as panacea for all such woes, as you just need to spend a bit of time sorting out all your compile lines and dependencies, the maintainers are really helpful in getting it to work and maintaining good standards within their project, and it’s easy to use.

But I’ve seen a lot of posts/banter on twitter about conda being bad or not working, so clearly it doesn’t solve these issues for everyone. As more gets installed, dependencies get more complex, and things like version pinning come to the forefront. I’ve so far managed to resolve any issues I’ve encountered, but I presume this is because I have become competent with the system from having written so many recipes for it. It’s probably incorrect to assume that most users have the same level of familiarity with the system. I still think conda is the best solution out there, but I think perhaps we need better guides and documentation for when things go wrong.

My three biggest tips to fix stuff in it:

Create a new environment if something isn’t working when installing:
```
conda create -n pyseer_env pyseer
```
If the environment solving is taking a while, try with a new environment too. Another option is mamba (itself installable through conda) which worked well when I tried it.
If you are getting a library error, run conda update on the package listed. Or, try a new environment.
Run on linux (if you are using OS X, you can use a virtual machine)

Write documentation

It’s a false economy not to. I’ve spent least 3-5 days on this for anything I want other people to use. reST, sphinx and readthedocs make it really easy to make something that’s easy to write, looks nice, and has a website. Update the docs when you change the code too!

Fully documenting/commenting code (e.g. including full valid docstrings, maintaining pep standards) has felt of smaller benefit for most projects – I think this becomes useful when you have more people working on the code. Comment it however is best for you.

Write some tests, add continuous integration

Probably obvious, and certainly not the most exciting revelation I have had. This has however saved me from the biggest issues when I’ve been updating code. travis-CI, circleCI or azure pipelines are all easy ways to automatically run these with each new push.

Though, I don’t feel I have time to write unit tests.

I’m sure they would help prevent some regressions I’ve seen in my code. I think these would be easiest to write if I made them each time I wrote a new function. But at what point to I decide/know this code is going to be used enough to make this time worthwhile? It feels like something for projects with multiple developers. (perhaps this is blasphemy?)

We are not all Heng Li (or, Make Use of Dependencies)

Heng Li is clearly a legendary bioinformatician, and his software is some of the best. Heng has advocated avoiding maintenance of code by not using external dependencies, making a simple and clean user interface, and duplicating code bases: https://lh3.github.io/2019/03/11/on-maintaining-bioinformatics-software

I agree with two of these points, but not with not using dependencies. If, like me, you are not Heng Li, you may find it hard to bash out an aligner from scratch rather than including a library as a dependency. It saves me so much time relying heavily on dependencies for many common tasks, and there are some things I simply don’t think I could reimplement (either as efficiently, or correctly). If you are making a file format or something fundamental this seems more doable, but for most things, why reinvent the wheel?

With conda, I’m happy knowing that I can sort out any dependency installation difficulties when I write the recipe to install the package. I save a lot of time overall.

Engage users

Previously I have ‘advertised’ software through papers, conference presentations and twitter. This seems to work quite well, but doesn’t really help beyond the initial release. I have recently tried to get feedback on a more ongoing basis. I wish I’d done this earlier.

Most of the feedback I get is usually through github issues or emails saying something doesn’t work or is missing. But I rarely hear from people for whom the code worked, but had suggestions to improve things. Or people who were put off trying in the first place. And maybe I’m missing out on some positive feedback through this too!

I tried running a survey, which although probably far from ideal, definitely helped fill some of these gaps. I did try promoting it to see if this got more responses – I felt a bit icky giving money to twitter, but at least it would have pushed some other ads out of people’s feeds. I am still unsure whether this was good value for money or not.

A little money can go a long way

While I can give things like graphic design the old college try, it generally looks ok at best, and takes me all day. I tried paying for some images from the noun project ($2.99 each, or $39.99 for unlimited use) and a html/css template ($29). These both saved me loads and load of time, and looked much better than anything I could have done.

I’m pretty sure these costs pale in comparison to most other things in science, and definitely your own time, so I didn’t find them hard to justify.

There’s also lots of free stuff to use out there. I am a growing fan of using emojis in figures - universal, well-designed images that put across a concept efficiently to a wide range of audiences 🙏.

Write your package in R

R is one of my less favourite languages, but I’ve never had a problem installing something from cran. Maybe R developers are just better coders?