Question: How To Organize A Pipeline Of Small Scripts Together?
4
gravatar for Giovanni M Dall'Olio
10.7 years ago by
Barcelona, Spain
Giovanni M Dall'Olio420 wrote:

In bioinformatics it is very common to end up with a lot of small scripts, each one with a different scope - plotting a chart, converting a file into another format, execute small operations - so it is very important to have a good way to clue them together, to define which should be executed before the others and so on.

How do you deal with the problem? Do you use makefiles, taverna workflows, batch scripts, or any other solution?

ADD COMMENTlink modified 10.6 years ago by User 82260 • written 10.7 years ago by Giovanni M Dall'Olio420
5
gravatar for Giovanni M Dall'Olio
10.7 years ago by
Barcelona, Spain
Giovanni M Dall'Olio420 wrote:

My favorite way of defining pipelines is by writing Makefiles, about which you can find a very good introduction in Software Carpentry for Bioinformatics: http://swc.scipy.org/lec/build.html .

Although they have been originally developed for compiling programs, Makefiles allow to define which operations are needed to create each file, with a declarative syntax that it is a bit old-style but still does its job. Each Makefile is composed of a set of rules, which define operations needed to calculate a file and that can be combined together to make a pipeline. Other advantages of makefiles are conditional execution of tasks, so you can stop the execution of a pipeline and get back to it later, without having to repeat calculations. However, one of the big disadvantages of Makefiles is its old syntax... in particular, rules are identified by the names of the files that they create, and there is no such thing as 'titles' for rules, which make more tricky.

I think one of the best solutions would be to use BioMake, that allow to define tasks with titles that are not the name of the output files. To understand it better, look at this example: you see that each rule has a title and a series of parameters like its output, inputs, comments, etc.

Unfortunately, I can't make biomake to run on my computer, as it requires very old dependencies and it is written in a very difficult perl. I have tried many alternatives and I think that rake is the one that is more close to biomake, but unfortunately I don't understand ruby's syntax.

So, I am still looking for a good alternative... Maybe one day I will have to time to re-write BioMake in python :-)

ADD COMMENTlink written 10.7 years ago by Giovanni M Dall'Olio420
4
gravatar for István Albert
10.7 years ago by
István Albert ♦♦ 310
University Park
István Albert ♦♦ 310 wrote:

I don't have personal experience with this package but it is something that I plan to explore in the near future:

Ruffus a lightweight python module to run computational pipelines.

ADD COMMENTlink written 10.7 years ago by István Albert ♦♦ 310
3
gravatar for Chris
10.7 years ago by
Chris30
Munich
Chris30 wrote:

Since I work a lot with Python, I usually write a wrapper method that embeds the external script/program, i.e. calls it, parses its output and returns the desired information. The 'glueing' of several such methods then takes place within my Python code that calls all these wrappers. I guess that's a very common thing to do.

Chris

ADD COMMENTlink written 10.7 years ago by Chris30
3
gravatar for Michael Barton
10.7 years ago by
Akron, Ohio, United States
Michael Barton30 wrote:

My answer would be: don't bother. I've often found that much of the scripts I write are never used again after the initial use. Therefore spending time using a complex framework that considers dependency between scripts is a waste because the results might be negative and you never visit the analysis again. Even if you do end up using the script multiple times a simple hacky bash script might be more than enough to meet the requirements.

There will however be the 1-2% of initial analyses that return a interesting result and therefore need to be expanded with more deeper investigation. I think this is the point to invest more time time in organising the project. For me I use Rake because it's simple and allows me to write in the language I'm used to (Ruby).

Overall I think pragmatism is the important factor in computational biology. Just do enough to get the results you need and only invest more time when it's necessary. There's so many blind alleys in computational analysis of biological data it's not worth investing too much of your time until it's necessary.

ADD COMMENTlink written 10.7 years ago by Michael Barton30
3
gravatar for Etal
10.6 years ago by
Etal70
Athens, GA
Etal70 wrote:

The most important thing for me has been keeping a README file at the top of each project directory, where I write down not just how to run the scripts, but why I wrote them in the first place -- coming back to a project after a several-month lull, it's remarkable difficult to figure out what all the half-finished results mean without detailed notes.

That said:

  • make is pretty handy for simple pipelines that need to be re-run a lot
  • I'm also intrigued by waf and scons, since I use Python a lot
  • If a pipeline only takes a couple of minutes to run, and you only re-run it every few days, coercing it into a build system doesn't really save time overall for that project
  • But once you're used to working with a build system, the threshold where it pays off to use it on a new project drops dramatically
ADD COMMENTlink written 10.6 years ago by Etal70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 1 users visited in the last hour