Monday, 28 May 2012

Am I really a bioinformatician?

A bioinformatician is someone who creates computational tools that address biological questions.

A computational biologist is someone who uses computational tools to address biological questions.

There are four classes of computational tools, algorithms, workflows, databases and interfaces. I have created each in the past. I was a bioinformatician! These days most of my time is spent deploying and maintaining the computational tools of others; so am I still a bioinformatician? I'd like to hope so, and therefore propose a new classification - the operational bioinformatician to describe myself and those like me, to be distinguished from the research bioinformaticians who get all the fun creative jobs. Oh well, let's hope any future bifx union treats perl scripting as an operational rather than research activity.

Saturday, 7 January 2012

The various semantics of Open-*

Several 'Open' movements have gained traction since 'Open Source' was coined in the late nineties. Those important to me, an applied informatics scientist include;

The meaning of 'Open' in each case can, however, be somewhat different for each. In the case of Open Source, it basically means that the end product of the (software) development process is available for anyone to see (if not always distribute) free of charge. Having the end product 'open' holds true for Open Access and Open Data. One difference is that reuse of Open Data is generally associated with fewer restrictions than Open Access and Open Source, which are typically limited somewhat by copyright and software licenses.

At about the same time as the rise of the Open Source, it was recognised that openness of the process of software development itself, in addition to that of the end product, could be a useful approach to the creation of free software. This was most famously described by the Eric Raymond's Cathedral and the Bazaar. This paradigm has been seized by the Open Science movement, who contend that the very process of science should be conducted in the open (e.g. blogs, live lab notebooks). It logically follows that the product of such endeavours should be Open (Access, Data, Source), but this is a secondary attribute.

Open Innovation, simply looking outside your organisation for innovative ideas, seems like the square peg in this discussion. Surely it's simple outsourcing of research and development activities? Where is the 'open' in that? However, if one accepts that the approach is epitomised by 'Open Innovation' challenge companies such as Innocentive, then the openness is in the clear statement of requirement/specification associated with each challenge. If one accepts 'open specification' as a key attribute of Open Innovation, then it's easy to see the fit with translational research; unmet R&D requirements for domestic industries provide justified focus for domestic science funding agencies.

It seems that we have 'Open' movements for all stages of the development lifecycle, from inception (Open Innovation), to implementation (Open Science), to end product (Open Source, Access, Data). The commonality is that the outputs of each stage are freely available for everyone to see. It does not, however, mandate that one should necessarily follow another.

Tuesday, 30 August 2011


I've just returned from 2 weeks living under canvas in Spain. Although my Spanish is, erm, basic, communication was simply not a problem. That's in stark comparison to my day job of science and business, even with all the tools of the modern internet at my disposal; there's so much I want to say that is intuitively obvious (to me), but defies logical description (by me). Compare the following; "I'd like bread, cheese and a litre of red wine in a plastic bottle" with "commercial involvement will make open science more successful". Everyone understands intuitively why I want to poison my body with saturated fats and alcohol, but polluting open communities with business ethics requires special justification for which I have yet to develop the language. But I'm working on it.

Friday, 1 July 2011

The battle between "Open Science" and "Open Innovation"

"Open innovation" is a term that describes the sourcing of new methods, ideas, solutions etc. from outside the organisation.

I hate "open innovation"! I don't hate the process of "open innovation", I just hate the term. Because there's nothing "open" about it. The final "innovation" is just as closed as if it had it been invented in-house. 

The poster child of open innovation is, of course, InnoCentive. Clever name, brilliant business model: "Seekers" invite "solvers" to provide solutions to their problems for a cash reward. InnoCentive represents open-something for sure, but if not innovation, then what? Open questions? Not quite. Open quandaries? Better. Open befuddlement? Too far! I therefore humbly suggest;
InnoCentive; crowdsourced solutions to open quandaries.

However, so called "open innovation" extends way beyond crowdsourcing of the Innocentive mould. A great number of acquisitions and licensing deals, particularly in the pharmaceutical industry, can be seen in this light. Although such transactions open very little to the public domain it is clear that innovation has technically come from outside the purchaser/licensee, hence "open innovation" still fits.

More on "open innovation": Pharma are increasingly looking to bypass the biotech middleman by partnering directly with academia. This represents the funding of public research by multinational corporations in exchange for first dibs on any intellectual property that may emerge. It could be argued that such initiatives pit "open innovation" in mortal combat with "open science". "Open science", remember, asserts that public research belongs public domain, for free, and for the good of all.

Semantic posturing aside, innovation is the key to progress, no matter how it is couched. A nice recent example from Henry Chesborough (who coined open innovation) on how it can help pharma. So let's call a spade a spade; I hate the term "open innovation" because I can twist it to be in conflict with "open science" which is a movement that I truly value. I would rather "open innovation" revert to plain old "contract research", perhaps reserving "open quandary" to describe the crowdsourcing of same. 

Thursday, 16 June 2011

The brilliant Genome Analysis Crowdsourcing repository

In the days following the deadly German E. coli outbreak various 'rapid response' sequencing, assembly and annotation efforts washed across my radar (mainly via twitter). In isolation each of these efforts represents little more than a shop-front for their respective creator's (albeit impressive) capabilities. There was always the nagging feeling that a coordinated effort would have been more credible, and ultimately more useful.

Having perused a couple of the available data sets to see which file formats were being distributed I was hoping to find a blog post that summarised them all. That's when I found the E.coli O104:H4 Genome Analysis Crowdsourcing repository at GitHub. This goes way beyond being a simple blog. It represents a living repository linking all of the data generation efforts to-date. If that in itself were not enough, there is also a day-by-day listing of analysis reports (mainly blog posts).

I now contend that "Genome Analysis Crowdsourcing", by pooling various independent data and analyses makes these as credible and useful, if not more so, than any coordinated project could possibly have been. The quantity and variety of data in the public domain, all generated within 2 weeks, linked from a central location, is staggering!

Thursday, 2 June 2011

Open communities - build or reuse?

I have drafted this blog a few times and I'm bored with the narrative. I'm therefore going to spit out the conclusion right at the start: if you want to engage a developer community for your project go to them, don't ask them to come to you (they almost certainly won't)...

As funding for open databases like NCBIs OMIM is cut, there tend to be fairly rational calls for the database curation to be opened up to the community (eg Manuel Corpas' recent blog post). The typical method is the addition to the project of a wiki interface that accepts community annotation. I've been at the birth of a few such projects. No names named, and here's why; I've also sadly attended their inevitable deaths from neglect when no bugger ever used the darned things. Whilst notable successes exist (EcoliWiki, SNPedia, the Polymath Project) building an open community from scratch is hard, very hard, and most projects are doomed to failure. I, for one, limit myself to participating in two or three projects at any one time, and need a very compelling reason to start contributing to a new one.

So, given that collaborative development does produce valuable products and individuals can be motivated to contribute, how do we go about finding our contributors? The solution is actually pretty obvious; don't build an open community from scratch - use an existing one! The shining example of this approach is the Rfam adoption of Wikipedia itself as the source of community‐derived annotation, with advantages described in this NAR paper, including;
  • Access to a large existing community of curators,
  • Access to well maintained, user-friendly curation tools,
  • Entries subjected to automated QC tools (bots),
  • Leading to improved database content (around 2500 contributions/year),
  • Plus the side effect of improved discoverability of the resource via Wikipedia itself. 

It will be interesting to see other annotation projects cotton on to this idea; Pfam already has, but it's from the same Bateman stable as Rfam, so might not count (I've already been chided for mixing the two over at this Tree of Life blog post on a similar subject). Away from annotation, for active and inclusive bioinformatics-specific open communities you have the OBF leading the way, and also Debian Med (now blogging here) who are leveraging the wider Debian Linux community for the benefit of the life sciences. Whether there will be open science projects that successfully leverage Twitter and other social media communities remains to be seen.

So; what's the point of all this? Oh yes - if you're serious about engaging a developer community for your project go to them, don't ask them to come to you. Got that?

Thursday, 19 May 2011

Well that KEGGing sucks - but how much?

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a hugely important database of genomic pathways and interactions that has been used daily by countless molecular biologists over the past 15 years (up to 200K unique web site visitors per month).

Even though the data sources that KEGG integrates to build its database are predominantly available to all, free of restriction, the KEGG database itself has traditionally carried a dual license - free for academic use, but non-free for commercial use through their Pathway Solutions licensing agent. I'm no great lover of dual licenses as they discourage commercial use thereby restricting translational application of the resource 'for the good of humanity'. Well, two days ago KEGG announced that it would go even further, by charging up to $5000 for academics to download the database (starting July 1st).

Can we use this unfortunate circumstance to assess the impact of limiting access to an established resource such as KEGG? And do it using a scientific measure that really matters; citations? Given that KEGG have 1000 citations/year and a 15 year trading record, the returns for the next few years should be very revealing.

Footnote - funding for large integrated databases is notoriously difficult to maintain over the long term even though the resources themselves are enormously valuable. In February NCBI tackled budget challenges by throwing their SRA toys out of the pram (on which I have commented before), whereas KEGG have been far more pragmatic in looking for alternative sources. I have huge respect for both projects.