The challenges of storing scientific data

database
(Image credit: Pixabay)

Karen Ambrose is the database team lead at Francis Crick Institute in St. Pancras in London. We caught up with her at the Percona Live 2019 conference in Amsterdam to understand the complexities involved in managing databases in a scientific installation. Karen has been with the Francis Crick for about five years. She has a background in Bioinformatics and it was during her Masters that she got interested in the application of technology to get better insights into scientific data.

Karen started her career at the Sanger Institute in Cambridge at the time when they were mapping the Human Genome, before moving on to the Francis Crick Institute. The Francis Crick Institute itself came about as a merger between various research organisations including the National Institute of Medical Research (NIMR) and London Research Institute (LRI). 

Her first task was to migrate the data from the different databases in the various organisations: “We initially had a time frame I think of about nine months to a year to physically migrate and move into the Francis Crick. And so we have to migrate about 300 databases. But that was in a landscape where the groups weren't entirely moving in one go. So you might have a group, which essentially will talk to a cluster of databases at one site. Half of that group is then moved into the Francis Crick, and the other half is staying in because they have to shut down their lab in order to move. And we've got to make that data available at the new site and the old site.”

What made it even more challenging was that it wasn't just a set of databases assigned to one group that was moving; some of those databases were being shared between five groups that were moving at different times. Karen describes the migration as shuffling chess pieces during which she had to make sure that they don't corrupt any data and that it's available to the teams that were still working on it, with the least amount of downtime, if any. 

It sounds like a herculean task, and given their strict deadlines would surely have required an army of database wranglers: “There's four of us in the team, including me.”

(Image credit: Shutterstock)

Strategising storage

“Over the years we've basically been building a scientific data mountain. Data doesn't get smaller, it just seems to get more complex and large.”

The institute has about 1500 people, including about 1300 scientists and 200 operational staff. There are some 130 lab groups supported by about 18 to 20 Scientific Technology Platforms (STPs) that provide the core services to the lab groups to be able to further their science: “So things like structural biology, and electromagnetic microscopy, high throughput sequencing, scientific computing, of which the database team which I manage is part of. So we provide a core service to the rest of the Institute.”

“For us, it's very much about the data that comes off these instruments”, Karen tells us. Besides making sure they provide the right platform to help scientists investigate the raw data that comes off the machines, a major task for Karen and her team is to store the data efficiently: “We need to work out what can we contain within the storage that we have within the institute, and also what other strategies do we need to incorporate, in terms of maybe looking at cloud, to help us provide the scientific insights that a particular lab group requires.”

The first challenge, she tells us, is to manage and secure all the generated data: “If people generate data, they generally want to keep everything, because you never quite know when you might need it. But we can't physically keep everything.” So her team works with the lab groups to identify the important data and separate it from the data that can be generated. 

The next challenge is performance. While for some scientists throughput isn't important as long as they can access the data, for others performance is important: “We're always looking how can we best design their database, how does their data need to be structured so that it will be performant.” Once again, the solution Karen says comes up in discussions  with the labs to understand what they need to achieve from the data.

(Image credit: Image credit: Shutterstock/ Imilian)

The Open Source advantage

The Francis Crick Institute uses various types of databases. While for the enterprise side of things, they use Oracle or SQL Server, Karen tends to steer the science groups towards open source databases. The Institute uses relational databases like MySQL and Postgres, but is starting to explore NoSQL databases like MongoDB, Neo4j, Cassandra, and others. She’s particularly keen on investigating Neo4j because “it's interesting in terms of how it graphs the relationships between data.” 

Karen also likes working with open source databases because of their open developmental model: “If you come up with something, a new problem that you want to solve, it's a lot easier to be able to talk to all the community to be able to come up with a solution. They're always innovating, always pushing things forward. So you never feel like you're always going to be confined by stagnant release process.”

Mayank Sharma

With almost two decades of writing and reporting on Linux, Mayank Sharma would like everyone to think he’s TechRadar Pro’s expert on the topic. Of course, he’s just as interested in other computing topics, particularly cybersecurity, cloud, containers, and coding.