Database management for "big science" applications

| | Comments (4) | TrackBacks (1)
I recently attended an invitation-only, one-day workshop at the Stanford Linear Accelerator Center. Attendees included representatives from:

  • The database research community (including me)
  • The "big science" community who have BIG data base problems
  • Commercial DBMSs vendors
  • Other "power users" of database technology, including eBay, Yahoo!, and Google

The point of the workshop was to look for approaches that would solve the DBMS problems of big science in a better way. The conventional wisdom today -- and I am generalizing a bit -- is to store science data in the file system with metadata about the files stored in a relational DBMS. In both astronomy and particle physics, projected data size is well into the petabyte range.


The top three DBMS issues for big science

The big science community has at a variety of problems in terms of DBMSs, including:

  1. Consistency of data and metadata. Since metadata is stored separately from the data, the programmer is responsible for keeping the two consistent. This reminds me of the DBMS community in the 1970s -- it lamented about the same issue.

  2. A differing view of DBMS requirements. Science data is stored in the file system because DBMSs don't "do the right thing." However, there seems to be no common statement of what the right thing is. For example, the particle physics folks want time series support for observation data and particle tracks, the astronomy folks want indexing of 3D objects in several coordinate systems, and the remote sensing and astronomy communities want built-in support for multi-dimensional arrays.

  3. No automated lineage support. Support for lineage (provenance) is crucial. It is important for scientists to know how any given data set was derived. In other words, they want to keep track of the sequence of processing steps that has previously been applied. As with the first problem, the programmer currently handles this issue manually.

Obviously, the best solution to these three problems would be to put everything in a next-generation DBMS -- one capable of keeping track of data, metadata, and lineage. Supporting the latter would require all operations on the data to be done inside the DBMS with user-defined functions -- Postgres-style.


A previous effort with Postgres failed

Clearly big science would like somebody else to take over its storage issues. But that has not happened yet. I am reminded of the Sequoia 2000 project in the mid 1990s, which I co-led with Jeff Dozier of UC/Santa Barbara while I was at Berkeley. This was a DEC-sponsored collaborative project between computer scientists and earth scientists to build tools and systems for earth scientists. In the database arena, the goal was to use Postgres for storage. But this part of the project failed because:

  • Postgres had no support for big arrays, which was the predominant data type.

  • Postgres had no notion of a processing pipeline whereby raw imagery is "cooked" into finished data products. Hence, there was no way for it to automatically keep track of lineage.

  • Postgres was not particularly easy to use for the operations earth scientists wanted to use it for, such as coordinate transformations. Hence they did not see the value of a DBMS over custom C or C++ code operating against the file system.

The Sequoia experience convinced me that big science would not be happy with anything remotely like what was offered in commercial DBMSs. That leads to the question of the day: "What do they want?"


A call for help: Let the research community help develop a science database

At CIDR 2007, some of us reported on a prototype called ASAP that we thought might appeal to the science community. This system proposed a real-time processing pipeline, lineage, and good support for large arrays. In other words, we fixed all the problems we saw in the Sequoia project a decade ago.

ASAP is languishing because we cannot find any scientists willing to work with us. The ones we have talked to are typically too busy and don't see the near-term value of collaboration. In a sense they are right -- the value of the collaboration would be to define a good science DBMS that could then be commercialized. But that process and the benefits would probably take at least five years.

A better solution requires input from big science on the initial ideas from the DBMS research community. The DBMS research community would be thrilled to try and define these operations, but it needs help from big science. This is a plea to big science for help.

Where could we start working together? A significant problem in developing a science DBMS will be the definition of a small collection of primitives. Relational DBMSs succeeded in business data processing because essentially all users were willing to use SQL engines based on a single data type (table) and a small set of operations (filter, join, aggregate, etc.). To have a chance to succeed, a science DBMS must also have a small set of data types and operations. A small collection of primitive operations is crucial; otherwise, the run-time system will be hopelessly complex. It appears to be a big challenge to come up with one small set of operations, given the diversity of needs I saw at the workshop.


Big science can't do it on its own

The big Web companies have storage issues at least at the scale of big science, and perhaps bigger. Several are in the process -- or have done so already -- of "rolling their own" solutions, having given up on DBMS technology for their immediate needs. However, these companies have much bigger budgets and the skilled manpower to develop custom DBMS solutions than appears to be available to big science. Without the money and resources, it will be crucial for big science to agree on some common standards and then foster their implementation.

 

Categories

,

1 TrackBacks

Listed below are links to blogs that reference this entry: Database management for "big science" applications.

TrackBack URL for this entry: http://www.databasecolumn.com/blog/mt-tb.cgi/19

» Reading Material from Confluence: Extremely Large Databases

This page lists interesting papers and articles in the extremely large database space. Blogs about the first XLDB Workshop Brian Aker, MySQL Read More

4 Comments

Tom Davis said:

I know this isn't exactly what you're looking for, but the HDF5 format has been developed to address many of these issues. Data and Metadata are easily handled atomicly, and lineage can be shown in the tree with the raw data as a parent node, and each processing step as a descendent node.

I believe some neuroscience lab at Stanford were working on something like this two or three years ago.

http://en.wikipedia.org/wiki/Hierarchical_Data_Format

I think we qualify as your typical "big" scientists:
In the dutch EcoGRID project (www.ecogrid.nl) we have a system with 14 distributed databases containing geo-referenced field observations on biodiversity. In total it now contains 20 million records, however in the next year it will grow towards 300 million records when we add more datasets. These field observations are colected by more then 5000 qualified volunteers, each of which will have a database account to enter and retreive their personal observations. The system is currently implemented in PostgreSQL 8.1 with the PostGIS 1.1 library.
Records can be for example a datapoint or a complete GPS track or a whole grid (satelite-imagery, aerial photography).


Datatypes we lack:
o GIS rasters (or multidimensional matrices with neighbourhood indices and functions). ISO 19123

o Timeinterval datatype for timestamps, including operators for before, after, overlap.

o Space/time datatypes for moving species (GPS tracks) or Timeseries for GPS tracks that can grow while measuring, that we can couple with GIS-linestrings.

o Functionality to synchronize login/password between postgresql clusters.

o Remote tables: efficient querying of large tables stored in remote clusters: (SQL/MED) ISO/IEC 9075-9:2003

o Parallel query processing: when we could access in parallel several remote tables on different clusters/servers it could provide a significant speed up. I am thinking of remote tables that are childs inherited from one parent table. The childs would be stored on different servers.

o Multi-Master replication with automated recovery after restoring backups. Replication of only parts (schema/table) of a database is a requisite.

I am the chief engineer of this system and work at the group Computational GeoEcology of the University of Amsterdam. (http://www.science.uva.nl/ibed-cbpg).
And my group and me are very interested in participating to develop any of the above functionality (preferrably in PostgreSQL).

Regards



Floris Sluiter

fsluiter _a_ science.uva _dot_ nl

Jacek Becla said:

Mike,

It was discussed at the mentioned workshop that we should try to organize a dedicated focused meeting between DB research/academia and scientific community to better understand common scientific database requirements. We have started to draft a list of possible requirements, and I believe we will be able to assemble a solid group of representatives from science for a meeting with you and other database researchers around April or May of this year. I'm also planning to present at this year SSDBM conference (also suggested at the workshop) and discuss the needs for collaborative work on common scientific requirements.

For these who didn't attend the aforementioned workshop, the full report is available at:
http://www-conf.slac.stanford.edu/xldb07/xldb07_report.pdf
and will be published soon in Data Science Journal.

Jacek Becla

Michael Stonebraker said:

Jacek,

Such a meeting would be great. Let us know how your work progresses. We can help organize the meeting you talk about.

Mike

Leave a comment

About this Post

This page contains a single post by Michael Stonebraker published on November 6, 2007 12:26 PM.

Database parallelism choices greatly impact scalability was the previous entry in this blog.

Once upon a time ... the origins of today's relational database architectures is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.