To ETL or federate ... that is the question
It's a common problem. Enterprises must integrate data in a number of operational systems. But how should they do it? There are two technical approaches:
For the rest of this relatively short post, we will explain the pros and cons of each approach.
Data element "heat": Hot data favors ETL
In the ETL approach, data transformation occurs when a data element is extracted, while in the federation approach transformations occur at query time. If a data element is queried multiple times, it is obviously cheaper to perform the transformation once, thereby favoring the ETL approach. On the other hand, if a data element is never or only queried once, then the federation approach makes more sense. In summary, if a data element is rarely queried (i.e., it is cold) then federation is desirable. In contrast, hot data elements are better with ETL solutions.
Indexing: Federation is harder to optimize
The data indexing requirements of OLTP are typically quite different from those of data warehouses. Hence, in an ETL approach the warehouse workload can be optimized separately from the OLTP workload on different hardware. In the federation approach, a DBA must balance the needs of both workloads in a single database -- a task that will be much more complex than optimizing two separate workloads.
In general, the federation approach will have significantly worse performance because the needs of the two environments must be optimized together, rather than separately.
Resource management: Faster BI query responses for ETL shops
In a data warehouse, there is a dedicated machine with optimized indexing for BI users. In contrast, the BI user will typically be prioritized behind OLTP transactions in a data federation. This will lead to poor response time for BI queries (i.e., more recommendations to "go out for lunch" while waiting for the result of a query).
Complexity of the schema change: ETL approach performs less joins
Most data warehouses implement star or snowflake schemas. In contrast, most OLTP systems utilize non-snowflake schemas. As a result, the global schema is quite different from the various operational schemas. In this case, a single record in the global schema may come from several records in the operational schema. Therefore, a federator must perform this join on every query. In contrast, an ETL system will do the join once at load time. Again, the ETL approach should have much better performance when the schema mapping becomes complex.
Contention (concurrency control): Federation contention challenges
In an ETL system, data elements must be extracted from the operational systems periodically. Once loaded into a central warehouse, they become read-only. Hence, there is essentially no contention for locks in the ETL approach. In contrast, the federation approach will mix business intelligence queries and transactions in the operational systems. The result is lock contention, as well as contention for other resources.
Timeliness: ETL approaches must deal with out-of-date data issues
A data warehouse is fundamentally out of date by one-half of the periodicity of the load process. On the other hand, a federator gives up-to-the-second information.
To alleviate this disadvantage, some newer warehouse systems, such as Vertica, allow data loading in parallel with querying, a process called "trickle loading."
Mapping: Federations can't handle some transformations
A common situation is for the operational databases to have customer information, such as customer names. In an ETL approach, whenever a customer datum is encountered, it can be looked up in a steadily growing table containing the mapping from operational system names to global schema names. If a name is not present a new entry can be added. Name mapping is thereby a global operation, supported by a mapping table. In a data federation, name mapping is done on data access. It is difficult to guarantee that the same mapping is applied to each operational system, unless the same table -- discussed above -- is maintained. However, a federator has no facility to perform mappings that require state information. As such, there are some transformations that are very difficult to perform on the fly.
Summary: The ETL approach makes sense in most cases
In summary, virtually all enterprises use the ETL approach for data integration. The data federation market is, in contrast, quite small. The place where I see federations as most viable is when there are many, many data sources (e.g., more than 5,000 sources) and BI users utilize only a small number of them at any given time. In this extreme case, the average data element is accessed zero times before it is updated or deleted. In this instance, one is better off leaving the data where it originates. On the other -- more common -- hand, when most data elements get used several times, the ETL approach will continue to be preferred.
- Extract, transform, and load (ETL). In this approach, an enterprise sets up a centralized data warehouse and then constructs a global schema for the data of interest. For each operational system, they will employ some sort of ETL process to transform data instances into the global schema and then load them into the centralized warehouse.
- Federate. As an alternative, enterprises can construct a global schema as described above but leave the data where it resides. Instead of building a central warehouse, they can employ a data federator, such as MetaMatrix or Aqualogics. Queries (and perhaps updates) can be submitted to the federator. In turn, the federator figures out what queries or updates need to be run at each of the operational sites to construct the correct outcome to the submitted commands.
For the rest of this relatively short post, we will explain the pros and cons of each approach.
Data element "heat": Hot data favors ETL
In the ETL approach, data transformation occurs when a data element is extracted, while in the federation approach transformations occur at query time. If a data element is queried multiple times, it is obviously cheaper to perform the transformation once, thereby favoring the ETL approach. On the other hand, if a data element is never or only queried once, then the federation approach makes more sense. In summary, if a data element is rarely queried (i.e., it is cold) then federation is desirable. In contrast, hot data elements are better with ETL solutions.
Indexing: Federation is harder to optimize
The data indexing requirements of OLTP are typically quite different from those of data warehouses. Hence, in an ETL approach the warehouse workload can be optimized separately from the OLTP workload on different hardware. In the federation approach, a DBA must balance the needs of both workloads in a single database -- a task that will be much more complex than optimizing two separate workloads.
In general, the federation approach will have significantly worse performance because the needs of the two environments must be optimized together, rather than separately.
Resource management: Faster BI query responses for ETL shops
In a data warehouse, there is a dedicated machine with optimized indexing for BI users. In contrast, the BI user will typically be prioritized behind OLTP transactions in a data federation. This will lead to poor response time for BI queries (i.e., more recommendations to "go out for lunch" while waiting for the result of a query).
Complexity of the schema change: ETL approach performs less joins
Most data warehouses implement star or snowflake schemas. In contrast, most OLTP systems utilize non-snowflake schemas. As a result, the global schema is quite different from the various operational schemas. In this case, a single record in the global schema may come from several records in the operational schema. Therefore, a federator must perform this join on every query. In contrast, an ETL system will do the join once at load time. Again, the ETL approach should have much better performance when the schema mapping becomes complex.
Contention (concurrency control): Federation contention challenges
In an ETL system, data elements must be extracted from the operational systems periodically. Once loaded into a central warehouse, they become read-only. Hence, there is essentially no contention for locks in the ETL approach. In contrast, the federation approach will mix business intelligence queries and transactions in the operational systems. The result is lock contention, as well as contention for other resources.
Timeliness: ETL approaches must deal with out-of-date data issues
A data warehouse is fundamentally out of date by one-half of the periodicity of the load process. On the other hand, a federator gives up-to-the-second information.
To alleviate this disadvantage, some newer warehouse systems, such as Vertica, allow data loading in parallel with querying, a process called "trickle loading."
Mapping: Federations can't handle some transformations
A common situation is for the operational databases to have customer information, such as customer names. In an ETL approach, whenever a customer datum is encountered, it can be looked up in a steadily growing table containing the mapping from operational system names to global schema names. If a name is not present a new entry can be added. Name mapping is thereby a global operation, supported by a mapping table. In a data federation, name mapping is done on data access. It is difficult to guarantee that the same mapping is applied to each operational system, unless the same table -- discussed above -- is maintained. However, a federator has no facility to perform mappings that require state information. As such, there are some transformations that are very difficult to perform on the fly.
Summary: The ETL approach makes sense in most cases
In summary, virtually all enterprises use the ETL approach for data integration. The data federation market is, in contrast, quite small. The place where I see federations as most viable is when there are many, many data sources (e.g., more than 5,000 sources) and BI users utilize only a small number of them at any given time. In this extreme case, the average data element is accessed zero times before it is updated or deleted. In this instance, one is better off leaving the data where it originates. On the other -- more common -- hand, when most data elements get used several times, the ETL approach will continue to be preferred.
Categories
Database architecture

Leave a comment