BLOG
Manage Data Across Your Cloud(s) with Data Federation, Data Hubs, or Data Lakes
As more and more businesses move their applications and associated data to the cloud, managing all that information becomes more complicated.
IT no longer has complete control and insight over every aspect of the datastore; instead as multiple cloud providers are implemented and endpoint data is served and collected from widely-flung users and workstations, you’re likely to run into compatibility and versioning issues between various databases and storage platforms. The data management problem grows even larger as multicloud, the Internet of Things, and Big Data initiatives rise in popularity and real-world applicability.
Three ways to get all your ever-growing databases and datastores on the same page are data federation, data hubs, and data lakes. What are the differences between each, and what are some pros and cons of their use?
Federated Databases
Also known as virtual databases, a federated database uses a software abstraction layer to combine various database sources into a single view. The system can accept queries and essentially presents itself as a single database, while the software is actually querying the production databases or data warehouses in real-time. Each sub-database source is a “federate.”
With a virtual database, the source data is not moved or copied into a central repository, so operations can be more costly and time-consuming when using federation. However, it is relatively easy to configure a federated database on top of existing databases in different environments and locations.
A federated database can harmonize data, but only as it is processed upon return. (Data harmonization is taking all data from all sources, comparing similar records and combining them where possible, throwing out bad data, and presenting the most accurate items as a whole.)
Virtual databases do not index data (indexing involves the creation of an Index, which is saved separately to the storage media, allowing quicker record retrieval and analytics). Federated databases instead rely on the destination siloed databases to index. Any retrieval or analytics request is performed on the host database.
Federated databases can run into trouble if you are running a query that is not recognized by one of the source systems. Because your federation platform is tied to the source databases, it can take a lot of effort to integrate them, and making changes is also development-heavy. If any of the source systems are degraded, the entire federation might be inaccessible.
One advantage of federation is real-time access to all of your data, rather than waiting for it to be moved into a central location and then harmonized and indexed. Scaling can be difficult, but for real-time web service type workloads, federation can be a good option.
Data Lakes
Many articles have been popping up recently about how to manage data lakes, perhaps partially due to the popularity of Hadoop. Data lakes involve moving all data into a single location. The data can be structured or unstructured.
Strong management is necessary for data lakes to ensure they do not sprawl too much. If you continue to push more and more data into your data lake without regard as to which data is useful for business insights or production applications, you risk wasting money on storage and systems on top of reduced performance for anything tied to that data lake. Data lakes can index data but they are inherently limited as they deal with so many different data formats.
Data lakes are good options for batch processing and analytics, but you might have to sort through every single record to do so due to their lacking abilities when it comes to indexing. Data lakes are ideal for lots of unstructured data and insights, but not real-time processing.
Data Hubs
Data hubs follow, as you might imagine, a hub-and-spoke model. In this case the data is moved or copied to a central hub datastore using data discovery, indexing, and analytics tools. The main difference between a hub and a lake is that the hub re-indexes the data so it can be easily queried.
Data hubs will harmonize data as it is imported and then use the harmonized records to index. Data hubs present a “best of both worlds” approach as they homogenize data and can return it in various desired formats. There are a variety of data hub product providers on the market. Data hubs have the most overhead when it comes to configuration and ensuring interoperability between data sources, but perhaps offer the most abilities of the three.
Increasingly complex and distributed workloads are generating more and more diversity of data everyday. You must assess your use cases, budget, existing environment, and future infrastructure goals in order to determine which solution — or combination of architectures — is best for your data management.