Here’s the second post in our series about data warehouses. If you missed the previous one, check it out here.
In our series, we’ll show you:
- What data warehouses are + their short history
- Who needs a data warehouse
- Key components of data warehouses
- Tips about implementing and designing data warehouses
- How to make sure that your data warehouse is secure
Ready to learn more about data warehouses and how to implement them? Let’s get started.
Designing data warehouses
Structured vs. semi-structured vs. unstructured
Another important issue we should consider when designing our data warehouse is whether the data we plan to store will be highly structured, partially structured, or unstructured. In general, each of these data types requires a different kind of data warehouse:
Data in a structured data warehouse will always assume the same form, and the data structure will rarely change. Relational database systems are usually used as such warehouses; ones that facilitate the storage of large amounts of data in a way allows finding data we need quickly. The advantage of structured databases is their efficiency in providing us with already pre-transformed data. Downsides? The extension and change of data structure are quite costly and dangerous.
Data warehouses of this type usually contain data that are grouped, categorized and segregated, but the structure of the target data may assume different forms. In the past, two types of databases – the so-called relational and non-relational databases – were used for this type of solution. In a relational database, there used to be mainly an index of possessed data along with information where it’s located. In a non-relational database, on the other hand, we would find target data in an unstructured form. Currently, database systems such as MySQL and PostgreSQL provide functionalities that allow storing unstructured data in one table cell in a way to make accessing this data similar to the rest of the structured data (adding a new data type called JSON was the solution).
This type of data warehousing contains entirely unstructured data, and one of its applications are ELT systems. To explain this type of warehouse, it’s worth first to describe the difference between the ETL and ELT process briefly. The ETL process involves downloading data, transforming it, and saving it to the database (read more about it in this article), while the ELT process consists in downloading data, saving it to the database, and transforming it in real time when it’s downloaded. In the case of the ELT process, our data warehouse will get pre-unified but ultimately unprocessed data, so the format and structure of the same data may be different, depending on the source from which they come.
This type of division is quite significant – perhaps not so much from the business point of view, but rather from the technical perspective. The choice between a structured, semi-structured and unstructured data warehouse has a massive impact on the kind of technological solutions we choose for implementing our data warehouse.
Here are some examples of real-world applications:
A database that contains data about employees, finances, and infrastructure resources is usually structured. Such data warehouses supply HR, as well as logistics and inventory systems. You’ll also find them in systems used in banking, healthcare, or telecommunications; that is, in systems that require data reliability and consistency.
All databases that contain incomplete data, analytical and statistical data, temporary data. You’ll find this type of data warehouse in systems based on Machine Learning, used in the Business Intelligence processed or, in general, related to the field of Data Science (Data mining, Visualization, Data processing). These are usually databases that result from a migration and data processing process. They’re often just temporary. It can also be an ODS data warehouse.
There are usually crawling databases that contain data from many different sources in their original or pre-unified/processed form. Unstructured databases are also created during the ELT process at the Load (L) stage.
Physical vs. virtual data warehouses
Looking at the problem of cataloging our data, we can also review solutions such as iPaaS, Data Virtualization or Data Federation. In that case we need to distinguish between physical data and virtual warehouses.
That’s a data warehouse with physically stored data. On the one hand, this solution ensures the durability and consistency of data. On the other hand, it also requires more effort to maintain and develop.
Data in this type of warehouse is metadata, including information about the structure and location of physically stored data. These types of data warehouses are created when instead of building an ETL- or ELT-based system, we pick Data Virtualization or Data Federation solutions.
Build a data warehouse in accordance with its goal
When deciding to build a data warehouse, remember that a data warehouse is not an end in itself – instead, it’s a means to achieving a goal. That’s why it’s worth considering what data we will use in our warehouse – and how. Looking at the problem from this perspective, we can divide data warehouses further into:
All data in our warehouse will be grouped in relation to subjects of the enterprise – existing entities in our organization. If we save the costs of employing our staff there, the database will be designed to be easily saved and read as “HR.” The issues related to human resources form a separate topic in every company – most of the time, organizations have different departments that deal with HR and the same will be reflected in the database.
Data in integrated data warehouses are unified, and the database itself is quite standardized. To illustrate that better, let’s focus on one example: employee salary. In the case of the Subject-Oriented warehouse, that information can be seen in the case of “HR” as “employee -> remuneration.” In the case of “Finance,” the same data can be seen as “fixed expenses -> employee.” In the case of an integrated warehouse, however, these data will be visible always as “Finance -> Expenses -> Employees -> Employee.”
In this database, data are sorted according to the chronology of events. These are usually historical data saved in such a way that we can easily download and sort them after a while. These can be, for example, transaction data for micro-payments.
These are read-only data warehouses. That means we can only download data – and not create, update, or delete it.
The data in the warehouse are saved and available as summary information, with the possibility of obtaining details. Here’s an example: “Set of all employees -> costs of the software development department -> costs of one software developer.” In that case, it’s best to first ask about the total cost of employees and then go more in-depth.
To sum up, data warehouses can be divided into many different kinds depending on the type and format of data we plan to store, as well as their application and purpose.
There are still other ways to classify data warehouses, but the ones mentioned in this series are enough to approach the topic of data warehouse design and implementation successfully.
Stay tuned for the third and last part of our series where I talk about different methods for keeping data warehouses secure.