If you've been watching the data science scene, you probably spotted this term mentioned quite a lot: the Extract, Transform, and Load (ETL) process.
ETL is a process that takes advantage of databases and especially in data warehousing. In particular, it’s about extracting data from multiple sources, processing it according to individual requirements, and storing it to databases. Organizations can use such data in many different ways: as databases for different systems, website content, or analytics. The most widespread use case among enterprises is Business Intelligence (BI) solutions.
Understanding what it means is essential to know how you can make use of it for your project.
In this article, we explain what ETL is and why it's so important.
What is the Extract, Transform, and Load process?
To put it simply, ETL describes the method for collecting data from various sources and delivering it to further use in a standardized form - all thanks to storing in databases that form data warehouses. It’s a general description that tells us how that process should be accomplished.
Extracting data from different files, transforming it, and then saving to different files is a kind of ETL process too - even though we eventually don’t create a data warehouse but a loose set of files containing information we can use further.
There exist many different patterns for building data warehouses apart from ETL. Depending on individual requirements, we can bet on ETL (creating an ETL-based warehouse), but also alternatives such as the Enterprise Service Bus (ESB), Enterprise Application Integration, the new version of ETL called ELT, as well as Data Virtualization, Data Federation, and PaaS.
ETL systems usually integrate data from multiple systems that are most of the time developed and supported by different vendors. What's more is that these separate systems which contain the original data are often managed and operated by different employees. Take the cost accounting system as an example – it usually combines data coming from sources such as payroll, purchasing, and sales.
But note that while implementing such a system looks promising at the beginning, as more data pours in we need more space and the processing stage takes a lot of time. Each change becomes expensive and it takes a long time to get results. Developing and maintaining such a system might become challenging because a single mistake can bring catastrophic consequences.
Implementing a system based on ETL in small- and mid-sized companies usually doesn’t make sense - it’s just not very cost-effective. To reduce expenses in this area, companies can try implementing the ELT paradigm - but accomplishing that is quite challenging even for the most experienced developers.
The value of ETL
Getting data from various sources, transforming it, applying business rules, loading to the right destinations, and validating the results - each of these steps is a cog in the complex machinery that keeps the correct data flowing within an organization.
ETL processes are often complex, and their design is critical to an organization's operational efficiency. After all, they're at the center of every organization’s data management strategy.
Note that the three phases of ETL are often executed in parallel. That's because data extraction takes a lot of time and we can't afford to lose any of it. While data is extracted, we can set up a transformation process to execute the data already received and prepare it for loading. The loading process can then start – without waiting for the previous phases to be completed.
Why is ETL important?
Information was critical to business success way before computers were invented. For instance, artisans who passed their knowledge from generation to generation made their products more refined - and merchants who had the information about where to buy and sell them made fortunes.
As soon as we learned how to write, we started storing information - and technological development caused the amount of data to grow exponentially, forcing us to come up with solutions for processing and analyzing it.
It’s not enough to store data today. Businesses need to know how to extract and process it - and that’s where ETL helps.
The most common enterprise use case of ETL are systems that store and process data which are critical in the decision-making process - one of the core elements of Business Intelligence. ETL systems allow organizations to collect a massive amount of data and then process and store in a form that allows easy analysis or presentation.
Example use case 1
Let’s imagine that we’re a global organization producing clothing. To come up with our business objectives for the following year, we collect data such as price, color, and size about relevant products which are sold by 10 of the most popular online stores. By passing this data through our system, we can see how pricing changed over the years, which color was most popular during a given season, and which sizes were most widespread on the market. We can compare this information with our products and see how that affected our revenue. That type of data is essential for planning - and delivering it is the main job of ETL systems.
Example use case 2
Let’s imagine that we’re an investor interested in creating an online platform that acts as an intermediary for purchasing flights. We have met with a company that creates such services and now we know that the approximate cost would be $120 thousand, costing us $30 thousand per month for maintenance. We have also met with several large airlines and know that we could gain $20 per each sold ticket. We acquired historical data about sold tickets and - since we own an intermediary for hotel booking - we know how many people use such platforms. With that knowledge, it seems that we’re prepared to make a decision. But historical data airlines shared with us are saved in completely different formats - while some sent us simple Excel files, others shared an API that allows downloading the information. Our hotel booking system saves statistical data in a way that makes understanding it really difficult. Moreover, we need a list of all airlines together with their phone numbers. To pull all this information together in the right format, we can take advantage of an ETL system.
A well-designed ETL process enables organizations to extract data from source systems, improves data quality and consistency standards, and delivers data in a format easy to understand by developers who build applications and other stakeholders who need data to support their decision-making process.
Have you got any questions about the ETL process and how to design it to help your organization make better use of its data?
Reach out to us in comments; we're always happy to offer advice on proven data engineering practices that comply with industry standards and help organizations take full advantage of their data.