What is ETL and Why Do I Need It? Is that the same as a data warehouse? The world of data is almost unrecognizable even from as little as five years ago. In reality, most organizations today understand the value of storing and managing their data to optimize their performance and to remain competitive in their market space. We all recognize that better information leads to better decisions, and an effective data solution, along with a new data “culture” makes this possible. Most businesses have no shortage of data, but organizing that data for easy access and new insights is a challenge that requires more than just data storage in a warehouse.
In an ideal state, you can find value in all of the information you've collected about your business. Anything you have, that has ever been recorded in a computer record, forms part of your data solution. This can potentially differ from the scope of data required for “big data” – that is, volumes of data typically requiring highly-scalable cloud storage. When combined, all your advertising campaigns in the last 20 years (and the results from them), the individual sales records of all the sales people you've had in the last 20 years, and everything else, might suddenly reveal a new and valuable insight.
Choose any profession at all...say Farm Equipment & Supplies. Those facts you collected reveal that no salesman under the age of 30 has ever sold a tractor within his first two years. They can sell arc welders, harrows, seed planters, or irrigation systems, but never a tractor. The "why" of that might lead you to change your hiring practices and new employee training, or it might change the makeup of your sales teams. A simple change might result in you doubling your annual tractor sales. It works the same for any business model. Companies have made the same mistakes for years without even knowing that they were doing it. Having a data solution means you are no longer ignorant; mistakes like that stand out in the light of new insights.
You own a collection of invaluable information, but heretofore it's been virtually inaccessible. It has been like trying to watch a construction project through the little viewing holes provided in the surrounding fence. You can only see a little bit at a time, and not necessarily in relation to any other bit. ETL processes are relevant to the data discussion because those processes can often consume 70% of the time and effort to build a data solution (Ralph Kimball’s The Data Warehouse Lifecycle Toolkit).
ETL processing can be viewed as a control system that manages every scrap of data that you own, and orders those scraps into a useful model. A control system like this often includes extracting, storing, and securing data from multiple sources, combining and cleansing the data, and preparing the data for effective analysis. The ETL acronym stands for Extract, Transform, and Load. Often, the solution may be better served using ELT, but the concepts are similar. Through ETL, information from many varying sources is converted into a homogeneous whole. Now, finally, your data becomes useful as a tool to forecast performance, isolate trends, and predict future outcomes. Let's look at each step of the ETL process individually.
The most important step, extraction, involves accessing all your systems of data whether they're operational repositories, Excel spreadsheets, external file drops, or just completely unstructured pieces of data. If some poor soul in your company is doing screen scrapes to mine the details of a legacy app, that data can be included in your extracts and stored for use. Public marketing “big data” may be part of your collection and analysis. One important piece of the extract step is to land the data into flexible, scalable storage with some level of repeatable automation and consistent governance. This is your all-important archive of source data.
The transformation step then integrates your extracted source data and does the work of integrating and preparing that data for analysis use. The transformation processes typically are hosted within a staging environment where the data can be managed and worked through the various stages of preparation (i.e., in the “kitchen” of a restaurant). Combining, cleaning, and reformatting of the data are steps along the transformation path. Pivoting and de-normalizing and de-duping; turning strings into date fields; changing codes (1,2) into consistent understandable values (“East”, “West”). Matching records from varying unkeyed sources to customer “Mary Smith” accurately, and then cleaning her contact information from lookup tables and reference files. Calculating profit the correct way every time – unique to your business. It’s critical that your data is transformed into one or more table models – possibly a star-schema dimensional model – that are optimized for analysis and insight. Once your data is fully prepared, you’re ready for the load step.
In the load step, the transformed information is loaded into your final destination – from the restaurant “kitchen” and into the “dining” area. The data has been formed into a consistent, accessible model from which you can select any particular piece of data and compare it in relation to any other pieces of data. Your final destination can be scalable cloud-hosted storage or landed in an on-premises database. This final resting place can be secured for end-user access only and tuned for top performance. It can be refreshed as often as you need – automated push as new source data presents itself, or on a periodic schedule – nightly, etc..
If you're transitioning from a diverse collection of programs that have acted as your ERP (Enterprise Resource Planning) solution, to an all-in-one solution, you're going to be faced with some challenges. Moving your information from old systems to a new system is going to be a living nightmare…ETL to the rescue! Once again it can collect all your information from all your disparate systems and make it completely homogeneous. That means it can be fed straight into your brand new system with its integrity intact.
Sorting through all your information, deciding what to keep, what to compile, what to consolidate, and what to throw away might be beyond the ability and understanding of your own IT staff. Often this process is outside of the scope of the DBA or developer skillset, and you may need to partner with a software development company with a specialization in business intelligence. You have a much better chance of ending up with a usable, scalable analytics solution.
The future belongs to those who are in control of their data. Simply put: “Better Information + Better Decisions = Better Performance”. The right business intelligence toolset allows you to create never before imagined business insights by giving you full visibility into the treasures hidden in your business data.
Discover the key elements of the ultimate data strategy.