To get this blog started, I thought I would try something that I hope you find as entertaining as I did. In the Spring of 1998, I did a short write-up that discussed data warehousing. I am reporting that write-up here for a couple of reasons. First, it’s interesting to see how some things have changed in 14 years yet many are still quite the same. The second reason for posting this is to show a few of your that I have been messing around with warehousing on the Mainframe for a much longer time than it has been considered fashionable.
This write-up comes from a simpler time when many just “assumed” the Mainframe, using DB2 for z/OS, would just be the place to put a data warehouse. Remember also, that is write-up is over 14 years old…
Read on, and I hope you enjoy.
The issues facing today’s data warehouse administrator can be divided into “before” – those involved in creating or implementing a data warehouse, and “after” – those involving maintenance issues that arise in a data warehouse once it is actively used.
First, the “Before”
How large will your data warehouse be? Can you settle for a series of data marts? For an administrator, both options have good and bad points. A full data warehouse may become completely unmanageable in size. Data marts, on the other hand, may grow independently, creating the same mainframe data issues you are trying to solve by implementing a data warehouse. In either case, the amount of raw and summarized data may be tremendous. The rapid growth of the data may raise capacity planning to an art.
How many data sources will you combine to form a data warehouse? From where will they come? Are they all on the mainframe or scattered across multiple platforms? Are they in databases or flat files? Is, in fact, all the information you need for a successful warehouse even in an automated system? Might some of the information be on paper documents, or is some of it in an end user’s head?! Regardless of the source, you will need to combine all pertinent information into the warehouse. Once you have selected those data sources that need combining, how do you clean the data? I hope you won’t assume that your data is already clean and that you can simply combine it and load it into a warehouse. Your company will quickly learn that its warehouse is only as good as the information it holds. If, for example, the analysts believe the data is unreliable, your company could end up with a warehouse that no one will use. Remember – the reason for cleaning data is not always because something is wrong with it. If data used by multiple OLTP systems are being combined, it may contain information, such as indicators and flags, that is simply of no use in a warehouse. The warehouse may even need data that does not currently exist. In some cases, the data warehouse could actually end up being larger than the sum of its original inputs.
Now that you have determined which data to combine to form the warehouse and how to scrub the data, you face a tough decision. Where does it all happen? Do you perform all activities on the mainframe where you have the horsepower, or on the workstation where a wealth of tools to assist you now resides? Either way, there are still questions: which tools to use, how to unload the data for combining and cleaning, and finally, how to reload the data into the warehouse. This could be a huge ongoing task if your warehouse is at all large. The right tools will make a dramatic impact on how you maintain the correct information in the warehouse and how you keep that information fresh. And remember, all these tasks must be predictable and repeatable. Eventually, the warehouse will have to be reloaded, probably a number of times.
Some of the new hardware now available prompts yet another question. Must you move the data at all? At least one hardware vendor allows you to move the data between two different relational databases in the hardware, completely eliminating the need to unload and reload any data at all.
And now for the “After”
So you built it, and they really did come. Now, how do you keep them coming back? This is a major “after” issue. All this data consolidation and movement has been accomplished at tremendous corporate expense. It must now be treated like any other corporate investment, any other corporate resource. If an analyst believes the information in the warehouse is accurate, and the data are used to make important corporate decisions, the administrator has a heavy responsibility to ensure that data remains accurate.
First among the issues she faces are the volumes (literally) of data. Moving all this information into a data warehouse is clearly not a one time event. The warehouse’s data has meaning and accuracy only to a certain point in time. How do you plan to keep it accurate? Will you use data replication/propagation or perhaps a complete data refresh? Meanwhile, as more users find more ways to use the data, the “success” of the warehouse prompts a never-ending cycle of demands for more data. Now the administrator must determine what to do with the data once it arrives at the warehouse. Should it be archived or simply deleted when it becomes old? How should the data be summarized? Indeed, what level of summary is necessary?
Because the data held in a warehouse is not changed, a warehouse is administered differently from an OLTP system. Here, the administrator won’t have to worry about: backups and recoveries, for example, or time-outs and deadlocks. Referential integrity and check constraints may be less of a concern because the data is not being modified. Instead the administrator can focus on ensuring that the data is clean and sufficiently summarized; his task will be to devise better and faster ways to access the data.
The next challenge: how do business rules apply to a data warehouse? Data is being combined in many cases from multiple sources. What if the operational OLTP system has used a set of rules that have no bearing on the information one is attempting to maintain in the data warehouse? What if the OLTP systems use different rules for similar data because the data will eventually be used in different ways? Many facts can be kept about a piece of data because different organizations have different ideas of what the “right” answer is. When you combine this “different” data, how do you decide which is correct? You may not be able to rely on the business rules because there may be more than one rule for the same piece of data!
Then, of course, there is schema management. Here the administrator may need to manage and coordinate schema changes across heterogeneous relational database systems, possibly on different platforms. The tools the administrator uses will have a significant impact on the success of this task. Assuring that changes made to the originating data sources are reflected in the data warehouse is as important as the data itself. For this maintenance task, a tool is needed that makes this task easy and efficient. However, the tool should also be usable by someone who is not as skilled in the RDBMS requiring the schema changes. This allows schema maintenance tasks to be spread across administrators with skills in different relational databases.
And finally there is meta data. Was it there when you began the data warehouse design? Or will the meta data be created as a result of your data warehouse’s implementation? Whatever its origin, meta data can play an important role in the future success of your data warehouse. It should describe the relationship between the data in the warehouse and the operational data, contain any formulas or rules used to summarize the data in the warehouse, and include a description of how the data is structured. Once created, it must be kept up to date just like the schema information.
A data warehouse is a completely different beast from the operational OLTP. Its problems and the tools needed to solve them are different. But administrators also need to be concerned with warehouse availability and performance during access. Data size, placement, indexing, and all those performance issues you have been fighting with DB2 will still be there to challenge you. In addition, new functionality is available in the database world that may also enhance how a data warehouse performs. Such features as parallel processing in the RDBMS and parallel processing in hardware are only the beginning. New analytical techniques made available through OLAP, both relational and multi-dimensional, are now on the scene. Then there’s object technology – it can improve the operational usefulness of a data warehouse.