Maintenance of Data Warehouse
Relatively little attention is paid in the literature on the maintenance of data warehouses. This business segment is not only irrelevant in the life cycle of any IT system but the Warehouse perhaps more pronounced than in other applications. Article will present some of the problems in the ‘life’ of data warehouses. Dealing with problems depends primarily on the environment in which the warehouse is in general about and the possibility to actually make anything.
When submitting the product for maintenance it is necessary to have in mind the complexity of data warehouses. Although most procedures and codes are essentially the same thing, it is important and necessary to understand the issues and learn about the production systems from which data are retrieved for the data warehouse.
Person responsible for maintenance will have to learn a lot about all the systems where data are retrieved. One thing should be kept on mind, these production systems are living systems and are subject to change for various reasons, and that they also have their own lifecycle of changes, smaller or bigger. Any changes to the source system can cause error getting data from the source. It should be considered that the standard data warehouse is full of different applications and that the person who maintains the data warehouse should be at least generally familiar with business and process logic of these production systems and is able to evaluate how changes in the source applications, cause changes in the way of the reach and quality of data for data warehousing. A person who deals with maintenance of the warehouse will be very exposed to changes, much more than someone who deals with the maintenance of typical transactional applications. Monitoring possible changes in source systems and responding to them is more complicated than responding to business issues that come into ordinary transactional applications for several reasons:
- Lack of familiarity with the work of transactional sources
- Not being informed about changes
- The complexity of the code change
- Too little time or opportunity to test
For these reasons it is recommended that the DWH developers and administrators working on data warehouse maintenance to be present at all meetings concerning changes in the sources for the data warehouse.
Next important issue of data warehouse maintenance is proper periodical data delete exercise. At a certain moment, DWH will certainly need to delete part of the information. Prudent is not deleted, but consider the possibilities of alternative methods of storing such data. One possibility is in keeping all data in storage system, but the greater the volume of data is - worse is performance, response to inquiries. Although, there are ways to circumvent the problem. Another option is to pull old data from tables and storage in case of possible subsequent calculations. Comparisons of actual archival data is often difficult to make a quality manner, because the business concept has changed, completed the restructuring of income, expenses, etc. If there are no quality transformation tables, comparison will be possible only on the basis of estimations. The process of restoring the old data is often time consuming and restoration of the old data may require several days with a number of employees engaged to create the desired report. Certain databases allow partition table on n logical unit. Some of these units can be virtually excluded in order to shorten response time to inquiries, if necessary parts included. It is therefore not necessary to load table with data and to use parts of the table, which is of course much more elegant solution.
The third way is to eject parts from a warehouse of information in aggregated form. It is now the most common way to delete data. The decision to retain the aggregated information is based upon response of the most business users of data and in collaboration with the people who maintain the data warehouse.
Creating Reports in Data Warehouse Maintenance Mode
The fact is that some users, because they are more agile than others, may be able to make themselves some reports. Even though, these users can become so skilful in knowing reports and business logic with the reporting tool that they can eventually be better than the staff responsible for maintenance of Data warehouse system. On the other hand it is expected that after the introduction of DWH and by going in operation IT will no longer use the reports and they will no longer have to write reports. But, there will be always reports that users will not know to create or to do them without errors.
It is also often the case, that the appetite after the introduction of DWH systems significantly increases and that the initial process of creating reports and publishing of the results to customers or other business segment. Because users including management can get the wrong impression that it is sometimes very easy to create a report by the maintainers just as it is simple to activate it. It is of course wrong image, right picture would be for proper interpretation of DWH efforts = invested great DWH effort by staff enable ease of use of data warehouses for customers. Greater IT effort means better simplicity of reports for users. In reality efforts are just reciprocal.
It is expected over a time that DWH becomes a bag for any information or data that is found. It is required to be brought into the warehouse ‘and such as you find’ data sources for a sense of completeness of the staff that maintains DWH systems. DWH system thus becomes a bag in which to throw all sorts of information, many of which will be never used, and which will be further burden for the system. Unnecessary data can be split into those where the warehouse is not the place for them and to those who are ineligible for entry into the model. As an illustration, certain users requested information from the newspapers should be found in the data warehouse…
Endless Possibilities and Requirements for ‘Tuning’ DWH System
There are users who are sensitive to the work of DWH system just as well as production users who can not tolerate such as more than half a minute unavailability of the production system. The reason for poor response of DWH may be certain ad hoc queries. DWH system is more vulnerable to performance slowdown then a production system that is often resistant to such things, secured and such queries rarely appear on production due to nature of work. One “poor” ad hoc query can block DWH system to the extent that for few hours is much slower response. Normal DWH users who regularly use this system will be unhappy with performance decrease.
It’s often impossible to predict from designers point of view all combinations what users can do to DWH performance, and it can be seen only through practice. Active education of users is therefore very important to avoid customer dissatisfaction and in favor to IT staff to which this dissatisfaction are reflected.
Aggregation of Data
Many DWH systems imply the need for the construction of aggregated structures that contain less data and thus have a better response to ad hoc inquiries. These structures can be technically implemented in various ways such as tables in the database, aggregated in OLAP systems, or something else. Common to all these is that daat warehouse maintenance experts must spend time for building them in useful way on regular basis and without many errors (‘up to date’). The growth of aggregated reports can complicate maintenance of DWH. To avoid too complicated maintenance it is required to somewhere draw a line and to say that certain statements are no longer worthwhile to build, and it is sometimes necessary to wait a longer time to respond in a query to avoid customer dissatisfaction.
Reports from the DWH system Vs. Reports from the Production System
Erroneous is the idea that the OLAP tools can completely replace the reports of the production. According to the experience, aggregated reports should be made with the data from the DWH system (statements produced from global experience) and reports that provide a single image should be provided from the production systems.
Once noticed the error should be corrected if the ratio of information importance and resources availability is positive. Filters and data-cleansing process will observe errors from the production system. Errors should be corrected sooner or later in a production system, and communication about error correction is important whether they can be corrected. Rectification can be done in the software procedures which data are retrieved after they are observed. Typical errors can be made such as wrong date (eg, 4.3.0002). DWH systems are “read – only ‘systems, in comparison to production systems. DWH can with little effort create a specific table of recorded errors, table as a writable location and description of the observed error, and this is important information for cleaning data in production system. With such table of error information production system can accept and act on the basis of them.
DWH tools often overlap in their functionalities. Sometimes, company is investing considerable money in OLAP and multidimensional database. This investment might look irational if the relational model was satisfactory with common reporting tools for most tasks. But, company should be aware that with the relational database, old star schemes and together and with various reporting tools, often come to the insoluble problems, even though they created barely 5% of all reports. For this reason relational databases are replaced and investment is made into multidimensional databases and OLAP. The importance of unsolvable problems is a measure and trigger for the introduction of DWH system.
The fact is, however, that investments in complex OLAP and other reporting tools needs to be somehow justified and that is not always easy since this few percent of reports sometimes can be made with extra expert work and without DWH tools.
Testing Structural Changes In the Source Systems
Very soon one of the sources in the life of DWH system will change. Testing the transfer of the modified information from production system might not be easy at all. The implication of problems are described in a few steps:
- Temporary suspension of data transfer
- Developer intervention on procedures for data retrieval
- Testing the data transfer - accuracy and quality
- Inclusion of the modified data transfer in the corresponding extraction
The third point can be sometimes very problematic, particularly if implemented data warehouses do not have testing environments. Testing DWH can be sometimes build as smaller replica of productive environment (partly with pieces of data, etc.), but usually testing requires an identical copy of the productive DWH and for such requirement there are usually no resources.
Consequences of Problems from Source Systems
If the report is made based on the data from several source applications, it is very likely that most of the report will still be correct despite the possibility that some reporting parts contain errors! If there is a dimension introduced that contains a list of sources and associated attributes, example of table can be as presented:
In this case, it can be said that the report was good until 15.4.
There can be a display format which immediately targets error as presented in the following table:
where the error are expressed in bold font.
Problems in Changing the Attributes and Their Behavior
For example, if the code of products are marked for years with the prefix code 100 – xxxxx and suddenly from a particular reason there is a need to start products with the 120 - xxxxxx. Question is, shall all changes be appropriately implemented and performed in a manner that final reports will not show mix of products. Despite the possibility that this is resolved with slowly changing dimension, it is possible that some users of the tool just took the code as a criterion for the report. They will have errors in reports. All variations of errors can not be predicted when attributes are changed.
The fact is that employees within the company change jobs, organization units and responsibilities. New employees are hired and others are living. There is a fluctuation of DWH users. If for example there are hundreds of DWH users there is a great opportunity DWH system administrators will be addressed almost every day with access requirements: who must see what, who should be restricted form access or why XY user sees what should not be seen. The solution may be in a maximum granting rights to specific users – for example, managers who have the right to decide who you can watch that attributed to themselves the same rights to other users – with their subordinates.
“Dying” of certain DWH parts
Some parts of the DWH system – data marts in time are not used, and simply there is no benefit of them despite the fact that these parts were presented as great value during the development phase. A number of reasons that were previously described cause the death of some datamarts. Keeping the data updated, unfortunately, due to time and memory limits can become impossible. Then there can be raised certain negative questions – why was it done at all? Because of this, it is a good approach to record and document all project requirements and to keep it for a long time.
At the end it is easy to notice that it is easier and cheaper to build a data warehouse, than to maintain it for a longer period of time. The cost of maintenance should be included in the price calculation of costs and profitability of DWH system, otherwise the DWH system look like a bag without a bottom. Typical life looks something like the image:
From the figure it is evident that after a release and maintenance can consume resources that were not intended for maintenance purpose.