Nowadays, many organizations are use Business Intelligence (BI) to make strategic decisions with expectations of achieving competitive benefits for their business. BI software is usually created with the help of data warehouses and data marts which collect data from other applications, flat files, CRMs, websites, big data stores and data sources. BI is a procedure of collecting raw data which required for business and converting it into useful and meaningful information which can be further used for data analysis, mining, reporting, predicting and interpretive analysis.
A data warehouse is a collection of data from various heterogeneous sources which enables organizations to have consolidated data for separate analysis from transactions. It is a database mainly designed for query and analysis rather than transactions.
Role of ETL in the organizations
ETL (Extract Transformation and Load) is process which collects data from different data sources, converts it into preferred format of data based on business rules and loads it into a cohesive database known as data warehouse. ETL transforms data in a unified set of data from various dissimilar formats of data. As multiple data sources are involved, integrity and implementation of business intelligence is a complex task. Success of a project depends on the various aspects of the quality of data which makes testing a critical and extremely important task throughout the ETL process. This guarantees data transformation, data integrity, data completeness and data consistency of the migrated data.
ETL testing is different than conventional testing in terms of difficulty and scope. You need a different approach to ETL. The testing procedures primarily target identifying and mitigating data defects in the transformed data and overall errors occurred prior to processing of data for analysis and reporting.
ETL testing process
ETL testing process follows the steps below:
- Identifying data sources
- Data acquisition
- Build and populate data
- Summary reports and result analysis
ETL Testing approaches
- Verifying Data integrity of the transformed data
As data is moved into the production systems from multiple data sources, flat files, spreadsheets among others, it has to be in the correct order and arranged according to the customer requirement.
Verifying completeness of the data into production system
This step includes confirming whether data transformed from the source is expected data in the target/production database. Verification of all the data which is loaded into target database to validate compare counts, totals the actual data with that of processed or non-processed target data.
- Verifying Metadata
This mainly involves verifying the data types, data length, indexes and constraints into the transformed data.
- Verifying data transformation based on business rules and BI requirements
Verification of transformed data for business rules validation with the help of writing multiple SQL queries and comparing the output with the target data.
- Verifying quality of data into target database
This test includes syntax and reference checks to avoid any errors due to invalid formats of dates and unique ids from a business perspective. The syntax test includes invalid characters, uppercase or lower case letters. In addition, the data quality check consists of number check, date check, precision and null values verification.
- Verification of inserts and updates during ETL process
Includes verification of inserts and upgrades getting processed correctly during the ETL process.
Challenges in ETL testing
ETL testing is dissimilar to conventional testing. There are multiple challenges testers face during ETL testing. Some of them are listed below:
- Many times, a huge delay is caused in development and testing due to an unstable environment.
- Access to insufficient data for testing as many clients are not comfortable sharing access to production data.
- Complete testing practically may not be possible as data is huge and complex.
- Data warehouse deals with large data volumes from different multiple data sources which can cause a lot of issues. If these issues are not fixed early, it may impact project schedules.
- In conventional testing, testers can assure the quality based on data of input/output using user interface and compared with expected behavior depending upon the input data and actions taken.
- Incorrect, duplicate and null records during the data transformation in ETL process.
- Testers have limited or no access to ETL jobs. Hence, there is a dependency on developers to start the job which leads to delay in deliverables.
- Misinterpretation of business procedures leads in delay in testing.
The above challenges can be overcome by taking precautionary actions with the help of the following measures:
- Proper test planning, defining responsibility of making data available to the testing team.
- Defining back-up procedures.
- Using sampling techniques for optimized testing coverage.
- Using automation tools for comparison of huge data.
- Regression testing should be properly defined.
- Use of modular approach to minimize challenges and proceeding with quality assurance
For an organization, data warehouse is an important asset which contains invaluable data used for real-time analysis, data mining, analysis of historical data, and corporate decision making. Thus, ETL testing plays a substantial role in validating completeness, integrity, reliability, accuracy and security of data. Defects found in later stages of application development are costlier to resolve than earlier stages of software applications. Therefore, ETL testing is crucial to ensure that the developed data warehouse is a quality product.
References
http://queforum.com/etl-testing/93052-testing-challenges-etl-testing-process.html
http://qainsights.com/challenges-in-data-warehouse-testing/
http://www.developer.com/db/big-data-practical-tips-for-etl-testing.html
http://www.pqatesting.com/our_ideas/blog/etl_and_data_warehouse_testing