BI is built on a foundation of data.  Therefore, testing in BI requires test data, and that means we’ve got to get it from somewhere!

I’ve seen two general strategies here:

  1. Load the test system directly with production data, meaning that the contents of the test system change with each “refresh”.
  2. Generate test data sets (usually based on production data) that can be loaded on demand.

In the first scenario, tooling required to load the test systems is limited and simple – maybe even as simple as restoring a backup of production onto the test server.  However, this can break down:

  • Production data is what you’re testing!  So using it as the expected set can be counter-productive, at least initially.  Well tested production data can be useful later.
  • Production volume may not be appropriate for the test system.
  • Production data may be sensitive and not appropriate for the test system (although masking can be applied if available).
  • Production data doesn’t (shouldn’t!) contain data for negative test cases.
  • Production data isn’t available until the system has created it.  Classic chicken/egg problem.

Additionally, using production data for tests means that the test platform is fluid and forces test to account for this.  This means that test cases must define the expected result in terms of a query or some measure of the production data rather than “expect 42″.

Generating test data sets allows for:

  • Static data sets than can be specified in test fixtures – “Load test set 4A.”
  • Fabricated or masked data to protect sensitive production data.
  • Negative tests with erroneous data intentionally included.

The downside – creation of test sets can be labor intensive and need to be included in the overall work estimate of a project.  So, here’s my first recommendation:

Consider the creation of test data a critical deliverable of the development process.

Test Data Sets

If data masking is needed, collect masking rules during modeling.  As the model changes, update test data sets accordingly.  And version test data to match the model version.  I’ve used a naming convention as follows: 4A-1.1.4-DimCustomer.csv.  This is test set 4 (a test set contains a complete load of the system).  “A” versions are the base load.  A “B”, “C”, etc. would be an incremental load to introduce additional data and/or negative conditions.  1.1.4 is the model version, and DimCustomer is the table.  If you have multiple schemas, include the schema name as well.

Tooling for test data creation may include:

  • ETL job(s) to create test data sets by applying masking rules to production data.  These should be as generic as possible and read masking rules from the model via metadata.
  • Scripts to aid in the creation of a new test data version.  These may highlight data sets that need to be updated due to model changes, and may copy and rename unchanged files.
  • A load script to read test data sets and load the system using the bulk load utility.  Again, this should be generalized and take advantage of model metadata available and the consistent naming of test data files.
  • A “rollback” script to remove incremental (“B”, “C”, etc.) data files.  This may be a generalized ETL job that reads data from the incremental files and DELETEs those values from the database. IDENTITY and database sequences also need to be considered.

These depend on the capability to write scripts and/or jobs and the availability of system metadata either from a metadata system or directly from the DBMS catalogs.

In the commercial space, some of the data integration/ETL tools directly or indirectly address test data generation, often as part of data masking solutions geared toward protecting “leakage” of sensitive data from test and development environments.  There are also a variety of generalized test data generation tools – just search “test data generation”.  While many of these can be very useful, I prefer the following strategy:

  1. Capture test data profile metadata as part of the modeling process.  For instance, for each table/column is the data a) static such as a list of states, b) not-sensitive but variable, or c) sensitive.  Data validation patterns are also useful as generation patterns.
  2. Use the model metadata to drive sourcing of test data – static from source control, non-sensitive from a production source, and sensitive via data masking.
  3. Script as much as possible, and then manually create the rest.  Hopefully, only a few table/scenarios will require manual intervention.
  4. Maintain the test sets over time.

Putting this all together resulting in the capability to rapidly manipulate the contents of your test data repositories.  It enables things like regression testing and test automation.  And it will (over time) reduce the cost of test data manipulation as compared to manual data manipulation which allows you to spend that time (money) actually testing.