Oracle® Warehouse Builder Data Modeling, ETL, and Data Quality Guide 11g Release 2 (11.2) Part Number E10935-05 |
|
|
PDF · Mobi · ePub |
This chapter describes the data cleansing features of Oracle Warehouse Builder and how to use them.
This chapter contains the following topics:
"Overview of Data Rules in ETL and Automatic Data Correction"
"Generating Correction Mappings from Data Profiling Results"
"Cleansing and Transforming Source Data Based on Data Profiling Results"
After you derive data rules from profiling results, you can automate the process of correcting source data based on profiling results. You can create the schema and mapping corrections. The schema correction creates scripts that you can use to create a corrected set of data objects with the same structure as the source objects, but with the derived data rules applied. The mapping correction creates new correction mappings to take your data from the source objects and load them into new objects.
For a given set of data objects (tables, views and so on) and a given set of data rules applied to those objects, Oracle Warehouse Builder can automatically generate the following data correction objects and logic:
Definitions for corrected schema objects, that is, tables that have the same columns as the source tables and the same data rules bound to them, but which have constraints, stricter data types and other structures that enforce the data rules being corrected. Details of how individual data rules are enforced on corrected schema objects are described in Table 21-1, "Data Rules Implementation for Schema Correction"
Cleansing ETL mappings for loading a clean version of the source data into the new corrected tables. Compliant rows can be passed through to the clean tables without change. Noncompliant data can be filtered out, reported on, or corrected to be made compliant. Many common data correction algorithms are built into Oracle Warehouse Builder, or you can implement your own cleansing logic. Details of the available correction strategies are described in Table 21-2, "Cleansing Strategies for Data Correction"
To actually create your corrected data, you must then deploy the corrected schema objects, mappings and relevant data rules to the target location and either run the mappings or schedule them to run as needed. You can then implement further ETL using the cleansed schema objects as a source instead of the original dirty data.
You can also apply data rules in ETL mappings manually. When adding a data object such as a table to a mapping, you can select one or more data rules to be applied to the object. Error tables are created for the object, and you can define separate flows within your mapping for compliant and noncompliant data.
When automatically generating corrections for source tables or other objects based on data rules, the objects generated include the following.
Definitions for corrected schema objects, that is, tables that have the same columns as the source tables, but which have constraints, types and other structures that correspond to the data rules being corrected.
Cleansing ETL mappings that move compliant source data into the target tables and either filter out noncompliant data or generate corrected, compliant data from the noncompliant data based on algorithms you specify.
To actually create your corrected data, you must then deploy the corrected schema objects, mappings and relevant data rules to the target database and either run the mappings or schedule them to run as needed.
The prerequisites for creating corrections are:
You must have a data profile where you have profiled the source data objects (tables, views and so on) to be corrected.
You must have data rules to be used to identify noncompliant data for correction.
The Data Profile Editor enables you to create mappings that performs schema correction and data cleansing based on your data profiling results.
To create corrections:
If the data profile is not open, open it by right-clicking the data profile in the Projects Navigator and selecting Open.
From the Profile menu, select Create Correction.
The Create Correction Wizard is displayed.
On the Welcome page, click Next.
On the Select Target Module page, specify the target module that contains the corrections and click Next.
You can either create a new module or use an existing module.
To store the corrections in an existing target module, choose Select an existing module. The Available list displays the existing modules in which corrections can be stored. Select the module from this list.
To store the corrections in a new target module, select Create a new target module. The Create Module Wizard guides you through the steps of creating a new target module.
To remove correction objects created because of previous corrections, select Remove previous correction objects.
On the Select Objects page, select the objects for which corrections should be generated by moving them to the Selected list. Click Next.
The Filter list enables you to filter the objects that are available for selection. The default selection is All Objects. You can display only particular types of data objects such as tables or views.
On the Select Data Rules and Data Types page, select the corrections that must be generated to perform schema correction. Click Next.
See "Selecting the Data Rules and Data Types for Corrected Schema Objects" for information about specifying data corrections.
(Optional) On the Data Rules Validation page, note the validation errors, if any, and correct them before proceeding.
If correction objects from a previous data correction action exist for the objects selected for correction, this page displays a message. Click Next to remove previously created correction objects.
On the Verify and Accept Corrected Tables page, select the objects to correct and click Next.
See "Selecting the Objects to Be Corrected" for more information about how to specify how objects should be corrected.
On the Choose Data Correction Actions page, specify the correction actions to be performed to cleanse source data and click Next.
See "Choosing Data Correction and Cleansing Actions" for more details about specifying the actions that perform data correction and cleansing.
On the Summary page, click Finish to create the correction objects.
The correction schema is created and added to the Projects Navigator. The correction objects and mappings are displayed under the module that you specify as the target module on the Select Target Module page of the Create Correction Wizard. The correction object uses the same name as the source object. The name of the correction mapping is the object name prefixed with M_. The correction mapping is used to cleanse source data and load it into the corrected target object.
Use the Data Rules and Data Types page to select the schema corrections that should be generated for the corrected data objects. Based on the data profiling results, Oracle Warehouse Builder populates this page with data type corrections and data rules that you can apply to the data object.
Schema correction consists of correcting data type definitions and defining data rules that should be applied to the corrected objects. The objects selected for correction are displayed on the left side of the page and are organized into a tree by modules. The panel on the right contains two tabs: "Data Rules" and "Data Types". Select an object by clicking the object name and then define how schema correction should be performed for this object using the Data Rules and Data Types tabs.
Data Rules The Data Rules tab displays the available data rules for the object selected in the object tree. Specify the data rules that should be generated for the corrected object by selecting the check box to the left of the data rule. Oracle Warehouse Builder uses these data rules to create constraints on the tables during the schema generation.
The Bindings section contains details about the table column to which the rule is bound. Click a rule name to display the bindings for that rule.
Oracle Warehouse Builder uses different methods of enforcing data rules on corrected schema objects. The method used depends on the type of data rule that you are implementing.
Table 21-1 describes the methods used for object schema correction. It also lists the data rule types for which each correction is used.
Table 21-1 Data Rules Implementation for Schema Correction
Schema Correction Method | Description | Data Rule Types for which Correction Method Can be Used |
---|---|---|
Create Constraints |
Creates a constraint reflecting the data rule on the correction table. If a constraint cannot be created, then a validation message is displayed on the Data Rules Validation page of the Apply Data Rule Wizard. |
Custom Domain List Domain Pattern List Domain Range Common Format No Nulls Unique Key |
Change the data type |
Changes the data type of the column to |
Is Number Is Date |
Create a lookup table |
Creates a lookup table and adds the appropriate foreign key or unique key constraints to the corrected table and the lookup table. |
Functional Dependency |
Name and Address Parse |
Adds additional name and address attributes to the correction table. The name and address attributes correspond to a selection of the output values of the Name and Address operator. In the map that is created to cleanse data, a Name and Address operator is used to perform name and address cleansing. |
Name and Address |
Data Types The Data Types tab displays the columns that are selected for correction. The change could be a modification of the data type, precision, or from fixed-length to variable-length. The Documented Data Type column on this tab displays the existing column definition and the New Data Type column displays the proposed correction to the column definition.
To correct a column definition, select the check box to the left of the column name.
Use the Verify and Accept Corrected Tables page to confirm the objects to correct and to provide additional details about how data correction should be performed. This page contains the objects you selected for schema correction on the Data Rules and Data Types page.
Use the following steps to specify how your data objects should be corrected.
In the Verify and Accept Corrected Tables that is Generated section, select Create to the left of a data object to create this data object in the corrected schema.
The Definition of the Corrected Table section displays the corrections details for the selected data object. The Columns tab displays the details of columns that is created in the corrected data object. The Constraints tab displays details of constraints that is created on the corrected data object. The Data Rules tab displays details of data rules that is created on the corrected data object.
On the Columns tab of the Definition of the Corrected Table section:
Select Create to the left of a column name to create this column in the corrected data object.
Deselect Create to the left of a column name to remove this column from the corrected object.
Edit the Data Type, Length, Precision, Seconds Precision, and Scale for a column by clicking the value and entering the new value. However, you cannot modify a column name.
On the Constraints tab of the Definition of the Corrected Table section:
Click Add Constraint to create additional constraints.
Select the constraint and click Delete to remove a constraint from the corrected data object.
On the Data Rules tab of the Definition of the Corrected Table section:
Select the check box to the left of a data rule to apply this derived data rule to the corrected data object.
Ensure that the Bindings column contains the column to which the data rule should be applied.
Click Apply Rule to apply a new data rule to the corrected object. The Apply Data Rule Wizard guides you through the process of applying a data rule.
When you decide to automatically generate corrected objects based on data profiling results, you must specify how inconsistent data from the source object should be cleansed before being stored in the corrected object. To do this, you specify a cleansing strategy for each data rule that is applied to the correction object.
The Choose Data Correction Actions page enables you to specify how to correct source data. This page contains two sections: Select a Corrected Table and Choose Data Correction Actions. The Select a Corrected Table section lists the objects that you selected for corrections. This section contains the following columns for each data object that you selected for correction:
Correct: Select this option to enable generation of correction objects for the data object listed in the Table column.
Table: Represents the name of the data object for which correction actions are being specified.
Load Option: Indicates which records should be loaded by the correction mapping. Select All Records to indicate that the generated correction mapping should load all records. Select Corrected Objects to indicate that the generated correction mapping should load only the records being corrected.
Audit Option: Select this option to create a data auditor for the table represented by the Table column.
Description: Represents a description for the correction mapping that is created.
Select a data object in the Select a Corrected Table section to display the affiliated data rules in the Choose Data Correction Actions section.
For each data rule, you must choose a correction action that specifies how data values that violate data rules set for the data object should be handled. Use the list in the Action column to specify the correction action to perform.
The correction actions that you can choose are:
Ignore: The data rule is ignored and, therefore, no values are rejected based on this data rule.
Report: The data rule is run only after the data has been loaded for reporting purposes. It is similar to the Ignore option, except that a report containing values that do not adhere to the data rules is created. This action can be used for some rule types only.
Cleanse: The values rejected by this data rule are moved to an error table where cleansing strategies are applied. When you select this option, you must specify a cleansing strategy as described in "Specifying the Cleansing Strategy".
For each data rule, use the Cleansing Strategy list to specify how data that violates a set data rule should be cleansed. This option is enabled only if you select Cleanse in the Action column. The cleansing strategy depends on the type of data rule and the rule configuration. Error tables are used to store the records that do not conform to the data rule.
Table 21-2 describes the cleansing strategies and lists the types of data rules for which each strategy is applicable.
Table 21-2 Cleansing Strategies for Data Correction
Cleansing Strategy | Description | Applicable to Data Rule Types |
---|---|---|
Remove |
Does not populate the target table with error records |
All |
Custom |
Creates a function in the target table that contains a header, but no implementation details. You must add the implementation details to this function. |
Domain List Domain Pattern List Domain Range Common Format No Nulls Name and Address Custom |
Use Existing Function |
Select from a list of existing functions to perform the correction |
Domain List Domain Pattern List Domain Range Common Format No Nulls Name and Address Custom |
Set to Min |
Sets the attribute value of the error record to the minimum value defined in the data rule |
Domain Range rules that have a minimum value defined |
Set to Max |
Sets the attribute value of the error record to the maximum value defined in the data rule |
Domain Range rules that have a maximum value defined |
Similarity |
Uses a similarity algorithm based on permitted domain values to find a value that is similar to the error record. If no similar value is found, then the original value is used. |
Domain List rules with character data types |
Soundex |
Uses a soundex algorithm based on permitted domain values to find a value that is similar to the error record. If no soundex value is found, then the original value is used. |
Domain List rules with character data types |
Merge |
Uses the match-merge algorithm to merge duplicate records into a single row |
Unique Key |
Set to Mode |
Uses the mode value to correct the error records if a mode value exists for the functional dependency partition that fails |
Functional Dependency |
See Also:
"Types of Data Rules".You can view the correction tables in the Table Editor to see the data rules and constraints created as part of the design of your table. You can also view the correction mappings as you can view any other ETL mapping.
To view the correction mappings:
Double-click the mapping to open the object in the Mapping Editor.
After the mapping is open, select View and then Auto Layout to view the entire mapping.
Figure 21-1 displays a correction map generated by the Create Correction Wizard.
Select the submapping ATTR_VALUE_1 and click Visit Child Graph on the toolbar to view the submapping.
Figure 21-2 displays the submapping that is displayed.
The submapping is the element in the mapping that performs the actual correction cleansing that you specified in the Create Correction Wizard. In the middle of this submap is the DOMAINSIMILARITY transformation that was generated as a function by the Create Correction Wizard.
After you generate correction objects, you must deploy and run the correction objects to perform schema correction and data cleansing. Your data is corrected after you run the correction mappings with the data rules. The relevant data rules also remain bound to the objects in the corrected schema for optional use in data auditors.
Correcting your schema and cleansing data requires the following steps:
When you perform schema correction based on data profiling results, Oracle Warehouse Builder generates the schema correction actions that you specified and generates corrected data objects. The name of the corrected data object is the name of the original source object prefixed with TMP_.
When deploying schema corrections, deploy all corrected data objects, along with any data rules that were defined for the corrected objects as part of the data correction process.
When you generate correction mappings to cleanse source data based on data profiling results, Oracle Warehouse Builder creates the correction mappings in the workspace. The name of the correction mapping for a particular data object is the name of the data object prefixed with M_. For example, the correction mapping generated to cleanse the DEPT
table is called M_DEPT
.
To deploy the correction mappings created as part of the data correction process:
Grant the SELECT
privilege on the source tables to PUBLIC
.
For example, your correction mapping contains the table EMPLOYEES
from the HR
schema. You can successfully deploy this correction mapping only if the SELECT
privilege is granted to PUBLIC
on the HR.EMPLOYEES
table.
Deploy the correction tables created because of data profiling.
You can right-click the table in the Projects Navigator and select Deploy. Or you can use the Control Center to deploy data objects.
Deploy the correction mappings generated to cleanse source data.
To cleanse source data and load it into the corrected tables, run the correction mapping as you would any other ETL mapping.
To run the mapping, right-click the mapping in the Projects Navigator and select Start.
You can also schedule this mapping to run like any other mapping or include it in process flows.