Oracle® Data Mining Concepts 11g Release 2 (11.2) Part Number E16808-06 |
|
|
PDF · Mobi · ePub |
This chapter explains how to use features of Oracle Data Mining to prepare data for mining.
This chapter contains the following sections:
The quality of a model depends to a large extent on the quality of the data used to build (train) it. Much of the time spent in any given data mining project is devoted to data preparation. The data must be carefully inspected, cleansed, and transformed, and algorithm-appropriate data preparation methods must be applied.
The process of data preparation is further complicated by the fact that any data to which a model is applied, whether for testing or for scoring, must undergo the same transformations as the data used to train the model.
Oracle Data Mining offers several features that significantly simplify the process of data preparation.
Embedded data preparation — The transformations used in training the model are embedded in the model and automatically executed whenever the model is applied to new data. If you specify transformation for the model, you only have to specify them once.
Automatic Data Preparation (ADP) — Oracle Data Mining supports an automated data preparation mode. When ADP is active, Oracle Data Mining automatically performs the data transformations required by the algorithm. The transformation instructions are embedded in the model along with any user-specified transformation instructions.
Tools for custom data preparation — Oracle Data Mining provides a PL/SQL package of transformation routines that you can use to build your own transformation instructions. You can use these transformation instructions along with ADP or instead of ADP.
Automatic management of missing values and sparse data — Oracle Data Mining uses consistent methodology across mining algorithms to handle sparsity and missing values.
Transparency — Oracle Data Mining provides model details, which are a view of the categorical and numerical attributes internal to the model. This insight into the inner details of the model is possible because of reverse transformations, which map the transformed attribute values to a form that can be interpreted by a user. Where possible, attribute values are reversed to the original column values. Reverse transformations are also applied to the target of a supervised model, thus the results of scoring are in the same units as the units of the original target.
The first step in preparing data for mining is the creation of a case table. If all the data resides in a single table and all the information for each case (record) is included in a single row (single-record case), this process is already taken care of.
If the data resides in several tables, creating the data source involves the creation of a view. For the sake of simplicity, the term "case table" refers to either a table or a view.
When the data source includes transactional data (multi-record case), it must be aggregated to the case level, using nested columns when desired. In transactional data, the information for each case is contained in multiple rows. An example is sales data in a star schema when mining at the product level. Sales is stored in many rows for a single product (the case) since the product is sold in many stores to many customers over a period of time.
Once you have created a case table that contains all the pertinent data, you should cleanse the data of any inconsistent formats within columns. Pay special attention to such items as phone numbers, zip codes, and dates, as described in the following section.
See Also:
Oracle Data Mining Application Developer's Guide for detailsOracle Data Mining supports a limited number of column data types. Numeric data is interpreted as numerical attributes and character data is interpreted as categorical attributes.
You must convert the data type of a column if its type is not supported by Oracle Data Mining or if its type will cause Oracle Data Mining to interpret it incorrectly. For example, zip codes identify different postal zones; they do not imply order. If the zip codes are stored in a numeric column, it will be interpreted as a numerical attribute. You must convert the data type so that the column data can be used as a categorical attribute by the model. You can do this using the TO_CHAR
function to convert the digits 1-9 and the LPAD
function to retain the leading 0, if there is one.
LPAD(TO_CHAR(ZIPCODE),5,'0')
The Oracle Data Mining APIs do not support DATE
and TIMESTAMP
data. Date columns must be converted to simple numeric or character data types for data mining.
In most cases, DATE
and TIMESTAMP
should be converted to NUMBER
, but you should evaluate each case individually. A TIMESTAMP
column should generally be converted to a number since it represents a unique point in time.
Alternatively, a column of dates in a table of annual sales data might indicate the month when a product was sold. This DATE
column would be converted to VARCHAR2
and treated as a categorical. You can use the TO_CHAR
function to convert a DATE
data type to VARCHAR2
.
You can convert dates to numbers by selecting a starting date and subtracting it from each date value. Another approach would be to parse the date and distribute its components over several columns. This approach is used by DBMS_PREDICTIVE_ANALYTICS
, which does support DATE
and TIMESTAMP
data types.
See Also:
Oracle Database SQL Language Reference for information on data type conversion
Oracle Database PL/SQL Packages and Types Reference for information about date data types supported by DBMS_PREDICTIVE_ANALYTICS
You can use Oracle Data Mining to mine text. Columns of text in the case table can be mined once they have undergone the proper transformation.
The text column must be in a table, not a view. The transformation process uses several features of Oracle Text; it treats the text in each row of the table as a separate document. Each document is transformed to a set of text tokens known as terms, which have a numeric value and a text label. The text column is transformed to a nested column of DM_NESTED_NUMERICALS
.
See Also:
Oracle Data Mining Application Developer's Guide for detailsSome transformations are dictated by the definition of the business problem. For example, you might want to build a model to predict high-revenue customers. Since your revenue data for current customers is in dollars you need to define what "high-revenue" means. Using some formula that you have developed from past experience, you might recode the revenue attribute into ranges Low, Medium, and High before building the model.
Another common business transformation is the conversion of date information into elapsed time. For example, date of birth might be converted to age.
In some cases, the data might need to be transformed to minimize an unwanted interpretation by the model. An example is logarithmic transformations. You might take the log of a numerical attribute when the values fall within a very wide range. For instance, commissions might range from a few hundred to a million. Converting to the log scale would minimize the skewing effect on the model.
Domain knowledge can be very important in deciding how to prepare the data. For example, some algorithms might produce unreliable results if the data contains values that fall far outside of the normal range. In some cases, these values represent errors or abnormalities. In others, they provide meaningful information. See "Outlier Treatment".
Most algorithms require some form of data transformation. During the model training process, Oracle Data Mining can automatically perform the transformations required by the algorithm. You can choose to supplement the automatic transformations with additional transformations of your own, or you can choose to manage all the transformations yourself.
In calculating automatic transformations, Oracle Data Mining uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality in most cases.
Binning, normalization, and outlier treatment are transformations that are commonly needed by data mining algorithms.
Binning, also called discretization, is a technique for reducing the cardinality of continuous and discrete data. Binning groups related values together in bins to reduce the number of distinct values.
Binning can improve resource utilization and model build response time dramatically without significant loss in model quality. Binning can improve model quality by strengthening the relationship between attributes.
Normalization is the most common technique for reducing the range of numerical data. Most normalization methods map the range of a single variable to another range (often 0,1).
A value is considered an outlier if it deviates significantly from most other values in the column. The presence of outliers can have a skewing effect on the data and can interfere with the effectiveness of transformations such as normalization or binning.
Outlier treatment methods such as trimming or clipping can be implemented to minimize the effect of outliers.
Outliers may represent problematic data, for example a bad reading due to the abnormal condition of an instrument. However, in some cases, especially in the business arena, outliers may be perfectly valid. For example, in census data, the earnings for some of the richest individuals may vary significantly from the general population. This information should not be treated as an outlier, since it is an important part of the data. Domain knowledge is usually needed to determine outlier handling.
Table 19-1 shows how ADP prepares the data for each algorithm.
Note:
Many algorithms incorporate some form of data preparation. For example, algorithms that operate natively on numeric attributes explode each non-numeric input column into a set of numerical attributes.Transformations encapsulated within the algorithm are transparent to the user and occur independently of ADP.
Also, the handling of nested data, sparsity, and missing values is standard across algorithms and occurs independently of ADP. (See Oracle Data Mining Application Developer's Guide.)
Table 19-1 Oracle Data Mining Algorithms With ADP
Algorithm | Mining Function | Treatment by ADP |
---|---|---|
Classification |
All attributes are binned with supervised binning. |
|
Classification |
The ADP setting has no effect on Decision Tree. Data preparation is handled by the algorithm. |
|
Classification and Regression |
Numerical attributes are normalized. |
|
Classification, Anomaly Detection, and Regression |
Numerical attributes are normalized. |
|
Clustering |
Numerical attributes are normalized with outlier-sensitive normalization. |
|
Clustering |
Numerical attributes are binned with a specialized form of equi-width binning, which computes the number of bins per attribute automatically. Numerical columns with all nulls or a single value are removed. |
|
Attribute Importance |
All attributes are binned with supervised binning.. |
|
Association Rules |
The ADP setting has no effect on association rules. |
|
Feature Extraction |
Numerical attributes are normalized. |
See Also:
The chapters on the individual algorithms in Part III for more information about algorithm-specific data preparationTransformations can be embedded in a model automatically by ADP or they can be embedded as a result of user-specified transformation instructions. To specify your own embedded transformations, create a transformation list and pass it to DBMS_DATA_MINING.CREATE_MODEL
.
PROCEDURE create_model(
model_name IN VARCHAR2,
mining_function IN VARCHAR2,
data_table_name IN VARCHAR2,
case_id_column_name IN VARCHAR2,
target_column_name IN VARCHAR2 DEFAULT NULL,
settings_table_name IN VARCHAR2 DEFAULT NULL,
data_schema_name IN VARCHAR2 DEFAULT NULL,
settings_schema_name IN VARCHAR2 DEFAULT NULL,
xform_list IN TRANSFORM_LIST DEFAULT NULL);
See Also:
For details about transformation lists, see Oracle Database PL/SQL Packages and Types ReferenceIf you enable ADP and you specify a transformation list, the transformation list is embedded with the automatic, system-generated transformations. The transformation list is executed before the automatic transformations.
If you enable ADP and do not specify a transformation list, only the automatic transformations are embedded in the model.
If ADP is disabled (the default) and you specify a transformation list, your custom transformations are embedded in the model. No automatic transformations are performed.
If ADP is disabled (the default) and you do not specify a transformation list, no transformations will be embedded in the model. You will have to transform the build, test, and scoring data sets yourself. You must take care to apply the same transformations to each data set. This method of data preparation was required in previous releases of Oracle Data Mining.
Oracle Data Mining provides routines that implement various transformation techniques in the DBMS_DATA_MINING_TRANSFORM
package. Details about the package are in Oracle Database PL/SQL Packages and Types Reference.
A number of factors go into deciding a binning strategy. Having fewer values typically leads to a more compact model and one that builds faster, but it can also lead to some loss in accuracy.
Model quality can improve significantly with well-chosen bin boundaries. For example, an appropriate way to bin ages might be to separate them into groups of interest, such as children 0-13, teenagers 13-19, youth 19-24, working adults 24-35, and so on.
Table 19-2 lists the binning techniques provided by Oracle Data Mining.
Table 19-2 Binning Methods in DBMS_DATA_MINING_TRANSFORM
Binning Method | Description |
---|---|
Top-N Most Frequent Items |
You can use this technique to bin categorical attributes. You specify the number of bins. The value that occurs most frequently is labeled as the first bin, the value that appears with the next frequency is labeled as the second bin, and so on. All remaining values are in an additional bin. |
Supervised Binning |
Supervised binning is a form of intelligent binning, where bin boundaries are derived from important characteristics of the data. Supervised binning builds a single-predictor decision tree to find the interesting bin boundaries with respect to a target. It can be used for numerical or categorical attributes. |
Equi-Width Binning |
You can use equi-width binning for numerical attributes. The range of values is computed by subtracting the minimum value from the maximum value, then the range of values is divided into equal intervals. You can specify the number of bins or it can be calculated automatically. Equi-width binning should usually be used with outlier treatment. (See "Routines for Outlier Treatment".) |
Quantile Binning |
Quantile binning is a numerical binning technique. Quantiles are computed using the SQL analytic function |
Most normalization methods map the range of a single attribute to another range, typically 0 to 1 or -1 to +1.
Normalization is very sensitive to outliers. Without outlier treatment, most values will be mapped to a tiny range, resulting in a significant loss of information. (See"Routines for Outlier Treatment".)
Table 19-3 Normalization Methods in DBMS_DATA_MINING_TRANSFORM
Transformation | Description |
---|---|
This technique computes the normalization of an attribute using the minimum and maximum values. The shift is the minimum value, and the scale is the difference between the maximum and minimum values. |
|
This normalization technique also uses the minimum and maximum values. For scale normalization, shift = 0, and scale = max{abs(max), abs(min)}. |
|
This technique computes the normalization of an attribute using the mean and the standard deviation. Shift is the mean, and scale is the standard deviation. |
Outliers are extreme values, typically several standard deviations from the mean. To minimize the effect of outliers, you can Winsorize or trim the data.
Winsorizing involves setting the tail values of an attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 5th percentile, while the upper 5% of values are set equal to the maximum value in the 95th percentile.
Trimming sets the tail values to NULL. The algorithm treats them as missing values.
Outliers affect the different algorithms in different ways. In general, outliers cause distortion with equi-width binning and min-max normalization.
Table 19-4 Outlier Treatment Methods in DBMS_DATA_MINING_TRANSFORM
Transformation | Description |
---|---|
This technique trims the outliers in numeric columns by sorting the non-null values, computing the tail values based on some fraction, and replacing the tail values with nulls. |
|
This technique trims the outliers in numeric columns by sorting the non-null values, computing the tail values based on some fraction, and replacing the tail values with some specified value. |
Oracle Data Mining support for model transparency ensures that information returned by the model is expressed in a format that is similar to or the same as the format of the data that was used to train the model. Internal transformations are reversed in the model details and in the predictions generated by supervised models.
Some of the attributes used by the model correspond to columns in the build data. However, because of logic specific to the algorithm, nested data, and transformations, many attributes do not correspond to columns.
For example, a nested column in the training data is not interpreted as an attribute by the model. During the model build, Oracle Data Mining explodes nested columns, and each row (an attribute name/value pair) becomes an attribute.
Some algorithms, for example SVM and GLM, only operate on numeric attributes. Any non-numeric column in the build data is exploded into binary numerical attributes, one for each distinct value in the column (SVM). GLM does not generate a new attribute for the most frequent value in the original column. These binary attributes are set to one only if the column value for the case is equal to the value associated with the binary attribute.
Algorithms that generate coefficients present challenges in regards to interpretability of results. Examples are SVN and NMF. These algorithms produce coefficients that are used in combination with the transformed attributes. The coefficients are relevant to the data on the transformed scale, not the original data scale.
Algorithms do not necessarily use all the columns in the training data. Some columns might be deemed unnecessary or harmful to the quality of the model. These columns are not used as attributes.
For all these reasons, the attributes listed in the model details might not resemble the columns of data used to train the model. However, attributes that undergo embedded transformations, whether initiated by ADP or by a user-specified transformation list, appear in the model details in their pre-transformed state, as close as possible to the original column values. Although the attributes are transformed when they are used by the model, they are visible in the model details in a form that can be interpreted by a user.
See Also:
The following in Oracle Database PL/SQL Packages and Types Reference :GET_MODEL_DETAILS
GET_MODEL_TRANSFORMATIONS
ALTER_REVERSE_EXPRESSION