PK
Î<–Aoa«, mimetypeapplication/epub+zipPK Î<–A iTunesMetadata.plistA¾û
This chapter describes Non-Negative Matrix Factorization (NMF), the unsupervised algorithm used by Oracle Data Mining for feature extraction.
Note: Non-Negative Matrix Factorization (NMF) is described in the paper "Learning the Parts of Objects by Non-Negative Matrix Factorization" by D. D. Lee and H. S. Seung in Nature (401, pages 788-791, 1999). |
This chapter contains the following topics:
Non-Negative Matrix Factorization is a state of the art feature extraction algorithm. NMF is useful when there are many attributes and the attributes are ambiguous or have weak predictability. By combining attributes, NMF can produce meaningful patterns, topics, or themes.
Each feature created by NMF is a linear combination of the original attribute set. Each feature has a set of coefficients, which are a measure of the weight of each attribute on the feature. There is a separate coefficient for each numerical attribute and for each distinct value of each categorical attribute. The coefficients are all non-negative.
Non-Negative Matrix Factorization uses techniques from multivariate analysis and linear algebra. It decomposes the data as a matrix M into the product of two lower ranking matrices W and H. The sub-matrix W contains the NMF basis; the sub-matrix H contains the associated coefficients (weights).
The algorithm iteratively modifies of the values of W and H so that their product approaches M. The technique preserves much of the structure of the original data and guarantees that both basis and weights are non-negative. The algorithm terminates when the approximation error converges or a specified number of iterations is reached.
The NMF algorithm must be initialized with a seed to indicate the starting point for the iterations. Because of the high dimensionality of the processing space and the fact that there is no global minimization algorithm, the appropriate initialization can be critical in obtaining meaningful results. Oracle Data Mining uses a random seed that initializes the values of W and H based on a uniform distribution. This approach works well in most cases.
NMF can be used as a dimensionality reduction pre-processing step in classification, regression, clustering, and other mining tasks. Scoring an NMF model produces data projections in the new feature space. The magnitude of a projection indicates how strongly a record maps to a feature.
NMF is especially well-suited for text mining. In a text document, the same word can occur in different places with different meanings. For example, "hike" can be applied to the outdoors or to interest rates. By combining attributes, NMF introduces context, which is essential for explanatory power:
Oracle Data Mining supports five configurable parameters for NMF. All of them have default values which will be appropriate for most applications of the algorithm. The NMF settings are:
Number of features. By default, the number of features is determined by the algorithm.
Convergence tolerance. The default is .05.
Number of iterations. The default is 50.
Random seed. The default is -1.
Non-negative scoring. You can specify whether negative numbers should be allowed in scoring results. By default they are allowed.
Automatic Data Preparation normalizes numerical attributes for NMF.
When there are missing values in columns with simple data types (not nested), NMF interprets them as missing at random. The algorithm replaces missing categorical values with the mode and missing numerical values with the mean.
When there are missing values in nested columns, NMF interprets them as sparse. The algorithm replaces sparse numerical data with zeros and sparse categorical data with zero vectors.
If you choose to manage your own data preparation, keep in mind that outliers can significantly impact NMF. Use a clipping transformation before binning or normalizing. NMF typically benefits from normalization. However, outliers with min-max normalization cause poor matrix factorization. To improve the matrix factorization, you need to decrease the error tolerance. This in turn leads to longer build times.
Chapter 19, "Automatic and Embedded Data Preparation" Oracle Data Mining Application Developer's Guide for information about nested columns and missing data |
This chapter presents an overview of Oracle Data Mining predictive analytics, an automated form of predictive data mining.
See Also: Oracle Data Mining Administrator's Guide for installation instructions Oracle Database PL/SQL Packages and Types Reference for predictive analytics syntax in PL/SQL |
This chapter includes the following sections:
Predictive Analytics is a technology that captures data mining processes in simple routines. Sometimes called "one-click data mining," predictive analytics simplifies and automates the data mining process.
Predictive analytics develops profiles, discovers the factors that lead to certain outcomes, predicts the most likely outcomes, and identifies a degree of confidence in the predictions.
Predictive analytics uses data mining technology, but knowledge of data mining is not needed to use predictive analytics.
You can use predictive analytics simply by specifying an operation to perform on your data. You do not need to create or use mining models or understand the mining functions and algorithms summarized in Chapter 2 of this manual.
The predictive analytics routines analyze the input data and create mining models. These models are trained and tested and then used to generate the results returned to the user. The models and supporting objects are not preserved after the operation completes.
When you use data mining technology directly, you create a model or use a model created by someone else. Usually, you apply the model to new data (different from the data used to train and test the model). Predictive analytics routines apply the model to the same data used for training and testing.
Oracle Data Mining predictive analytics operations are described in Table 3-1.
Table 3-1 Oracle Predictive Analytics Operations
Operation | Description |
---|---|
|
Explains how the individual attributes affect the variation of values in a target column |
|
For each case, predicts the values in a target column |
|
Creates a set of rules for cases that imply the same target value |
The Oracle Spreadsheet Add-In for Predictive Analytics provides predictive analytics operations within a Microsoft Excel spreadsheet. You can analyze Excel data or data that resides in an Oracle database.
Figure 3-1 shows the EXPLAIN
operation using Microsoft Excel 7.0. EXPLAIN
shows the predictors of a given target ranked in descending order of importance. In this example, RELATIONSHIP
is the most important predictor, and MARTIAL STATUS
is the second most important predictor .
Figure 3-1 EXPLAIN in Oracle Spreadsheet Add-In for Predictive Analytics
Figure 3-2 shows the PREDICT
operation for a binary target. PREDICT
shows the actual and predicted classification for each case. It includes the probability of each prediction and the overall predictive confidence for the data set.
Figure 3-2 PREDICT in Oracle Spreadsheet Add-In for Predictive Analytics
Figure 3-3 shows the PROFILE
operation. This example shows five profiles for a binary classification problem. Each profile includes a rule, the number of cases to which it applies, and a score distribution. Profile 1 describes 319 cases. Its members are husbands or wives with bachelors, masters, Ph.D., or professional degrees; they have capital gains <= 5095.5. The probability of a positive prediction for this group is 68.7%; the probability of a negative prediction is 31.3%.
Figure 3-3 PROFILE in Oracle Spreadsheet Add-In for Predictive Analytics
You can download the latest version of the Spreadsheet Add-In from the Oracle Technology Network.
Oracle Data Mining implements predictive analytics in the DBMS_PREDICTIVE_ANALYTICS
PL/SQL package. The following SQL DESCRIBE
statement shows the predictive analytics procedures with their parameters.
SQL> describe dbms_predictive_analytics PROCEDUREEXPLAIN
Argument Name Type In/Out Default? ------------------------------ ----------------------- ------ -------- DATA_TABLE_NAME VARCHAR2 IN EXPLAIN_COLUMN_NAME VARCHAR2 IN RESULT_TABLE_NAME VARCHAR2 IN DATA_SCHEMA_NAME VARCHAR2 IN DEFAULT PROCEDUREPREDICT
Argument Name Type In/Out Default? ------------------------------ ----------------------- ------ -------- ACCURACY NUMBER OUT DATA_TABLE_NAME VARCHAR2 IN CASE_ID_COLUMN_NAME VARCHAR2 IN TARGET_COLUMN_NAME VARCHAR2 IN RESULT_TABLE_NAME VARCHAR2 IN DATA_SCHEMA_NAME VARCHAR2 IN DEFAULT PROCEDUREPROFILE
Argument Name Type In/Out Default? ------------------------------ ----------------------- ------ -------- DATA_TABLE_NAME VARCHAR2 IN TARGET_COLUMN_NAME VARCHAR2 IN RESULT_TABLE_NAME VARCHAR2 IN DATA_SCHEMA_NAME VARCHAR2 IN DEFAULT
Example 3-1 shows how a simple PREDICT
operation can be used to find the customers most likely to increase spending if given an affinity card.
The customer data, including current affinity card usage and other information such as gender, education, age, and household size, is stored in a view called MINING_DATA_APPLY_V
. The results of the PREDICT
operation are written to a table named p_result_tbl
.
The PREDICT
operation calculates both the prediction and the accuracy of the prediction. Accuracy, also known as predictive confidence, is a measure of the improvement over predictions that would be generated by a naive model. In the case of classification, a naive model would always guess the most common class. In Example 3-1, the improvement is almost 50%.
Example 3-1 Predict Customers Most Likely to Increase Spending with an Affinity Card
DECLARE
p_accuracy NUMBER(10,9);
BEGIN
DBMS_PREDICTIVE_ANALYTICS.PREDICT
(
accuracy => p_accuracy,
data_table_name =>'mining_data_apply_v',
case_id_column_name =>'cust_id',
target_column_name =>'affinity_card',
result_table_name =>'p_result_tbl');
DBMS_OUTPUT.PUT_LINE('Accuracy: ' || p_accuracy);
END;
/
Accuracy: .492433267
The following query returns the gender and average age of customers most likely to respond favorably to an affinity card.
SELECT cust_gender, COUNT(*) as cnt, ROUND(AVG(age)) as avg_age FROM mining_data_apply_v a, p_result_tbl b WHERE a.cust_id = b.cust_id AND b.prediction = 1 GROUP BY a.cust_gender ORDER BY a.cust_gender; C CNT AVG_AGE - ---------- ---------- F 90 45 M 443 45
This section provides some high-level information about the inner workings of Oracle predictive analytics. If you know something about data mining, you will find this information to be straight-forward and easy to understand. If you are unfamiliar with data mining, you can skip this section. You do not need to know this information to use predictive analytics.
EXPLAIN
creates an attribute importance model. Attribute importance uses the Minimum Description Length algorithm to determine the relative importance of attributes in predicting a target value. EXPLAIN
returns a list of attributes ranked in relative order of their impact on the prediction. This information is derived from the model details for the attribute importance model.
Attribute importance models are not scored against new data. They simply return information (model details) about the data you provide.
Attribute importance is described in "Feature Selection".
PREDICT
creates a Support Vector Machine (SVM) model for classification or regression.
PREDICT
creates a Receiver Operating Characteristic (ROC) curve to analyze the per-case accuracy of the predictions. PREDICT
optimizes the probability threshold for binary classification models. The probability threshold is the probability that the model uses to make a positive prediction. The default is 50%.
PREDICT
returns a value indicating the accuracy, or predictive confidence, of the prediction. The accuracy is the improvement gained over a naive prediction. For a categorical target, a naive prediction would be the most common class, for a numerical target it would be the mean. For example, if a categorical target can have values small
, medium
, or large
, and small
is predicted more often than medium
or large
, a naive model would return small
for all cases. Predictive analytics uses the accuracy of a naive model as the baseline accuracy.
The accuracy metric returned by PREDICT
is a measure of improved maximum average accuracy versus a naive model's maximum average accuracy. Maximum average accuracy is the average per-class accuracy achieved at a specific probability threshold that is greater than the accuracy achieved at all other possible thresholds.
SVM is described in Chapter 18.
PROFILE
creates a Decision Tree model to identify the characteristic of the attributes that predict a common target. For example, if the data has a categorical target with values small
, medium
, or large
, PROFILE
would describe how certain attributes typically predict each size.
The Decision Tree algorithm creates rules that describe the decisions that affect the prediction. The rules, expressed in XML as if-then-else statements, are returned in the model details. PROFILE
returns XML that is derived from the model details generated by the algorithm.
Decision Tree is described in Chapter 11.
This chapter describes classification, the supervised mining function for predicting a categorical target.
This chapter includes the following topics:
Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.
A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case.
Classifications are discrete and do not imply order. Continuous, floating-point values would indicate a numerical, rather than a categorical, target. A predictive model with a numerical target uses a regression algorithm, not a classification algorithm.
The simplest type of classification problem is binary classification. In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating. Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating.
In the model build (training) process, a classification algorithm finds relationships between the values of the predictors and the values of the target. Different classification algorithms use different techniques for finding relationships. These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target values in a set of test data. The historical data for a classification project is typically divided into two data sets: one for building the model; the other for testing the model. See "Testing a Classification Model".
Scoring a classification model results in class assignments and probabilities for each case. For example, a model that classifies customers as low, medium, or high value would also predict the probability of each classification for each customer.
Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling.
A classification model is tested by applying it to test data with known target values and comparing the predicted values with the known values.
The test data must be compatible with the data used to build the model and must be prepared in the same way that the build data was prepared. Typically the build data and test data come from the same historical data set. A percentage of the records is used to build the model; the remaining records are used to test the model.
Test metrics are used to assess how accurately the model predicts the known values. If the model performs well and meets the business requirements, it can then be applied to new data to predict the future.
A confusion matrix displays the number of correct and incorrect predictions made by the model compared with the actual classifications in the test data. The matrix is n-by-n, where n is the number of classes.
Figure 5-1 shows a confusion matrix for a binary classification model. The rows present the number of actual classifications in the test data. The columns present the number of predicted classifications made by the model.
Figure 5-1 Confusion Matrix for a Binary Classification Model
In this example, the model correctly predicted the positive class for affinity_card
516 times and incorrectly predicted it 25 times. The model correctly predicted the negative class for affinity_card
725 times and incorrectly predicted it 10 times. The following can be computed from this confusion matrix:
The model made 1241 correct predictions (516 + 725).
The model made 35 incorrect predictions (25 + 10).
There are 1276 total scored cases (516 + 25 + 10 + 725).
The error rate is 35/1276 = 0.0274.
The overall accuracy rate is 1241/1276 = 0.9725.
Lift measures the degree to which the predictions of a classification model are better than randomly-generated predictions. Lift applies to binary classification only, and it requires the designation of a positive class. (See "Positive and Negative Classes".) If the model itself does not have a binary target, you can compute lift by designating one class as positive and combining all the other classes together as one negative class.
Numerous statistics can be calculated to support the notion of lift. Basically, lift can be understood as a ratio of two percentages: the percentage of correct positive classifications made by the model to the percentage of actual positive classifications in the test data. For example, if 40% of the customers in a marketing survey have responded favorably (the positive classification) to a promotional campaign in the past and the model accurately predicts 75% of them, the lift would be obtained by dividing .75 by .40. The resulting lift would be 1.875.
Lift is computed against quantiles that each contain the same number of cases. The data is divided into quantiles after it is scored. It is ranked by probability of the positive class from highest to lowest, so that the highest concentration of positive predictions is in the top quantiles. A typical number of quantiles is 10.
Lift is commonly used to measure the performance of response models in marketing applications. The purpose of a response model is to identify segments of the population with potentially high concentrations of positive responders to a marketing campaign. Lift reveals how much of the population must be solicited to obtain the highest percentage of potential responders.
Oracle Data Mining computes the following lift statistics:
Probability threshold for a quantile n is the minimum probability for the positive target to be included in this quantile or any preceding quantiles (quantiles n-1, n-2,..., 1). If a cost matrix is used, a cost threshold is reported instead. The cost threshold is the maximum cost for the positive target to be included in this quantile or any of the preceding quantiles. (See "Costs".)
Cumulative gain is the ratio of the cumulative number of positive targets to the total number of positive targets.
Target density of a quantile is the number of true positive instances in that quantile divided by the total number of instances in the quantile.
Cumulative target density for quantile n is the target density computed over the first n quantiles.
Quantile lift is the ratio of target density for the quantile to the target density over all the test data.
Cumulative percentage of records for a quantile is the percentage of all cases represented by the first n quantiles, starting at the end that is most confidently positive, up to and including the given quantile.
Cumulative number of targets for quantile n is the number of true positive instances in the first n quantiles.
Cumulative number of nontargets is the number of actually negative instances in the first n quantiles.
Cumulative lift for a quantile is the ratio of the cumulative target density to the target density over all the test data.
ROC is another metric for comparing predicted and actual target values in a classification model. ROC, like lift, applies to binary classification and requires the designation of a positive class. (See "Positive and Negative Classes".)
You can use ROC to gain insight into the decision-making ability of the model. How likely is the model to accurately predict the negative or the positive class?
ROC measures the impact of changes in the probability threshold. The probability threshold is the decision point used by the model for classification. The default probability threshold for binary classification is .5. When the probability of a prediction is 50% or more, the model predicts that class. When the probability is less than 50%, the other class is predicted. (In multiclass classification, the predicted class is the one predicted with the highest probability.)
ROC can be plotted as a curve on an X-Y axis. The false positive rate is placed on the X axis. The true positive rate is placed on the Y axis.
The top left corner is the optimal location on an ROC graph, indicating a high true positive rate and a low false positive rate.
The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).
Changes in the probability threshold affect the predictions made by the model. For instance, if the threshold for predicting the positive class is changed from .5 to.6, fewer positive predictions will be made. This will affect the distribution of values in the confusion matrix: the number of true and false positives and true and false negatives will all be different.
The ROC curve for a model represents all the possible combinations of values in its confusion matrix. You can use ROC to find the probability thresholds that yield the highest overall accuracy or the highest per-class accuracy. For example, if it is important to you to accurately predict the positive class, but you don't care about prediction errors for the negative class, you could lower the threshold for the positive class. This would bias the model in favor of the positive class.
A cost matrix is a convenient mechanism for changing the probability thresholds for model scoring.
Oracle Data Mining computes the following ROC statistics:
Probability threshold: The minimum predicted positive class probability resulting in a positive class prediction. Different threshold values result in different hit rates and different false alarm rates.
True negatives: Negative cases in the test data with predicted probabilities strictly less than the probability threshold (correctly predicted).
True positives: Positive cases in the test data with predicted probabilities greater than or equal to the probability threshold (correctly predicted).
False negatives: Positive cases in the test data with predicted probabilities strictly less than the probability threshold (incorrectly predicted).
False positives: Negative cases in the test data with predicted probabilities greater than or equal to the probability threshold (incorrectly predicted).
True positive fraction: Hit rate. (true positives/(true positives + false negatives))
False positive fraction: False alarm rate. (false positives/(false positives + true negatives))
Costs, prior probabilities, and class weights are methods for biasing classification models.
A cost matrix is a mechanism for influencing the decision making of a model. A cost matrix can cause the model to minimize costly misclassifications. It can also cause the model to maximize beneficial accurate classifications.
For example, if a model classifies a customer with poor credit as low risk, this error is costly. A cost matrix could bias the model to avoid this type of error. The cost matrix might also be used to bias the model in favor of the correct classification of customers who have the worst credit history.
ROC is a useful metric for evaluating how a model behaves with different probability thresholds. You can use ROC to help you find optimal costs for a given classifier given different usage scenarios. You can use this information to create cost matrices to influence the deployment of the model.
Like a confusion matrix, a cost matrix is an n-by-n matrix, where n is the number of classes. Both confusion matrices and cost matrices include each possible combination of actual and predicted results based on a given set of test data.
A confusion matrix is used to measure accuracy, the ratio of correct predictions to the total number of predictions. A cost matrix is used to specify the relative importance of accuracy for different predictions. In most business applications, it is important to consider costs in addition to accuracy when evaluating model quality. (See "Confusion Matrix".)
The positive class is the class that you care the most about. Designation of a positive class is required for computing lift and ROC. (See "Lift" and "Receiver Operating Characteristic (ROC)").
In the confusion matrix in Figure 5-2, the value 1
is designated as the positive class. This means that the creator of the model has determined that it is more important to accurately predict customers who will increase spending with an affinity card (affinity_card
=1) than to accurately predict non-responders (affinity_card
=0). If you give affinity cards to some customers who are not likely to use them, there is little loss to the company since the cost of the cards is low. However, if you overlook the customers who are likely to respond, you miss the opportunity to increase your revenue.
The true and false positive rates in this confusion matrix are:
False positive rate — 10/(10 + 725) =.01
True positive rate — 516/(516 + 25) =.95
In a cost matrix, positive numbers (costs) can be used to influence negative outcomes. Since negative costs are interpreted as benefits, negative numbers (benefits) can be used to influence positive outcomes.
Suppose you have calculated that it costs your business $1500 when you do not give an affinity card to a customer who would increase spending. Using the model with the confusion matrix shown in Figure 5-2, each false negative (misclassification of a responder) would cost $1500. Misclassifying a non-responder is less expensive to your business. You figure that each false positive (misclassification of a non-responder) would only cost $300.
You want to keep these costs in mind when you design a promotion campaign. You estimate that it will cost $10 to include a customer in the promotion. For this reason, you associate a benefit of $10 with each true negative prediction, because you can simply eliminate those customers from your promotion. Each customer that you eliminate represents a savings of $10. In your cost matrix, you would specify this benefit as -10, a negative cost.
Figure 5-3 shows how you would represent these costs and benefits in a cost matrix.
With Oracle Data Mining you can specify costs to influence the scoring of any classification model. Decision Tree models can also use a cost matrix to influence the model build.
With Bayesian models, you can specify prior probabilities to offset differences in distribution between the build data and the real population (scoring data).
In many problems, one target value dominates in frequency. For example, the positive responses for a telephone marketing campaign may be 2% or less, and the occurrence of fraud in credit card transactions may be less than 1%. A classification model built on historic data of this type may not observe enough of the rare class to be able to distinguish the characteristics of the two classes; the result could be a model that when applied to new data predicts the frequent class for every case. While such a model may be highly accurate, it may not be very useful. This illustrates that it is not a good idea to rely solely on accuracy when judging the quality of a classification model.
To correct for unrealistic distributions in the training data, you can specify priors for the model build process. Other approaches to compensating for data distribution issues include stratified sampling and anomaly detection. (See Chapter 6.)
Oracle Data Mining provides the following algorithms for classification:
Decision Tree
Decision trees automatically generate rules, which are conditional statements that reveal the logic used to build the tree. See Chapter 11, "Decision Tree".
Naive Bayes
Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data. See Chapter 15, "Naive Bayes".
Generalized Linear Models (GLM)
GLM is a popular statistical technique for linear modeling. Oracle Data Mining implements GLM for binary classification and for regression.
GLM provides extensive coefficient statistics and model statistics, as well as row diagnostics. GLM also supports confidence bounds.
Support Vector Machine
Support Vector Machine (SVM) is a powerful, state-of-the-art algorithm based on linear and nonlinear regression. Oracle Data Mining implements SVM for binary and multiclass classification. See Chapter 18, "Support Vector Machines".
The nature of the data determines which classification algorithm will provide the best solution to a given problem. The algorithm can differ with respect to accuracy, time to completion, and transparency. In practice, it sometimes makes sense to develop several models for each algorithm, select the best model for each algorithm, and then choose the best of those for deployment.
This section describes new features in Oracle Data Mining. It includes the following sections:
The Oracle Data Mining Java API is deprecated in this release.
Note: Oracle recommends that you not use deprecated features in new applications. Support for deprecated features is for backward compatibility only |
Oracle Data Mining supports a new release of Oracle Data Miner. The earlier release, Oracle Data Miner Classic, is still available for download on OTN, but it is no longer under active development.
To download Oracle Data Miner 11g Release 2 (11.2.0.2), go to:
http://www.oracle.com/technetwork/database/options/odm/dataminerworkflow-168677.html
To download Oracle Data Miner Classic, go to:
http://www.oracle.com/technetwork/database/options/odm/downloads/odminer-097463.html
In Oracle Data Mining 11g Release 2 (11.2.0.2), you can import externally-created GLM models when they are presented as valid PMML documents. PMML is an XML-based standard for representing data mining models.
The IMPORT_MODEL
procedure in the DBMS_DATA_MINING
package is overloaded with syntax that supports PMML import. When invoked with this syntax, the IMPORT_MODEL
procedure will accept a PMML document and translate the information into an Oracle Data Mining model. This includes creating and populating model tables as well as SYS
model metadata.
External models imported in this way will be automatically enabled for Exadata scoring offload.
See Also: Oracle Database PL/SQL Packages and Types Reference for details about |
Mining Model schema objects
In Oracle 11g, Data Mining models are implemented as data dictionary objects in the SYS
schema. A set of new data dictionary views present mining models and their properties. New system and object privileges control access to mining model objects.
In previous releases, Data Mining models were implemented as a collection of tables and metadata within the DMSYS
schema. In Oracle 11g, the DMSYS
schema no longer exists.
See Also:
|
Automatic Data Preparation (ADP)
In most cases, data must be transformed using techniques such as binning, normalization, or missing value treatment before it can be mined. Data for build, test, and apply must undergo the exact same transformations.
In previous releases, data transformation was the responsibility of the user. In Oracle Database 11g, the data preparation process can be automated. Algorithm-appropriate transformation instructions are embedded in the model and automatically applied to the build data and scoring data. The automatic transformations can be complemented by or replaced with user-specified transformations.
Because they contain the instructions for their own data preparation, mining models are known as supermodels.
See Also:
|
Scoping of Nested Data and Enhanced Handling of Sparse Data
Oracle Data Mining supports nested data types for both categorical and numerical data. Multi-record case data must be transformed to nested columns for mining.
In Oracle Data Mining 10gR2, nested columns were processed as top-level attributes; the user was burdened with the task of ensuring that two nested columns did not contain an attribute with the same name. In Oracle Data Mining 11g, nested attributes are scoped with the column name, which relieves the user of this burden.
Handling of sparse data and missing values has been standardized across algorithms in Oracle Data Mining 11g. Data is sparse when a high percentage of the cells are empty but all the values are assumed to be known. This is the case in market basket data. When some cells are empty, and their values are not known, they are assumed to be missing at random. Oracle Data Mining assumes that missing data in a nested column is a sparse representation, and missing data in a non-nested column is assumed to be missing at random.
In Oracle Data Mining 11g, Decision Tree and O-Cluster algorithms do not support nested data.
Generalized Linear Models
A new algorithm, Generalized Linear Models, is introduced in Oracle 11g. It supports two mining functions: classification (logistic regression) and regression (linear regression).
New SQL Data Mining Function
A new SQL Data Mining function, PREDICTION_BOUNDS
, has been introduced for use with Generalized Linear Models. PREDICTION_BOUNDS
returns the confidence bounds on predicted values (regression models) or predicted probabilities (classification).
Enhanced Support for Cost-Sensitive Decision Making
Cost matrix support is significantly enhanced in Oracle 11g. A cost matrix can be added or removed from any classification model using the new procedures, DBMS_DATA_MINING.ADD_COST_MATRIX
and DBMS_DATA_MINING.REMOVE_COST_MATRIX
.
The SQL Data Mining functions support new syntax for specifying an in-line cost matrix. With this new feature, cost-sensitive model results can be returned within a SQL statement even if the model does not have an associated cost matrix for scoring.
Only Decision Tree models can be built with a cost matrix.
Features Not Available in 11g Release 1 (11.1)
DMSYS
schema
Oracle Data Mining Scoring Engine
In Oracle 10.2, you could use Database Configuration Assistant (DBCA) to configure the Data Mining option. In Oracle 11g, you do not need to use DBCA to configure the Data Mining option.
Basic Local Alignment Search Tool (BLAST)
Features Deprecated in 11g Release 1 (11.1)
Adaptive Bayes Network classification algorithm (replaced with Decision Tree)
DM_USER_MODELS
view and functions that provide information about models, model signature, and model settings (for example, GET_MODEL_SETTINGS
, GET_DEFAULT_SETTINGS
, and GET_MODEL_SIGNATURE
) are replaced by data dictionary views. See Oracle Data Mining Application Developer's Guide.
Part III provides basic conceptual information to help you understand the algorithms supported by Oracle Data Mining. In cases where more than one algorithm is available for a given mining function, this information in these chapters should help you make the most appropriate choice. Also, if you have a general understanding of the workings of an algorithm, you will be better prepared to optimize its use with tuning parameters and data preparation.
Part III contains the following chapters:
Concepts
11g Release 2 (11.2)
E16808-06
July 2011
Oracle Data Mining Concepts, 11g Release 2 (11.2)
E16808-06
Copyright © 2005, 2011, Oracle and/or its affiliates. All rights reserved.
Primary Author: Kathy L. TaylorÂ
This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.
If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable:
U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group.
This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.
Part I presents an introduction to Oracle Data Mining and Oracle predictive analytics. The first chapter is a general, high-level overview for those who are new to these technologies.
Part I contains the following chapters:
This chapter describes the feature selection and extraction mining functions. Oracle Data Mining supports a supervised form of feature selection and an unsupervised form of feature extraction.
This chapter contains the following sections:
Sometimes too much information can reduce the effectiveness of data mining. Some of the columns of data attributes assembled for building and testing a model may not contribute meaningful information to the model. Some may actually detract from the quality and accuracy of the model.
For example, you might collect a great deal of data about a given population because you want to predict the likelihood of a certain illness within this group. Some of this information, perhaps much of it, will have little or no effect on susceptibility to the illness. Attributes such as the number of cars per household may have no effect whatsoever.
Irrelevant attributes add noise to the data and affect model accuracy. Noise increases the size of the model and the time and system resources needed for model building and scoring.
Moreover, data sets with many attributes may contain groups of attributes that are correlated. These attributes may actually be measuring the same underlying feature. Their presence together in the build data can skew the logic of the algorithm and affect the accuracy of the model.
Wide data (many attributes) generally presents processing challenges for data mining algorithms. Model attributes are the dimensions of the processing space used by the algorithm. The higher the dimensionality of the processing space, the higher the computation cost involved in algorithmic processing.
To minimize the effects of noise, correlation, and high dimensionality, some form of dimension reduction is sometimes a desirable preprocessing step for data mining. Feature selection and extraction are two approaches to dimension reduction.
Feature selection — Selecting the most relevant attributes
Feature extraction  — Combining attributes into a new reduced set of features
Oracle Data Mining supports feature selection in the attribute importance mining function. Attribute importance is a supervised function that ranks attributes according to their significance in predicting a target.
Finding the most significant predictors is the goal of some data mining projects. For example, a model might seek to find the principal characteristics of clients who pose a high credit risk.
Attribute importance is also useful as a preprocessing step in classification modeling, especially for models that use Naive Bayes or Support Vector Machine. The Decision Tree algorithm includes components that rank attributes as part of the model build.
Oracle Data Mining does not support the scoring operation for attribute importance. The results of attribute importance are the attributes of the build data ranked according to their predictive influence. The ranking and the measure of importance can be used for selecting attributes.
Feature extraction is an attribute reduction process. Unlike feature selection, which ranks the existing attributes according to their predictive significance, feature extraction actually transforms the attributes. The transformed attributes, or features, are linear combinations of the original attributes.
The feature extraction process results in a much smaller and richer set of attributes. The maximum number of features may be user-specified or determined by the algorithm. By default, it is determined by the algorithm.
Models built on extracted features may be of higher quality, because the data is described by fewer, more meaningful attributes.
Feature extraction projects a data set with higher dimensionality onto a smaller number of dimensions. As such it is useful for data visualization, since a complex data set can be effectively visualized when it is reduced to two or three dimensions.
Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of supervised learning.
Feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.
Oracle Data Mining uses the Minimum Description Length (MDL) algorithm for feature selection (attribute importance).
Oracle Data Mining uses the Non-Negative Matrix Factorization (NMF) algorithm for feature extraction.
See Oracle Data Mining Application Developer's Guide for information about feature extraction for text mining.
active learning
A feature of the Support Vector Machine algorithm that provides a way to deal with large training data sets.
aggregation
The process of consolidating data values into a smaller number of values. For example, sales data could be collected on a daily basis and then be totalled to the week level.
algorithm
A sequence of steps for solving a problem. See data mining algorithm. The Oracle Data Mining programmatic interfaces support the following algorithms: MDL, Apriori, Decision Tree, k-Means, Naive Bayes, GLM, O-Cluster, and Support Vector Machine.
anomaly detection
The detection of outliers or atypical cases. To build an anomaly detection model using the Data Mining programmatic interfaces, specify classification as the mining function, SVM as the algorithm, and pass a NULL
or empty string as the target column name.
apply
The data mining operation that scores data, that is, uses the model with new data to predict results.
association rules
A mining function that captures co-occurrence of items among transactions. A typical rule is an implication of the form A -> B, which means that the presence of itemset A implies the presence of itemset B with certain support and confidence. The support of the rule is the ratio of the number of transactions where the itemsets A and B are present to the total number of transactions. The confidence of the rule is the ratio of the number of transactions where the itemsets A and B are present to the number of transactions where itemset A is present. Oracle Data Mining uses the Apriori algorithm for association models.
attribute
An attribute is a predictor in a predictive model or an item of descriptive information in a descriptive model. Data attributes are the columns used to build a model. Data attributes undergo transformations so that they can be used as categoricals or numericals by the model. Categoricals and numericals are model attributes. See also target.
attribute importance
A mining function providing a measure of the importance of an attribute in predicting a specified target. The measure of different attributes of a training data table enables users to select the attributes that are found to be most relevant to a mining model. A smaller set of attributes results in a faster model build; the resulting model could be more accurate. Oracle Data Mining uses the Minimum Description Length to discover important attributes. Sometimes referred to as feature selection or key fields.
Automatic Data Transformation
Mining models can be created in Automatic Data Preparation (ADP) mode. ADP transforms the build data according to the requirements of the algorithm, embeds the transformation instructions in the model, and uses the instructions to transform the test or scoring data when the model is applied.
case
All the data collected about a specific transaction or related set of values. A data set is a collection of cases. Cases are also called records or examples. In the simplest situation, a case corresponds to a row in a table.
case table
A table or view in single-record case format. All the data for each case is contained in a single row. The case table may include a case ID column that holds a unique identifier for each row. Mining data must be presented as a case table.
categorical attribute
An attribute whose values correspond to discrete categories. For example, state is a categorical attribute with discrete values (CA, NY, MA). Categorical attributes are either non-ordered (nominal) like state or gender, or ordered (ordinal) such as high, medium, or low temperatures.
classification
A mining function for predicting categorical target values for new records using a model built from records with known target values. Oracle Data Mining supports the following algorithms for classification: Naive Bayes, Decision Tree, and Support Vector Machines.
cluster centroid
The vector that encodes, for each attribute, either the mean (if the attribute is numerical) or the mode (if the attribute is categorical) of the cases in the training data assigned to a cluster. A cluster centroid is often referred to as "the centroid."
clustering
A mining function for finding naturally occurring groupings in data. More precisely, given a set of data points, each having a set of attributes, and a similarity measure among them, clustering is the process of grouping the data points into different clusters such that data points in the same cluster are more similar to one another and data points in different clusters are less similar to one another. Oracle Data Mining supports two algorithms for clustering, k-Means and Orthogonal Partitioning Clustering.
confusion matrix
Measures the correctness of predictions made by a model from a test task. The row indexes of a confusion matrix correspond to actual values observed and provided in the test data. The column indexes correspond to predicted values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing.
When predicted value equals actual value, the model produces correct predictions. All other entries indicate errors.
cost matrix
An n by n table that defines the cost associated with a prediction versus the actual value. A cost matrix is typically used in classification models, where n is the number of distinct values in the target, and the columns and rows are labeled with target values. The rows are the actual values; the columns are the predicted values.
counterexample
Negative instance of a target. Counterexamples are required for classification models, except for one-class Support Vector Machines.
data mining
Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD).
A data mining model implements a data mining algorithm to solve a given type of problem for a given set of data.
data mining algorithm
A specific technique or procedure for producing a data mining model. An algorithm uses a specific data representation and a specific mining function.
The algorithms in the Oracle Data Mining programming interfaces are Naive Bayes, Support Vector Machine, Generalized Linear Model, and Decision Tree for classification; Support Vector Machine and Generalized Linear Model for regression; k-Means and O-Cluster for clustering; Minimum Description Length for attribute importance; Non-Negative Matrix Factorization for feature extraction; Apriori for associations, and one-class Support Vector Machine for anomaly detection.
data mining server
The component of the Oracle database that implements the data mining engine and persistent metadata repository. You must connect to a data mining server before performing data mining tasks.
descriptive model
A descriptive model helps in understanding underlying processes or behavior. For example, an association model describes consumer behavior. See also mining model.
discretization
Discretization groups related values together under a single value (or bin). This reduces the number of distinct values in a column. Fewer bins result in models that build faster. Many Oracle Data Mining algorithms (for example NB) may benefit from input data that is discretized prior to model building, testing, computing lift, and applying (scoring). Different algorithms may require different types of binning. Oracle Data Mining includes transformations that perform top N frequency binning for categorical attributes and equi-width binning and quantile binning for numerical attributes.
distance-based (clustering algorithm)
Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. Data points are assigned to the nearest cluster according to the distance metric used.
Decision Tree
A decision tree is a representation of a classification system or supervised model. The tree is structured as a sequence of questions; the answers to the questions trace a path down the tree to a leaf, which yields the prediction.
Decision trees are a way of representing a series of questions that lead to a class or value. The top node of a decision tree is called the root node; terminal nodes are called leaf nodes. Decision trees are grown through an iterative splitting of data into discrete groups, where the goal is to maximize the distance between groups at each split.
An important characteristic of the decision tree models is that they are transparent; that is, there are rules that explain the classification.
See also rule.
equi-width binning
Equi-width binning determines bins for numerical attributes by dividing the range of values into a specified number of bins of equal size.
explode
For a categorical attribute, replace a multi-value categorical column with several binary categorical columns. To explode the attribute, create a new binary column for each distinct value that the attribute takes on. In the new columns, 1 indicates that the value of the attribute takes on the value of the column; 0, that it does not. For example, suppose that a categorical attribute takes on the values {1, 2, 3}. To explode this attribute, create three new columns, col_1
, col_2
, and col_3
. If the attribute takes on the value 1, the value in col_1
is 1; the values in the other two columns is 0.
feature
A combination of attributes in the data that is of special interest and that captures important characteristics of the data. See feature extraction.
See also text feature.
feature extraction
Creates a new set of features by decomposing the original data. Feature extraction lets you describe the data with a number of features that is usually far smaller than the number of original attributes. See also Non-Negative Matrix Factorization.
Generalized Linear Model
A statistical technique for linear modeling. Generalized linear models (GLM) include and extend the class of simple linear models. Oracle Data Mining supports logistic regression for GLM classification and linear regression for GLM regression.
k-Means
A distance-based clustering algorithm that partitions the data into a predetermined number of clusters (provided there are enough distinct cases). Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. Data points are assigned to the nearest cluster according to the distance metric used. Oracle Data Mining provides an enhanced version of k-Means.
lift
A measure of how much better prediction results are using a model than could be obtained by chance. For example, suppose that 2% of the customers mailed a catalog make a purchase; suppose also that when you use a model to select catalog recipients, 10% make a purchase. Then the lift for the model is 10/2 or 5. Lift may also be used as a measure to compare different data mining models. Since lift is computed using a data table with actual outcomes, lift compares how well a model performs with respect to this data on predicted outcomes. Lift indicates how well the model improved the predictions over a random selection given actual results. Lift allows a user to infer how a model will perform on new data.
lineage
The sequence of transformations performed on a data set during the data preparation phase of the model build process.
min-max normalization
Normalize numerical attributes using this transformation:
x_new
= (x_old
-min) / (max-min)
Minimum Description Length
Given a sample of data and an effective enumeration of the appropriate alternative theories to explain the data, the best theory is the one that minimizes the sum of
The length, in bits, of the description of the theory
The length, in bits, of the data when encoded with the help of the theory
This principle is used to select the attributes that most influence target value discrimination in attribute importance.
mining function
A major subdomain of data mining that shares common high level characteristics. The Oracle Data Mining programming interfaces support the following mining functions: classification, regression, attribute importance, feature extraction, and clustering. In both programming interfaces, anomaly detection is supported as classification.
mining model
An important function of data mining is the production of a model. A model can be a supervised model or an unsupervised model. Technically, a mining model is the result of building a model from mining settings. The representation of the model is specific to the algorithm specified by the user or selected by the DMS. A model can be used for direct inspection, for example, to examine the rules produced from an association model, or to score data.
mining result
The end product(s) of a mining task. For example, a build task produces a mining model; a test task produces a test result.
missing value
A data value that is missing at random. It could be missing because it is unavailable, unknown, or because it was lost. Oracle Data Mining interprets missing values in columns with simple data types (not nested) as missing at random. Oracle Data Mining interprets missing values in nested columns as sparse.
Data mining algorithms vary in the way they treat missing values. There are several typical ways to treat them: ignore them, omit any records containing missing values, replace missing values with the mode or mean, or infer missing values from existing values. See also sparse data.
multi-record case
Each case in the data table is stored in multiple rows. Also known as transactional data. See also single-record case.
Naive Bayes
An algorithm for classification that is based on Bayes's theorem. Naive Bayes makes the assumption that each attribute is conditionally independent of the others: given a particular value of the target, the distribution of each predictor is independent of the other predictors.
nested data
Oracle Data Mining supports transactional data in nested columns of name/value pairs. Multidimensional data that expresses a one-to-many relationship can be loaded into a nested column and mined along with single-record case data in a case table.
Non-Negative Matrix Factorization
A feature extraction algorithm that decomposes multivariate data by creating a user-defined number of features, which results in a reduced representation of the original data.
normalization
Normalization consists of transforming numerical values into a specific range, such as [–1.0,1.0] or [0.0,1.0] such that x_new = (x_old-shift)/scale
. Normalization applies only to numerical attributes. Oracle Data Mining provides transformations that perform min-max normalization, scale normalization, and z-score normalization.
numerical attribute
An attribute whose values are numbers. The numeric value can be either an integer or a real number. Numerical attribute values can be manipulated as continuous values. See also categorical attribute.
one-class Support Vector Machine
The version of the Support Vector Machine model used to solve anomaly detection problems. The Oracle Data Mining programmatic interfaces implement the one-class algorithm as classification.
Orthogonal Partitioning Clustering
An Oracle proprietary clustering algorithm that creates a hierarchical grid-based clustering model, that is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tzE…ºessellates the attribute space into clusters.
outlier
A data value that does not come from the typical population of data; in other words, extreme values. In a normal distribution, outliers are typically at least 3 standard deviations from the mean.
positive target value
In binary classification problems, you may designate one of the two classes (target values) as positive, the other as negative. When Oracle Data Mining computes a model's lift, it calculates the density of positive target values among a set of test instances for which the model predicts positive values with a given degree of confidence.
predictive model
A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input). The form of the equation or rules is suggested by mining data collected from the process under study. Some training or estimation technique is used to estimate the parameters of the equation or rules. A predictive model is a supervised model.
prepared data
Data that is suitable for model building using a specified algorithm. Data preparation often accounts for much of the time spent in a data mining project. Oracle Data Mining supports transformations binning, normalization, and missing value treatment. Oracle Data Mining can automatically perform algorithm-appropriate transformations when Automatic Data Transformation is enabled.
prior probabilities
The set of prior probabilities specifies the distribution of examples of the various classes in the original source data. Also referred to as priors, these could be different from the distribution observed in the data set provided for model build.
quantile binning
A numerical attribute is divided into bins such that each bin contains approximately the same number of cases.
random sample
A sample in which every element of the data set has an equal chance of being selected.
recode
Literally "change or rearrange the code." Recoding can be useful in many instances in data mining. Here are some examples:
Missing values treatment: Missing values may be indicated by something other than NULL
, such as "0000" or "9999" or "NA" or some other string. One way to treat the missing value is to recode, for example, "0000" to NULL
. Then the Oracle Data Mining algorithms and the database recognize the value as missing.
Change data type of variable: For example, change "Y" or "Yes" to 1 and "N" or "No" to 0.
Establish a cutoff value: For example, recode all incomes less than $20,000 to the same value.
Group items: For example, group individual US states into regions. The "New England region" might consist of ME, VT, NH, MA, CT, and RI; to implement this, recode the five states to, say, NE (for New England).
regression
A data mining function for predicting continuous target values for new records using a model built from records with known target values. Oracle Data Mining supports linear regression (GLM) and Support Vector Machine algorithms for regression.
rule
An expression of the general form if X, then Y. An output of certain algorithms, such as clustering, association, and decision tree. The predicate X may be a compound predicate.
scale normalization
Normalize numerical attributes using this transformation:
x_new = (x_old - 0) / (max(abs(max),abs(min)))
schema
A collection of objects in an Oracle database, including logical structures such as tables, views, sequences, stored procedures, synonyms, indexes, clusters, and database links. A schema is associated with a specific database user.
single-record case
Each case in the data table is stored in one row. Contrast with multi-record case.
sparse data
Data for which only a small fraction of the attributes are non-zero or non-null in any given case. Market basket data and text mining data are typically sparse. Oracle Data Mining interprets nested data as sparse. See also missing value.
split
Divide a data set into several disjoint subsets. For example, in a classification problem, a data set is often divided in to a training data set and a test data set.
stratified sample
Divide the data set into disjoint subsets (strata) and then take a random sample from each of the subsets. This technique is used when the distribution of target values is skewed greatly. For example, response to a marketing campaign may have a positive target value 1% of the time or less. A stratified sample provides the data mining algorithms with enough positive examples to learn the factors that differentiate positive from negative target values. See also random sample.
supermodel
Mining models that contain instructions for their own data preparation. Oracle Data Mining provides Automatic Data Transformation and embedded data transformation, which together provide support for supermodels.
supervised model
A data mining model that is built using a known dependent variable, also referred to as the target. Classification and regression techniques are examples of supervised mining. See unsupervised model. Also referred to as predictive model.
Support Vector Machine
An algorithm that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. Support vector machines can make predictions with sparse data, that is, in domains that have a large number of predictor columns and relatively few rows, as is the case with bioinformatics data. Support vector machine can be used for classification, regression, and anomaly detection.
table
The basic unit of data storage in an Oracle database. Table data is stored in rows and columns.
target
In supervised learning, the identified attribute that is to be predicted. Sometimes called target value or target attribute. See also attribute.
text feature
A combination of words that captures important attributes of a document or class of documents. Text features are usually keywords, frequencies of words, or other document-derived features. A document typically contains a large number of words and a much smaller number of features.
text mining
Conventional data mining done using text features. Text features are usually keywords, frequencies of words, or other document-derived features. Once you derive text features, you mine them just as you would any other data. Both Oracle Data Mining and Oracle Text support text mining.
top N frequency binning
This type of binning bins categorical attributes. The bin definition for each attribute is computed based on the occurrence frequency of values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin_1,..., bin_N corresponds to the values with top frequencies. The bin bin_N+1 corresponds to all remaining values.
transactional data
The data for one case is contained in several rows. An example is market basket data, in which a case represents one basket that contains multiple items. Oracle Data Mining supports transactional data in nested columns of attribute name/value pairs. See also nested data, multi-record case, and single-record case.
transformation
A function applied to data resulting in a new representation of the data. For example, discretization and normalization are transformations on data.
trimming
A technique used for dealing with outliers. Trimming removes values in the tails of a distribution in the sense that trimmed values are ignored in further computations. This is achieved by setting the tails to NULL
.
unstructured data
Images, audio, video, geospatial mapping data, and documents or text data are collectively known as unstructured data. Oracle Data Mining supports the mining of unstructured text data.
unsupervised model
A data mining model built without the guidance (supervision) of a known, correct result. In supervised learning, this correct result is provided in the target attribute. Unsupervised learning has no such target attribute. Clustering and association are examples of unsupervised mining functions. See supervised model.
view
A view takes the output of a query and treats it as a table. Therefore, a view can be thought of as a stored query or a virtual table. You can use views in most places where a table can be used.
winsorizing
A way of dealing with outliers. Winsorizing involves setting the tail values of an particular attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 6th percentile, while the upper 5% are set equal to the maximum value in the 95th percentile.
This chapter describes association, the unsupervised mining function for discovering association rules.
This chapter contains the following topics:
Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules.
The results of an association model are the rules that identify patterns of association within the data. Oracle Data Mining does not support the scoring operation for association modeling.
Association rules are ranked by these metrics:
Association rules are often used to analyze sales transactions. For example, it might be noted that customers who buy cereal at the grocery store often buy milk at the same time. In fact, association analysis might find that 85% of the checkout sessions that include cereal also include milk. This relationship could be formulated as the following rule.
Cereal implies milk with 85% confidence
This application of association modeling is called market-basket analysis. It is valuable for direct marketing, sales promotions, and for discovering business trends. Market-basket analysis can also be used effectively for store layout, catalog design, and cross-sell.
Association modeling has important applications in other domains as well. For example, in e-commerce applications, association rules may be used for Web page personalization. An association model might find that a user who visits pages A and B is 70% likely to also visit page C in the same session. Based on this rule, a dynamic link could be created for users who are likely to be interested in page C. The association rule could be expressed as follows.
A and B imply C with 70% confidence
Unlike other data mining functions, association is transaction-based. In transaction processing, a case includes a collection of items such as the contents of a market basket at the checkout counter. The collection of items in the transaction is an attribute of the transaction. Other attributes might be a timestamp or user ID associated with the transaction.
Transactional data, also known as market-basket data, is said to be in multi-record case format because a set of records (rows) constitute a case. For example, in Figure 8-1, case 11 is made up of three rows while cases 12 and 13 are each made up of four rows.
Non transactional data is said to be in single-record case format because a single record (row) constitutes a case. In Oracle Data Mining, association models can be built using either transactional or non transactional data. If the data is non transactional, it must be transformed to a nested column before association mining activities can be performed.
This manual describes the features of Oracle Data Mining, a comprehensive data mining solution within Oracle Database. It explains the data mining algorithms, and and it lays a conceptual foundation for much of the procedural information contained in other manuals. (See "Related Documentation".)
The preface contains these topics:
Oracle Data Mining Concepts is intended for analysts, application developers, and data mining specialists.
For information about Oracle's commitment to accessibility, visit the Oracle Accessibility Program website at http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc
.
Access to Oracle Support
Oracle customers have access to electronic support through My Oracle Support. For information, visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=info
or visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs
if you are hearing impaired.
The documentation set for Oracle Data Mining is part of the Oracle Database 11g Release 2 (11.2) Online Documentation Library. The Oracle Data Mining documentation set consists of the following:
Oracle Data Mining Java API Reference (javadoc)
Oracle Data Mining API Reference (virtual book)
Oracle Database PL/SQL Packages and Types Reference
DBMS_DATA_MINING
DBMS_DATA_MINING_TRANSFORM
DBMS_PREDICTIVE_ANALYTICS
Oracle Database SQL Language Reference
Data Mining Functions
The following text conventions are used in this document:
Convention | Meaning |
---|---|
boldface | Boldface type indicates graphical user interface elements associated with an action, or terms defined in text or the glossary. |
italic | Italic type indicates book titles, emphasis, or placeholder variables for which you supply particular values. |
monospace | Monospace type indicates commands within a paragraph, URLs, code in examples, text that appears on the screen, or text that you enter. |
In Part V, you will learn how to use Oracle Data Mining to mine text and other forms of unstructured data.
Part V contains the following chapters:
This chapter describes Support Vector Machines, a powerful algorithm based on statistical learning theory. Support Vector Machines is implemented by Oracle Data Mining for classification, regression, and anomaly detection.
Reference: Milenova, B.L., Yarmus, J.S., Campos, M.M., "SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines", Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005. |
This chapter contains the following sections:
Support Vector Machines (SVM) is a powerful, state-of-the-art algorithm with strong theoretical foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to the generalization of the model to new data.
SVM models have similar functional form to neural networks and radial basis functions, both popular data mining techniques. However, neither of these algorithms has the well-founded theoretical approach to regularization that forms the basis of SVM. The quality of generalization and ease of training of SVM is far beyond the capacities of these more traditional methods.
SVM can model complex, real-world problems such as text and image classification, hand-writing recognition, and bioinformatics and biosequence analysis.
SVM performs well on data sets that have many attributes, even if there are very few cases on which to train the model. There is no upper limit on the number of attributes; the only constraints are those imposed by hardware. Traditional neural nets do not perform well under these circumstances.
Oracle Data Mining has its own proprietary implementation of SVM, which exploits the many benefits of the algorithm while compensating for some of the limitations inherent in the SVM framework. Oracle Data Mining SVM provides the scalability and usability that are needed in a production quality data mining system.
Usability is a major enhancement, because SVM has often been viewed as a tool for experts. The algorithm typically requires data preparation, tuning, and optimization. Oracle Data Mining minimizes these requirements. You do not need to be an expert to build a quality SVM model in Oracle Data Mining. For example:
Data preparation is not required in most cases. (See "Data Preparation for SVM" .)
Default tuning parameters are generally adequate. (See "Tuning an SVM Model" .)
When dealing with very large data sets, sampling is often required. However, sampling is not required with Oracle Data Mining SVM, because the algorithm itself uses stratified sampling to reduce the size of the training data as needed.
Oracle Data Mining SVM is highly optimized. It builds a model incrementally by optimizing small working sets toward a global solution. The model is trained until convergence on the current working set, then the model adapts to the new data. The process continues iteratively until the convergence conditions are met. The Gaussian kernel uses caching techniques to manage the working sets. See "Kernel-Based Learning".
Oracle Data Mining SVM supports active learning, an optimization method that builds a smaller, more compact model while reducing the time and memory resources required for training the model. See "Active Learning".
SVM is a kernel-based algorithm. A kernel is a function that transforms the input data to a high-dimensional space where the problem is solved. Kernel functions can be linear or nonlinear.
Oracle Data Mining supports linear and Gaussian (nonlinear) kernels.
In Oracle Data Mining, the linear kernel function reduces to a linear equation on the original attributes in the training data. A linear kernel works well when there are many attributes in the training data.
The Gaussian kernel transforms each case in the training data to a point in an n-dimensional space, where n is the number of cases. The algorithm attempts to separate the points into subsets with homogeneous target values. The Gaussian kernel uses nonlinear separators, but within the kernel space it constructs a linear equation.
Active learning is an optimization method for controlling model growth and reducing model build time. Without active learning, SVM models grow as the size of the build data set increases, which effectively limits SVM models to small and medium size training sets (less than 100,000 cases). Active learning provides a way to overcome this restriction. With active learning, SVM models can be built on very large training sets.
Active learning forces the SVM algorithm to restrict learning to the most informative training examples and not to attempt to use the entire body of data. In most cases, the resulting models have predictive accuracy comparable to that of a standard (exact) SVM model.
Active learning provides a significant improvement in both linear and Gaussian SVM models, whether for classification, regression, or anomaly detection. However, active learning is especially advantageous for the Gaussian kernel, because nonlinear models can otherwise grow to be very large and can place considerable demands on memory and other system resources.
SVM has built-in mechanisms that automatically choose appropriate settings based on the data. You may need to override the system-determined settings for some domains.
The build settings described in Table 18-1 are available for configuring SVM models. Settings pertain to regression, classification, and anomaly detection unless otherwise specified.
Table 18-1 Build Settings for Support Vector Machines
Setting Name | Configures.... | Description |
---|---|---|
|
Kernel |
Linear or Gaussian. The algorithm automatically uses the kernel function that is most appropriate to the data. SVM uses the linear kernel when there are many attributes (more than 100) in the training data, otherwise it uses the Gaussian kernel. See "Kernel-Based Learning". The number of attributes does not correspond to the number of columns in the training data. SVM explodes categorical attributes to binary, numeric attributes. In addition, Oracle Data Mining interprets each row in a nested column as a separate attribute. See "Data Preparation for SVM". |
|
Standard deviation for Gaussian kernel |
Controls the spread of the Gaussian kernel function. SVM uses a data-driven approach to find a standard deviation value that is on the same scale as distances between typical cases. |
|
Cache size for Gaussian kernel |
Amount of memory allocated to the Gaussian kernel cache maintained in memory to improve model build time. The default cache size is 50 MB. |
|
Active learning |
Whether or not to use active learning. This setting is especially important for nonlinear (Gaussian) SVM models. By default, active learning is enabled. See "Active Learning". |
|
Complexity factor |
Regularization setting that balances the complexity of the model against model robustness to achieve good generalization on new data. SVM uses a data-driven approach to finding the complexity factor. |
|
Convergence tolerance |
The criterion for completing the model training process. The default is 0.001. |
|
Epsilon factor for regression |
Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data. |
|
Outliers for anomaly detection |
The expected outlier rate in anomaly detection. The default rate is 0.1. |
The SVM algorithm operates natively on numeric attributes. The algorithm automatically "explodes" categorical data into a set of binary attributes, one per category value. For example, a character column for marital status with values married
or single
would be transformed to two numeric attributes: married
and single
. The new attributes could have the value 1 (true) or 0 (false).
When there are missing values in columns with simple data types (not nested), SVM interprets them as missing at random. The algorithm automatically replaces missing categorical values with the mode and missing numerical values with the mean.
When there are missing values in nested columns, SVM interprets them as sparse. The algorithm automatically replaces sparse numerical data with zeros and sparse categorical data with zero vectors.
SVM requires the normalization of numeric input. Normalization places the values of numeric attributes on the same scale and prevents attributes with a large original scale from biasing the solution. Normalization also minimizes the likelihood of overflows and underflows. Furthermore, normalization brings the numerical attributes to the same scale (0,1) as the exploded categorical data.
The SVM algorithm automatically handles missing value treatment and the transformation of categorical data, but normalization and outlier detection must be handled by ADP or prepared manually. ADP performs min-max normalization for SVM.
Note: Oracle recommends that you use Automatic Data Preparation with SVM. The transformations performed by ADP are appropriate for most models. |
SVM classification is based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. SVM finds the vectors ("support vectors") that define the separators giving the widest separation of classes.
SVM classification supports both binary and multiclass targets.
In SVM classification, weights are a biasing mechanism for specifying the relative importance of target values (classes).
SVM models are automatically initialized to achieve the best average prediction across all classes. However, if the training data does not represent a realistic distribution, you can bias the model to compensate for class values that are under-represented. If you increase the weight for a class, the percent of correct predictions for that class should increase.
Oracle Data Mining uses SVM as the one-class classifier for anomaly detection. When SVM is used for anomaly detection, it has the classification mining function but no target.
One-class SVM models, when applied, produce a prediction and a probability for each case in the scoring data. If the prediction is 1, the case is considered typical. If the prediction is 0, the case is considered anomalous. This behavior reflects the fact that the model is trained with normal data.
You can specify the percentage of the data that you expect to be anomalous with the SVMS_OUTLIER_RATE
build setting. If you have some knowledge that the number of "suspicious" cases is a certain percentage of your population, then you can set the outlier rate to that percentage. The model will identify approximately that many "rare" cases when applied to the general population. The default is 10%, which is probably high for many anomaly detection problems.
SVM uses an epsilon-insensitive loss function to solve regression problems.
SVM regression tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.
The epsilon factor is a regularization setting for SVM regression. It balances the margin of error with model robustness to achieve the best generalization to new data. See Table 18-1 for descriptions of build settings for SVM.
This chapter describes Minimum Description Length, the supervised technique used by Oracle Data Mining for calculating attribute importance.
This chapter includes the following topics:
Minimum Description Length (MDL) is an information theoretic model selection principle. It is an important concept in information theory (the study of the quantification of information) and in learning theory (the study of the capacity for generalization based on empirical data).
MDL assumes that the simplest, most compact representation of the data is the best and most probable explanation of the data. The MDL principle is used to build Oracle Data Mining attribute importance models.
Data compression is the process of encoding information using fewer bits than the original representation would use. The MDL Principle is based on the notion that the shortest description of the data is the most probable. In typical instantiations of this principle, a model is used to compress the data by reducing the uncertainty (entropy) as discussed below. The description of the data includes a description of the model and the data as described by the model.
Entropy is a measure of uncertainty. It quantifies the uncertainty in a random variable as the information required to specify its value. Information in this sense is defined as the number of yes/no questions known as bits (encoded as 0 or 1) that must be answered for a complete specification. Thus, the information depends upon the number of values that variable can assume.
For example, if the variable represents the sex of an individual, then the number of possible values is two: female and male. If the variable represents the salary of individuals expressed in whole dollar amounts, it may have values in the range $0-$10B, or billions of unique values. Clearly it will take more information to specify an exact salary than to specify an individual's sex.
Information (the number of bits) depends on the statistical distribution of the values of the variable as well as the number of values of the variable. If we are judicious in the choice of Yes/No questions, the amount of information for salary specification may not be as much as it first appears. Most people do not have billion dollar salaries. If most people have salaries in the range $32000-$64000, then most of the time, we would require only 15 questions to discover their salary, rather than the 30 required, if every salary from $0-$1000000000 were equally likely. In the former example, if the persons were known to be pregnant, then their sex is known to be female. There is no uncertainty, no Yes/No questions need be asked. The entropy is 0.
Suppose that for some random variable there is a predictor that when its values are known reduces the uncertainty of the random variable. For example, knowing whether a person is pregnant or not, reduces the uncertainty of the random variable sex-of-individual. This predictor seems like a valuable feature to include in a model. How about name? Imagine that if you knew the name of the person, you would also know the person's sex. If so, the name predictor would seemingly reduce the uncertainty to zero. However, if names are unique, then what was gained? Is the person named Sally? Is the person named George?... We would have as many Yes/No predictors in the name model as there are people. Therefore, specifying the name model would require as many bits as specifying the sex of each person.
MDL takes into consideration the size of the model as well as the reduction in uncertainty due to using the model. Both model size and entropy are measured in bits. For our purposes, both numeric and categorical predictors are binned. Thus the size of each single predictor model is the number of predictor bins. The uncertainty is reduced to the within-bin target distribution.
MDL considers each attribute as a simple predictive model of the target class. Model selection refers to the process of comparing and ranking the single-predictor models.
MDL uses a communication model for solving the model selection problem. In the communication model there is a sender, a receiver, and data to be transmitted.
These single predictor models are compared and ranked with respect to the MDL metric, which is the relative compression in bits. MDL penalizes model complexity to avoid over-fit. It is a principled approach that takes into account the complexity of the predictors (as models) to make the comparisons fair.
Attribute importance uses a two-part code as the metric for transmitting each unit of data. The first part (preamble) transmits the model. The parameters of the model are the target probabilities associated with each value of the prediction.
For a target with j values and a predictor with k values, ni (i= 1,..., k) rows per value, there are Ci, the combination of j-1 things taken ni-1 at a time possible conditional probabilities. The size of the preamble in bits can be shown to be Sum(log2(Ci)), where the sum is taken over k. Computations like this represent the penalties associated with each single prediction model. The second part of the code transmits the target values using the model.
It is well known that the most compact encoding of a sequence is the encoding that best matches the probability of the symbols (target class values). Thus, the model that assigns the highest probability to the sequence has the smallest target class value transmission cost. In bits this is the Sum(log2(pi)), where the pi are the predicted probabilities for row i associated with the model.
The predictor rank is the position in the list of associated description lengths, smallest first.
Automatic Data Preparation performs supervised binning for MDL. Supervised binning uses decision trees to create the optimal bin boundaries. Both categorical and numerical attributes are binned.
MDL handles missing values naturally as missing at random. The algorithm replaces sparse numerical data with zeros and sparse categorical data with zero vectors. Missing values in nested columns are interpreted as sparse. Missing values in columns with simple data types are interpreted as missing at random.
If you choose to manage your own data preparation, keep in mind that MDL usually benefits from binning. However, the discriminating power of an attribute importance model can be significantly reduced when there are outliers in the data and external equal-width binning is used. This technique can cause most of the data to concentrate in a few bins (a single bin in extreme cases). In this case, quantile binning is a better solution.
Chapter 19, "Automatic and Embedded Data Preparation" Oracle Data Mining Application Developer's Guide for information about nested data and missing values |