Oracle® R Enterprise User's Guide Release 1.3 for Linux and Windows Part Number E36761-04 |
|
|
PDF · Mobi · ePub |
The Oracle Advanced Analytics option consists of both Oracle Data Mining and Oracle R Enterprise. Oracle R Enterprise provides a familiar R interface for predictive analytics and data mining functions available in Oracle Data Mining. This is exposed through the OREdm
package within Oracle R Enterprise.
Data mining uses sophisticated mathematical algorithms to segment data and evaluate the probability of future events. Oracle Data Mining can mine tables, views, star schemas, transactional data, and unstructured data.
For more information about Oracle Data Mining and the algorithms that it supports, see Oracle Data Mining Concepts 11g Release 2 (11.2) (http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html
).
See OREdm Models for a complete list of supported algorithms and brief descriptions of the algorithms.
Note:
The CRAN packageRODM
also supports many Oracle Data Mining algorithms. RODM
is different from OREdm
.The OREdm interface is designed to provide a standard R interface for corresponding predictive analytics and data mining functions.
This section provides an overview of the algorithms supported by OREdm
. For detailed information about a specific model, see the R help associated with the specific OREdm
function.
In order to build a model, you must have build (training) data that satisfies OREdm Requirements.
Oracle Data Mining models are somewhat different from OREdm
models; see OREdm Models and Oracle Data Mining Models.
For list of the models available at this release and brief overview information, see OREdm Models
Examples of using OREdm
to build model are included in the descriptions of each functions. For example, Attribute Importance Example show how to build an AI model.
OREdm
requires that the data used to train (build) models exists in a single table or view containing columns of the following types: VARCHAR2, CHAR, NUMBER, and FLOAT.
All privileges required by Oracle Data Mining are automatically grant during Oracle R Enterprise installation.
Oracle Data Mining must be enabled for the database that you connect to.
Within OREdm
, Oracle Data Mining models are given generated names. As long as the OREdm
R model object exists, these model names can be used to access Oracle Data Mining models through other interfaces, including:
Oracle Data Miner
Any SQL interface, such as SQL*Plus or SQL Developer
In particular, the models can be used with the Oracle Data Mining SQL Prediction functions.
Oracle Data Miner can be useful in a number of ways:
Get a list of available models
Use Model viewers to inspect model details
Score appropriately transformed data
Note:
Any transformations performed in the R space will not be carried over into Oracle Data Miner or SQL scoring.Similarly, SQL can be used to get a list of models, inspect model details, and score appropriately transformed data with these models.
Models created using OREdm
are transient objects; they usually are not persisted past the R session that created them. Oracle Data Mining models created using Data Miner or SQL, on the other hand, exist until they are explicitly dropped.
Model objects can be saved or persisted, as described in Persist and Manage R Objects in the Database. This allows OREdm
-generated model objects to exist across R sessions and keeps the ODM object in place.
While the OREdm
model exists, you can export and import it; then you can use it apart from the Oracle R Enterprise R object existence.
OREdm
supports these Oracle Data Mining models:
Oracle Data Mining and Open-Source R uses different terminology; see Data Mining Terminology.
Note that there are several Overloaded Functions that perform common actions such as predict (score), summary, and print summary.
Oracle Data Mining and the Oracle R Enterprise OREdm
package that creates statistical models use somewhat different terminology. These are the most important differences
Oracle R Enterprise fits models, whereas Oracle Data Mining builds or trains models.
Oracle R Enterprise predicts using new data, whereas Oracle Data Mining scores new data, or applies a model to new data.
Oracle R Enterprise uses formula, as described in Formula, in the API calls; Oracle Data Mining does not support formula.
R model definitions require a formula that expresses relationships between variables. A. The formula
class is included in the R stats
package. For more information, see the R help associated with ?formula
. formula
provides a symbolic description of the model to be fitted.
The [stats]{formula}
specification has the form (response ~ terms)
where
response
is the numeric or character response vector
term
s is a series of terms, that is., the column names, to include in the model. Multiple terms are specified using +
between column names.
Use {response ~ .}
if all columns in data should be used for model building
Functions can be applied to response and terms to realize transformations.
To exclude columns, use -
before each column name to exclude.
The examples of model builds in this document and in the R help all contain sample formulas.There is no equivalent of formula
in the Oracle Data Mining API.
predict()
, summary()
, and print()
are defined across all OREdm algorithms, for example, as illustrated in GLM Examples.
summary()
returns detailed information about the model created, for example, such as details of the generated decision tree.
Oracle Data Mining uses the Minimum Descriptor Length algorithm to calculate Attribute Importance. Attribute importance ranks attributes according to their significance in predicting a target.
Minimum Description Length (MDL) is an information theoretic model selection principle. It is an important concept in information theory (the study of the quantification of information) and in learning theory (the study of the capacity for generalization based on empirical data).
MDL assumes that the simplest, most compact representation of the data is the best and most probable explanation of the data. The MDL principle is used to build Oracle Data Mining attribute importance models.
Attribute Importance models built using Oracle Data Mining cannot be applied to new data.
ore.odmAI
produces a ranking of attributes and their importance values.
Note:
OREdm
AI models differ from Oracle Data Mining AI models in these ways: a model object is not retained, and an R model object is not returned. Only the importance ranking created by the model is returned.For details about parameters, see the R help associated with ore.odmAI
.
For an example, see Attribute Importance Example.
This example creates a table by pushing the data frame iris
to the table IRIS and then builds an attribute importance model:
IRIS <- ore.push(iris) ore.odmAI(Species ~ ., IRIS) # Analyse the column Species
The Decision Tree algorithm is based on conditional probabilities. Decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and easily used within a database to identify a set of records.
Decision Tree models are classification models.
A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.
During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. ore.odmDT
offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.
OREdm
includes these functions for Decision Tree (DT):
ore.odmDT
creates (builds) a DT model
predict
predicts classifications on new data using the DT model
summary
provides a summary of the DT model. The summary includes node details that describe the tree that the model generates, and a symbolic description of the model. Returns an instance of summary.ore.odmDT
print.ore.odmDT
prints select components of the ore.odmDT
model
For details about parameters, see the R help associated with ore.odmDT
.
For an example, see Decision Tree Example.
This example creates an input table, builds a model, makes predictions, and generates a confusion matrix.
# Create MTCARS, the input data m <- mtcars m$gear <- as.factor(m$gear) m$cyl <- as.factor(m$cyl) m$vs <- as.factor(m$vs) m$ID <- 1:nrow(m) MTCARS <- ore.push(m) row.names(MTCARS) <- MTCARS # Build the model dt.mod <- ore.odmDT(gear ~ ., MTCARS) summary(dt.mod) # Make predictions and generate a confusion matrix dt.res <- predict (dt.mod, MTCARS,"gear") with(dt.res, table(gear,PREDICTION)) # generate confusion matrix
Generalized Linear Models (GLM) include and extend the class of linear models (linear regression). Generalized linear models relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes.
Oracle Data Mining's GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.
In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved via Cholesky decomposition and matrix inversion, Oracle Data Mining GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data (This approach is similar to the approach in Komarek's paper of 2004. The choice of algorithm is handled internally and is transparent to the user.
GLM can be used to create classification or regression models as follows:
Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.
For an example, see GLM Examples.
Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.
For an example, see GLM Examples.
ore.odmGLM
allows you to build two different types of models. Some arguments apply to classification models only, and some, to regression models only.
OREdm
provides these functions for Generalized Linear Models (GLM):
ore.odmGLM
creates (builds) a GLM model; note that some arguments apply to classification models only, and some to regression models only
residuals
is an ore.frame
containing three types of residuals: deviance
, pearson
, and response
fitted
is fitted.values: an ore.vector containing the fitted values:
rank
: the numeric rank of the fitted model
type
: the type of model fit
predict.ore.odmGLM
predicts new data using the GLM model
confint
is logical indicator for whether to produce confidence intervals for the predicted values.
devianc
e is minus twice the maximized log-likelihood, up to a constant
coef.ore.odmGLM
retrieves coefficients for GLM models with linear kernel
extractAIC.ore.odmGLM
extracts Akaike's An Information Criterion (AIC) from the global details of the GLM model
logLik
extracts Log-Likelihood for an OREdm GLM model
nobs
extracts the number of observations from a model fit. nobs
is used in computing BIC.
BIC is defined as AIC(object, ..., k = log(nobs(object)))
summary
creates a summary of the GLM model. The summary includes fit details for the model. Also returns formula
, a symbolic description of the model. Returns an object of type summary.ore.odmGLM
print
prints selected components of the GLM model
For details about parameters and methods, see the R help associated with ore.odmGLM
.
These examples build several models using GLM. The input tables are R data sets pushed to the database.
Linear regression using the longley
data set:
LONGLEY <- ore.push(longley) longfit1 <- ore.odmGLM(Employed ~ ., data = LONGLEY) summary(longfit1)
Ridge regression using the longley
data set:
longfit2 <- ore.odmGLM(Employed ~ ., data = LONGLEY, ridge = TRUE, ridge.vif = TRUE) summary(longfit2)
Logistic regression (classification) using the infert
data set:
INFERT <- ore.push(infert) infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, data = INFERT, type = "logistic") infit1
Changing the reference value to 1 for infit1
:
infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, data = INFERT, type = "logistic", reference = 1) infit2
The k-Means (KM) algorithm, a distance-based clustering algorithm that partitions data into a specified number of clusters, is an enhanced version with these features:
Several distance functions: Euclidean, Cosine, and Fast Cosine distance functions. The default is Euclidean.
For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes and the mean and variance for numerical attributes.
OREdm
includes these functions for k-Means (KM) models:
ore.odmKMeans
creates (builds) a KM model
predict
predicts new data using the KM model
rules.ore.odmKMeans
extracts rules generated by the KM model
clusterhists.ore.odmKMeans
generates s a data.frame
with histogram data for each cluster and variable combination in the model. Numerical variables are binned.
histograms.ore.odmKMeans
produces lattice-based histograms from clustering model.
summary
returns a summary of the KM model, including rules. Also returns formula
, a symbolic description of the model. Returns an object of type summary.ore.KMeans
.
print
prints selected components of the KM model
For details about parameters, see the R help associated with ore.odmKM
().
For an example, see k-Means Example.
This example creates the table X, builds a cluster model, plots the clusters via histogram
(), and makes predictions:
# Create input table X x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") X <- ore.push (data.frame(x)) km.mod1 <- NULL km.mod1 <- ore.odmKMeans(~., X, num.centers=2) km.mod1 summary(km.mod1) rules(km.mod1) clusterhists(km.mod1) histogram(km.mod1) # Build clustering mode; plot results km.res1 <- predict(km.mod1,X,type="class",supplemental.cols=c("x","y")) head(km.res1,3) km.res1.local <- ore.pull(km.res1) plot(data.frame(x=km.res1.local$x, y=km.res1.local$y), col=km.res1.local$CLUSTER_ID) points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2) #Make predictions head(predict(km.mod1,X)) head(predict(km.mod1,X,type=c("class","raw")),3) head(predict(km.mod1,X,type=c("class","raw"),supplemental.cols=c("x","y")),3) head(predict(km.mod1,X,type="class"),3) head(predict(km.mod1,X,type="class",supplemental.cols=c("x","y")),3) head(predict(km.mod1,X,type="raw"),3) head(predict(km.mod1,X,type="raw",supplemental.cols=c("x","y")),3)
The Naive Bayes algorithm is based on conditional probabilities. Naive Bayes looks at the historical data and calculates conditional probabilities for the target values by observing the frequency of attribute values and of combinations of attribute values.
Naive Bayes assumes that each predictor is conditionally independent of the others. (Bayes' Theorem requires that the predictors be independent.)
OREdm
includes these functions for Naive Bayes (NB) models:
ore.odmNB
creates (builds) an NB model
predict
scores new data using the NB model
summary
provides a summary of the NB model. Also returns formula
, a symbolic description of the model.Returns an instance of summary.ore.odmNB
print
prints select components of the NB
model
For details about parameters, see the R help associated with ore.odmNB
.
For an example, see Naive Bayes Example.
This example creates MTCARS, builds a Naive Bayes model, and then uses the model to make predictions:
# Create MTCARS m <- mtcars m$gear <- as.factor(m$gear) m$cyl <- as.factor(m$cyl) m$vs <- as.factor(m$vs) m$ID <- 1:nrow(m) MTCARS <- ore.push(m) row.names(MTCARS) <- MTCARS # Build model nb.mod <- ore.odmNB(gear ~ ., MTCARS) summary(nb.mod) # Make predictions nb.res <- predict (nb.mod, MTCARS,"gear") with(nb.res, table(gear,PREDICTION)) # generate confusion matrix
Support Vector Machine (SVM) is a powerful, state-of-the-art algorithm with strong theoretical foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to the generalization of the model to new data.
SVM models have similar functional form to neural networks and radial basis functions, both popular data mining techniques.
SVM can be used to solve the following problems:
Classification: SVM classification is based on decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. SVM finds the vectors ("support vectors") that define the separators giving the widest separation of classes.
SVM classification supports both binary and multiclass targets.
For an example, see SVM Classification.
Regression: SVM uses an epsilon-insensitive loss function to solve regression problems.
SVM regression tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.
For an example, see SVM Regression
Anomaly Detection: Anomaly detection identifies identify cases that are unusual within data that is seemingly homogeneous. Anomaly detection is an important tool for detecting fraud, network intrusion, and other rare events that may have great significance but are hard to find.
Anomaly detection is implemented as one-class SVM classification. An anomaly detection model predicts whether a data point is typical for a given distribution or not.
For an example, see SVM Anomaly Detection.
ore.odmSVM
build each of these three different types of models. Some arguments apply to classification models only; some, to regression models only, and some, to anomaly detection models only.
OREdm
provides these functions for SVM models:
ore.odmSVM
creates (builds) SVM model
predict
predicts (scores) new data using the SVM model
coef
retrieves the coefficient of an SVM model
SVM has two kernels, Linear and Gaussian; the Linear Kernel generates coefficients.
summary
creates a summary of the SVM model.Also returns formula
, a symbolic description of the model. Returns an object of type summary.ore.odmSVM
print
print selected components of the SVM model
For details about parameters, see the R help associated with ore.odmSVM
.
These examples build three models:
This example creates mtcars
in the database from the R mtcars
dataset., builds a classification model, makes predictions, and finally generates a confusion matrix.
m <- mtcars m$gear <- as.factor(m$gear) m$cyl <- as.factor(m$cyl) m$vs <- as.factor(m$vs) m$ID <- 1:nrow(m) MTCARS <- ore.push(m) svm.mod <- ore.odmSVM(gear ~ .-ID, MTCARS,"classification") summary(svm.mod) coef(svm.mod) svm.res <- predict (svm.mod, MTCARS,"gear") with(svm.res, table(gear,PREDICTION)) # generate confusion matrix
This example creates a data frame, pushes it to a table, and then builds a regression model; note that ore.odmSVM
specifies a linear kernel:
x <- seq(0.1, 5, by = 0.02) y <- log(x) + rnorm(x, sd = 0.2) dat <-ore.push(data.frame(x=x, y=y)) # Build model with linear kernel svm.mod <- ore.odmSVM(y~x,dat,"regression",kernel.function="linear") summary(svm.mod) coef(svm.mod) svm.res <- predict(svm.mod,dat,supplemental.cols="x") head(svm.res,6)
This example uses MTCARS created in the classification example and builds an anomaly detection model:
svm.mod <- ore.odmSVM(~ .-ID, MTCARS,"anomaly.detection") summary(svm.mod) svm.res <- predict (svm.mod, MTCARS, "ID") head(svm.res) table(svm.res$PREDICTION)