2 Introducing Oracle Data Mining

This chapter introduces the basics you will need to start using Oracle Data Mining.

This chapter includes the following sections:

Data Mining in the Database Kernel
Data Mining in Oracle Exadata
Data Mining Functions
Data Mining Algorithms
Data Preparation
Model Transparency
How Do I Use Oracle Data Mining?
Where Do I Find Information About Oracle Data Mining?
Oracle Data Mining and Oracle Database Analytics

Data Mining in the Database Kernel

Oracle Data Mining provides comprehensive, state-of-the-art data mining functionality within Oracle Database.

Oracle Data Mining is implemented in the Oracle Database kernel, and mining models are first class database objects. Oracle Data Mining processes use built-in features of Oracle Database to maximize scalability and make efficient use of system resources.

Data mining within Oracle Database offers many advantages:

No Data Movement. Some data mining products require that the data be exported from a corporate database and converted to a specialized format for mining. With Oracle Data Mining, no data movement or conversion is needed. This makes the entire mining process less complex, time-consuming, and error-prone.
Security. Your data is protected by the extensive security mechanisms of Oracle Database. Moreover, specific database privileges are needed for different data mining activities. Only users with the appropriate privileges can score (apply) mining models.
Data Preparation and Administration. Most data must be cleansed, filtered, normalized, sampled, and transformed in various ways before it can be mined. Up to 80% of the effort in a data mining project is often devoted to data preparation. Oracle Data Mining can automatically manage key steps in the data preparation process. Additionally, Oracle Database provides extensive administrative tools for preparing and managing data.
Ease of Data Refresh. Mining processes within Oracle Database have ready access to refreshed data. Oracle Data Mining can easily deliver mining results based on current data, thereby maximizing its timeliness and relevance.
Oracle Database Analytics. Oracle Database offers many features for advanced analytics and business intelligence. Oracle Data Mining can easily be integrated with other analytical features of the database, such as statistical analysis and OLAP. See "Oracle Data Mining and Oracle Database Analytics".
Oracle Technology Stack. You can take advantage of all aspects of Oracle's technology stack to integrate data mining within a larger framework for business intelligence or scientific inquiry.
Domain Environment. Data mining models have to be built, tested, validated, managed, and deployed in their appropriate application domain environments. Data mining results may need to be post-processed as part of domain specific computations (for example, calculating estimated risks and response probabilities) and then stored into permanent repositories or data warehouses. With Oracle Data Mining, the pre- and post-mining activities can all be accomplished within the same environment.
Application Programming Interfaces. PL/SQL API and SQL language operators provide direct access to Oracle Data Mining functionality in Oracle Database.

Note:
The Oracle Data Mining Java API is deprecated in this release.
Oracle recommends that you not use deprecated features in new applications. Support for deprecated features is for backward compatibility only

Data Mining in Oracle Exadata

Scoring refers to the process of applying a data mining model to data to generate predictions. The scoring process may require significant system resources. Vast amounts of data may be involved, and algorithmic processing may be very complex.

With Oracle Data Mining, scoring can be off-loaded to intelligent Oracle Exadata Storage Servers where processing is extremely performant.

Oracle Exadata Storage Servers combine Oracle's smart storage software and Oracle's industry-standard Sun hardware to deliver the industry's highest database storage performance. For more information about Oracle Exadata, visit the Oracle Technology Network at:

http://www.oracle.com/us/products/database/exadata/index.htm

Data Mining Functions

A basic understanding of data mining functions and algorithms is required for using Oracle Data Mining. This section introduces the concept of data mining functions. Algorithms are introduced in "Data Mining Algorithms".

Each data mining function specifies a class of problems that can be modeled and solved. Data mining functions fall generally into two categories: supervised and unsupervised. Notions of supervised and unsupervised learning are derived from the science of machine learning, which has been called a sub-area of artificial intelligence.

Artificial intelligence refers to the implementation and study of systems that exhibit autonomous intelligence or behavior of their own. Machine learning deals with techniques that enable devices to learn from their own performance and modify their own functioning. Data mining applies machine learning concepts to data.

See Also:

Part II, "Mining Functions" for more details about data mining functions

Supervised Data Mining

Supervised learning is also known as directed learning. The learning process is directed by a previously known dependent attribute or target. Directed data mining attempts to explain the behavior of the target as a function of a set of independent attributes or predictors.

Supervised learning generally results in predictive models. This is in contrast to unsupervised learning where the goal is pattern detection.

The building of a supervised model involves training, a process whereby the software analyzes many cases where the target value is already known. In the training process, the model "learns" the logic for making the prediction. For example, a model that seeks to identify the customers who are likely to respond to a promotion must be trained by analyzing the characteristics of many customers who are known to have responded or not responded to a promotion in the past.

Supervised Learning: Testing

Separate data sets are required for building (training) and testing some predictive models. The build data (training data) and test data must have the same column structure. Typically, one large table or view is split into two data sets: one for building the model, and the other for testing the model.

The process of applying the model to test data helps to determine whether the model, built on one chosen sample, is generalizable to other data. In particular, it helps to avoid the phenomenon of overfitting, which can occur when the logic of the model fits the build data too well and therefore has little predictive power.

Supervised Learning: Scoring

Apply data, also called scoring data, is the actual population to which a model is applied. For example, you might build a model that identifies the characteristics of customers who frequently buy a certain product. To obtain a list of customers who shop at a certain store and are likely to buy a related product, you might apply the model to the customer data for that store. In this case, the store customer data is the scoring data.

Most supervised learning can be applied to a population of interest. Scoring is the purpose of classification and regression, the principal supervised mining techniques.

Oracle Data Mining does not support the scoring operation for attribute importance, another supervised function. Models of this type are built on a population of interest to obtain information about that population; they cannot be applied to separate data. An attribute importance model returns and ranks the attributes that are most important in predicting a target value.

Unsupervised Data Mining

Unsupervised learning is non-directed. There is no distinction between dependent and independent attributes. There is no previously-known result to guide the algorithm in building the model.

Unsupervised learning can be used for descriptive purposes. It can also be used to make predictions.

Unsupervised Learning: Scoring

Although unsupervised data mining does not specify a target, most unsupervised learning can be applied to a population of interest. For example, clustering models use descriptive data mining techniques, but they can be applied to classify cases according to their cluster assignments. Anomaly detection, although unsupervised, is typically used to predict whether a data point is typical among a set of cases.

Oracle Data Mining supports the scoring operation for clustering and feature extraction, both unsupervised mining functions. Oracle Data Mining does not support the scoring operation for association rules, another unsupervised function. Association models are built on a population of interest to obtain information about that population; they cannot be applied to separate data. An association model returns rules that explain how items or events are associated with each other. The association rules are returned with statistics that can be used to rank them according to their probability.

Oracle Data Mining Functions

Oracle Data Mining supports the supervised data mining functions described in Table 2-1.

Table 2-1 Oracle Data Mining Supervised Functions

Function	Description	Sample Problem
Attribute Importance	Identifies the attributes that are most important in predicting a target attribute	Given customer response to an affinity card program, find the most significant predictors
Classification	Assigns items to discrete classes and predicts the class to which an item belongs	Given demographic data about a set of customers, predict customer response to an affinity card program
Regression	Approximates and forecasts continuous values	Given demographic and purchasing data about a set of customers, predict customers' age

Oracle Data Mining supports the unsupervised functions described in Table 2-2.

Table 2-2 Oracle Data Mining Unsupervised Functions

Function	Description	Sample Problem
Anomaly Detection (implemented through one-class classification)	Identifies items (outliers) that do not satisfy the characteristics of "normal" data	Given demographic data about a set of customers, identify customer purchasing behavior that is significantly different from the norm
Association Rules	Finds items that tend to co-occur in the data and specifies the rules that govern their co-occurrence	Find the items that tend to be purchased together and specify their relationship
Clustering	Finds natural groupings in the data	Segment demographic data into clusters and rank the probability that an individual will belong to a given cluster
Feature Extraction	Creates new attributes (features) using linear combinations of the original attribute	Given demographic data about a set of customers, group the attributes into general characteristics of the customers

See Also:

Part II for details about the mining functions supported by Oracle Data Mining

Data Mining Algorithms

An algorithm is a mathematical procedure for solving a specific kind of problem. Oracle Data Mining supports at least one algorithm for each data mining function. For some functions, you can choose among several algorithms. For example, Oracle Data Mining supports four classification algorithms.

Each data mining model is produced by a specific algorithm. Some data mining problems can best be solved by using more than one algorithm. This necessitates the development of more than one model. For example, you might first use a feature extraction model to create an optimized set of predictors, then a classification model to make a prediction on the results.

Note:

You can be successful at data mining without understanding the inner workings of each algorithm. However, it is important to understand the general characteristics of the algorithms and their suitability for different kinds of applications.

Oracle Data Mining Supervised Algorithms

Oracle Data Mining supports the supervised data mining algorithms described in Table 2-3. The algorithm abbreviations are used throughout this manual.

Table 2-3 Oracle Data Mining Algorithms for Supervised Functions

Algorithm	Function	Description
Decision Tree (DT)	Classification	Decision trees extract predictive information in the form of human-understandable rules. The rules are if-then-else expressions; they explain the decisions that lead to the prediction.
Generalized Linear Models (GLM)	Classification and Regression	GLM implements logistic regression for classification of binary targets and linear regression for continuous targets. GLM classification supports confidence bounds for prediction probabilities. GLM regression supports confidence bounds for predictions.
Minimum Description Length (MDL)	Attribute Importance	MDL is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of the data.
Naive Bayes (NB)	Classification	Naive Bayes makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence, as observed in the data.
Support Vector Machine (SVM)	Classification and Regression	Distinct versions of SVM use different kernel functions to handle different types of data sets. Linear and Gaussian (nonlinear) kernels are supported. SVM classification attempts to separate the target classes with the widest possible margin. SVM regression tries to find a continuous function such that the maximum number of data points lie within an epsilon-wide tube around it.

Oracle Data Mining Unsupervised Algorithms

Oracle Data Mining supports the unsupervised data mining algorithms described in Table 2-4. The algorithm abbreviations are used throughout this manual.

Table 2-4 Oracle Data Mining Algorithms for Unsupervised Functions

Algorithm	Function	Description
Apriori (AP)	Association	Apriori performs market basket analysis by discovering co-occurring items (frequent itemsets) within a set. Apriori finds rules with support greater than a specified minimum support and confidence greater than a specified minimum confidence.
k-Means (KM)	Clustering	k-Means is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters. Each cluster has a centroid (center of gravity). Cases (individuals within the population) that are in a cluster are close to the centroid. Oracle Data Mining supports an enhanced version of k-Means. It goes beyond the classical implementation by defining a hierarchical parent-child relationship of clusters.
Non-Negative Matrix Factorization (NMF)	Feature Extraction	NMF generates new attributes using linear combinations of the original attributes. The coefficients of the linear combinations are non-negative. During model apply, an NMF model maps the original data into the new set of attributes (features) discovered by the model.
One Class Support Vector Machine (One- Class SVM)	Anomaly Detection	One-class SVM builds a profile of one class and when applied, flags cases that are somehow different from that profile. This allows for the detection of rare cases that are not necessarily related to each other.
Orthogonal Partitioning Clustering (O-Cluster or OC)	Clustering	O-Cluster creates a hierarchical, grid-based clustering model. The algorithm creates clusters that define dense areas in the attribute space. A sensitivity parameter defines the baseline density level.

See Also:

Part III for details about the algorithms supported by Oracle Data Mining

Data Preparation

Data for mining must exist within a single table or view. The information for each case (record) must be stored in a separate row.

A unique capability of Oracle Data Mining is its support for dimensioned data (for example, star schemas) through nested table transformations. Additionally, Oracle Data Mining can mine unstructured data.

See Also:

Oracle Data Mining Application Developer's Guide to learn how to construct a table or view for data mining

Chapter 20, "Text Mining" for information about mining unstructured data

Proper preparation of the data is a key factor in any data mining project. The data must be properly cleansed to eliminate inconsistencies and support the needs of the mining application. Additionally, most algorithms require some form of data transformation, such as binning or normalization.

The data mining development process may require several data sets. One data set may needed for building (training) the model; a separate data set may be used for scoring. Classification models should also have a test data set. Each of these data sets must be prepared in exactly the same way.

Automatic Data Preparation

Oracle Data Mining supports automatic and embedded data transformation, which can significantly reduce the time and effort involved in developing a data mining model. In Automatic Data Preparation (ADP) mode, the model itself transforms the build data according to the requirements of the algorithm. The transformation instructions are embedded in the model and reused whenever the model is applied.

You can choose to add your own transformations to those performed automatically by Oracle Data Mining. These are embedded along with the automatic transformation instructions and reused with them whenever the model is applied. In this case, you only have to specify your transformations once — for the build data. The model itself will transform the data appropriately when it is applied.

Mining models are known as supermodels, because they contain the instructions for their own data preparation.

Model Transparency

Oracle Data Mining provides a high degree of model transparency. All models support a set of model details that you can query to observe the effect of the algorithm on the training data. Additionally, some algorithms produce rules, which express the logic used by the model. Clustering algorithms, as well as Decision Tree and Association Rules, all produce rules.

The data used internally by the model often does not look like the data used to build the model. This is because the model transforms much of the data prior to algorithmic processing. Model details reverse these transformations and present the training data as it is understood by a user. Similarly, any transformations applied to the target of a supervised model are reversed before the results are returned to the user.

See Also:

Oracle Data Mining Application Developer's Guide for a summary of information returned in model details

Chapters on O-Cluster, k-Means, Decision Tree, and Apriori in Part III, "Algorithms"

How Do I Use Oracle Data Mining?

Oracle Data Mining supports programmatic interfaces for PL/SQL, SQL, and R.

Note:

The Oracle Data Mining Java API is deprecated in this release.

Certain models that are PMML-compliant can be imported into Oracle Database for scoring.

The Oracle Data Miner graphical user interface provides access to the full range of in-database capabilities of Oracle Data Mining. A spreadsheet add-in provides access to predictive analytics that are powered by Oracle Data Mining.

Oracle Data Miner

Oracle Data Miner is the graphical user interface to Oracle Data Mining. Oracle Data Miner, an extension to Oracle SQL Developer 3.0, uses a work flow paradigm to perform data mining tasks.

You can use Oracle Data Miner to explore data, build and evaluate multiple mining models, and apply the models to new data. By building work flows, you can capture and document the methodology you use to perform a range of mining tasks. You can save and share work flows.

Note:

To download Oracle Data Miner 11g Release 2 (11.2.0.2), go to:

http://www.oracle.com/technetwork/database/options/odm/dataminerworkflow-168677.html

The previous release of Oracle Data Miner is no longer under active development, but it is still available for download on the Oracle Technology Network. To download the earlier release, go to:

http://www.oracle.com/technetwork/database/options/odm/downloads/odminer-097463.html

PL/SQL Packages

The Oracle Data Mining PL/SQL API is implemented in the following PL/SQL packages:

DBMS_DATA_MINING — Contains routines for building, testing, and applying data mining models.
DBMS_DATA_MINING_TRANSFORM — Contains routines for transforming the data sets prior to building or applying a model. You can use these routines, adaptations of these routines, or any other SQL-based method for implementing transformations.

Note that user-defined transformations may not be required. With ADP, Oracle Data Mining can automatically perform most transformations required by a given algorithm.
DBMS_PREDICTIVE_ANALYTICS — Contains automated data mining routines for PREDICT, EXPLAIN, and PROFILE operations.

The following example shows the PL/SQL routine for creating an SVM classification model called my_model. The algorithm is specified in a settings table called my_settings. The algorithm must be specified as a setting because Naive Bayes, not SVM, is the default classifier.

CREATE TABLE my_settings(
  setting_name  VARCHAR2(30),
  setting_value VARCHAR2(4000));

BEGIN
    INSERT INTO my_settings VALUES
        (dbms_data_mining.algo_name, 
         dbms_data_mining.algo_support_vector_machines);
    COMMIT;
END;
/

BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'my_model',
    mining_function     => dbms_data_mining.classification,
    data_table_name     => 'mining_data_build',
    case_id_column_name => 'cust_id',
    target_column_name  => 'affinity_card',
    settings_table_name => 'my_settings');
END;
/

SQL Scoring Functions

A family of SQL language operators supports the deployment of data mining models. These operators allow scoring to be easily incorporated into SQL queries, and thus into SQL-based applications.

The following example illustrates the Data Mining PREDICTION_PROBABILITY operator. The operator applies the classification model nb_sh_clas_sample to the data set mining_data_apply_v.

SELECT cust_id, prob
  FROM (SELECT cust_id,
          PREDICTION_PROBABILITY (nb_sh_clas_sample, 1 USING *) prob
          FROM  mining_data_apply_v
          WHERE cust_id < 100011)
  ORDER BY cust_id;

The SELECT statement returns ten customers, listed by customer ID, along with the likelihood that they will accept (1) an affinity card.

CUST_ID       PROB
---------- ----------
    100001 .025622714
    100002 .090424232
    100003 .028064789
    100004 .048458859
    100005 .989335775
    100006 .000151844
    100007  .05749942
    100008 .108750373
    100009 .538512886
    100010 .186426058

See Also:

Oracle Database SQL Language Reference

R-ODM

The R Interface to Oracle Data Mining (R-ODM) provides access to Oracle Data Mining from the R programming environment. R-ODM consists of a set of function wrappers written in source R language. These functions pass data and parameters from the R environment to Oracle Database through an ODBC connection.

The R-ODM interface is completely external to Oracle Database. It does not use or expose any Oracle product code. R-ODM is packaged as a standard R source package and is distributed freely as part of the Comprehensive R Archive Network ( CRAN).

http://www.r-project.org for information about the R environment, R packages, and CRAN

PMML Import

Using the PL/SQL API, you can import a GLM model represented in Predictive Model Markup Language (PMML) into an Oracle database.This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2) Data Mining.

PMML is an XML-based standard specified by the Data Mining Group (http://www.dmg.org). Applications that are PMML-compliant can deploy PMML-compliant models that were created by any vendor. Oracle Data Mining supports the core features of PMML 3.1 for regression models.

See Also:

Oracle Data Mining Administrator's Guide for information about exporting and importing data mining models

http://www.dmg.org/faq.html for information about PMML

Oracle Spreadsheet Add-In for Predictive Analytics

Predictive Analytics automates the data mining process with routines for PREDICT, EXPLAIN, and PROFILE. The Oracle Spreadsheet Add-In for Predictive Analytics implements these routines for Microsoft Excel.You can use the Spreadsheet Add-In to analyze Excel data or data that resides in an Oracle database.

You can download the Spreadsheet Add-In, including a readme file, from the Oracle Technology Network.

http://www.oracle.com/technetwork/database/options/odm/index.html

Java API

The Oracle Data Mining Java API is an Oracle implementation of the JDM standard Java API for data mining (JSR-73). The Java API is layered on the PL/SQL API, and the two APIs are fully interoperable.

Note:

The Oracle Data Mining Java API is deprecated in this release.

Oracle recommends that you not use deprecated features in new applications. Support for deprecated features is for backward compatibility only

The following code fragment creates a Decision Tree model that models customer affinity card response patterns and applies this model to predict new customers' affinity card responses.

//Step-1: Create connection to a database with the Data Mining option
 OraConnectionFactory m_dmeConnFactory = new OraConnectionFactory();
 ConnectionSpec connSpec = m_dmeConnFactory.getConnectionSpec();
 connSpec.setURI("jdbc:oracle:thin:@<hostName>:<port>:<sid>");
 connSpec.setName("<user name>");
 connSpec.setPassword("password");
 m_dmeConn = m_dmeConnFactory.getConnection(connSpec);
 
//Step-2: Create object factories
 PhysicalDataSetFactory m_pdsFactory =
   (PhysicalDataSetFactory)m_dmeConn.getFactory(
      "javax.datamining.data.PhysicalDataSet");
 PhysicalAttributeFactory m_paFactory =
   (PhysicalAttributeFactory)m_dmeConn.getFactory(
      "javax.datamining.data.PhysicalAttribute");
 TreeSettingsFactory m_treeFactory =
   (TreeSettingsFactory)m_dmeConn.getFactory(
      "javax.datamining.algorithm.tree.TreeSettings");
 ClassificationSettingsFactory m_clasFactory =
   (ClassificationSettingsFactory)m_dmeConn.getFactory(
      "javax.datamining.supervised.classification.ClassificationSettings");
BuildTaskFactory m_buildFactory =
   (BuildTaskFactory)m_dmeConn.getFactory(
      "javax.datamining.task.BuildTask");
ClassificationApplySettingsFactory m_applySettingsFactory =
   (ClassificationApplySettingsFactory)m_dmeConn.getFactory(
     "javax.datamining.supervised.classification.ClassificationApplySettings");
DataSetApplyTaskFactory m_dsApplyFactory =
   (DataSetApplyTaskFactory)m_dmeConn.getFactory(
      "javax.datamining.task.apply.DataSetApplyTask");
ClassificationApplySettingsFactory m_applySettingsFactory =
   (ClassificationApplySettingsFactory)m_dmeConn.getFactory(
     "javax.datamining.supervised.classification.ClassificationApplySettings");
 
//Step-3: Create and save model build task input objects 
//        (training data, build settings)
    //Create & save model input data specification (PhysicalDataSet)
    PhysicalDataSet buildData =
      m_pdsFactory.create("MINING_DATA_BUILD_V", false);
    PhysicalAttribute pa =
      m_paFactory.create("CUST_ID", AttributeDataType.integerType,
                         PhysicalAttributeRole.caseId);
    buildData.addAttribute(pa);
    m_dmeConn.saveObject("treeBuildData_jdm", buildData, true);
    //Create & save Mining Function Settings
    ClassificationSettings buildSettings = m_clasFactory.create();
    TreeSettings treeAlgo = m_treeFactory.create();
    buildSettings.setAlgorithmSettings(treeAlgo);
    buildSettings.setTargetAttributeName("AFFINITY_CARD");
    m_dmeConn.saveObject("treeBuildSettings_jdm", buildSettings, true);

//Step-4: Create and save model build task
 BuildTask buildTask =
    m_buildFactory.create("treeBuildData_jdm", "treeBuildSettings_jdm",
                            "treeModel_jdm");
 m_dmeConn.saveObject("treeBuildTask_jdm", buildTask, true);
 
//Step-5: Create and save model apply task input objects (apply settings)
  //Create & save PhysicalDataSpecification
    PhysicalDataSet applyData =
      m_pdsFactory.create("MINING_DATA_APPLY_V", false);
    PhysicalAttribute pa =
      m_paFactory.create("CUST_ID", AttributeDataType.integerType,
                         PhysicalAttributeRole.caseId);
    applyData.addAttribute(pa);
    m_dmeConn.saveObject("treeApplyData_jdm", applyData, true);
    //Create & save ClassificationApplySettings
    ClassificationApplySettings clasAS = m_applySettingsFactory.create();

//Step-6: Create and save model apply task with build task as dependent
    DataSetApplyTask applyTask =
      m_dsApplyFactory.create("treeApplyData_jdm", "treeModel_jdm",
                              "treeApplySettings_jdm",
                              "TREE_APPLY_OUTPUT_JDM");
    ((OraTask)applyTask).addDependency("treeBuildTask_jdm");
    m_dmeConn.saveObject("treeApplyTask_jdm", applyTask, true);

//Step-7: Execute build task which executes build task and then after 
//        successful completion triggers the execution of its dependent 
//        task(s). In this example, there is only one dependent task.
m_dmeConn.execute("treeBuildTask_jdm");

Where Do I Find Information About Oracle Data Mining?

Oracle Data Mining documentation is included in the Oracle Database Online Documentation Library.

For your convenience, the Oracle Data Mining and related Oracle Database manuals are listed in Table 2-5.

Table 2-5 Oracle Data Mining Documentation

Document	Description
Oracle Data Mining Concepts	This manual. An overview of mining functions, algorithms, data preparation, predictive analytics, and other special features supported by Oracle Data Mining
Oracle Data Mining Application Developer's Guide	How to use the programmatic interfaces to Oracle Data Mining.
Oracle Data Mining Administrator's Guide	How to install and administer a database for Data Mining. How to install and use the demo programs
Oracle Database PL/SQL Packages and Types Reference	How to use the Data Mining PL/SQL API syntax
Oracle Database SQL Language Reference	How to use the Data Mining SQL function syntax
Oracle Database Reference	How to query data dictionary views to obtain information about mining models, mining model attributes, and mining model settings

See Also:

Oracle Database online documentation library at http://www.oracle.com/pls/db112/homepage

Oracle Data Mining Resources on the Oracle Technology Network

The Oracle Technology Network (OTN) is easily accessible and provides a wealth of information. You can visit the Oracle Data Mining home page at:

http://www.oracle.com/technetwork/database/options/odm/index.html

This site provides news and discussion forums as well as tools and educational materials for download. On this site, you will find:

Oracle Data Miner (download)
Oracle Spreadsheet Add-In for Predictive Analytics (download)
Sample code (download)
R-ODM information
White papers and web casts
Oracle Data Mining discussion forum
Blogs on data mining and analytics in the Oracle Database

Oracle Data Mining and Oracle Database Analytics

As described in "Data Mining in the Database Kernel", in-database analytics offer significant advantages. When analytical capabilities are implemented where the data is stored, the data does not have to be exported to an external server for analysis. The results of analysis do not need to be imported; they reside in the database where they can be easily accessed, refreshed, and combined with other data.

Along with data mining and predictive analytics, Oracle Database supports a wide array of analytical features. Since these features are part of a common server it is possible to combine them efficiently. The results of analytical processing can be integrated with Oracle Business Intelligence Suite Enterprise Edition and other BI tools and applications. Taken as a whole, these features make the Oracle Database a powerful platform for developing analytical applications.

The possibilities for combining different analytics are virtually limitless. Example 2-1 shows data mining and text processing within a single SQL query. The query selects all customers who have a high propensity to attrite (> 80% chance), are valuable customers (customer value rating > 90), and have had a recent conversation with customer services regarding a Checking Plus account. The propensity to attrite information is computed using a Data Mining model called tree_model. The query uses the Oracle Text CONTAINS operator to search call center notes for references to Checking Plus accounts.

Example 2-1 Combine Oracle Data Mining and Oracle Text in a SQL Query

SELECT A.cust_name, A.contact_info
  FROM customers A
 WHERE PREDICTION_PROBABILITY(tree_model,
            'attrite' USING A.*) > 0.8
   AND A.cust_value > 90
   AND A.cust_id IN
       (SELECT B.cust_id
          FROM call_center B
         WHERE B.call_date BETWEEN '01-Jan-2005'
                               AND '30-Jun-2005'   
         AND CONTAINS(B.notes, 'Checking Plus', 1) > 0);

Some of the analytics supported by Oracle Database are described in Table 2-6. Use the links in the Documentation column to find related documentation.

Table 2-6 Overview of Analytics in Oracle Database

Analytical Feature	Description	Documentation
Data Mining	Oracle Data Mining implements complex algorithms that sift through large volumes of data to find hidden information. Data Mining models discover patterns, predict probable outcomes, identify key predictors, and find other kinds of valuable information	Oracle Data Mining Concepts (this manual)
Complex data transformations	Data transformation is a key aspect of analytical applications and ETL (extract, transform, and load). You can use SQL expressions to implement data transformations, or you can use the `DBMS_DATA_MINING_TRANSFORM` package. `DBMS_DATA_MINING_TRANSFORM` is a flexible data transformation package that includes a variety of missing value and outlier treatments, as well as binning and normalization capabilities.	Oracle Database PL/SQL Packages and Types Reference
Statistical functions	Oracle Database provides a long list of SQL statistical functions with support for: hypothesis testing (such as t-test, F-test), correlation computation (such as pearson correlation), cross-tab statistics, and descriptive statistics (such as median and mode). The `DBMS_STAT_FUNCS` package adds distribution fitting procedures and a summary procedure that returns descriptive statistics for a column.	Oracle Database SQL Language Reference and Oracle Database PL/SQL Packages and Types Reference
Window and analytic SQL functions	Oracle Database supports analytic and windowing functions for computing cumulative, moving, and centered aggregates. With windowing aggregate functions, you can calculate moving and cumulative versions of `SUM`, `AVERAGE`, `COUNT`, `MAX`, `MIN`, and many more functions.	Oracle Database Data Warehousing Guide
Frequent Itemsets	The `DBMS_FREQUENT_ITEMSET` supports frequent itemset counting, a mechanism for counting how often multiple events occur together. `DBMS_FREQUENT_ITEMSET` is used as a building block for the Association Rules algorithm used by Oracle Data Mining.	Oracle Database PL/SQL Packages and Types Reference
Linear algebra	The `UTL_NLA` package exposes a subset of the popular `BLAS` and `LAPACK` (Version 3.0) libraries for operations on vectors and matrices represented as VARRAYs. This package includes procedures to solve systems of linear equations, invert matrices, and compute eigenvalues and eigenvectors.	Oracle Database PL/SQL Packages and Types Reference
OLAP	Oracle OLAP supports multidimensional analysis and can be used to improve performance of multidimensional queries. Oracle OLAP provides functionality previously found only in specialized OLAP databases. Moving beyond drill-downs and roll-ups, Oracle OLAP also supports time-series analysis, modeling, and forecasting.	Oracle OLAP User's Guide
Spatial analytics	Oracle Spatial provides advanced spatial features to support high-end GIS and LBS solutions. Oracle Spatial's analysis and mining capabilities include functions for binning, detection of regional patterns, spatial correlation, colocation mining, and spatial clustering. Oracle Spatial also includes support for topology and network data models and analytics. The topology data model of Oracle Spatial allows one to work with data about nodes, edges, and faces in a topology. It includes network analysis functions for computing shortest path, minimum cost spanning tree, nearest-neighbors analysis, traveling salesman problem, among others.	Oracle Spatial Developer's Guide
Text Mining	Oracle Text uses standard SQL to index, search, and analyze text and documents stored in the Oracle database, in files, and on the web. It also supports automatic classification and clustering of document collections. Many of these analytical features are layered on top of ODM functionality	Oracle Text Application Developer's Guide