Oracle® Data Mining Application Developer's Guide 11g Release 2 (11.2) Part Number E12218-07 |
|
|
PDF · Mobi · ePub |
Oracle Data Mining supports the mining of data sets that have one or more text columns. These columns must undergo a special preprocessing step whereby text tokens known as terms are extracted and stored in a nested column. The transformed text can then be used as any other attribute in the building, testing, and scoring of models. Any algorithm that supports nested data can be used for text mining. (See Table 3-2 in Chapter 3.)
This chapter explains how to use PL/SQL to prepare a column of text for mining.
Note:
Oracle Data Mining includes sample programs that illustrate text transformation and text mining using the PL/SQL API. Refer to Oracle Data Mining Administrator's Guide for information about the Oracle Data Mining sample programs.See Also:
Oracle Data Mining Concepts for information about text miningThis chapter contains the following sections.
Oracle Data Mining uses specialized Oracle Text routines to preprocess text data. Oracle Text is a technology within the database for building text querying and classification applications.
Oracle Text provides the following facilities for the Oracle Data Mining term extraction process:
SVM_CLASSIFIER
, defined in the CTX_DLL
Oracle Text PL/SQL package, specifies an index preference for Oracle Data Mining term extraction.
The CTXSYS.DRVODM
Oracle Text PL/SQL package defines the table functions, FEATURE_PREP
and FEATURE_EXPLAIN
, which generate intermediate and final tables of text terms for Oracle Data Mining.
In PL/SQL, the term extraction process requires the use of these Oracle Text facilities. (See "Using Text Transformation".)
See Also:
Oracle Text Application Developer's Guide and Oracle Text Reference for information on Oracle TextNote:
The Oracle Text facilities for Oracle Data Mining are documented in this chapter. They are not documented in the Oracle Text manuals.Note on Terminology:
Text terms are also known as features. In text mining, a feature is a word or group of words extracted from a text document. Features (terms) are the fundamental unit of text that can be manipulated and analyzed.The feature extraction mining function (NMF algorithm in Oracle Data Mining) and text mining transformation both perform a kind of feature extraction.
A feature extraction model creates features that represent fundamental qualities of multiple attributes. The model operates on the features instead of the original attributes.
Text transformation changes a text document into a collection of features, each one representing a fundamental characteristic of the document. The model operates on the text features instead of the original document.
A good place to start in learning the term extraction process is with the sample programs. The following sample programs contain term extraction code for text mining:
dmsh.sql
— Prepares the build, test, and scoring data for the sample programs, including the text mining programs. dmsh.sql
creates views for data mining and tables and indexes for text mining.
dmtxtfe.sql
— Using the indexed text column created by dmsh.sql
, creates a nested column suitable for text mining.
The dmtxtfe.sql
program is a sample term extractor. It contains extensive comments that explain the code in a step-by-step fashion. You can expand this program into a complete term extraction solution by adding index creation and the preparation of test and scoring data (as in dmsh.sql
).
Two sample PL/SQL programs use the data prepared by dmsh.sql
to mine text.
dmtxtnmf.sql
creates and applies a model that uses Non-Negative Matrix Factorization.
dmtxtsvm.sql
creates and applies a model that uses SVM classification.
Both these programs mine a table of customer data that includes a nested column of text data called COMMENTS
. The COMMENTS
column has been pre-processed by dmsh.sql
. The models created by these programs are referenced in the following example from a Linux system.
-- Run the programs to create the models SQL> @ $ORACLE_HOME/rdbms/demo/dmtxtnmf.sql SQL> @ $ORACLE_HOME/rdbms/demo/dmtxtsvm.sql -- List the models created by the programs SQL> SELECT model_name, mining_function, algorithm FROM user_mining_models; MODEL_NAME MINING_FUNCTION ALGORITHM ------------------------ --------------------------- ------------------------ T_SVM_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES T_NMF_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR
See Also:
Oracle Data Mining Administrator's Guide. This manual provides instructions for obtaining and running the sample programs. It includes information about the build, training, and scoring data used by these programs.The pre-processing steps for text mining create nested table columns of type DM_NESTED_NUMERICALS
from columns of type VARCHAR2
or CLOB
. Each row of the nested table specifies an attribute name and a value. The DM_NESTED_NUMERICALS
type defines the following columns.
attribute_name VARCHAR2(4000) value NUMBER)
The term extraction process treats the text in each row of the original table as a separate document. Each document is transformed to a set of terms that have a numeric value and a text label. Within the nested table column, the attribute_name
column holds the text and the value
column holds the numeric value of the term, which is derived using the term frequency in the document and in the document collection (other rows).
For example, the following query returns various attributes of customer 102998, including a text column of comments. The text column has not been transformed.
SQL> select cust_id, cust_gender, cust_income_level, affinity_card, comments from mining_build_text where cust_id = 102998; CUST_ID C CUST_INCOME_LEVEL AFFINITY_CARD COMMENTS ------- -- -------------------- ------------- -------------------------------- 102998 M J: 190,000 - 249,999 1 I wanted to write you to let you know that I've purchased several items at your store recently and have been very satisfied with my purchases. Keep up the good work.
The following query returns the same attributes of customer 102998, but the text in the comments column has been transformed. The query extracts the ATTRIBUTE_NAME
and VALUE
columns from the nested table that holds the transformed text.
SQL> select b.cust_id, b.cust_gender, b.cust_income_level, b.affinity_card, n.* from mining_build_nested_text b, table(b.comments) n where b.cust_id = 102998 order by n.attribute_name; CUST_ID C CUST_INCOME_LEVEL AFFINITY_CARD ATTRIBUTE_NAME VALUE ------- -- ------------------- ------------- -------------- -------- 102998 M J: 190,000 - 249,999 1 GOOD .26894 102998 M J: 190,000 - 249,999 1 ITEMS 158062 102998 M J: 190,000 - 249,999 1 KEEP 238765 102998 M J: 190,000 - 249,999 1 KNOW .2006 102998 M J: 190,000 - 249,999 1 LET 299856 102998 M J: 190,000 - 249,999 1 PURCHASED 142743 102998 M J: 190,000 - 249,999 1 PURCHASES 173146 102998 M J: 190,000 - 249,999 1 RECENTLY .195223 102998 M J: 190,000 - 249,999 1 SATISFIED .355851 102998 M J: 190,000 - 249,999 1 SEVERAL .355851 102998 M J: 190,000 - 249,999 1 STORE .0712537 102998 M J: 190,000 - 249,999 1 UP .159838 102998 M J: 190,000 - 249,999 1 WANTED .355851 102998 M J: 190,000 - 249,999 1 WORK .299856 102998 M J: 190,000 - 249,999 1 WRITE .355851
The ATTRIBUTE_NAME
column holds an item of text from the original comments column. The VALUE
column holds the term value. Note that not all words from the original comments column are extracted as terms. For example, the articles the
and to
are not included.
The steps in the term extraction process are summarized in this section. Further details and specific syntax requirements are explained later in this chapter.
First transform the text in the build data. During this process you will generate the text term definitions, which you will reuse for the test and apply data. Perform the following steps:
Create an index on the text column in the build table.
Define a table to hold the categories specified by the SVM_CLASSIFIER
index.
Use the FEATURE_PREP
table function to create the term definitions and populate an intermediate terms table.
Use the FEATURE_EXPLAIN
table function to populate the final terms table.
Replicate the columns of the original build table (using a view or another table), replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.
The test and apply data must undergo the same pre-processing as the build data. To transform the test and apply data, you will reuse the term definitions generated for the build data. Perform the following steps:
Create an index on the text column in the test or apply table.
Use the FEATURE_PREP
table function to populate an intermediate terms table. Use the term definitions previously generated for the build data.
Use the FEATURE_EXPLAIN
table function to populate the final terms table.
Replicate the columns of the original test or apply table, replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.
Oracle Text processing requires a text index. Oracle Text supports several types of indexes for querying, cataloging, and classifying text documents. The Oracle Data Mining term extraction process requires a CONTEXT
index for text querying.
You must create an index for each text column to be transformed. Use the following syntax to create the index.
SQL>CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS ctxsys.context PARAMETERS ('nopopulate');
Note:
This statement creates a basicCONTEXT
index. You can further define the characteristics of the index by specifying additional arguments to the CREATE INDEX
statement. Refer to Oracle Text Reference for details.Oracle Text supports index preferences for overriding the default characteristics of an index. The CREATE_PREFERENCE
procedure in the Oracle Text package CTX_DDL
creates a preference with the name and type that you specify. The SVM_CLASSIFIER
preference type defines the characteristics of an index for Oracle Data Mining.
You must create an index preference when you prepare the build data. It will be reused when you prepare the test and apply data. Use the following syntax to create the index preference.
SQL>EXECUTE ctx_ddl.create_preference('preference_name', 'SVM_CLASSIFIER');
The SVM_CLASSIFIER
index preference uses a predefined table with two numeric columns: ID
and CAT
. ID
holds the case ID; CAT
holds the category. The category table is used for internal processing. You must create the category table using the following syntax.
SQL>CREATE TABLE category_table_name(id NUMBER, cat NUMBER);
The FEATURE_PREP
table function in the CTXSYS.DRVODM
Oracle Text package extracts terms from a text column using an index preference of type SVM_CLASSIFIER
. FEATURE_PREP
creates a table of term definitions from the build data and reuses these definitions for the test and apply data.
FEATURE_PREP
returns an intermediate terms table.
FEATURE_PREP
is an over-loaded function that accepts two different sets of arguments. You will specify one set of arguments for the build data and another set for the test and apply data.
--- syntax for build data --- CTXSYS.DRVODM.FEATURE_PREP ( text_index IN VARCHAR2, case_id IN VARCHAR2, category_tbl IN VARCHAR2, category_tbl_id_col IN VARCHAR2, category_tbl_cat_col IN VARCHAR2, feature_definition_tbl IN VARCHAR2, index_preference IN VARCHAR2) RETURN DRVODM; --- syntax for test/apply data --- CTXSYS.DRVODM.FEATURE_PREP ( text_index IN VARCHAR2, case_id IN VARCHAR2, feature_definition_tbl IN VARCHAR2, RETURN DRVODM;
FEATURE_PREP
returns the following columns. The SEQUENCE_ID
column holds the case ID; the ATTRIBUTE_ID
column holds the term ID.
Name NULL? Type ---------------------- ------- ------ SEQUENCE_ID NUMBER ATTRIBUTE_ID NUMBER VALUE NUMBER
FEATURE_PREP
accepts the arguments described in Table 4-1.
Table 4-1 FEATURE_PREP Table Function Arguments
Argument Name | Data Type | |
---|---|---|
|
|
Name of the index on the text column in the build, test, or apply table. |
|
|
Name of the case ID column in the build, test, or apply table. |
|
|
Name of the table used by the Specify this argument only for build data. |
|
|
Specify 'id'. This is the name of the Specify this argument only for build data. |
|
|
Specify 'cat'. This is the name of the Specify this argument only for build data. |
|
|
Name of the term definition table created by Name Null? Type --------------------------------- CAT_ID NUMBER TYPE NUMBER RULE BLOB |
|
|
Name of the Specify this argument only for build data. |
The following example creates an intermediate terms table called txt_term_out
. The FEATURE_PREP
table function extracts terms from a text column with an index called build_text_idx
. The text column is in a build table with a case ID column called cust_id
. The index preference txt_pref
is applied to the index using the id
and cat
columns in the table cat_tbl
. FEATURE_PREP
creates a table of term definitions called txt_pref_terms
.
CREATE TABLE txt_term_out AS SELECT * FROM TABLE(ctxsys.drvodm.feature_prep ( 'build_text_idx', 'cust_id', 'cat_tbl', 'id', 'cat', 'txt_pref_terms', 'txt_pref'));
The FEATURE_EXPLAIN
table function in the CTXSYS.DRVODM
Oracle Text package extracts the term values from the definitions created by FEATURE_PREP
and appends the associated word to each value.
FEATURE_EXPLAIN
returns the final terms table.
The calling syntax of FEATURE_EXPLAIN
is described as follows.
CTXSYS.DRVODM.FEATURE_EXPLAIN ( feature_definition_tbl IN VARCHAR2, RETURN DRVODM;
FEATURE_EXPLAIN
returns the following columns.
Name Type --------------- --------------- text VARCHAR2(160) type NUMBER(3) ID NUMBER score NUMBER
FEATURE_EXPLAIN
accepts a single argument: the terms definition table created by FEATURE_PREP
.
The following example creates a final terms table called txt_final_terms
using the intermediate terms table txt_term_out
. The FEATURE_EXPLAIN
table function returns the terms specified in the terms definition table txt_pref_terms
.
SQL> create table txt_final_terms as select A.sequence_id, B.text, A.value FROM txt_term_out A, TABLE(ctxsys.drvodm.feature_explain( 'txt_pref_terms')) B WHERE A.attribute_id = B.id;
Use the final terms table to populate a nested table column of type DM_NESTED_NUMERICALS
.
The following example creates the table mining_build_nested_text
. (Alternatively, you could create a view.) The table has a case ID column of customer IDs and three customer attribute columns: age, education, and occupation. It also includes a comments column of type DM_NESTED_NUMERICALS
created from the terms table txt_final_terms
.
SQL> CREATE TABLE mining_build_nested_text NESTED TABLE comments store AS build_comments AS SELECT non_text.cust_id, non_text.age, non_text.education, non_text.occupation, txt.comments FROM mining_build_text non_text, ( SELECT features.sequence_id, cast(COLLECT(dm_nested_numerical(features.text,features.value)) as dm_nested_numericals) comments FROM txt_final_terms features group by features.sequence_id) txt WHERE non_text.cust_id = txt.sequence_id(+);
In the following example, a text column in MINING_BUILD_TEXT
is transformed to a nested table column in MINING_BUILD_NESTED_TEXT
. The same text column in MINING_APPLY_TEXT
is transformed to a nested table column in MINING_APPLY_NESTED_TEXT
.
Both MINING_BUILD_TEXT
and MINING_APPLY_TEXT
have the following columns.
Name Null? Type --------------------------------- -------- --------------------------- CUST_ID NOT NULL NUMBER AGE NUMBER EDUCATION VARCHAR2(21) OCCUPATION VARCHAR2(21) COMMENTS VARCHAR2(4000)
The following statements create the indexes.
SQL> create index build_text_idx on mining_build_text (comments) indextype is ctxsys.context parameters ('nopopulate'); SQL> create index apply_text_idx ON mining_apply_text (comments) indextype is ctxsys.context parameters ('nopopulate');
The following statements create the index preference and its table.
SQL> execute ctx_ddl.create_preference('idx_pref', 'SVM_CLASSIFIER'); SQL> create table idx_pref_cat (id number, cat number);
The following statement returns the intermediate terms in the table BUILD_TERMS_OUT
. It also creates the table FEATURE_DEFS
and populates it with the term definitions.
SQL> create table build_terms_out as select * from table (ctxsys.drvodm.feature_prep ('build_text_idx', 'cust_id', 'idx_pref_cat', 'id', 'cat', 'feature_defs', 'idx_pref'));
The following statement returns the final terms in the table BUILD_EXPLAIN_OUT
.
SQL> create table build_explain_out as select a.sequence_id, b.text, a.value from build_terms_out a, table (ctxsys.drvodm.feature_explain('feature_defs')) b where a.attribute_id = b.id;
The following statement creates the table MINING_BUILD_NESTED_TEXT
. This table contains the non-text attributes from the original build table and a nested table of comments. This table can be used to build a model.
SQL> create table mining_build_nested_text nested table comments store as build_comments as select non_text.cust_id, non_text.age, non_text.education, non_text.occupation, txt.comments from mining_build_text non_text, (select features.sequence_id, cast(collect(dm_nested_numerical(features.text,features.value)) as dm_nested_numericals) comments from build_explain_out features group by features.sequence_id) txt where non_text.cust_id = txt.sequence_id(+);
The following statement creates the intermediate terms table for the comments column in the apply table, MINING_APPLY_TEXT
. It uses the term definitions in the FEATURE_DEFS
table, which was created during the pre-processing of the comments column in MINING_BUILD_TEXT
.
SQL> create table apply_terms_out as select * from table (ctxsys.drvodm.feature_prep ('apply_text_idx', 'cust_id', 'feature_defs'));
The following statement creates the final terms table for apply.
SQL> create table apply_explain_out as select a.sequence_id, b.text, a.value from apply_terms_out a, table (ctxsys.drvodm.feature_explain('feature_defs')) b where a.attribute_id = b.id;
The following statement creates the table MINING_APPLY_NESTED_TEXT
. This table contains the non-text attributes from the original apply table and a nested table of comments. This table can be used to apply the model.
SQL> create table mining_apply_nested_text nested table comments store as apply_comments as select non_text.cust_id, non_text.age, non_text.education, non_text.occupation, txt.comments from mining_apply_text non_text, (select features.sequence_id, cast(collect(dm_nested_numerical(features.text,features.value)) as dm_nested_numericals) comments from apply_explain_out features group by features.sequence_id) txt where non_text.cust_id = txt.sequence_id(+);