Oracle® R Enterprise User's Guide Release 1.3 for Linux and Windows Part Number E36761-04 |
|
|
PDF · Mobi · ePub |
This chapter explains how to use Oracle R Enterprise to analyze data stored in tables or views in an Oracle Database.
This chapter discusses these topics:
We assume familiarity with R in the remainder of this section.
For additional examples of using Oracle R Enterprise functionality, see Oracle R Enterprise Statistical Functions. For examples of building statistical models, including models created using Oracle Data Mining algorithm, see In-Database Predictive Models in Oracle R Enterprise.
Before you can use Oracle R Enterprise to analyze data stored in database tables, you must install Oracle R Enterprise, start a client, and connect to the database, as described in Oracle R Enterprise Administrator's Guide.
By convention, most of the functions and methods defined in Oracle R Enterprise begin with the prefix ore
. This is done to avoid name collisions with other R software. However, the objects created by those functions and methods can be anything the end user wants them to be. The end user has complete control over object naming.
Pick any object returned by ore.ls()
and type either class(OBJECTNAME)
or class(OBJECTNAME$COLUMN_NAME)
.
For an example, see Example: Load Data. This example loads a data frame into the database, creating the table DF_TABLE. DF_TABLE has class ore.frame
:
R> class(DF_TABLE) [1] "ore.frame"
The prefix ore
indicates that the object is an Oracle R Enterprise created object that holds metadata for the corresponding object in Oracle Database.
ore.frame is the Oracle R Enterprise metadata object that maps to a database table. The ore.frame
object is the counterpart to an R data.frame.
ore.frame
or can be returned by the class()
function. For an example of creating ore.frame
data, see Load an R Data Frame into the Database.
Use this command to view the Oracle R Enterprise documentation library:
R> OREShowDoc()
Oracle R Enterprise supports this functionality:
Oracle R Enterprise handles R naming conventions for ore.frame
columns, instead of a more restrictive Database names. ore.frame
column names can be longer than 30 bytes, contain double quotes, and be non-unique.
Follow these steps to load data from R data frames on your system to the Oracle Database:
Load contents of the file to an R data frame using read.table()
or read.csv()
functions documented in R online help.
Then use ore.create()
to load a data frame to a table:
ore.create(data_frame, table="TABLE_NAME")
loads data_frame into the database table TABLE_NAME.
For an example, see Example: Load Data.
This example creates an R data frame df
consisting of pairs of numbers and letters and then loads the data frame into the Oracle table DF_TABLE. The example shows that the data frame and the table have the same dimensions and the same first few elements, but different values for class. The class for DF_TABLE is ore.frame
. At the end of the example is a check that DF_TABLE exists in the current schema.
R> df <- data.frame(A=1:26, B=letters[1:26]) R> dim(df) [1] 26 2 R> class(df) [1] "data.frame" R> head(df) A B 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 f R> ore.create(df, table="DF_TABLE") R> ore.ls() [1] "DF_TABLE" R> class(DF_TABLE) [1] "ore.frame" attr(,"package") [1] "OREbase" R> dim(DF_TABLE) [1] 26 2 R> head(DF_TABLE) A B 0 1 a 1 2 b 2 3 c 3 4 d 4 5 e 5 6 f R> exists("DF_TABLE")[1] TRUE
If you connect to the database using a tool such as SQL Developer, you can view DF_TABLE directly in the database.
ore.push(data.frame)
stores an R object in the database as a temporary object, and returns a handle to that object. It converts data frame, matrix, and vector to a table, and list, model, and others to a serialized object.
The object that you create exists during the R session; to store the data in a permanent way, see Persist and Manage R Objects in the Database
This example pushes the numerical vector created by the R command c(1,2,3,4,5)
to v, an Oracle R Enterprise object:
R> v <- ore.push(c(1,2,3,4,5)) R> class(v) [1] "ore.numeric" attr(,"package") [1] "OREbase" R> head(v) [1] 1 2 3 4 5
ore.exists()
checks for the existence of an ore.frame
object in the ORE schema environment. For ore.exists()
to find an ore.frame
object the object must have been synchronized with ore.sync()
first.
The objects available in the ORE environment are not necessarily the same as the database objects. One should not use ore.exists
() to check for table existence.
For an example, see Example: Load Data.
ore.exists(name, schema)
has these arguments:
name
: A character string specifying the name of the ore.frame
object
schema
: A character string specifying the name of database schema to check
ore.exists()
returns TRUE if the object exists in the ORE schema and FALSE, if it does not exist.
To drop a table in the database use
ore.drop(table="NAMEOFTABLE")
For example, these commands drop the table v
and verifies that it does not exist:
R> ore.drop(table="v") R> ore.exists("v") [1] FALSE
If you drop a table that does not exist, there is no error message.
To pull the contents of an Oracle Database table or view into an in-memory R data frame use ore.pull(OBJECT_NAME)
for the name of an object returned by ore.ls()
.
Note:
You can pull a table or view to an R frame only if the data can fit into R's memory.Suppose that your Oracle Database contains the table NARROW. Then ore.pull()
creates the data frame df_narrow
from the table NARROW. When you verify that df_narrow
is a data frame. The warning message appears because the table NARROW is not indexed:
R> df_narrow <- ore.pull(NARROW) Warning message: ORE object has no unique key - using random order R> class(df_narrow) [1] "data.frame"
Almost all data in R is a vector or is based on vectors (vectors themselves, lists, matrices, data frames, and so forth). The elements of a vector have an explicit order. Each element has an index. R code actively uses this order of elements.
However, database-backed relational data (tables and views) does not define any order of rows and thus cannot be directly mapped to R data structures. You can define an explicit order on database tables and views via an ORDER BY clause. The order is usually achieved by having a unique identifier (single- or multi- column key). Ordering in this way can be inefficient and slow for some operations that lead to unnecessary sorting.
row.names<-
defines ordering but doesn't actually index a table. The assignment option provides a way to specify a unique column. Initially it supports at least one column but may support multi-column specifications as well. When row.names<-
is applied to unordered frames, it returns an error.
You can use the integer indexing created by the ordering infrastructure to perform sampling and partitioning, as described in Sampling and Partitioning.
Suppose that the table NARROW is not indexed. The following example illustrates using row.names
to create an indexed table:
R> row.names(head(NARROW)) Error: ORE object has no unique key In addition: Warning message: ORE object has no unique key - using random order R> R> row.names(NARROW) <- NARROW$ID R> R> row.names(head(NARROW[,1:3])) [1] "101501" "101502" "101503" "101504" "101505" "101506" R> R> head(NARROW[,1:3]) ID GENDER AGE 101501 101501 <NA> 41 101502 101502 <NA> 27 101503 101503 <NA> 20 101504 101504 <NA> 45 101505 101505 <NA> 34 101506 101506 <NA> 38
The ordering (indexing) for tables described in Order in Tables can be used to perform sampling and partitioning.
This section provides examples of
R supports powerful constructions using vectors as indices. Oracle R Enterprise supports similar functionality with these differences:
Integer indexing is not supported for ore.vector
objects.
Negative integer indexes are not supported.
Row order is not preserved.
This example illustrates indexing:
R> tmp <- ASTHMA R> tmp[c(1L, 2L, 1L),] Error: ORE object has no unique key R> rownames(tmp) <- tmp R> tmp[c(1L, 2L, 1L),] CITY ASTHMA COUNT 1|0|65 1 0 65 1|0|65.1 1 0 65 1|1|35 1 1 35 R> tmp[c(1L, 2L, 1L),]@dataQry
This code illustrates several sampling techniques:
# Generate random data set.seed(123) N <- 1000000 mydata <- data.frame(x = rnorm(N, mean = 20, sd = 2), group = sample(letters, N, replace = TRUE, prob = (26:1)/sum(26:1))) mydata$y <- rbinom(N, 1, 1/(1+exp(-(.5 - 0.25 * mydata$x + .1 * as.integer(mydata$group))))) MYDATA <- ore.push(mydata) rm(mydata) # Create a function that creates random row indices from large tables mysampler <- function(n, size, replace = FALSE) { #' Random Whole Number Sampler #' @param n number of observations in sample #' @param size total number of observations #' @param replace indicator for sampling with replacement #' @return numeric vector containing the sample indices n <- round(n) size <- round(size) if (n < 0) stop("'n' must be a non-negative number") if (size < 1) stop("'size' must be a positive number") if (!replace && (n > size)) stop("'n' cannot exceed 'size' when 'replace = FALSE'") if (n == 0) numeric() else if (replace) round(runif(n, min = 0.5, max = size + 0.5)) else { maxsamp <- seq(size + 0.5, by = -1, length.out = n) samp <- round(runif(n, min = 0.5, max = maxsamp)) while(length(bump1 <- which(duplicated(samp)))) samp[bump1] <- samp[bump1] + 1 samp } } # Data set and sample size N <- nrow(MYDATA) sampleSize <- 500 # 1. Simple random sampling srs <- mysampler(sampleSize, N) simpleRandomSample <- ore.pull(MYDATA[srs, , drop = FALSE]) # 2. Systematic sampling systematic <- round(seq(1, N, length.out = sampleSize)) systematicSample <- ore.pull(MYDATA[systematic, , drop = FALSE]) # 3. Stratified sampling stratifiedSample <- do.call(rbind, lapply(split(MYDATA, MYDATA$group), function(y) { ny <- nrow(y) y[mysampler(sampleSize * ny/N, ny), , drop = FALSE] })) # 4. Cluster sampling clusterSample <- do.call(rbind, sample(split(MYDATA, MYDATA$group), 2)) # 5a. Accidental/Convenience sampling (via row order access) convenientSample1 <- head(MYDATA, sampleSize) # 5b. Accidental/Convenience sampling (via hashing) maxHash <- 2^32 # maximum allowed in ore.hash convenient2 <- (ore.hash(rownames(MYDATA), maxHash)/maxHash) <= (sampleSize/N) convenientSample2 <- ore.pull(MYDATA[convenient2, , drop = FALSE]) Random
For Oracle R Enterprise random partitions can be generated in the transparency layer by adding a partition or group column to an ore.frame object in the following manner:
nrowX <- nrow(x) x$partition <- sample(rep(1:k, each = nrowX/k, length.out = nrowX), replace = TRUE)
After these partitions have been joined to the original data set, the ore.groupApply function can be used to perform the little bootstraps:
results <- ore.groupApply(x, x$partition, function(y) {...}, parallel = TRUE)
R objects exist for the duration of the current session, unless they are explicitly saved. For example, if you build a model in a particular R session, the model is not available when the session is closed, unless the model was explicitly saved.
Oracle R Enterprise supports persistence for R objects onto the database.
Persistence provides these advantages:
You can access the same R and Oracle R Enterprise object (for example, a model) among different R sessions.
You can build a model in R and use it for prediction and scoring in embedded Oracle R Enterprise.
Oracle R Enterprise creates datastores to contain persisted objects.
Persisted objects reside in a datastore. The following Oracle R Enterprise functionality allows you manage persistence:
ore.save()
saves an R object or a list of R objects to the specified datastore in the connected database in the current user's schema:
ore.save({...}, list = character(0), name, envir = parent.frame(), overwrite = FALSE, append = FALSE, description = character(0)))
The parameters for ore.save()
are as follows:
{...}
is the list of R objects to save; the names of the objects to be saved (as symbols or character strings)
list
is a character vector containing the names of objects to be saved
envir
is the environment to search for objects to be saved
overwrite
is a logical value specifying whether to overwrite the datastore if already exists; the default is FALSE (do not overwrite)
name
is the name of the datastore; name
must be specified
description
is a comment describing the datastore
append
is a logical value specifying whether to append objects to the datastore if already exists; the default is FALSE (do not append)
Save all objects in the current workspace environment to the datastore ds_1
in the user's current schema:
ore.save(list=ls(), name="ds_1", description = "example datastore")
Overwrite existing datastore ds_2
with objects x
, y
, and z
in the current workspace environment:
ore.save(x, y, z, name="ds_2", overwrite=TRUE)
Add objects x
, y
, and z
in the current workspace environment to the existing datastore ds_3
(that is append the objects to the datastore):
ore.save(x, y, z, name="ds_3", append=TRUE)
ore.load()
loads all of the R objects stored in a specified datastore in the current user schema in the connected database to R:
ore.load(name, list = character(0), envir = parent.frame())
The parameters for ore.load()
are
name
is a character string specifying the name of datastore to load the objects from; you must specify a name
list
is a character vector containing the names of objects to be loaded
envir
is the R environment that objects are loaded to
ore.load()
returns a character vector containing the names of objects loaded from the datastore.
ore.delete()
deletes the specified datastore (and all of the R objects in it) from the current user schema in the connected database:
ore.delete(name)
The parameter for ore.delete()
is
name
is a character string specifying the name of datastore to delete; you must specify a name
Use ore.datastore() to list the datastores that exist in the user's Oracle Database schema.
ore.datastore()
lists the datastores and basic information about each datastore in the current schema:
ore.datastore(name, pattern)
The parameters for ore.datastore()
are
name
is a character string specifying the name of datastore to list
pattern
is a regular expression character string specifying the names of the datastores to list.
ore.datastore()
lists information about the datastore with name specified in name
or information about the datastores whose names match the regular expression specified in pattern
.
If neither name
nor pattern
is provided, ore.datastore()
returns information about all datastores in user's schema.
Either name
or pattern
can be specified but not both.
ore.datastore()
returns a data.frame
object with these columns:
datastore.name
name of the datastore
object.count
number of objects in the datastore identified by datastore.name
size
size of the datastore in bytes
creation.date
date of datastore creation
description
comment for datastore (comment is specified in the description
parameter of ore.save
)
Each row of the data.frame
lists one datastore. Rows are sorted by column datastore.name
in alphabetical order.
ore.datastoreSummary()
returns a data.frame
that lists the names and summary information for the R objects saved in the specified datastore in the schema in the connected database:
ore.datastoreSummary(name)
The parameter for ore.datastoreSummary()
is
name
is a character string specifying the name of datastore to summarize; you must specify a name
If the specified datastore does not exist, an error is returned.
ore.datastoreSummary()
returns a data.frame
object with these columns:
object.name
is the name of the R object
class.name
is the class name of the R object
size
is the size of the R object in bytes
length
is the length of the R object
row.count
is the number of rows for the R object
col.count
is number of columns of the R object
Each row of the data.frame
lists one R object. Rows are sorted by column datastore.name
in alphabetical order.
The following examples illustrate using R with Oracle R Enterprise data types:
Simple column and row selection in R:
# Push built-in R data set iris to database R> ore.create(iris, table="IRIS") R> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa R> iris_projected = IRIS[, c("PETAL_LENGTH", "SPECIES")] R> head (iris_projected) PETAL_LENGTH SPECIES 0 1.4 setosa 1 1.4 setosa 2 1.3 setosa 3 1.5 setosa 4 1.4 setosa 5 1.7 setosa
df1 <- data.frame(x1=1:5, y1=letters[1:5]) df2 <- data.frame(x2=5:1, y2=letters[11:15]) merge (df1, df2, by.x="x1", by.y="x2") x1 y1 y2 1 1 a o 2 2 b n 3 3 c m 4 4 d l 5 5 e k # Create database objects to correspond to in-memory R objects df1 and df2 ore.df1 <- ore.create(df1, table="DF1") ore.df2 <- ore.create(df2, table="DF2") # Compare results R> merge (DF1, DF2, by.x="X1", by.y="X2") X1 Y1 Y2 0 1 a o 1 2 b n 2 3 c m 3 4 d l 4 5 e k
# Push built-in data set iris to database ore.create(iris, table="IRIS") aggdata <- aggregate(IRIS, by = list(IRIS$SPECIES), FUN = summary) class(aggdata) head(aggdata)
Data formatting and creating derived columns in R
Note that adding derived columns does not change the database table. See Derived Columns in Oracle R Enterprise.
diverted_fmt <- function (x) { ifelse(x==0, 'Not Diverted', ifelse(x==1, 'Diverted','')) } cancellationCode_fmt <- function(x) { ifelse(x=='A', 'A CODE', ifelse(x=='B', 'B CODE', ifelse(x=='C', 'C CODE', ifelse(x=='D', 'D CODE', 'NOT CANCELLED')))) } delayCategory_fmt <- function(x) { ifelse(x>200,'LARGE', ifelse(x>=30,'MEDIUM','SMALL')) } zscore <- function(x) { (x-mean(x,na.rm=TRUE))/sd(x,na.rm=TRUE) # ONTIME_S is a database table ONTIME_S$DIVERTED <- diverted_fmt(DIVERTED) ONTIME_S$CANCELLATIONCODE <- cancellationCode_fmt(CANCELLATIONCODE) ONTIME_S$ARRDELAY <- delayCategory_fmt(ARRDELAY) ONTIME_S$DEPDELAY <- delayCategory_fmt(DEPDELAY) ONTIME_S$DISTANCE_ZSCORE <- zscore(DISTANCE)
When you add derived columns using Oracle R Enterprise, the derived columns do not affect the underlying table in the database. A SQL query is generated that has the additional derived columns in the select list, but the table is not changed.
This example illustrates using Oracle R Enterprise with a standard R package downloaded from CRAN:
This example illustrates building a regression model using a CRAN package. You can prepare the data used for training in the database (filtering out observations that are not of interest, selecting attributes, imputing missing values, and so forth). Suppose that the preprocessed data is in the table ONTIME_S_PREPROCESSED_SUBSET. Then pull the prepared training set (which is usually small enough to fit in desktop R memory) into the R client to execute the model build.
You can use the resulting model to score (predict) large numbers of rows, in parallel, in Oracle Database. The data are stored in ONTIME_S_FINAL_DATA_TO_BE_SCORED
.
Note that scoring is a trivially parallelizable operation because one row can be scored independent of and in parallel with another row. The model built on the desktop is shipped to the database to perform scoring on vast numbers of rows in the database.
The computations are divided into these steps:
dat <- ore.pull(ONTIME_S_PREPROCESSED_SUBSET) mod <- glm(ARRDELAY ~ DISTANCE + DEPDELAY, dat) mod summary(mod)
Score in-parallel in the database using embedded R:
prd <- predict(mod, newdata=ONTIME_S_FINAL_DATA_TO_BE_SCORED) class(prd) # Add predictions as a new column res <- cbind(newdat, PRED = prd) head(res)
R provides many other ways to build regression models, such as lm()
.
For other ways to build regression models, see Oracle R Enterprise Versions of R Models and In-Database Predictive Models in Oracle R Enterprise.
The embedded R engine in the Oracle Database allows R users to off load desktop calculations that may require either more resources such as those available to Oracle Database or database-driven data parallelism. The embedded R engine also executes R scripts embedded in SQL or PL/SQL programs (lights-out processing).
These examples illustrate using Oracle R Enterprise embedded R engine with standard R packages downloaded from CRAN:
This example illustrates off loading R computation to execute in the embedded R engine. To off load an R computation, simply include the R code within a closure (that is, function() {}
) and invoke ore.doEval()
. ore.doEval()
schedules execution of the R code with the database-embedded R engine and returns the results back to the desktop for continued analysis:
library(biglm) mod <- ore.doEval( function() { library(biglm) dat <- ore.pull(ONTIME_S) mod <- biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat) mod }, ore.connect = TRUE); print(mod) mod=ore.pull(mod) print(mod)
This example illustrates database-driven data parallelism at work in building a series of regression models using a CRAN package. One model is built per unique value of a factor. The database orchestrates the parallel and concurrent building of the models, one per factor and brings the list of all models built to the user desktop for further analysis:
modList <- ore.groupApply( # Organize input to the R script – This is always an Oracle R Enterprise # data frame X=ONTIME_S, # Specify the grouping column. Here we request one model per unique value of # ONTIME_S$DEST INDEX=ONTIME_S$DEST, # Model building code goes inside the closure. Input and grouping # conditions can be referenced as parameters to the function function(x) { library(biglm) biglm(ARRDELAY ~ DISTANCE + DEPDELAY, x) }); modList_local <- ore.pull(modList) # Print the model for just one destination - BOSTON summary(modList_local$BOS)
Oracle R Enterprise is shipped with a collection of demos, examples that illustrate how to use Oracle R Enterprise. These examples are a collection of self-contained R scripts.
Most of the sample programs use the data frame iris
, which is included in the R distribution. iris
is loaded into a table as described in Load Data Frame to a Table.
The rest of this section describes two examples in detail and includes a list of all of the examples:
Start R, load the ORE packages via library(ORE)
, and then connect to the database.
Follow these steps to load an R data frame to a database table:
This example uses the R data set iris
.
The iris
data set is located in the datasets package that is part of the R distribution:
R> find("iris") [1] "package:datasets"
Use the R command class
to verify that iris
is an R data frame:
R> class(iris) [1] "data.frame"
iris
consist of measurements of parts of iris flowers. Use the R command head
to see a small sample of the data in iris
.
R> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
Now load the data frame iris
into the database that you are connected to.
Suppose that the database table version of iris
is named IRIS_TABLE. Drop IRIS_TABLE to make sure that no table of this name exists in the connected schema:
ore.drop(table = "IRIS_TABLE")
If IRIS_TABLE doesn't exist, you do not get a message.
Now create a database table with the data contained in iris
:
ore.create(iris, table = "IRIS_TABLE")
Use ore.ls()
to verify that the table was created:
R> ore.ls() [1] "IRIS_TABLE" "NARROW" "ONTIME_S"
IRIS_TABLE is a database-resident table with just metadata on the R side:
R> class(IRIS_TABLE) [1] "ore.frame" attr(,"package") [1] "OREbase"
Use head
to see the column names and the first few values in IRIS_TABLE:
R> head(IRIS_TABLE) SEPAL_LENGTH SEPAL_WIDTH PETAL_LENGTH PETAL_WIDTH SPECIES 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa 5 5.4 3.9 1.7 0.4 setosa
Use class
to see the data type of the column SPECIES.
R > class(IRIS_TABLE$SPECIES) [1] "raw"
You can use R functions to analyze the data in the table. Here are some simple examples taken from the example basic.R
:
Use unique
to get a list of the unique entries in a column. This example finds the unique SPECIES:
R> unique(IRIS_TABLE$SPECIES) [1] setosa versicolor virginica Levels: setosa versicolor virginica
Find the minimum, maximum, and mean of PETAL_LENGTH:
R> min(IRIS_TABLE$PETAL_LENGTH) [1] 1 R> max(IRIS_TABLE$PETAL_LENGTH) [1] 6.9 R> mean(IRIS_TABLE$PETAL_LENGTH) [1] 3.758
If you need information about an R function, use the command help(function-name)
.
Oracle Database has logical values: TRUE, FALSE, NULL. There is a 3x3 table that defines truth values for propositions with AND and OR. NULL is treated as unknown value. For some operations the result is either deterministic (for example TRUE OR NULL) or unknown (TRUE AND NULL). If logical values are used in a WHERE clause, only rows with the condition TRUE are selected; FALSE and NULL are ignored. R, on the other hand, keeps TRUE and NA. Rows with NA are selected with value NA.
The option ore.na.extract
controls if NAs are selected or not. The default is to treat NA as SQL treats FALSE.
The demo nulls.R
is the only sample that does not use iris
as data. nulls.R
compares the handling of NULLs in SQL with the handling of NAs in R.
In R, NA is a logical constant of length 1 which contains a missing value indicator. In the database, null refers to the absence of a value in a column of a row. Nulls indicate missing, unknown, or inapplicable data.
Follow these steps to understand the demo nulls.R
:
This demo uses the data frame airquality
. Verify that the data set is a data frame and look at the few rows of the data frame:
R> class(airquality) [1] "data.frame" R> head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
Load airquality
into the database as "AIRQUALITY":
ore.drop(table = "AIRQUALITY") ore.create(airquality, table = "AIRQUALITY")
Use ore.ls()
to verify that the table was created. If you wish, use class(AIRQUALITY)
to verify that AIRQUALITY
is a database-resident table with just metadata on the R side.
Examine how R handles NAs. Return all observations where ozone < 30 is TRUE:
R> nrow(airquality[airquality$Ozone < 30,])[1] 92
Compare this with the results when NAs are explicitly excluded:
R> nrow(airquality[airquality$Ozone < 30 & !is.na(airquality$Ozone),]) [1] 55
The default behavior for SQL tables is to exclude NULLS in output:
nrow(AIRQUALITY[AIRQUALITY$OZONE < 30,]) [1] 55
To handle NULLs the same way that R handles NA, request the behavior explicitly:
options(ore.na.extract = TRUE) nrow(AIRQUALITY[AIRQUALITY$OZONE < 30,]) [1] 92
These scripts have been added as demos to the ORE package.
To list all of the demos included with Oracle R Enterprise, type
R> demo(package = "ORE")
To run one of these scripts, specify the name of the demo in a demo
function call. For example, to run aggregate.R
, type
R> demo("aggregate", package = "ORE")
These demos are shipped with Oracle R Enterprise:
aggregate Aggregation analysis Basic analysis & data processing operations basic Basic connectivity to database binning Binning logic columnfns Column functions cor Correlation matrix crosstab Frequency cross tabulations datastore DataStore operations datetime Date/Time operations derived Handling of derived columns distributions Distribution, density, and quantile functions do_eval Embedded R processing freqanalysis Frequency cross tabulations graphics Demonstrates visual analysis group_apply Embedded R processing by group hypothesis Hyphothesis testing functions matrix Matrix related operations nulls Handling of NULL in SQL vs. NA in R odm_ai Oracle Data Mining: attribute importance odm_dt Oracle Data Mining: decision trees odm_glm Oracle Data Mining: generalized linear models odm_kmeans Oracle Data Mining: enhanced k-means clustering odm_nb Oracle Data Mining: naive Bayes classification odm_svm Oracle Data Mining: support vector machines push_pull RDBMS <-> R data transfer rank Attributed-based ranking of observations reg Ordinary least squares linear regression row_apply Embedded R processing by row chunks sampling Random row sampling and partitioning of an ore.frame sql_like Mapping of R to SQL commands stepwise Stepwise OLS linear regression summary Summary functionality table_apply Embedded R processing of entire table