PK
D/Aoa, mimetypeapplication/epub+zipPK D/A iTunesMetadata.plistV
This chapter contains the following topics:
The best practices discussed in this section apply to Oracle Database 11g with Oracle Real Application Clusters (Oracle RAC). These best practices build on the Oracle Database 11g configuration best practices described in Chapter 5, "Configuring Oracle Database" and Chapter 6, "Configuring Oracle Database with Oracle Clusterware." These best practices are identical for the primary and standby databases if they are used with Data Guard in Oracle Database 11g with Oracle RAC and Data Guard—MAA. Some best practices may use your system resources more aggressively to reduce or eliminate downtime. This can, in turn, affect performance service levels, so be sure to assess the impact in a test environment before implementing these practices in a production environment.
Instance recovery is the process of recovering the redo thread from the failed instance. Instance recovery is different from crash recovery, which occurs when all instances accessing a database have failed. Crash recovery is the only type of recovery when an instance fails using a single-instance Oracle Database.
When using Oracle RAC, the SMON process in one surviving instance performs instance recovery of the failed instance.
In both Oracle RAC and single-instance environments, checkpointing is the internal mechanism used to bound Mean Time To Recover (MTTR). Checkpointing is the process of writing dirty buffers from the buffer cache to disk. With more aggressive checkpointing, less redo is required for recovery after a failure. Although the objective is the same, the parameters and metrics used to tune MTTR are different in a single-instance environment versus an Oracle RAC environment.
In a single-instance environment, you can set the FAST_START_MTTR_TARGET
initialization parameter to the number of seconds the crash recovery should take. Note that crash recovery time includes the time to startup, mount, recover, and open the database.
Oracle provides several ways to help you understand the MTTR target your system is currently achieving and what your potential MTTR target could be, given the I/O capacity.
See Also: The MAA white paper "Best Practices for Optimizing Availability During Unplanned Outages Using Oracle Clusterware and Oracle Real Application Clusters" for more information from the MAA Best Practices area for Oracle Database at |
The FAST_START_PARALLEL_ROLLBACK
parameter determines how many processes are used for transaction recovery, which is done after redo application. Optimizing transaction recovery is important to ensure an efficient workload after an unplanned failure. If the system is not CPU bound, setting this parameter to HIGH
is a best practice. This causes Oracle to use four times the CPU_COUNT
(4 X CPU_COUNT) parallel processes for transaction recovery. The default setting for this parameter is LOW
, or two times the CPU_COUNT
(2 X CPU_COUNT). Set the parameter as follows:
ALTER SYSTEM SET FAST_START_PARALLEL_ROLLBACK=HIGH SCOPE=BOTH;
See Also: Oracle Database VLDB and Partitioning Guide for information about Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL |
Using asynchronous I/O is a best practice that is recommended for all Oracle Databases. For more information, see Section 5.1.7, "Set DISK_ASYNCH_IO Initialization Parameter".
Use redundant dedicated connections and sufficient bandwidth for public traffic, Oracle RAC interconnects, and I/O.
In Oracle terms, an extended cluster is a two or more node configuration where the nodes are separated in two physical locations. For an extended cluster and for other Oracle RAC configurations, separate dedicated channels on one fibre may be needed, or you can optionally configure Dense Wavelength Division Multiplexing (DWDM) to allow communication between the sites without using repeaters and to allow greater distances, greater than 10 km, between the sites. However, the disadvantage is that DWDM can be prohibitively expensive.
See Also: Oracle Database 2 Day + Real Application Clusters Guide for more information About Network Hardware Requirements |
Oracle RAC One Node is a single instance of an Oracle Real Application Clusters (Oracle RAC) database that runs on one node in a cluster with an option to failover or migrate to other nodes in the same cluster. This option adds to the flexibility that Oracle offers for database consolidation. You can consolidate many databases into one cluster with minimal overhead while also providing the high availability benefits of failover protection, online rolling patch application, and rolling upgrades for the operating system and Oracle Clusterware.
See Also: Oracle Real Application Clusters Administration and Deployment Guide for more information about Administering Oracle RAC One Node |
An Oracle RAC extended cluster is an architecture that provides extremely fast recovery from a site failure and allows for all nodes, at all sites, to actively process transactions as part of single database cluster. An extended cluster provides greater high availability than a local Oracle RAC cluster, but because the sites are typically in the same metropolitan area, this architecture may not fulfill all disaster recovery requirements for your organization.
The best practices discussed in this section apply to Oracle Database 11g with Oracle RAC on extended clusters, and build on the best practices described in Section 7.1, "Configuring Oracle Database with Oracle RAC."
Use the following best practices when configuring an Oracle RAC database for an extended cluster environment:
Spread the Workload Evenly Across the Sites in the Extended Cluster
Configure the Nodes to Be Within the Proximity of a Metropolitan Area
Use Host-Based Storage Mirroring with Oracle ASM Normal or High Redundancy
See Also:
|
A typical Oracle RAC architecture is designed primarily as a scalability and availability solution that resides in a single data center. To build and deploy an Oracle RAC extended cluster, the nodes in the cluster are separated by greater distances. When configuring an Oracle RAC database for an extended cluster environment, you must:
Configure one set of nodes at Site A and another set of nodes at Site B.
Spread the cluster workload evenly across both sites to avoid introducing additional contention and latency into the design. For example, avoid client/server application workloads that run across sites, such that the client component is in site A and the server component is in site B.
Most extended clusters have only two storage systems (one at each site). During normal processing each node writes and reads a disk heartbeat at regular intervals, but if the heartbeat cannot complete, all affected nodes are evicted from the cluster forcing them to restart their processes and retry to acquire access to the shared resources safely as a member. Thus, the site that houses the majority of the voting disks is a potential single point of failure for the entire cluster. For availability reasons, you should add a third site that can act as the arbitrator in case either: one site fails, or a communication failure occurs between the sites.
In some cases, you can also use standard NFS to support a third voting disk on an extended cluster. You can configure the quorum disk on inexpensive, low end, standard NFS mounted device somewhere on the network. Oracle recommends putting the NFS voting disk on a dedicated server which belongs to a production environment.
If you have an extended cluster and do not configure a third site, you must find out which of the two sites is the primary site. Then, if the primary site fails, you must manually restart the secondary site.
Note: Oracle Clusterware supports NFS, iSCSI, Direct Attached Storage (DAS), Storage Area Network (SAN) storage, and Network Attached Storage (NAS). If your system does not support NFS, use an alternative. For example, on Windows systems you can use iSCSI. |
See Also: For more information, see the Technical Article "Using standard NFS to support a third voting file for extended cluster configurations" at
|
Extended clusters provide the highest level of availability for server and site failures when data centers are in close enough proximity to reduce latency and complexity. The preferred distance between sites in an extended cluster is within a metropolitan area. High internode and interstorage latency can have a major effect on performance and throughput. Performance testing is mandatory to assess the impact of latency. In general, distances of 50 km or less are recommended.
Testing has shown the distance (greatest cable stretch) between Oracle RAC cluster nodes generally affects the configuration, as follows:
Distances less than 10 km can be deployed using normal network cables.
Distances equal to or more than 10 km require Dense Wavelength Division Multiplexing (DWDM) links.
Distances from 10 to 50 km require storage area network (SAN) buffer credits to minimize the performance impact due to the distance. Otherwise, the performance degradation due to the distance can be significant.
For distances greater than 50 km, there are not yet enough proof points to indicate the effect of deployments. More testing is needed to identify what types of workloads could be supported and what the effect of the chosen distance would have on performance.
Use host-based mirroring with Oracle ASM normal or high redundancy configured disk groups so that a storage array failure does not affect the application and database availability.
Oracle recommends host-based mirroring using Oracle ASM to internally mirror across the two storage arrays. Implementing mirroring with Oracle ASM provides an active/active storage environment in which system write I/Os are propagated to both sets of disks, making the disks appear as a single set of disks that is independent of location. Do not use array-based mirroring because only one storage site is active, which makes the architecture vulnerable to this single point of failure and longer recovery times.
The Oracle ASM volume manager provides flexible host-based mirroring redundancy options. You can choose to use external redundancy to defer the mirroring protection function to the hardware RAID storage subsystem. The Oracle ASM normal and high-redundancy options allow two-way and three-way mirroring, respectively.
Note: Array based mirroring can be used in an Oracle RAC extended cluster. Using this approach has the result that the two mirror sites will be in an active-passive configuration and this will result in a complete outage if one site fails. Service becomes available if the remaining mirror site is brought up. For this reason array based mirroring is not recommended from an HA perspective. To work with two active sites, host based mirroring is recommended. |
Beginning with Oracle Database Release 11g, Oracle ASM includes a preferred read capability that ensures that a read I/O accesses the local storage instead of unnecessarily reading from a remote failure group. When you configure Oracle ASM failure groups in extended clusters, you can specify that a particular node reads from a failure group extent that is closest to the node, even if it is a secondary extent. This is especially useful in extended clusters where remote nodes have asymmetric access for performance, thus leading to better usage and lower network loading. Using preferred read failure groups is most useful in extended clusters.
The ASM_PREFERRED_READ_FAILURE_GROUPS
initialization parameter value is a comma-delimited list of strings that specifies the failure groups that should be preferentially read by the given instance. This parameter is instance specific, and it is generally used only for clustered Oracle ASM instances. It's value can be different on different nodes. For example:
diskgroup_name1.failure_group_name1, ...
See Also:
|
Consider the following additional factors when implementing an extended cluster architecture:
Network, storage, and management costs increase.
Write performance incurs the overhead of network latency. Test the workload performance to assess impact of the overhead.
Because this is a single database without Oracle Data Guard, there is no protection from data corruption or data failures.
The Oracle release, the operating system, and the clusterware used for an extended cluster all factor into the viability of extended clusters.
When choosing to mirror data between sites:
Host-based mirroring requires a clustered logical volume manager to allow active/active mirrors and thus a primary/primary site configuration. Oracle recommends using Oracle ASM as the clustered logical volume manager.
Array-based mirroring allows active/passive mirrors and thus a primary/secondary configuration.
Extended clusters need additional destructive testing, covering
Site failure
Communication failure
For full disaster recovery, complement the extended cluster with a remote Data Guard standby database, because this architecture:
Maintains an independent physical replica of the primary database
Protects against regional disasters
Protects against data corruption and other potential failures
Provides options for performing rolling database upgrades and patch set upgrades
A data protection plan is not complete without a sound backup and recovery strategy to protect against system and storage failures. Oracle delivers a comprehensive data protection suite for backup and recovery of Oracle database and unstructured, application files.
The primary focus of this chapter is best practice configuration for backup and recovery for the Oracle database. File system data protection offerings are introduced along with pointers on where to find more information.
This chapter contains the following topics:
Backup and Recovery Configuration and Administration Best Practices
Backup and Recovery Operations and Maintenance Best Practices
Table 8-1provides a quick reference summary of the Oracle backup and recovery suite.
Table 8-1 Backup and Recovery Summary
Technology | Recommended use with Oracle Database | Recommended use with File System Data | Comments |
---|---|---|---|
Recovery Manager (RMAN) |
Yes |
No |
Native backup utility for the Oracle database |
Oracle Secure Backup |
Yes |
Yes |
Tape backup management software |
Oracle Secure Backup Cloud Module |
Yes |
No |
Backup to Amazon S3 storage |
Flashback Technologies |
Yes |
No |
Logical error correction leveraging undo data of the Oracle database |
Flashback Database |
Yes |
No |
Continuous Data Protection (CDP) leveraging flashback logs |
Automatic Clustered File System (ACFS) Snapshots (for file system clones and disaster recovery) |
No |
Yes |
Read-only or read/write copy-on-write version of the file system. Replication available. |
ZFS Snapshots (for database clones such as dev/test) |
Yes |
Yes |
Read-only or read/write copy-on-write version of the database for testing and development |
ZFS Snapshots for backup/restore |
No |
Yes |
Read-only or read/write copy-on-write version of the file system for testing, development and backup |
This section discusses the motivation and tools for maintaining good database backups, for using Oracle database recovery features, and for using backup options and strategies made possible with Oracle database features.
Using backups to resolve an unscheduled outage of a production database may not allow you to meet your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) or service-level requirements. For example, some outages are handled best by using Flashback Database or a standby database. However, some situations require using database backups, including the sample situations shown in Table 8-2.
Table 8-2 Sample Situations that Require Database Backup
Situations that require Database Backup | Description |
---|---|
Setting Up the Initial Data Guard Environment |
During initial setup of a standby database, you can either use a backup of the primary database that is made accessible to the secondary site to create the initial standby database, or use RMAN network-enabled database duplication to create the standby database without the need for a pre-existing backup. To perform an over-the-network duplication you must include the RMAN |
Recovering from Data Failures Using File or Block Media Recovery |
When a block corruption, media failure, or other physical data failure occurs in an environment that does not include Data Guard, the only method of recovery is to restore from existing backups. |
Resolving a Double Failure |
A double failure scenario is a situation that affects the availability of both the production and the standby databases. An example of a double failure scenario is a site outage at the secondary site, which eliminates fault tolerance, followed by a media failure on the production database. Whether the standby must be re-created depends on the type of outage at the secondary site. If the secondary site outage was temporary and did not involve the physical destruction of files, then after the secondary site is brought back online it can continue to receive redo data from the production database. Otherwise, the resolution of this situation is to re-create the production database from an available backup and then re-create the standby database. Some multiple failures, or more appropriately disasters, such as a primary site outage followed by a secondary site outage, might require the use of backups that exist only in an offsite location. Developing and following a process to deliver and maintain backup tapes at an offsite location is necessary to restore service in this worst case scenario. |
See Also:
|
Recovery Manager (RMAN) is Oracle's utility to backup and recover the Oracle Database. Because of its tight integration with the database, RMAN determines automatically what files must be backed up. More importantly, RMAN knows what files must be restored for media-recovery operations. RMAN uses server sessions to perform backup and recovery operations and stores metadata about backups in a repository. RMAN offers many advantages over typical user-managed backup methods, including:
Online database backups without placing tablespaces in backup mode
Efficient block-level incremental backups
Data block integrity checks during backup and restore operations
Test backups and restores without actually performing the operation
Synchronize a physical standby database with the primary database
RMAN automates backup and recovery. While user-managed methods require you to:
Locate backups for each data file
Copy backups to the correct place using operating system commands
Choose which logs to apply
RMAN fully automates these backup and recovery tasks.
There are also capabilities of Oracle backup and recovery that are only available when using RMAN, such as automated tablespace point-in-time recovery and block media recovery.
See Also: For more information, see the following chapters:
|
Oracle Secure Backup delivers unified data protection for heterogeneous environments with a common management interface across the spectrum of servers. Protecting both Oracle databases and unstructured data, Oracle Secure Backup provides centralized tape backup management for your entire IT environment, including:
Oracle database through the Oracle Secure Backup built-in integration with Recovery Manager (RMAN)
File system data protection: For UNIX, Windows, and Linux servers
Network Attached Storage (NAS) data protection leveraging the Network Data Management Protocol (NDMP)
Oracle Secure Backup is integrated with RMAN providing the media management layer (MML) for Oracle database tape backup and restore operations. The tight integration between these two products delivers high-performance Oracle database tape backup.
Specific performance optimizations between RMAN and Oracle Secure Backup that reduce tape consumption and improve backup performance are:
Unused block compression: Eliminates the time and space usage needed to backup unused blocks
Backup undo optimization: Eliminates the time and space usage needed to backup undo that is not required to recover the current backup.
You can manage the Oracle Secure Backup environment using the command line, the Oracle Secure Backup Web tool, and Oracle Enterprise Manager.
Using the combination of RMAN and Oracle Secure Backup provides an end-to-end tape backup solution, eliminating the need for third-party backup software.
See Also:
|
Users can take advantage of the Internet-based data storage services offered by Amazon Web Services (AWS) Simple Storage Service (S3) for their database backup needs. The OSB Cloud Module enables RMAN to use S3 as a repository for Oracle Database backups. This provides an easy-to-manage, cost-efficient, and scalable alternative to maintaining in-house data storage and a local, fully configured backup infrastructure. RMAN with the OSB Cloud module is also the recommended means of backing up an Oracle Database that is running on AWS's Elastic Compute Cloud (EC2).
Users must establish an account with AWS to pay for their storage and network transfer costs as appropriate. Additionally, users must consider their security requirements and network resources, and configure their backups appropriately. Since backups to S3 may travel to AWS on the public Internet, and will be stored on a public cloud facility, it may be necessary to encrypt them to protect the data in transit and at rest at S3. This requirement may not apply for certain databases or in certain configurations, for example when the database runs on EC2 or inside AWS's Virtual Private Cloud (VPC).
Users must also configure the degree of network parallelism (using RMAN channels) to match the network capabilities between the database and S3 (which may include a portion of the public Internet). Multiple simultaneous channels may achieve the highest throughput overall, as RMAN takes full advantage of such parallelism.
Oracle provides restore points and guaranteed restore points:
Restore points protect against logical failures at risky points during database maintenance. Creating a normal restore point assigns a restore point name to a specific point in time or SCN, that is a snapshot of the data as of that time. Normal restore points are available with Flashback Table, Flashback Database, and all RMAN recovery-related operations.
Guaranteed restore points are recommended for database-wide maintenance such as database or application upgrades, or running batch processes. Guaranteed restore points are integrated with Flashback Database and enforce the retention of all flashback logs required for flashing back to the guaranteed restore point. After maintenance activities complete and the results are verified, you should delete guaranteed restore points that are no longer needed to reclaim flashback log space.
See Also: Oracle Database Backup and Recovery User's Guide for more information about using restore points and guaranteed restore points with a Flashback Database |
This section describes best practices for determining backup frequency, using the RMAN recovery catalog, and for using Oracle database backup options such as Block Change Tracking.
It is important to determine a backup frequency policy and to perform regular backups. A backup retention policy helps ensure that needed data is not destroyed.
Factors Determining Backup Frequency Frequent backups are essential for any recovery scheme. You should base the frequency and content of backups on the following criteria:
Criticality of the data: The Recovery Point Objective (RPO) determines how much data your business can acceptably lose if a failure occurs. The more critical the data, the lower the RPO and the more frequently data should be backed up. If you are going to back up certain tablespaces more often than others, with the goal of getting better RPO for those tablespaces, then you also must plan for doing TSPITR as part of your recovery strategy. This requires considerably more planning and practice than DBPITR, because you must ensure that the tablespaces you plan to TSPITR are self-contained.
Estimated repair time: The Recovery Time Objective (RTO) determines the acceptable amount of time needed for recovery. Repair time is dictated by restore time plus recovery time. The lower the RTO, the higher the frequency of backups, that is, backups are more current, thereby reducing recovery time.
Volume of changed data: The rate of database change effects how often data is backed up:
For read-only data, perform backups frequently enough to adhere to retention policies.
For frequently changing data, perform backups more often to reduce the RTO.
To simplify database backup and recovery, the Oracle Suggested Backup Strategy uses the fast recovery area, incremental backups, and incrementally updated backup features. After the initial image copy backup to the FRA, only the changed blocks are captured in the incremental backups thereafter and subsequently applied to the image copy, thereby updating the copy to the most current incremental backup time (that is, incrementally updating the backup).
See Also:
|
Establishing a Backup Retention Policy A backup retention policy is a rule set regarding which backups must be retained, on disk or other backup media, to meet recovery and other requirements. It may be safe to delete a specific backup because it has been superseded by more recent backups or because it has been stored on tape. You may also have to retain a specific backup on disk for other reasons such as archival or regulatory requirements. A backup that is no longer needed to satisfy the backup retention policy is said to be obsolete.
Base your backup retention policy on redundancy or on a recovery window:
In a redundancy-based retention policy, specify a number n such that you always keep at least n distinct backups of each file in your database.
In a recovery window-based retention policy, specify an earlier time interval, for example, one week or one month, and keep all backups required to let you perform point-in-time recovery to any point during that window.
Keeping Archival Backups Some businesses must retain some backups for much longer than their day-to-day backup retention policy. RMAN allows for this with the Long-term Archival Backup feature. Rather than becoming obsolete according to the database's backup retention policy, archival backups either never become obsolete or become obsolete when their time limit expires.
You can use the RMAN BACKUP
command with the KEEP
option to retain backups for longer than your ordinary retention policy. This option specifies the backup as an archival backup, which is a self-contained backup that is exempt from the configured retention policy. This allows you to retain certain backups for much longer than usual, when needed for such reasons as satisfying statutory retention requirements. Using the KEEP
FOREVER
option, a recovery catalog is required because the backup records eventually age out of the control file (otherwise, without a recovery catalog, loss may occur when you retain backups for much longer than usual using the database control file). Only the archived redo log files required to make an archival backup consistent are retained. For more information about the RMAN recovery catalog, see Section 8.2.2, "Use an RMAN Recovery Catalog".
See Also: Oracle Database Backup and Recovery User's Guide for information about Archival Backups for Long-Term Storage |
To protect and keep backup metadata for longer retention times than can be accommodated by the control file, you can create a recovery catalog. You should create the recovery catalog schema in a dedicated standalone database. Do not locate the recovery catalog with other production data. If you use Oracle Enterprise Manager, you can create the recovery catalog schema in the Oracle Enterprise Manager repository database.
The advantages of using a recovery catalog include:
Storing backup information for a longer retention period than what can be feasibly stored in the control file. If the control file is too small to hold additional backup metadata, then existing backup information is overwritten, making it difficult to restore and recover using those backups.
Stores metadata for multiple databases.
Offloading backups to a physical standby database and using those backups to restore and recover the primary database. Similarly, you can back up a tablespace on a primary database and restore and recover it on a physical standby database. Note that backups of logical standby databases are not usable at the primary database.
See Also: Oracle Database Backup and Recovery User's Guide for more information about RMAN repository and the recovery catalog |
When creating backups to disk or tape, use the target database control file as the RMAN repository so that the success of the backup does not depend on the availability of the database connection to the recovery catalog. To use the target database control file as the RMAN repository, run RMAN with the NOCATALOG
option. Immediately after the backup is complete, the new backup information stored in the target database control file should be synchronized to the recovery catalog using the RESYNC CATALOG
command.
See Also: Oracle Database Backup and Recovery Reference for more information about theRESYNC CATALOG command |
Oracle database includes the BLOCK CHANGE TRACKING
feature for incremental backups which improves incremental backup performance by keeping track of which database bloc ks have changed since the previous backup. If BLOCK CHANGE TRACKING
is enabled then RMAN uses the block change tracking file to identify which blocks to include in an incremental backup. This avoids the need to scan every block in the data file, reducing the number of disk reads during backup.
Starting with Oracle Database 11g, you can enable BLOCK CHANGE TRACKING
on both the primary and physical standby databases. You should enable change tracking for any database where incremental backups are being performed. For example, if backups have been completely offloaded to a physical standby database, then Block Change Tracking should be enabled for that database (this requires Active Data Guard). If backups are being performed on both the primary and physical standby databases, then enable Block Change Tracking for both databases.
See Also:
|
You should configure RMAN to automatically back up the control file and the server parameter file (SPFILE) whenever the database structure metadata in the control file changes or when a backup record is added.
The control file autobackup option enables RMAN to recover the database even if the current control file, catalog, and SPFILE are lost. Enable the RMAN autobackup feature with the CONFIGURE CONTROLFILE AUTOBACKUP ON
statement.
You should enable autobackup for both the primary and standby databases. For example, after connecting to the primary database, as the target database, and the recovery catalog, issue the following command:
CONFIGURE CONTROLFILE AUTOBACKUP ON;
See Also:
|
In an Oracle Data Guard configuration you can offload the process of backing up control files, data files, and archived redo log files to a physical standby database system, thereby minimizing the effect of performing backups on the primary system. You can use these backups to recover the primary or standby database.
Note: Backups of logical standby databases are not usable on the primary database. |
See Also: Oracle Data Guard Concepts and Administration for information about using RMAN to back up and restore files |
To ensure a database is enabled to use Flashback Query, Flashback Versions Query, and Flashback Transaction Query, implement the following:
Set the UNDO_MANAGEMENT
initialization parameter to AUTO
. This ensures the database is using an undo tablespace.
Set the UNDO_RETENTION
initialization parameter to a value that allows UNDO
to be kept for a length of time that allows success of your longest query back in time or to recover from human errors.
Set the RETENTION
GUARANTEE
clause for the undo tablespace to guarantee that unexpired undo will not be overwritten.
The Flashback Table also relies on the undo data to recover the tables. Enabling Automatic Undo Management is recommended and the UNDO_RETENTION
parameter must be set to a period for which the Flashback Table is needed. If a given table does not contain the required data after a Flashback Table, it can be flashed back further, flashed forward, or back to its original state, if there is sufficient UNDO
data.
See Also:
|
Review the following priorities to determine your disk backup strategy:
Overall backup time
Impact to resource consumption
Space used by the backup
Recovery time
Table 8-3 compares different backup alternatives against the different priorities you might have. Using Table 8-3 as a guide, you can choose the best backup approach for your specific business requirements. You might want to minimize backup space while sacrificing recovery time. Alternatively, you might choose to place a higher priority on recovery and backup times while space is not an issue.
Table 8-3 Comparing Backup to Disk Options
Backup to Disk: Best Practices for Optimizing Recovery Times If restore time is your primary concern then perform either a database copy or an incremental backup with immediate apply of the incremental to the copy. These are the only options that provide an immediate usable backup of the database, which you then must recover only to the time of the failure using archived redo log files created since the last incremental backup was performed.
Backup to Disk: Best Practices for Minimizing Space Usage If space usage is your primary concern then perform an incremental backup with a deferred apply of the incremental to the copy. If you perform a cumulative level 1 incremental backup, then it stores only those blocks that have been changed since the last level 0 backup:
With a cumulative incremental backup apply only the last level 1 backup to the level 0 backup.
With a differential incremental backup apply all level 1 backups to the level 0 backup.
A cumulative incremental backup usually consumes more space in the fast recovery area than a differential incremental backup.
Backup to Disk: Best Practices for Minimizing System Resource Consumption (I/O and CPU) If system resource consumption is your primary concern then an incremental backup with a Block Change Tracking enabled consumes the least amount of resources on the database.
Example
For many applications, only a small percentage of the entire database is changed each day even if the transaction rate is very high. In many cases, applications repeatedly modify the same set of blocks; so, the total unique, changed block set is small.
For example, a database contains about 600 GB of user data, not including temp files and redo logs. Every 24 hours, approximately 2.5% of the database is changed, which is approximately 15 GB of data. In this example, MAA testing recorded the following results:
Level 0 backup takes 180 minutes, including READ
s from the data area and WRITE
s to the fast recovery area
Level 1 backup takes 20 minutes, including READ
s from the data area and WRITE
s to the fast recovery area
Rolling forward and merging an existing image copy in the fast recovery area with a newly created incremental backup takes only 45 minutes, including READ
s and WRITE
s from the fast recovery area.
In this example, the level 0 backup (image copy) takes 180 minutes. This is the same amount of time it takes to perform a full backup set.
Subsequent backups are level 1 (incremental), which take 20 minutes, so the potential impact on the data area is reduced. That backup is then applied to the existing level 0 backup, which takes 45 minutes. This process does not perform I/O to the data area, so there is no impact (assuming the fast recovery area and data area use separate storage). The total time to create the incremental backup and apply it to the existing level 0 backup is 65 minutes (20+45).
The result is the same using incrementally updated backups or full backup sets, a full backup of the database is created. The incremental approach takes 115 minutes less time (64% less) than simply creating a full backup set. In addition, the I/O impact is less, particularly against the data area which should have less detrimental effect on production database performance.
Thus, for this example when you compare a full backup set strategy versus starting with an image copy, performing only incremental backups, and then rolling forward the copy, the net savings are:
Thus, for this example when you compare always taking full backups versus starting with a level 0 backup, performing only incremental backups, and then rolling forward the level 0 backup, the net savings are:
115 minutes or 64% time savings to create a complete backup
Reduced I/O on the database during backups
See Also: Oracle Database Backup and Recovery User's Guide for more information about backing up the database |
Recovery Manager (RMAN) provides automated disk backup for the Oracle database and is integrated with media management products such as Oracle Secure Backup for backup to tape. Whether your Oracle database backup strategy uses disk, tape, or both, the combination of RMAN and Oracle Secure Backup delivers a comprehensive solution to meet your specific requirements.
When installing Oracle Secure Backup, the System Backup Tape (SBT) libraries for RMAN tape backups are automatically linked. Using Oracle Enterprise Manager Database Control you can manage the Oracle Secure Backup backup domain from tape vaulting to backup and restore operations. The tight integration between RMAN, Oracle Enterprise Manager Database Control, and Oracle Secure Backup makes initial configuration a simple process.
Perform the following four steps to perform initial configuration and prepare to backup a database to tape:
Define your Oracle Secure Backup Administrative Server in Oracle Enterprise Manager Database Control enabling the Oracle Secure Backup domain to be managed through Oracle Enterprise Manager.
Pre-authorize an Oracle Secure Backup user for use with RMAN allowing the RMAN backup/restore be performed without having to explicitly login to Oracle Secure Backup.
Set-up media policies in Oracle Secure Backup to be used for RMAN backups.
Establish RMAN backup settings such as parallelism and compression.
Note: If you use Oracle Secure Backup or tape-side compression, do not also use RMAN compression. |
See Also: Oracle Secure Backup Administrator's Guide for more information about using Recovery Manager with Oracle Secure Backup |
Once backup data stored on tape is no longer needed, its lifecycle is complete and the tape media can be reused. Management requirements during a tape's lifecycle (retention period) may include duplication and vaulting across multiple storage locations. Oracle Secure Backup provides effective media lifecycle management through user-defined media policies, including:
Retention
Tape duplication
Vaulting: rotation of tapes between multiple locations
Media lifecycle management may be as simple as defining appropriate retention settings or more complex to include tape duplication with the original and duplicate(s) having different retention periods and vaulting requirements. Oracle Secure Backup media families, often referred to as tape pools, provide the media lifecycle management foundation.
The best practice recommendation is to leverage content-managed media families which use defined RMAN retention parameters associated with the database to determine when the tape may be reused (effectively an expired tape). A specific expiration date is not associated with content-managed tapes as is done with time-managed. The expiration or recycling of these tapes is based on the attribute associated with the backup images on the tape. All backup images written to content-managed tapes automatically have an associated "content-manages reuse" attribute. Since the recycling of content-managed tapes adheres to user-defined RMAN retention settings, RMAN instructs Oracle Secure Backup when to change the backup image attribute to "deleted".
The RMAN DELETE
OBSOLETE
command communicates which backup pieces (images) are no longer required to meet the user-defined RMAN retention periods. Once Oracle Secure Backup receives this communication, the backup image attribute is changed to "deleted". The actual backup image is not deleted but the attribute is updated within the Oracle Secure Backup catalog. Once all backup images on tape have a deleted attribute, Oracle Secure Backup considers the tape eligible for reuse, similar to that of an expired time-managed tape.
Oracle Secure Backup provides policy-based media management for RMAN backup operations through user-defined Database Backup Storage Selectors. One Database Backup Storage Selector (SSEL) may apply to multiple databases or multiple SSELs may be associated with a single database. For example, you would create two SSEL for a database when using RMAN duplexing and each copy should be written to a different media family. The SSEL contains the following information:
Database name / ID or applicable to all databases
Hostname or applicable to all hosts
Content: archive logs, full, incremental, autobackup or applicable to all
RMAN copy number (applicable when RMAN duplexing is configured)
Media family name
Name(s) of devices to which operations are restricted (if no device restrictions are configured, Oracle Secure Backup uses any available device)
Wait time (duration) for available tape resources
Encryption setting
Oracle Secure Backup automatically uses the storage selections defined within a SSEL without further user intervention. To override the storage selections for one time backup operations or other exceptions, define alternate media management parameters in the RMAN backup script. For more information, see:
http://www.oracle.com/technetwork/database/secure-backup/documentation/index.html
You can easily backup the Fast Recovery Area (FRA) to tape with the RMAN command: BACKUP RECOVERY AREA
. Using this disk to tape backup method instead of performing a separate backup of the production database to tape provides a few distinct advantages:
Saves tape consumption by creating an optimized backup of the Fast Recovery Area (FRA) thereby eliminating unnecessary backup of files already protected on tape
Enables RMAN to use better restore intelligence from disk then tape as necessary, otherwise, RMAN would restore from the most recent backup regardless of media type
Reduces I/O on the production database since the FRA uses a separate disk group
Upon restoration, RMAN automatically selects the most appropriate backup to restore from disk or tape. If the required backup is on tape, RMAN would restore or recovery the database directly from tape media through integration with Oracle Secure Backup. As RMAN has intimate knowledge of what files are necessary for recovery, restoration from disk or tape is an automated process.
While it is possible to backup the FRA or other RMAN disk backup to tape outside of RMAN by performing a file system backup of the disk area using the media management software, it is not recommended. If RMAN is not aware of the tape backup then restoration is an error-prone, manual process:
DBA must determine what files are needed for the restoration.
Media manager administrator would then restore designated files from tape backups to a disk location.
Once files on disk, DBA would initiate an RMAN restore or recovery from the disk location.
The combination of RMAN and Oracle Secure Backup provides an integrated Oracle database tape backup solution.
Backup tapes are highly portable and often stored at offsite locations for disaster recovery purposes. These tapes are first created from within a tape device but often are removed from the hardware device. Once backup tapes are removed from a hardware device the tapes may be stored in an on-site or offsite location. You can effectively manage tape movement between multiple locations using Oracle Secure Backup rotation policies.
For Oracle database restoration, a restore request is submitted from RMAN to Oracle Secure Backup. If the tapes are within the library, the restore begins immediately assuming device availability. However, if the tapes needed for restore could be offsite; you may want to confirm the location of tapes before you issue the restore command. With RMAN and Oracle Secure Backup you can easily do so by issuing the following RMAN command(s):
RESTORE DATABASE PREVIEW
command provides a list of tapes needed for restoration which are offsite.
RESTORE DATABASE PREVIEW RECALL
command initiates a recall operation through Oracle Secure Backup to return the tapes from offsite to the tape device for restoration. Once the tapes are on-site, you can begin the RMAN restore operation.
See Also:
|
This section outlines procedures to regularly check for corruption of data files using the Data Recovery Advisor, for testing recovery procedures, and for backing up the recovery catalog database.
Data Recovery Advisor is an Oracle Database tool that automatically diagnoses data failures, determines and presents appropriate repair options, and executes repairs at the user's request. In this context, a data failure is a corruption or loss of persistent data on disk. By providing a centralized tool for automated data repair, Data Recovery Advisor improves the manageability and reliability of an Oracle database and thus helps reduce the mean time to recover (MTTR).
See Also: Oracle Database Backup and Recovery User's Guide for information about using Data Recovery Advisor |
Use the RMAN VALIDATE
command to regularly check database files for block corruption that has not yet been reported by a user session or by normal backup operations. RMAN scans the specified files and checks for physical and logical errors, but does not actually perform the backup or recovery operation. Oracle database records the address of the corrupt block and the type of corruption in the control file. Access these records through the V$DATABASE_BLOCK_CORRUPTION
view, which can be used by RMAN block media recovery.
To detect all types of corruption that are possible to detect, specify the CHECK LOGICAL
option.
See Also: Oracle Database Backup and Recovery User's Guide for information about Validating Database Files and Backups |
Comple-xte, successful, and tested backups are fundamental to the success of any recovery. Create test plans for different outage types. Start with the most common outage types and progress to the least probable. Using the RMAN DUPLICATE
command is a good way to perform recovery testing, because it requires restoring from backups and performing media recovery.
Monitor the backup procedure for errors, and validate backups by testing your recovery procedures periodically. Also, validate the ability to restore the database using the RMAN command RESTORE...VALIDATE
.
See Also:
|
Include the recovery catalog database in your backup and recovery strategy. If you do not back up the recovery catalog and a disk failure occurs that destroys the recovery catalog database, then you may lose the metadata in the catalog. Without the recovery catalog contents, recovery of your other databases is likely to be more difficult.
The Oracle Secure Backup catalog maintains backup metadata, scheduling and configuration details for the backup domain. Just as it's important to protect the RMAN catalog or control file, the Oracle Secure Backup catalog should be backed up on a regular basis.
See Also:
|
Oracle Secure Backup provides tape backup for non-Oracle files. Oracle Secure Backup does not have pro-active checking as the database does. Use the features available to backup files outside the database. For more information, see Section 8.6, "Backup Files Outside the Database".
Oracle IT environments include both database and application files that must be protected for short and long-term retention requirements. Differences exist between backing up database and unstructured files. In addition, managing backup and recovery often crosses organizational areas such as DBA for the database and system administration for file system data could cross organizational areas. The Oracle data protection suite offers a cohesive solution meeting your complete needs for Oracle database and non Oracle database storage.
An Oracle ACFS snapshot is an online, read-only, point in time copy of an Oracle ACFS file system. The snapshot copy is space-efficient and uses Copy-On-Write functionality. Before an Oracle ACFS file extent is modified or deleted, its current value is copied to the snapshot to maintain the point-in-time view of the file system.
Oracle ACFS snapshot can support the online recovery of files inadvertently modified or deleted from a file system. With up to 63 snapshot views supported for each file system, flexible online file recovery solutions spanning multiple views can be employed. An Oracle ACFS snapshot can also be used as the source of a file system backup, as it can be created on demand to deliver a current, consistent, online view of an active file system.
See Also:
|
Oracle's Sun ZFS Storage Appliance provides an integrated high performance backup solution and is also a cost effective platform for disaster recovery for non-Database files. Ever-growing amounts of data present system and database administrators with many challenges—the thorniest of which are associated with the complex process of backup and recovery. Without reliable data protection and processes, mission-critical data is at risk.
Oracle's Sun ZFS Storage Appliance is an easy-to-deploy Unified Storage System that ensures that backup window and recovery time objectives (RTO) are met by providing timely recovery in the event of a disaster.
Oracle's Sun ZFS Storage Appliance supports unlimited snapshot capability. A snapshot similar to Oracle ACFS is a read-only, point-in-time copy of a file system (for information about Oracle ACFS, see Section 8.6.1, "ACFS Snapshots"). It is instantaneously created and no space is allocated initially. Blocks are allocated as changes are made to the base file system (copy-on-write). The snapshots are either initiated manually or can be automated by scheduling at specific intervals. The snapshot data can be directly accessed for any backup purposes. Any reads to the snapshot blocks are served by the base file system's block. When changes happen to the base file system, the older block is now referenced by the snapshot and the new changed block is referenced by the file system.
Oracle's Sun ZFS Storage Appliance is also recommended for development and test systems that are snapshots taken from the standby database in a Data Guard environment.
Snapshot rollback is the process to bring the base file system to the point in time when the snapshot is taken. The rollback process discards all the changes that happened to the base file system from the time of the snapshot to the time of rollback. This removes the need for a data restore process.
See Also:
|
Oracle Secure Backup provides centralized tape backup management for heterogeneous file system data and the Oracle database. Oracle Secure Backup offers multiple backups levels with full, cumulative and differential incrementals. For more information, see Section 8.4, "Backup to Tape Best Practices". In addition, a full offsite backup level may be scheduled without interfering with the regular full/incremental schedule. File system backups can be performed at the file, directory, file system, or raw partition level meeting even the most stringent requirements within user-defined backup windows.
For file system backup operations, you define Oracle Secure Backup "datasets" which describes what to backup. A dataset is a textual description employing a lightweight language to communicate how to build and organize files to be protected. Being Oracle Database aware, Oracle Secure Backup can skip database files during file system backups by using the "exclude oracle database files" directive within the dataset.
See Also: The Oracle Secure Backup Administrator's Guide for more information about file system backup operations |
High Availability Best Practices
11g Release 2 (11.2)
E10803-02
May 2012
Oracle Database High Availability Best Practices 11g Release 2 (11.2)
E10803-02
Copyright © 2005, 2012, Oracle and/or its affiliates. All rights reserved.
Primary Authors: Lawrence To, Viv Schupmann, Thomas Van Raalte, Virginia Beecher
Contributing Author: Janet Stern
Contributors: Andrew Babb, Janet Blowney, Larry Carpenter, Timothy Chien, Jay Davison, Senad Dizdar, Ray Dutcher, Mahesh Girkar, Stephan Haisley, Wei Ming Hu, Holger Kalinowski, Nitin Karkhanis, Frank Kobylanski, Rene Kundersma, Joydip Kundu, Barb Lundhild, Roderick Manalac, Pat McElroy, Robert McGuirk, Joe Meeks, Markus Michalewicz, Valarie Moore, Michael Nowak, David Parker, Darryl Presley, Hector Pujol, Michael T. Smith, Vinay Srihari, Mark Townsend, Douglas Utzig, Thomas Van Raalte, James Viscusi, Vern Wagman, Steve Wertheimer, Shari Yamaguchi
This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.
If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable:
U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle USA, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherently dangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group.
This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.
Use operational best practices to provide a successful MAA implementation.
This chapter contains the following topics:
Provide a Plan to Test and Apply Recommended Patches and Software
Configure Monitoring and Service Request Infrastructure for High Availability
Understand and document your high availability and performance service-level agreements (SLAs) and create an outage and solution matrix:
Document the business's cost of downtime, Recovery Time Objectives (RTO or recovery time) and Recovery Point Objectives (RPO or data loss tolerance) for the outages described in Oracle Database High Availability Overview.
Build an outage and solution matrix similar those shown in Table 13-1, "Recovery Times and Steps for Unscheduled Outages on the Primary Site" and Table 14-1, "Solutions for Scheduled Outages on the Primary Site".
Implement a high availability environment to achieve the optimal high availability architecture:
Install or update your software with the latest certified patch sets
Configure your software using best practices
Document your choices and configuration
Validate and automate repair operations to ensure that you meet your target HA service-level agreements (SLAs). You should validate the backup, restore, and recovery operations and periodically evaluate all repair operations for various types of possible outages (see Table 13-1 for more information).
If you use Oracle Data Guard for disaster recovery and data protection, Oracle recommends that you perform periodic switchover operations or conduct full application and database failover tests. Also, periodically execute Application and Data Guard switchovers to fully validate all role transition procedures. For more information see Section 2.8, "Execute Data Guard Role Transitions".
Corporate data can be at grave risk if placed on a system or database that does not have proper security measures in place. A well-defined security policy can help protect your systems from unwanted access and protect sensitive corporate information from sabotage. Proper data protection reduces the chance of outages due to security breaches. For more information, see the Oracle Database Security Guide.
Institute procedures that manage and control changes as a way to maintain the stability of the system and to ensure that no changes are incorporated in the primary database unless they have been rigorously evaluated on your test systems.
Review the changes and get feedback and approval from your change management team, which should include representatives for any component that affects the business requirements, functionality, performance, and availability of your system. For example, the team can include representatives for end-users, applications, databases, networks, and systems.
By periodically testing and applying the latest recommended patches and software versions, you ensure that your system has the latest security and software fixes required to maintain stability and avoid many known issues. Remember to validate all updates and changes on a test system before performing the upgrade on the production system. For more information, see "Oracle Recommended Patches -- Oracle Database" in My Oracle Support Note 756671.1 at
https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=756671.1
Proper testing and patching are important prerequisites for preventing instability. You must validate any change in your test systems thoroughly before applying it to your production environment. These practices involve the following:
The test system should be a close replica of the production and standby environment and workload to execute functional tests, performance tests, and availability tests (for more information, see Table 13-1). Any changes should be validated in the test environment first, including evaluating the effect of changes and the fallback procedures before introducing the changes in the production environment.
With a properly configured test system, many problems can be avoided because changes are validated with an equivalent production and standby database configuration containing a full data set and using a workload framework to mimic production (for example, using Oracle Real Application Testing).
Do not try to reduce costs by eliminating the test system because that decision ultimately affects the stability and the availability of your production applications. Using only a subset of system resources for testing and QA has the tradeoffs shown in Table 2-1.
Table 2-1 Tradeoffs for Different Test and QA Environments
Test Environment | Benefits and Tradeoffs |
---|---|
Full Replica of Production and Standby Systems |
Validate all patches and software changes. Validate all functional tests. Full performance validation at production scale. Full HA validation. |
Full Replica of Production Systems |
Validate all patches and software changes. Validate all functional tests. Full performance validation at production scale. Full HA validation minus the standby system. No functional, performance, HA and disaster recovery validation with standby database. |
Standby System |
Validate most patches and software changes. Validate all functional tests. Full performance validation if using Data Guard Snapshot Standby but this can extend recovery time if a failover is required. Role transition validation. Resource management and scheduling is required if standby and test databases exist on the same system. |
Shared System Resource |
Validate most patches and software changes. Validate all functional tests. This environment may be suitable for performance testing if enough system resources can be allocated to mimic production. Typically, however, the environment includes a subset of production system resources, compromising performance testing/validation. Resource management and scheduling is required. |
Smaller or Subset of the system resources |
Validate all patches and software changes. Validate all functional tests. No performance testing at production scale. Limited full-scale high availability evaluations. |
Different hardware or platform system resources but same operating system |
Validate most patches and software changes. Limited firmware patching test. Validate all functional tests unless limited by new hardware features. Limited production scale performance tests. Limited full-scale high availability evaluations. |
Pre-production validation and testing of software patches or any change is an important way to maintain stability. The high-level pre-production validation steps are:
Review the patch or upgrade documentation or any document relevant to that change. Evaluate the possibility of performing a rolling upgrade if your SLAs require zero or minimal downtime. Evaluate any rolling upgrade opportunities to minimize or eliminate planned downtime. Evaluate whether the patch qualifies for Standby-First Patching.
Note: Standby-First Patch enables you to apply a patch initially to a physical standby database while the primary database remains at the previous software release (this applies to certain types of patches and does not apply to Oracle patch sets and major release upgrades; use the Data Guard transient logical standby method for patch sets and major releases). Once you are satisfied with the change, then perform a switchover to the standby database. The fallback is to switchback if required. Alternatively, you can proceed to the following step and apply the change to your production environment. For more information, see "Oracle Patch Assurance - Data Guard Standby-First Patch Apply" in My Oracle Support Note 1265700.1 at
|
Validate the application in a test environment and ensure the change meets or exceeds your functionality, performance, and availability requirements. Automate the procedure and be sure to also document and test a fallback procedure. This requires comparing metrics captured before and after patch application on the test and against metrics captured on the production system. Real Application Testing may be used to capture the workload on the production system and replay it on the test system. AWR and SQL Performance Analyzer may be used to assess performance improvement or regression resulting from the patch.
Validate the new software on a test system that mimics your production environment, and ensure the change meets or exceeds your functionality, performance, and availability requirements. Automate the patch or upgrade procedure and ensure fallback. Being thorough during this step eliminates most critical issues during and after the patch or upgrade.
See Section 2.7.1, "Configuring the Test System and QA Environments" for more information about configuring your test system.
Optionally, use the Oracle Real Application Testing option that enables you to perform real-world testing of Oracle Database. Oracle Real Application Testing captures production workloads and assesses the impact of system changes before production deployment; thus, Oracle Real Application Testing minimizes the risk of instabilities associated with changes. Oracle GoldenGate can also be used as another logical replica to apply changes.
See Section 2.7.1, "Configuring the Test System and QA Environments" for more information about configuring your test system.
If applicable, perform final pre-production validation of all changes on a Data Guard standby database before applying them to production. Apply the change in an Oracle Data Guard environment, if applicable. For more information about Data Guard transient logical standby method, see Section 14.2.6, "Database Upgrades".
Apply the change in your production environment.
See Also:
|
When you have a standby database(s), it is important to ensure that the operations and DBA teams are well prepared to use the standby database(s) at anytime when the primary database is down or underperforming, according to a predetermined threshold. By reacting and executing efficiently, which includes detection and making the decision to failover, overall downtime can be reduced from hours to minutes.
If you use Oracle Data Guard for disaster recovery and data protection, Oracle recommends that you perform periodic switchover operations every quarter or conduct full application and database failover tests. For more information about configuring Oracle Data Guard and role transition best practices, see Chapter 9, "Configuring Oracle Data Guard" and Section 9.4.1, "Oracle Data Guard Switchovers Best Practices."
See: My Oracle Support provides notes for Data Guard switchovers:
|
Establish escalation management procedures so repair is not hindered. Most repair solutions, when conducted properly are automatic and transparent with the MAA solution. The challenges occur when the primary database or system is not meeting availability or performance SLAs and failover procedures are not automatic as in the case with some Data Guard failover scenarios. Downtime can be prolonged if proper escalation policies are not followed and decisions are not made quickly.
Availability is the top priority, and a contingency plan should be created to gather sufficient data for future Root Cause Analysis (RCA).
For more information about MAA outage and repair, check the MAA web page on the Oracle Technology Network (OTN) at
To maintain your High Availability environment, you should configure the monitoring infrastructure that can detect and react to performance and high availability related thresholds. Also, where available, Oracle can detect failures, dispatch field engineers, and replace failing hardware without customer involvement.
You should configure and use Enterprise Manager and the monitoring infrastructure that detects and reacts to performance and high availability related thresholds to avoid potential downtime. The monitoring infrastructure assists you with monitoring for High Availability and enables you to do the following:
Monitor system, network, application, database and storage statistics
Monitor performance and service statistics
Create performance and high availability thresholds as early warning indicators of system or application problems
See Also:
|
In addition to monitoring infrastructure with Enterprise Manager in the Oracle HA environment where available, Oracle can detect failures, dispatch field engineers, and replace failing hardware without customer involvement. For example, Oracle Auto Service Request (ASR) is a secure, scalable, customer-installable software solution available as a feature. The software resolves problems faster by using auto-case generation for Oracle's Sun server and storage systems when specific hardware faults occur.
See Also: See "Oracle Auto Service Request" in My Oracle Support Note 1185493.1 at
|
Using Oracle Enterprise Manager 11g is the MAA best practice recommendation for configuring your entire high availability environment. Oracle Enterprise Manager is Oracle's single, integrated solution for managing all aspects of the Oracle Grid and the applications running on it. Oracle Enterprise Manager Grid Control couples top-down monitoring for applications with automated configuration management, provisioning, and administration. This powerful combination provides unequaled management for any size Oracle data center.
Using Oracle Enterprise Manager you can perform most configuration tasks. For example, you can:
Migrate to Oracle Automatic Storage Management (Oracle ASM)
Migrate a single-instance Oracle Database to Oracle Clusterware and Oracle Real Application Clusters (Oracle RAC)
Create Oracle Data Guard standby databases
Configure backup and recovery
Implement Oracle Active Data Guard
Use the MAA Advisor to implement Oracle's best practices and achieve a high availability architecture
For information about the configuration Best Practices for Oracle Database, see the following chapters:
See Also: Oracle Enterprise Manager online help system, and the documentation set available at
|
By implementing and using Oracle Maximum Availability Architecture (MAA) best practices, you can provide high availability for the Oracle database and related technology.
This chapter contains the following topics:
Designing and implementing a high availability architecture can be a daunting task given the broad range of Oracle technologies and hardware, software, and deployment options. A successful effort begins with clearly defined and thoroughly understood business requirements. Thorough analysis of the business requirements enables you to make intelligent design decisions and develop an architecture that addresses your business needs in the most cost effective manner. The architecture you choose must achieve the required levels of availability, performance, scalability, and security. Moreover, the architecture you choose should have a clearly defined plan for deployment and ongoing management that minimizes complexity and business risk.
Once your business requirements are understood, you should begin designing your high availability architecture by reading the Oracle Database High Availability Overview to get a high-level view of the various Oracle solutions that comprise the Oracle Maximum Availability Architecture (MAA). This should result in a design for an architecture that can be fully examined and validated using the best practices documented in this book.
Oracle High Availability (HA) best practices help you deploy a highly available architecture throughout your enterprise. Having a set of configuration and operational best practices helps you achieve high availability and reduces the cost associated with the implementation and ongoing maintenance of your enterprise. Also, employing best practices can optimize usage of system resources.
By implementing the HA best practices you can:
Reduce the cost of creating an Oracle Database high availability system by following detailed guidelines on configuring your database, storage, application failover, backup and recovery. See Chapter 3, "Overview of Configuration Best Practices" for more information.
Use operational best practices to maintain your system. See Chapter 2, "Operational Prerequisites to Maximizing Availability" for more information.
Detect and quickly recover from unscheduled outages caused by computer failure, storage failure, human error, or data corruption. For more information, see Section 5.1.6, "Protect Against Data Corruption" and Chapter 13, "Recovering from Unscheduled Outages".
Eliminate or reduce downtime due to scheduled maintenance such as database patches or application upgrades as described in Chapter 14, "Reducing Downtime for Planned Maintenance".
Oracle Maximum Availability Architecture (MAA) is Oracle's best practices blueprint based on Oracle High Availability (HA) technologies, extensive validation performed by the Oracle MAA development team, and the accumulated production experience of customers who have successfully deployed business critical applications on Oracle.
MAA covers Oracle products within the following technologies:
Oracle Database as described in this book
Oracle Exadata Database Machine and Oracle Exalogic Elastic Cloud
Oracle Fusion Middleware and Oracle WebLogic Server
Oracle Applications (Siebel, Peoplesoft, E-Business Suite)
Oracle Collaboration Suite
Oracle Enterprise Manager
This book, Oracle Database High Availability Best Practices primarily focuses on high availability best practices for the Oracle Database. There are also other components for which you might want to consider Oracle Maximum Availability Architecture (MAA) best practices. For more information go to:
http://www.oracle.com/goto/MAA
See:
|
The goal of MAA is to achieve the optimal HA architecture at the lowest cost and complexity. MAA provides:
Best practices that span the Exadata Database Machine, Oracle Database, Oracle Fusion Middleware, Oracle Applications, Oracle Enterprise Manager, and solutions provided by Oracle Partners.
Accommodates a range of business requirements to make these best practices as widely applicable as possible.
Leverages lower-cost servers and storage.
Uses hardware and operating system independent features and evolves with new Oracle versions and features. The only exception is Exadata MAA which has specific and customized configuration and operating practices for Exadata Database Machine.
Makes high availability best practices as widely applicable as possible considering the various business service level agreements (SLA).
Uses the Oracle Grid Infrastructure with Database Server Grid and Database Storage Grid to provide highly resilient, scalable, and lower cost infrastructures.
Provides the ability to control the length of time to recover from an outage and the amount of acceptable data loss from any outage.
For more information about MAA and documentation about best practices for all MAA components, visit the MAA website at
This chapter provides best practices for monitoring your system using Enterprise Manager and to monitor and maintain a highly available environment across all tiers of the application stack.
This chapter contains the following topics:
Continuous monitoring of the system, network, database operations, application, and other system components ensures early detection of problems. Early detection improves the user's system experience because problems can be avoided or resolved faster. In addition, monitoring captures system metrics to indicate trends in system performance, growth, and recurring problems. This information can facilitate prevention, enforce security policies, and manage job processing. For the database server, a sound monitoring system must measure availability and detect events that can cause the database server to become unavailable, and provide immediate notification about critical failures to responsible parties.
The monitoring system itself must be highly available and adhere to the same operational best practices and availability practices as the resources it monitors. Failure of the monitoring system leaves all monitored systems unable to capture diagnostic data or alert the administrator about problems.
Enterprise Manager provides management and monitoring capabilities with many different notification options. Recommendations are available for methods of monitoring the environment's availability and performance, and for using the tools in response to changes in the environment.
A major benefit of Enterprise Manager is its ability to manage components across the entire application stack, from the host operating system to a user or packaged application. Enterprise Manager treats each of the layers in the application as a target. Targets—such as databases, application servers, and hardware—can then be viewed along with other targets of the same type, or can be grouped by application type. You can also review all targets in a single view from the High Availability Console (for more information, Section 12.3.3, "Manage Database Availability with the High Availability Console"). Each target type has a default generated home page that displays a summary of relevant details for a specific target. You can group different types of targets by function; that is, as resources that support the same application.
Every target is monitored by an Oracle Management Agent. Every Management Agent runs on a system and is responsible for a set of targets. The targets can be on a system that is different from the one that the Management Agent is on. For example, a Management Agent can monitor a storage array that cannot host an agent natively. When a Management Agent is installed on a host, the host is automatically discovered along with other targets that are on the machine.
Moreover, to help you implement the Maximum Availability Architecture (MAA) best practices, Enterprise Manager provides the MAA Advisor (for more information, see Section 12.3.4, "Configure High Availability Solutions with MAA Advisor"). The MAA Advisor page recommends Oracle solutions for most outage types and describes the benefits of each solution.
In addition to monitoring infrastructure with Enterprise Manager in the Oracle HA environment, Oracle Auto Service Request (ASR) can be used to resolve problems faster by using auto-case generation for Oracle's Sun server and storage systems when specific hardware faults occur. For more information, see "Oracle Auto Service Request" in My Oracle Support Note 1185493.1 at
https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=1185493.1
See Also: Oracle Enterprise Manager Concepts for information about Enterprise Manager Architecture and the Oracle Management Agent |
The Enterprise Manager home page in Figure 12-1 shows the availability of all discovered targets.
The Enterprise Manager home page includes the following information:
A snapshot of the current availability of all targets. The All Targets Status pie chart gives the administrator an immediate indication of any target that is Available (Up), unavailable (Down), or has lost communication with the console (Unknown).
An overview of how many alerts and problems (for jobs) are known in the entire monitored system. You can display detailed information by clicking the links, or by navigating to the Alerts tab from any Enterprise Manager page.
A view of the severity and total number of policy violations for all managed targets. Drill down to determine the source and type of violation.
All Targets Jobs lists the number of scheduled, running, suspended, and problem (stopped/failed) executions for all Enterprise Manager jobs. Click the number next to the status group to view a list of those jobs.
An overview of what is actually discovered in the system. This list can be shown at the hardware level and the Oracle level.
Alerts are generated by a combination of factors and are defined on specific metrics. A metric is a data point sampled by a Management Agent and sent to the Oracle Management Repository. An alert could be the availability of a component through a simple heartbeat test, or an evaluation of a specific performance measurement such as "disk busy" or percentage of processes waiting for a specific wait event.
There are four states that can be checked for any metric: error, warning, critical, and clear. The administrator must make policy decisions such as:
What objects should be monitored (databases, nodes, listeners, or other services)?
What instrumentation should be sampled (such as availability, CPU percent busy)?
How frequently should the metric be sampled?
What should be done when the metric exceeds a predefined threshold?
All of these decisions are predicated on the business needs of the system. For example, all components might be monitored for availability, but some systems might be monitored only during business hours. Systems with specific performance problems can have additional performance tracing enabled to debug a problem.
See Also: Oracle Enterprise Manager Cloud Control Introduction for more information about monitoring and using metrics in Enterprise Manager |
Notification Rules are defined sets of alerts on metrics that are automatically applied to a target when it is discovered by Enterprise Manager. For example, an administrator can create a rule that monitors the availability of database targets and generates an e-mail message if a database fails. After that rule is generated, it is applied to all existing databases and any database created in the future. Access these rules by navigating to Preferences and then choosing Rules.
The rules monitor problems that require immediate attention, such as those that can affect service availability, and Oracle or application errors. Service availability can be affected by an outage in any layer of the application stack: node, database, listener, and critical application data. A service availability failure, such as the inability to connect to the database, or the inability to access data critical to the functionality of the application, must be identified, reported, and reacted to quickly. Potential service outages such as a full archive log directory also must be addressed correctly to avoid a system outage.
Enterprise Manager provides a series of default rules that provide a strong framework for monitoring availability. A default rule is provided for each of the preinstalled target types that come with Enterprise Manager. You can modify these rules to conform to the policies of each individual site, and you can create rules for site-specific targets or applications. You can also set the rules to notify users during specific time periods to create an automated coverage policy.
Use the following best practices:
Modify each rule for high-value components in the target architecture to suit your availability requirements by using the rules modification wizard. For the database rule, set the metrics in Table 12-1, Table 12-2, and Table 12-3 for each target. The frequency of the monitoring is determined by the service-level agreement (SLA) for each component.
Use Beacon functionality to track the performance of individual applications. A Beacon can be set to perform a user transaction representative of normal application work. Enterprise Manager can then break down the response time of that transaction into its component pieces for analysis. In addition, an alert can be triggered if the execution time of that transaction exceeds a predefined limit.
Add Notification Methods and use them in each Notification Rule. By default, the easiest method for alerting an administrator to a potential problem is to send e-mail. Supplement this notification method by adding a callout to an SNMP trap or operating system script that sends an alert by some method other than e-mail. This avoids problems that might occur if a component of the e-mail system fails. Set additional Notification Methods by using the Setup link at the top of any Enterprise Manager page.
Modify Notification Rules to notify the administrator when there are errors in computing target availability. This might generate a false positive reading on the availability of the component, but it ensures the highest level of notification to system administrators.
See Also:
|
Figure 12-2 shows the Edit Notification Rule property page for choosing availability states, with the Down option chosen.
Figure 12-2 Setting Notification Rules for Availability
In addition, ensure that the metrics listed in Table 12-1, Table 12-2, and Table 12-3 are added to the database notification rule. Configure those metrics using the Metrics and Policy Settings page, which can be accessed from the Related Links section of the Database Homepage.
Use the metrics shown in Table 12-1 to monitor space management conditions that have the potential to cause a service outage.
Table 12-1 Recommendations for Monitoring Space
Metric | Recommendation |
---|---|
Tablespace Space Used (%) |
Set this database-level metric to check the Available Space Used (%) for each tablespace. For cluster databases, this metric is monitored at the cluster database target level and not by member instances. This metric enables the administrator to choose the threshold percentages that Enterprise Manager tests against, and the number of samples that must occur in error before a message is generated and sent to the administrator. If the percentage of used space is greater than the values specified in the threshold arguments, then a warning or critical alert is generated. The recommended default settings are 85% for a warning and 97% for a critical space usage threshold, but you should adjust these values appropriately, depending on system usage. Also, you can customize this metric to monitor specific tablespaces. Note: there is an Enterprise Manager Job in the Job Library named:
Use this Job to disable alerts for all |
Dump Area Used (%) |
Set this metric to monitor the dump directory destinations. Dump space must be available so that the maximum amount of diagnostic information is saved the first time an error occurs. The recommended default settings are 70% for a warning and 90% for an error, but these should be adjusted depending on system usage. Set this metric in the Dump Area metric group. |
Recovery Area Free Space (%) |
This is a database-level metric that is evaluated by the server every 15 minutes or during a file creation, whichever occurs first. The metric is also printed in the alert log. For cluster databases, this metric is monitored at the cluster database target level and not by member instances. The Critical Threshold is set for < 3% and the Warning Threshold is set for < 15%. You cannot customize these thresholds. An alert is returned the first time the alert occurs, and the alert is not cleared until the available space rises above 15%. |
File System Available(%) |
By default, this metric monitors the root file system per host. The default warning level is 20% and the critical warning is 5%. |
Archive Area Used (%) |
Set this metric to return the percentage of space used on the archive area destination. If the space used is more than the threshold value given in the threshold arguments, then a warning or critical alert is generated. If the database is not running in |
In Enterprise Manager 11g the mechanism for monitoring the Database Alert Log is tightly integrated with the Support Workbench, with the benefits of being able to generate packages for each problem or incident reported and quickly upload them to support.
As part of integrating with the Support Workbench, errors are categorized into different classes and groups, each served by a separate metric. At the highest level of categorization there are two different classes of errors: incidents and operational errors.
Incidents are errors that are recorded in the database alert log file, which signify that the database being monitored has detected a critical error condition. For example a critical error condition could be a generic internal error or an access violation.
Operational Errors are errors that are recorded in the database alert log file, which signify that the database being monitored has detected an error that may affect the operation of the database. For example, an operational error could be an indication that the archiver is hung or a media failure.
Configure the metrics that raise alerts for errors reported in the Alert Log as shown in Table 12-2.
Note: For more information about the Alert Log metrics in Table 12-2 and Alert Log Monitoring for 11g database targets in Enterprise Manager, see "Monitoring 11g Database Alert Log Errors in Enterprise Manager" in My Oracle Support Note 949858.1 at
|
Table 12-2 Recommendations for Monitoring the Alert Log
Metric | Recommendation |
---|---|
Generic Internal Error Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Access Violation Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Session Terminated Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Out of Memory Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Redo Log Corruption Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Inconsistent DB State Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Deadlock Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more Note: This metric does not raise alerts when application level deadlocks ( |
Internal SQL Error Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Cluster Error Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Data Block Corruption Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Media Failure Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more |
Generic Incident Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more of the errors: |
Generic Operational Error Status |
Set the Critical threshold to 0 to ensure an alert is raised each time one or more generic operational errors have been reported in the Alert Log since the last time the metric was collected. |
Monitor the system to ensure that the processing capacity is not exceeded. The warning and critical thresholds for these metrics should be modified based on the usage pattern of the system, following the recommendations in Table 12-3.
Table 12-3 Recommendations for Monitoring Processing Capacity
Metric | Recommendation |
---|---|
Process limit |
Set thresholds for this metric to warn if the number of current processes approaches the value of the |
Session limit |
Set thresholds for this metric to warn if the instance is approaching the maximum number of concurrent connections allowed by the database. |
Figure 12-3 shows the Metric and Policy settings page for setting and editing metrics. The online help contains complete reference information for every metric. To access reference information for a specific metric, use the online help search feature.
Figure 12-3 Setting Notification Rules for Metrics
See Also:
|
The Database target Home page in Figure 12-4 shows system performance, space usage, and the configuration of important availability components such as the percentage space used in the Fast Recovery Area, and Flashback Database Logging status (the Fast Recovery Area is labeled Flash Recovery Area (%) in Enterprise Manager 11g).
You can see the most recent alerts for the target under the Alerts table, as shown in Figure 12-4. You can access further information about alerts by clicking the links in the Message column.
Performance Analysis and Performance Baseline
Many of the metrics for Database targets in Enterprise Manager pertain to performance. A system that is not meeting performance service-level agreements is not meeting High Availability system requirements. While performance problems seldom cause a major system outage, they can still cause an outage to a subset of customers. Outages of this type are commonly referred to as application service brownouts. The primary cause of brownouts is the intermittent or partial failure of one or more infrastructure components. IT managers must be aware of how the infrastructure components are performing (their response time, latency, and availability), and how they are affecting the quality of application service delivered to the end user.
A performance baseline, derived from normal operations that meet the service-level agreement should determine what constitutes a performance metric alert. Baseline data should be collected from the first day that an application is in production and should include the following:
Application statistics (transaction volumes, response time, web service times)
Database statistics (transaction rate, redo rate, hit ratios, top 5 wait events, top 5 SQL transactions)
Operating system statistics (CPU, memory, I/O, network)
You can use Enterprise Manager to capture a baseline snapshot of database performance and create an Automatic Workload Repository (AWR) baseline. Enterprise Manager compares these values against system performance and displays the result on the database Target page. Enterprise Manager can also send alerts if the values deviate too far from the established baseline. See "Use Automatic Performance Tuning Features" for more information about Automatic Workload Repository.
Set the database notification rule to capture the metrics listed in Table 12-4 for all database targets.
Table 12-4 Recommendations for Performance Related Metrics
Metric | Level | Recommendation |
---|---|---|
I/O Requests (per second) |
Instance |
This metric represents the total rate of I/O read and write requests for the database. It sends an alert when the number of operations exceeds a user-defined threshold. Use this metric with operating system-level metrics that are also available with Enterprise Manager. Set this metric based on the total I/O throughput available to the system, the number of I/O channels available, network bandwidth (in a SAN environment), the effects of the disk cache if you are using a storage array device, and the maximum I/O rate and number of spindles available to the database. |
Database CPU Time (%) |
Instance |
This metric represents the percentage of database call time that is spent on the CPU. It can be used to detect a change in the operation of a system, for example, a drop in Database CPU time from 50% to 25%. The |
Wait Time (%) |
Instance |
Excessive idle time indicates that a bottleneck for one or more resources is occurring. Set this instance-level metric based on the system wait time when the application is performing as expected. |
Network Bytes per Second |
Instance |
This metric reports network traffic that Oracle generates. This metric can indicate a potential network bottleneck. Set this metric based on actual usage during peak periods. |
Pages Paged-in (per second) |
Host |
For UNIX-based systems, represents the number of pages paged in (read from disk to resolve fault memory references) per second. This metric checks the number of pages paged in for the CPU(s) specified by the Host CPU(s) parameter, such as For Microsoft Windows, this metric is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk. When a page is faulted, the system tries to read multiple contiguous pages into memory to maximize the benefit of the read operation. |
Run Queue Length |
Host |
For UNIX-based systems, the Run Queue Length metrics represent the average number of processes in memory and subject to be run in the last interval (1 minute average, 5 minute average, and 15 minute average). It is recommended to alert when Run Queue Length = # of CPUs. (An alternative way to do this is to monitor the Load Average metric and compare it to Maximum CPU.) This metric is not available on Microsoft Windows. |
See Also:
|
Set Enterprise Manager metrics to monitor the availability of Data Guard configurations. Table 12-5 shows the metrics that are available for monitoring Data Guard databases.
Table 12-5 Recommendations for Setting Data Guard Metrics
Metric | Recommendation |
---|---|
Notifies you about system problems in a Data Guard configuration. | |
Displays (in seconds) how far the standby is behind the primary database. This metric generates an alert on the standby database if it falls behind more than the user-specified threshold (if any). | |
Displays the approximate number of seconds required to failover to this standby database. | |
Displays the Redo Apply rate in KB/second on this standby database. | |
Displays the approximate number of seconds of redo that is not yet available on this standby database. The lag may be because the redo data has not yet been transported or there may be a gap. This metric generates an alert on the standby database if it falls behind more than the user-specified threshold (if any). |
Use Enterprise Manager as a proactive part of administering any system and for problem notification and analysis, with the following recommendations:
Use Enterprise Manager to Manage Oracle Patches and Maintain System Baselines
Manage Database Availability with the High Availability Console
Enterprise Manager comes with a pre-installed set of policies and recommendations of best practices for all databases. These policies are checked by default and the number of violations is displayed on the Targets page in the Policy Violations area, as shown in Figure 12-5.
Figure 12-5 Database Home Page with Targets Showing Policy Violations
To see more details on violations, select a link in the Policy Violations area. Figure 12-6 shows the Policy Tend Overview page.
Figure 12-6 Database Targets Policy Trend Overview Page
To see Policy Violations, select Violations from the Compliance tab, as shown in Figure 12-7.
Figure 12-7 Shows Compliance Tab with Policy Violations
For any monitored system in the application environment, you can use Enterprise Manager to download and manage patches from My Oracle Support at
You can set up a job to routinely check for patches that are relevant to the user environment. Those patches can be downloaded and stored directly in the Management Repository. Patches can be staged from the Management Repository to multiple systems and applied during maintenance windows.
You can examine patch levels for one system and compare them between systems in either a one-to-one or one-to-many relationship. In this case, a system can be identified as a baseline and used to demonstrate maintenance requirements in other systems. This can be done for operating system patches and database patches.
See Also:
|
The High Availability (HA) Console is a one stop, dashboard-style page for monitoring the availability of each database. You can use it on any database and if a database is part of a Data Guard configuration, the HA Console allows you to switch your view from the primary database to any of the standby databases.
Use the HA Console to:
Display high availability events including events from related targets such as standby databases
View the high availability summary that includes the status of the database
View the last backup status
View the Fast Recovery Area Usage, if configured
If Oracle Data Guard is configured: View the Data Guard summary, set up Data Guard standby databases for any database target, manage switchover and failover of database targets other than the database that contains the Management Repository, and monitor the health of a Data Guard configuration at a glance
If Oracle RAC is configured: View the Oracle RAC Services summary including Top Services
Note: Oracle Enterprise Manager Database Control uses the name Fast Recovery Area for the renamed Flash Recovery Area. In places, the HA Console and Enterprise Manager use the name Flash Recovery Area. For more information about the Fast Recovery Area, see Section 5.1.3, "Use a Fast Recovery Area". |
Figure 12-8 shows the HA Console. This figure shows summary information, details, and historical statistics for the primary database and shows the standby databases for the primary target, various Data Guard standby performance metrics and settings, and the data protection mode.
Figure 12-8 Monitoring a Primary Database in the High Availability Console
In Figure 12-8, the Availability Summary shows that the primary database is up and its availability is currently 100%. The Availability Summary also shows Oracle ASM instances status. The Availability Events table shows specific high availability events (alerts). You can click the message to obtain more details (or to suppress the event). To set up, manage, and configure a specific solution area for this database, under Availability Summary, next to MAA Advisor, click Details to go to the Maximum Availability Architecture (MAA) Advisor page (described in more detail in Section 12.3.4, "Configure High Availability Solutions with MAA Advisor").
The Backup/Recovery Summary area displays the Last Backup and Next Backup information. The Fast Recovery Area Usage chart indicates about 83% of the fast recovery area is currently used. The Used (Non-reclaimable) Fast Recovery Area (%) chart shows the usage over the last 2 hours. You can click the chart to display the page with the metric details.
The Data Guard Summary area shows the primary database is running in Maximum Availability mode and has Fast-Start Failover enabled. You can click the link next to Protection Mode to modify the data protection mode. In the Standby Databases table, the physical standby database (north) is caught up with the primary database (Apply/Transport Lag) metrics are showing 0 seconds, and the Used Fast Recovery Area (FRA) is 16.02%. The Primary Database Redo Rate chart shows the redo trend over the past 2 hours. Note that if Data Guard is not configured, the "Switch To" box in the corner of the console is not displayed.
Figure 12-9 shows information similar to figure Figure 12-8, but for the standby database (north), which is a physical standby database running real-time query. In the Standby Databases table, the Apply/Transport Lag metrics indicate that the physical standby database is caught up with the primary database, and the Used Fast Recovery Area (FRA) is 16%. Note that if Data Guard is not configured, the "Switch To" box in the corner of the console is not displayed.
Figure 12-9 Monitoring the Standby Database in the High Availability Console
Figure 12-10 shows sample values for Services Summary and Services Details areas. These areas show summary and detail information about Oracle RAC Services, including Top Services and problem services.
Figure 12-10 Monitoring the Cluster in the High Availability Console Showing Services
See Also: Oracle Enterprise Manager Cloud Control Introduction for information about Database Management |
The goal of the MAA Advisor is to help you implement Oracle's best practices to achieve the optimal high availability architecture.
From the Availability Summary section on the High Availability Console, you can link to the MAA Advisor to:
View recommended Oracle solutions for each outage type (site failures, computer failures, storage failures, human errors, and data corruptions)
View the configuration status and use the links in the Oracle Solution column to go to the Enterprise Manager page where the solution can be configured.
Understand the benefits of each solution
Link to the MAA website for white papers, documentation, and other information
The MAA Advisor page contains a table that lists the outage type, Oracle solutions for each outage, configuration status, and benefits. The MAA Advisor allows you to view High Availability solutions in the following ways:
Primary Database Recommendations Only—This condensed view shows only the recommended solutions (the default view) for the primary database.
All Solutions —This expanded view shows all configuration recommendations and status for all primary and standby databases in this configuration. It includes an extra column Target Name:Role that provides the database name and shows the role (Primary, Physical Standby, or Logical Standby) of the database.
Figure 12-11 shows an example of the MAA Advisor page with the Show All Solutions view selected.
Figure 12-11 Maximum Availability Architecture (MAA) Advisor Page in Enterprise Manager
You can click the link in the Oracle Solution column to go to a page where you can set up, manage, and configure the specific solution area. Once a solution has been configured, click Refresh to update the configuration status. Once the page is refreshed, click Advisor Details on the Console page to see the updated values.
The Cluster Health Monitor (CHM) gathers operating system metrics in real time and stores them in its repository for later analysis to determine the root cause of many Oracle Clusterware and Oracle RAC issues with the assistance of Oracle Support. It also works with Oracle Database Quality of Service Management (Oracle Database QoS Management) by providing metrics to detect memory over-commitment on a node. With this information, Oracle Database QoS Management can take action to relieve the stress and preserve existing workloads.
See: Oracle Clusterware Administration and Deployment Guide for an Overview of Managing Oracle Clusterware Environments and for more information about Cluster Health Monitor (CHM) |
This chapter describes best practices for configuring a fault-tolerant storage subsystem that protects data while providing manageability and performance. These practices apply to all Oracle Database high availability architectures described in Oracle Database High Availability Overview.
This chapter contains the following topics:
Evaluate Database Performance and Storage Capacity Requirements
Use Automatic Storage Management (Oracle ASM) to Manage Database Files
Characterize your database performance requirements using different application workloads. Extract statistics during your target workloads by gathering the beginning and ending statistical snapshots. Some examples of target workloads include:
Average load
Peak load
Application workloads such as batch processing, Online Transaction Processing (OLTP), decision support systems (DSS) and reporting, Extraction, Transformation, and Loading (ETL)
Evaluating Database Performance Requirements
You can gather the necessary statistics by using Automatic Workload Repository (AWR) reports or by querying the GV$SYSSTAT
view. Along with understanding the database performance requirements, you must evaluate the performance capabilities of a storage array.
Choosing Storage
When you understand the performance and capacity requirements, choose a storage platform to meet those requirements.
See Also: Oracle Database Performance Tuning Guide for Overview of the Automatic Workload Repository (AWR) and on Generating Automatic Workload Repository Reports |
Oracle ASM is a vertical integration of both the file system and the volume manager built specifically for Oracle database files. Oracle ASM extends the concept of stripe and mirror everything (SAME) to optimize performance, while removing the need for manual I/O tuning (distributing the data file layout to avoid hot spots). Oracle ASM helps manage a dynamic database environment by letting you grow the database size without shutting down the database to adjust the storage allocation. Oracle ASM also enables low-cost modular storage to deliver higher performance and greater availability by supporting mirroring and striping.
Oracle ASM provides data protection against drive and SAN failures, the best possible performance, and extremely flexible configuration and reconfiguration options. Oracle ASM automatically distributes the data across all available drivers, transparently and dynamically redistributes data when storage is added or removed from the database.
Oracle ASM manages all of your database files. You can phase Oracle ASM into your environment by initially supporting only the fast recovery area.
Note: Oracle recommends host-based mirroring using Oracle ASM. |
See:
|
The Grid Infrastructure is the software that provides the infrastructure for an enterprise grid architecture. In a cluster, this software includes Oracle Clusterware and Oracle ASM.
You can use clustered Oracle ASM with both Oracle single-instance databases and Oracle Real Application Clusters (Oracle RAC). In an Oracle RAC environment, there is one Oracle ASM instance for each node, and the Oracle ASM instances communicate with each other on a peer-to-peer basis. Only one Oracle ASM instance is required and supported for each node regardless of the number of database instances on the node. Clustering Oracle ASM instances provides fault tolerance, flexibility, and scalability to your storage pool.
See Also: Oracle Automatic Storage Management Administrator's Guide for more information about clustered Oracle ASM |
Oracle Restart improves the availability of your Oracle database. When you install the Oracle Grid Infrastructure for a standalone server, it includes both Oracle ASM and Oracle Restart. Oracle Restart runs out of the Oracle Grid Infrastructure home, which you install separately from Oracle Database homes.
Oracle Restart provides managed startup and restart of a single-instance (non-clustered) Oracle Database, Oracle ASM instance, service, listener, and any other process running on the server. If an interruption of a service occurs after a hardware or software failure, Oracle Restart automatically takes the necessary steps to restart the component.
With Server Control Utility (SRVCTL) you can add a component, such as an Oracle ASM instance to Oracle Restart. You then enable Oracle Restart protection for the Oracle ASM instance. With SRVCTL, you also remove or disable Oracle Restart protection.
See Also:
|
Use the following Oracle ASM strategic best practices:
Oracle recommends using Oracle high redundancy disk groups (3 way mirroring) or an external redundancy disk group with equivalent mirroring resiliency for mission critical applications. This higher level of mirroring provides greater protection and better tolerance of different storage failures. This is especially true during planned maintenance windows when a subset of the storage is offline for patching or upgrading. For more information about redundancy, see Chapter 4, "Use Redundancy to Protect from Disk Failure."
When you use Oracle ASM for database storage, create two disk groups: one disk group for the data area and another disk group for the fast recovery area:
data area: contains the active database files and other files depending on the level of Oracle ASM redundancy. If Oracle ASM with high redundancy is used, then the data area can also contain OCR, Voting, spfiles, control files, online redo log files, standby redo log files, broker metadata files, and change tracking files used for RMAN incremental backup.
For example (high redundancy):
CREATE DISKGROUP data HIGH REDUNDANCY FAILGROUP controller1 DISK '/devices/c1data01' NAME c1data01,\ '/devices/c1data02' NAME c1data02 FAILGROUP controller2 DISK '/devices/c2data01' NAME c2data01, '/devices/c2data02' NAME c2data02 FAILGROUP controller3 DISK '/devices/c3data01' NAME c3data01, '/devices/c3data02' NAME c3data02 ATTRIBUTE 'au_size'='4M', 'compatible.asm' = '11.2', 'compatible.rdbms'= '11.2', 'compatible.advm' = '11.2';
fast recovery area: contains recovery-related files, such as a copy of the current control file, a member of each online redo log file group, archived redo log files, RMAN backups, and flashback log files.
For example (normal redundancy):
CREATE DISKGROUP reco NORMAL REDUNDANCY FAILGROUP controller1 DISK '/devices/c1reco01' NAME c1reco01, '/devices/c1reco02' NAME c1reco02 FAILGROUP controller2 DISK '/devices/c2reco01' NAME c2reco01, '/devices/c2reco02' NAME c2reco02 ATTRIBUTE 'au_size'='4M', 'compatible.asm' = '11.2', 'compatible.rdbms'= '11.2', 'compatible.advm' = '11.2';
Note 1: If you are using ASMLib in a Linux environment, then create the disks using theORACLEASM CREATEDISK command. ASMLib is a support library for Oracle ASM and is not supported on all platforms. For more information about ASMLib, see Section 4.4.6, "Use ASMLib On Supported Platforms".
For example: /etc/init.d/oracleasm createdisk lun1 /devices/lun01 Then, create the disk groups. For example: CREATE DISKGROUP DATA DISK 'ORCL:lun01','ORCL:lun02','ORCL:lun03','ORCL:lun04'; |
Note 2: Oracle recommends using four (4) or more disks in each disk group. Having multiple disks in each disk group spreads kernel contention accessing and queuing for the same disk. |
To simplify file management, use Oracle Managed Files to control file naming. Enable Oracle Managed Files by setting the following initialization parameters: DB_CREATE_FILE_DEST
and DB_CREATE_ONLINE_LOG_DEST_
n
.
For example:
DB_CREATE_FILE_DEST=+DATA DB_CREATE_ONLINE_LOG_DEST_1=+RECO
You have two options when partitioning disks for Oracle ASM:
Allocate entire disks to the data area and fast recovery area disk groups. Figure 4-1 illustrates allocating entire disks.
Partition each disk into two partitions, one for the data area and another for the fast recovery area. Figure 4-2 illustrates partitioning each disk into two partitions.
The advantages of the option shown in Figure 4-1 are:
Easier management of the disk partitions at the operating system level because each disk is partitioned as just one large partition.
Quicker completion of Oracle ASM rebalance operations following a disk failure because there is only one disk group to rebalance.
Fault isolation, where storage failures only cause the affected disk group to go offline.
Patching isolation, where you can patch disks or firmware for individual disks without impacting every disk.
The disadvantage of the option shown in Figure 4-1 is:
Less I/O bandwidth, because each disk group is spread over only a subset of the available disks.
Figure 4-2 illustrates the partitioning option where each disk has two partitions. This option requires partitioning each disk into two partitions: a smaller partition on the faster outer portion of each drive for the data area, and a larger partition on the slower inner portion of each drive for the fast recovery area. The ratio for the size of the inner and outer partitions depends on the estimated size of the data area and the fast recovery area.
The advantages of the option shown in Figure 4-2 for partitioning are:
More flexibility and easier to manage from a performance and scalability perspective.
Higher I/O bandwidth is available, because both disk groups are spread over all available spindles. This advantage is considerable for the data area disk group for I/O intensive applications.
There is no need to create a separate disk group with special, isolated storage for online redo logs or standby redo logs if you have sufficient I/O capacity.
You can use the slower regions of the disk for the fast recovery area and the faster regions of the disk for data.
The disadvantages of the option shown in Figure 4-2 for partitioning are:
A double partner disk failure will result in loss of both disk groups, requiring the use of a standby database or tape backups for recovery. This problem is eliminated when using high redundancy ASM disk groups.
An Oracle ASM rebalance operation following a disk failure is longer, because both disk groups are affected.
See Also:
|
When setting up redundancy to protect from hardware failures, there are two options to consider:
See Also:
|
If you are using a high-end storage array that offers robust built-in RAID solutions, then Oracle recommends that you configure redundancy in the storage array by enabling RAID protection, such as RAID1 (mirroring) or RAID5 (striping plus parity). For example, to create an Oracle ASM disk group where redundancy is provided by the storage array, first create the RAID-protected logical unit numbers (LUNs) in the storage array, and then create the Oracle ASM disk group using the EXTERNAL
REDUNDANCY
clause:
CREATE DISKGROUP DATA EXTERNAL REDUNDANCY DISK '/devices/lun1','/devices/lun2','/devices/lun3','/devices/lun4';
See Also:
|
Oracle ASM provides redundancy with the use of failure groups, which are defined during disk group creation. The disk group type determines how Oracle ASM mirrors files. When you create a disk group, you indicate whether the disk group is a normal redundancy disk group (2-way mirroring for most files by default), a high redundancy disk group (3-way mirroring), or an external redundancy disk group (no mirroring by Oracle ASM). You use an external redundancy disk group if your storage system does mirroring at the hardware level, or if you have no need for redundant data. The default disk group type is normal redundancy. After a disk group is created the redundancy level cannot be changed.
Failure group definition is specific to each storage setup, but you should follow these guidelines:
If every disk is available through every I/O path, as would be the case if using disk multipathing software, then keep each disk in its own failure group. This is the default Oracle ASM behavior if creating a disk group without explicitly defining failure groups.
CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4', '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
For an array with two controllers where every disk is seen through both controllers, create a disk group with each disk in its own failure group:
CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4', '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
If every disk is not available through every I/O path, then define failure groups to protect against the piece of hardware that you are concerned about failing. Here are some examples:
For an array with two controllers where each controller sees only half the drives, create a disk group with two failure groups, one for each controller, to protect against controller failure:
CREATE DISKGROUP DATA NORMAL REDUNDANCY FAILGROUP controller1 DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4' FAILGROUP controller2 DISK '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
For a storage network with multiple storage arrays, you want to mirror across storage arrays, then create a disk group with two failure groups, one for each array, to protect against array failure:
CREATE DISKGROUP DATA NORMAL REDUNDANCY FAILGROUP array1 DISK '/devices/diska1','/devices/diska2','/devices/diska3','/devices/diska4' FAILGROUP array2 DISK '/devices/diskb1','/devices/diskb2','/devices/diskb3','/devices/diskb4';
When determining the proper size of a disk group that is protected with Oracle ASM redundancy, enough free space must exist in the disk group so that when a disk fails Oracle ASM can automatically reconstruct the contents of the failed drive to other drives in the disk group while the database remains online. The amount of space required to ensure Oracle ASM can restore redundancy following disk failure is in the column REQUIRED_MIRROR_FREE_MB
in the V$ASM_DISKGROUP
view. The amount of free space that you can use safely in a disk group, taking mirroring into account, and still be able to restore redundancy after a disk failure is in the USABLE_FILE_MB
column in the V$ASM_DISKGROUP
view. The value of the USABLE_FILE_MB
column should always be greater than zero. If USABLE_FILE_MB
falls below zero, then add more disks to the disk group.
Oracle Automatic Storage Management (Oracle ASM) and Oracle Clusterware are installed into a single home directory, which is called the Grid home. Oracle Grid Infrastructure for a cluster software refers to the installation of the combined products. The Grid home is separate from the home directories of other Oracle software products installed on the same server.
See Also: Oracle Database 2 Day + Real Application Clusters Guide for information about Oracle ASM and the Grid home. |
Although ensuring that all disks in the same disk group have the same size and performance characteristics is not required, doing so provides more predictable overall performance and space utilization. When possible, present physical disks (spindles) to Oracle ASM as opposed to Logical Unit Numbers (LUNs) that create a layer of abstraction between the disks and Oracle ASM.
If the disks are the same size, then Oracle ASM spreads the files evenly across all of the disks in the disk group. This allocation pattern maintains every disk at the same capacity level and ensures that all of the disks in a disk group have the same I/O load. Because Oracle ASM load balances workload among all of the disks in a disk group, different Oracle ASM disks should not share the same physical drive.
See Also: Oracle Automatic Storage Management Administrator's Guide for complete information about administering Oracle ASM disk groups |
Using failure groups to define a common failure compo nent ensures continuous access to data when that component fails. For maximum protection, use at least three failure groups for normal redundancy and at least five failure groups for high redundancy. Doing so enables Oracle ASM to tolerate multiple failure group failures and avoids the confusing state of having Oracle ASM running without full redundancy.
Intelligent Data Placement enables you to specify disk regions on Oracle ASM disks for best performance. Using the disk region settings you can ensure that frequently accessed data is placed on the outermost (hot) tracks which have greater speed and higher bandwidth. In addition, files with similar access patterns are located physically close, reducing latency. Intelligent Data Placement also enables the placement of primary and mirror extents into different hot or cold regions.
See Also: Oracle Automatic Storage Management Administrator's Guide for more information about Intelligent Data Placement |
Oracle Automatic Storage Management Cluster File System (Oracle ACFS) is a multi-platform, scalable file system, and storage management technology that extends Oracle Automatic Storage Management (Oracle ASM) functionality to support customer files maintained outside of Oracle Database. Oracle ACFS includes a volume management service and comes with fine grained security policies, encryption, snapshotting and replication.
Oracle ACFS supports many database and application files, including executables, database trace files, database alert logs, application reports, BFILEs, and configuration files. Other supported files are video, audio, text, images, engineering drawings, and other general-purpose application file data.
Note: Oracle database binaries can be put on Oracle ACFS but not binaries in the Grid Infrastructure home. |
See Also: Oracle Automatic Storage Management Administrator's Guide for more information about Oracle ACFS |
Use the following Oracle ASM configuration best practices:
Disk multipathing software aggregates multiple independent I/O paths into a single logical path. The path abstraction provides I/O load balancing across host bus adapters (HBA) and nondisruptive failovers when there is a failure in the I/O path. You should use disk multipathing software with Oracle ASM.
When specifying disk names during disk group creation in Oracle ASM, use the logical device representing the single logical path. For example, when using Device Mapper on Linux 2.6, a logical device path of /dev/dm-0
may be the aggregation of physical disks /dev/sdc
and /dev/sdh
. Within Oracle ASM, the ASM_DISKSTRING
parameter should contain /dev/dm-*
to discover the logical device /dev/dm-0
, and that logical device is necessary during disk group creation:
asm_diskstring='/dev/dm-*' CREATE DISKGROUP DATA DISK '/dev/dm-0','/dev/dm-1','/dev/dm-2','/dev/dm-3';
Note: For more information about using the combination of ASMLib and Multipath Disks, see "Configuring Oracle ASMLib on Multipath Disks" in My Oracle Support Note 309815.1 at
|
See Also:
|
Use the SGA_TARGET and PGA_AGGREGATE_TARGET initialization parameters in the Oracle ASM instance to manage ASM process memory using the automatic shared memory management (ASSM) functionality.
To use automatic shared memory management the values for MEMORY_TARGET and MEMORY_MAX_TARGET should be set to 0.
See Also: Oracle Automatic Storage Management Administrator's Guide for information about memory-related initialization parameters for Oracle ASM |
The PROCESSES
initialization parameter affects Oracle ASM, but the default value is usually suitable. However, if multiple database instances are connected to an Oracle ASM instance, you can use the following formulas:
For < 10 instances per node... | For > 10 instances per node... |
---|---|
PROCESSES = 50 * (n + 1) |
PROCESSES = 50 * MIN (n + 1, 11) + 10 * MAX (n - 10, 0) |
where n
is the number database instances connecting to the Oracle ASM instance.
See Also:
|
Disk labels ensure consistent access to disks across restarts. ASMLib is the preferred tool for disk labeling. For more information about ASMLib, see Section 4.4.6, "Use ASMLib On Supported Platforms".
The DISK_REPAIR_TIME
disk group attribute specifies how long a disk remains offline before Oracle ASM drops the disk. If a disk is made available before the DISK_REPAIR_TIME
parameter has expired, the storage administrator can issue the ONLINE DISK
command and Oracle ASM resynchronizes the stale data from the mirror side. In Oracle Database 11g, the online disk operation does not restart if there is a failure of the instance on which the disk is running. You must reissue the command manually to bring the disk online.
You can set a disk repair time attribute on your disk group to specify how long disks remain offline before being dropped. The appropriate setting for your environment depends on how long you expect a typical transient type of failure to persist.
Set the DISK_REPAIR_TIME
disk group attribute to the maximum amount of time before a disk is definitely considered to be out of service.
See Also: Oracle Automatic Storage Management Administrator's Guide for information about restoring the redundancy of an Oracle ASM disk group after a transient disk path failure |
To improve manageability use ASMLib on platforms where it is available. ASMLib is a support library for Oracle ASM.
Although ASMLib is not required to run Oracle ASM, using ASMLib is beneficial because ASMLib:
Eliminates the need for every Oracle process to open a file descriptor for each Oracle ASM disk, thus improving system resource usage.
Simplifies the management of disk device names, makes the discovery process simpler, and removes the challenge of having disks added to one node and not be known to other nodes in the cluster.
Eliminates the impact when the mappings of disk device names change upon system restart.
Note: ASMLib is not supported on all platforms. |
See Also:
|
A general rule of thumb is to disable variable sized extents if the amount of space managed by a single Oracle ASM cluster is less than or equal to 330TB Raw.
Note: This rule of thumb assumes that read-only tablespaces are not being shared across multiple databases. |
Use the following Oracle ASM operational best practices:
The Oracle ASM instance is managed by a privileged role called SYSASM
, which grants full access to Oracle ASM disk groups. Using SYSASM
enables the separation of authentication for the storage administrator and the database administrator. By configuring a separate operating system group for Oracle ASM authentication, you can have users that have SYSASM
access to the Oracle ASM instances and do not have SYSDBA
access to the database instances.
See Also: Oracle Automatic Storage Management Administrator's Guide for information about authentication to access Oracle ASM instances |
Higher Oracle ASM rebalance power limits make a rebalance operation run faster but can also affect application service levels. Rebalancing takes longer with lower power values, but consumes fewer processing and I/O resources that are shared by other applications, such as the database.
After performing planned maintenance, for example adding or removing storage, it is necessary to subsequently perform a rebalance to spread data across all of the disks. There is a power limit associated with the rebalance. You can set a power limit to specify how many processes perform the rebalance. If you do not want the rebalance to impact applications, then set the power limit lower. However, if you want the rebalance to finish quickly, then set the power limit higher. To determine the default power limit for rebalances, check the value of the ASM_POWER_LIMIT
initialization parameter in the Oracle ASM instance.
If the POWER
clause is not specified in an ALTER DISKGROUP
statement, or when rebalance is run implicitly when you add or drop a disk, then the rebalance power defaults to the value of the ASM_POWER_LIMIT
initialization parameter. You can adjust the value of this parameter dynamically.
See Also: Oracle Automatic Storage Management Administrator's Guide for more information about rebalancing Oracle ASM disk groups |
Mounting multiple disk groups in the same command ensures that disk discovery runs only one time, thereby increasing performance. Disk groups that are specified in the ASM_DISKGROUPS
initialization parameter are mounted automatically at Oracle ASM instance startup.
To mount disk groups manually, use the ALTER DISKGROUP...MOUNT
statement and specify the ALL
keyword:
ALTER DISKGROUP ALL MOUNT;
Note: TheALTER DISKGROUP...MOUNT command only works on one node. For cluster installations use the following command:
srvctl start diskgroup -g |
See Also: Oracle Automatic Storage Management Administrator's Guide for information about mounting and dismounting disk groups |
Oracle ASM permits you to add or remove disks from your disk storage system while the database is operating. When you add a disk to a disk group, Oracle ASM automatically redistributes the data so that it is evenly spread across all disks in the disk group, including the new disk. The process of redistributing data so that it is also spread across the newly added disks is known as rebalancing. By executing storage maintenance commands in the same command, you ensure that only one rebalance is required to incur minimal impact to database performance.
See Also: Oracle Automatic Storage Management Administrator's Guide for information about Altering Disk Groups |
You should periodically check disk groups for imbalance. Occasionally, disk groups can become unbalanced if certain operations fail, such as a failed rebalance operation. Periodically checking the balance of disk groups and running a manual rebalance, if needed, ensures optimal Oracle ASM space utilization and performance.
Use the following methods to check for disk group imbalance:
To check for an imbalance on all mounted disk groups, see "Script to Report the Percentage of Imbalance in all Mounted Diskgroups" in My Oracle Support Note 367445.1 at
https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=367445.1
To check for an imbalance from an I/O perspective, query the statistics in the V$ASM_DISK_IOSTAT
view before and after running a large SQL*Plus statement. For example, if you run a large query that performs only read I/O, the READS
and BYTES_READ
columns should be approximately the same for all disks in the disk group.
You should proactively mine vendor logs for disk errors and have Oracle ASM move data off the bad disk spots. Disk vendors usually provide disk-scrubbing utilities that notify you if any part of the disk is experiencing problems, such as a media sense error. When a problem is found, use the ASMCMD utility REMAP
command to move Oracle ASM extents from the bad spot to a good spot.
Note that this is only applicable for data that is not accessed by the database or Oracle ASM instances, because in that case Oracle ASM automatically moves the extent experiencing the media sense error to a different location on the same disk. In other words, use the ASMCMD utility REMAP
command to proactively move data from a bad disk spot to a good disk spot before that data is accessed by the application.
See Also: Oracle Automatic Storage Management Administrator's Guide for information about the ASMCMD utility |
Use the ASMCMD utility to ease the manageability of day-to-day storage administration. Use the ASMCMD utility to view and manipulate files and directories in Oracle ASM disk groups and to list the contents of disk groups, perform searches, create and remove directories and aliases, display space usage. Also, use the ASMCMD utility to backup and restore the metadata of the disk groups (using the md_backup
and md_restore
commands).
Note: As a best practice to create and drop Oracle ASM disk groups, use SQL*Plus, ASMCA, or Oracle Enterprise Manager. |
See Also: Oracle Automatic Storage Management Administrator's Guide for more information about ASMCMD Disk Group Management Commands |
Oracle ASM Configuration Assistant (ASMCA) supports installing and configuring Oracle ASM instances, disk groups, volumes, and Oracle Automatic Storage Management Cluster File System (Oracle ACFS). In addition, you can use the ASMCA command-line interface as a silent mode utility.
The Oracle Storage Grid consists of either:
Oracle ASM and third-party storage using external redundancy.
Oracle ASM and Oracle Exadata or third-party storage using Oracle ASM redundancy. The Oracle Storage Grid with Exadata seamlessly supports MAA-related technology, improves performance, provides unlimited I/O scalability, is easy to use and manage, and delivers mission-critical availability and reliability to your enterprise.
To protect storage against unplanned outages:
Set the DB_BLOCK_CHECKSUM
initialization parameter to TYPICAL
(default) or FULL
. For more information, see Section 9.3.8.3, "Set DB_BLOCK_CHECKSUM
=FULL
and DB_BLOCK_CHECKING=MEDIUM
or FULL
".
Note: Oracle Exadata Database Machine also prevents corruptions from being written to disk by incorporating the hardware assisted resilient data (HARD) technology in its software. HARD uses block checking, in which the storage subsystem validates the Oracle block contents, preventing corrupted data from being written to disk. HARD checks in Oracle Exadata operate completely transparently and no parameters must be set for this purpose at the database or storage tier. For more information see the White Paper "Optimizing Storage and Protecting Data with Oracle Database 11g" at
|
Choose Oracle ASM redundancy type (NORMAL
or HIGH
) based on your desired protection level and capacity requirements
The NORMAL
setting stores two copies of Oracle ASM extents, while the HIGH
setting stores three copies of Oracle ASM extents. Normal redundancy provides more usable capacity and high redundancy provides more protection.
If a storage component is to be offlined when one or more databases are running, then verify that taking the storage component offline does not impact Oracle ASM disk group and database availability. Before dropping a failure group or offlining a storage component perform the appropriate checks.
Ensure I/O performance can be sustained after an outage
Ensure that you have enough I/O bandwidth to support your service-level agreement if a failure occurs. For example, a typical case for a Storage Grid with n storage components would be to ensure that n-1 storage components could support the application service levels (for example, to handle a storage component failure).
Use the following list of best practices for planned maintenance:
Size I/O for performance first, and then set it for capacity:
When building your Oracle Storage Grid, make sure you have enough drives to support I/O's per second and MBs per second to meet your service-level requirements. Then, make sure you also have enough capacity. The order is important because you do not want to buy enough drives to support capacity but then find the system cannot meet your performance requirements.
When you are sizing, you must consider what happens to performance when you offline a subset of storage for planned maintenance. For Example, when a subset of the overall storage is offlined you still must make sure you get the required number of IOPS if that is important to meet your SLAs. Also, if offlining storage means the system cannot add more databases, then one has to consider that upfront.
Set Oracle ASM power limit for faster rebalancing For more information, see Section 4.5.2, "Set Rebalance to the Maximum Limit that Does Not Affect Service Levels".
See Also:
|
This chapter describes the Oracle operational best practices that can tolerate or manage each unscheduled outage type and minimize downtime.
This chapter contains the following topics:
See Also: Chapter 14, "Reducing Downtime for Planned Maintenance" for information about scheduled outages. |
This section describes unscheduled outages that affect the primary or secondary site components, and describes the recommended methods to repair or minimize the downtime associated with each outage.
Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:
Hardware
Software
Network infrastructure
Naming services infrastructure
Database
Your monitoring and high availability infrastructure should provide rapid detection and recovery from downtime. Chapter 12, "Monitoring for High Availability" describes detection, while this chapter focuses on reducing downtime.
Solutions for unscheduled outages are critical for maximum availability of the system. Table 13-1 compares the most common Oracle high availability architectures and summarizes the recovery steps for unscheduled outages on the primary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 13.2, "Recovering from Unscheduled Outages".
Table 13-1 Recovery Times and Steps for Unscheduled Outages on the Primary Site
Outage Scope | Oracle Database 11g | Oracle Database 11g with Oracle RAC and Oracle Clusterware | Oracle Database 11g with Data GuardFoot 1 | Oracle Database 11g MAA |
---|---|---|---|---|
|
Hours to days
|
Hours to days
|
Seconds to 5 minutesFoot 2 |
Seconds to 5 minutesFootref 2 |
|
Not applicable |
Hours to days
|
Not applicable |
Seconds to 5 minutes |
computer failure (node) |
Minutes to hoursFoot 3
|
No downtimeFoot 4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)" |
Seconds to 5 minutesFootref 3 |
No downtimeFootref 4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)" |
computer failure (instance) |
MinutesFootref 3
|
No downtimeFootref 4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)" |
MinutesFootref 3
or Seconds to 5 minutesFootref 2 |
No downtimeFootref 4 Managed automatically by Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)" |
|
No downtimeFoot 5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures" |
No downtimeFootref 5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures" |
No downtimeFootref 5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures" |
No downtimeFootref 5 Section 13.2.5, "Oracle ASM Recovery After Disk and Storage Failures" |
|
Minutes to hours |
Minutes to hours |
Possible no downtime with Active Data Guard: Section 13.2.6.2, "Use Active Data Guard" Seconds to 5 minutes |
Possible no downtime with Active Data Guard: Section 13.2.6.2, "Use Active Data Guard" Seconds to 5 minutes |
|
< 30 minutesFoot 6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)" |
< 30 minutesFootref 6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)" |
<30 minutesFootref 6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)" |
< 30 minutesFootref 6 Section 13.2.7, "Recovering from Human Error (Recovery with Flashback)" |
|
customized and configurable Foot 7 |
customized and configurable Footref 7 |
customized and configurable Foot 8 |
customized and configurableFootref 7 andFootref 8 |
Footnote 1 While Data Guard physical replication is the most common data protection and availability solution used for Oracle Database, there are cases where active-active logical replication may be preferred, especially when control over the application makes it possible to implement. You may use Oracle GoldenGate in place of Data Guard for these requirements. See the topic, "Oracle Active Data Guard and Oracle GoldenGate" for additional discussion of the trade-offs between physical and logical replication at http://www.oracle.com/technetwork/database/features/availability/dataguardgoldengate-096557.html
Footnote 2 Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time.
Footnote 3 Recovery time consists largely of the time it takes to restart the failed system.
Footnote 4 Database is still available, but portion of application connected to failed system is temporarily affected.
Footnote 5 Storage failures are prevented by using Oracle ASM with mirroring and its automatic rebalance capability.
Footnote 6 Recovery times from human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, then it typically only requires seconds to flash back the appropriate transactions, if properly rehearsed. Referential or integrity constraints must be considered.
Footnote 7 Oracle Enterprise Manager or a customized application heartbeat can be configured to detect application or response time slowdown and react to these SLA breaches. For example, you can configure the Enterprise Manager Beacon to monitor and detect application response times. Then, after a certain threshold expires, Enterprise Manager can alert and possibly restart the database.
Footnote 8 Oracle Enterprise Manager or a customized application heartbeat can be configured to detect application or response time slowdown and react to these SLA breaches. For example, you can configure the Enterprise Manager Beacon to monitor and detect application response times. Then, after a certain threshold expires, Enterprise Manager can call the Oracle Data Guard DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure to initiate a failover.
See:
|
Outages on the standby site do not impact the availability of the primary database when using Data Guard Maximum Availability (synchronous communication with net_timeout) or Maximum Performance (asynchronous communication).
Note: Outages to a system that uses the Active Data Guard option with the standby database can affect applications that are using the standby database for read activity, but such outages do not impact the availability of the primary database (the availability is based on the mode you specify). |
Data Guard Maximum Protection, however, has an impact on availability if the primary database does not receive acknowledgment from a standby database running in SYNC transport mode (net_timeout does not apply to Maximum Protection). For this reason, if you are using Maximum Protection you should follow the MAA best practice of deploying two SYNC standby databases, each at its own site. With two standby databases a single standby outage does not impact primary availability or zero data loss protection.
If limited system resources make it impractical to deploy two standby databases, then the availability of the primary database can be restored simply by downgrading the data protection mode to Maximum Availability and restarting the primary database.
Table 13-2 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 13.2, "Recovering from Unscheduled Outages".
Table 13-2 Recovery Steps for Unscheduled Outages on the Secondary Site
Outage Type | Recovery Steps for Single-Instance or Oracle RAC Standby Database |
---|---|
Computer failure (instance) |
The broker automatically restarts the log apply services. Note 1: If there is only one standby database and if Maximum Protection is configured, then the primary database shuts down to ensure that there is no data divergence with the standby database (no unprotected data). Note 2: If this is an Oracle RAC standby database, then there is no affect on primary database availability if you configured the primary database Oracle Net descriptor to use connect-time failover to an available standby instance. If you are using the broker, connect-time failover is configured automatically. |
Data corruption |
Section 13.3.5, "Restoring Fault Tolerance After a Standby Database Data Failure" |
Primary database opens with |
Section 13.3.6, "Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs" |
See Also:
|
This section describes best practices for recovering from various types of unscheduled outages.
With complete site failover, the database, the middle-tier application server, and all user connections fail over to a secondary site that is prepared to handle the production load.
If the standby site meets the prerequisites, then complete site failover is recommended for the following scenarios:
Primary site disaster, such as natural disasters or malicious attacks
Primary network-connectivity failures
Primary site power failures
To expedite site failover in minutes:
Use the Data Guard configuration best practices in Section 9.3, "General Data Guard Configuration Best Practices"
Use Data Guard fast-start failover to automatically fail over to the standby database, with a recovery time objective (RTO) of less than 30 seconds (described in Section 9.4.2.3, "Fast-Start Failover Best Practices")
Maintain a running middle-tier application server on the secondary site to avoid the startup time, or redirect existing applications to the new primary database using the Fast Connection Failover best practices described in:
The MAA white paper: "Client Failover Best Practices for Data Guard 11g Release 2" from the MAA Best Practices area for Oracle Database at
Configure Automatic Domain Name Server (DNS) failover procedure. Automatic DNS failover occurs after a primary site is inaccessible and the wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect.
The potential for data loss is dependent on the Data Guard protection mode used: Maximum Protection, Maximum Availability, or Maximum Performance.
A wide-area traffic manager on the primary and secondary sites provides the site failover function. The wide-area traffic manager can redirect traffic automatically if the primary site, or a specific application on the primary site, is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is directed to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primary site fails, then user traffic is directed to the secondary site automatically.
Figure 13-1 illustrates the possible network routes before site failover:
Client requests enter the client tier of the primary site and travel by the WAN traffic manager.
Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier.
Requests are forwarded through the active load balancer to the application servers.
Requests are sent through another firewall and into the database server tier.
The application requests, if required, are routed to an Oracle RAC instance.
Responses are sent back to the application and clients by a similar path.
Figure 13-1 Network Routes Before Site Failover
Figure 13-2 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow the same path on the secondary site that they followed on the primary site.
Figure 13-2 Network Routes After Site Failover
The following steps describe the effect of a failover or switchover on network traffic:
The administrator has failed over or switched over the primary database to the secondary site. This is automatic if you are using Data Guard fast-start failover.
The administrator starts the middle-tier application servers on the secondary site, if they are not running.
The wide-area traffic manager selection of the secondary site can be automatic for an entire site failure. The wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect. In this scenario, the site failover is accomplished by an automatic domain name system (DNS) failover.
Alternatively, a DNS administrator can manually change the wide-area traffic manager selection to the secondary site for the entire site or for specific applications. The following is an example of a manual DNS failover:
Change the DNS to point to the secondary site load balancer:
The master (primary) DNS server is updated with the zone information, and the change is announced with the DNS NOTIFY
announcement.
The slave DNS servers are notified of the zone update with a DNS NOTIFY
announcement, and the slave DNS servers pull the zone information.
Note: The master and slave servers are authoritative name servers. Therefore, they contain trusted DNS information. |
Clear affected records from caching DNS servers.
A caching DNS server is used primarily for performance and fast response. The caching server obtains information from an authoritative DNS server in response to a host query and then saves (caches) the data locally. On a second or subsequent request for the same data, the caching DNS server responds with its locally stored data (the cache) until the time-to-live (TTL) value of the response expires. At this time, the server refreshes the data from the zone master. If the DNS record is changed on the primary DNS server, then the caching DNS server does not pick up the change for cached records until TTL expires. Flushing the cache forces the caching DNS server to go to an authoritative DNS server again for the updated DNS information.
Flush the cache if the DNS server being used supports such a capability. The following is the flush capability of common DNS BIND versions:
BIND 9.3.0: The command rndc
flushname
name
flushes individual entries from the cache.
BIND 9.2.0 and 9.2.1: The entire cache can be flushed with the command rndc
flush
.
BIND 8 and BIND 9 up to 9.1.3: Restarting the named server clears the cache.
Refresh local DNS service caching.
Some operating systems might cache DNS information locally in the local name service cache. If so, this cache must also be cleared so that DNS updates are recognized quickly.
Solaris: nscd
Linux: /etc/init.d/nscd restart
Microsoft Windows: ipconfig /flushdns
Apple Mac OS X: lookupd -flushcache
The secondary site load balancer directs traffic to the secondary site middle-tier application server.
The secondary site is ready to take client requests.
Failover also depends on the client's web browser. Most browser applications cache the DNS entry for a period. Consequently, sessions in progress during an outage might not fail over until the cache timeout expires. To resume service to such clients, close the browser and restart it.
Failover is the operation of transitioning one standby database to the role of primary database. A failover operation is invoked when an unplanned failure occurs on the primary database and there is no possibility of recovering the primary database in a timely fashion.
With Oracle Data Guard, you can automate the failover process using the broker and fast-start failover, or you can perform the failover manually:
Fast-start failover eliminates the uncertainty of a process that requires manual intervention and automatically executes a zero loss or minimum-loss failover (that you configure using the FastStartFailoverLagLimit
property) within seconds of an outage being detected. See Section 9.4.2.3, "Fast-Start Failover Best Practices" for configuration best practices.
Manual failover allows for a failover process where decisions are user driven using any of the following methods:
Oracle Enterprise Manager
The broker command-line interface (DGMGRL)
SQL*Plus statements
See Section 13.2.2.3, "Best Practices for Performing Manual Failover".
A database failover is accompanied by an application failover and, in some cases, preceded by a site failover. After the Data Guard failover, the secondary site hosts the primary database. You must reinstate the original primary database as a new standby database to restore fault tolerance of the configuration. See Section 13.3.2, "Restoring a Standby Database After a Failover."
A failover operation typically occurs in under a minute, and with little or no data loss.
See Also:
|
When a primary database failure cannot be repaired in time to meet your Recovery Time Objective (RTO) using local backups or Flashback technology, you should perform a failover using Oracle Data Guard.
You should perform a failover manually due to an unplanned outage such as:
A site disaster, which results in the primary database becoming unavailable
Damage resulting from user errors that cannot be repaired in a timely fashion
Data failures, which impact the production application
A failover requires that you reinstate the initial primary database as a standby database to restore fault tolerance to your environment. You can quickly reinstate the standby database using Flashback Database provided the original primary database has not been damaged. See Section 13.3.2, "Restoring a Standby Database After a Failover."
A fast-start failover is completely automated and requires no user intervention.
There are no procedural best practices to consider when performing a fast-start failover. However, it is important to address all of the configuration best practices described in Section 9.4.2.3, "Fast-Start Failover Best Practices".
See Also: The MAA white paper "Data Guard Switchover and Failover Best Practices" from the MAA Best Practices area for Oracle Database at |
When performing a manual failover:
Follow the configuration best practices outlined in Section 9.4.2.4, "Manual Failover Best Practices."
Choose from the following methods:
Oracle Enterprise Manager
See Oracle Data Guard Broker for complete information about how to perform a manual failover using Oracle Enterprise Manager. The procedure is the same for both physical and logical standby databases.
Oracle Data Guard broker command-line interface (DGMGRL)
See Oracle Data Guard Broker for complete information about how to perform a manual failover using Oracle Enterprise Manager. The procedure is the same for both physical and logical standby databases.
SQL*Plus statements:
Oracle Data Guard Concepts and Administration for information about Physical standby database steps for "Performing a Failover to a Physical Standby Database"
Oracle Data Guard Concepts and Administration for information about Logical standby database steps for "Performing a Failover to a Logical Standby Database"
Oracle RAC Recovery is performed automatically when there is a node or instance failure. In regular multi instance Oracle RAC environments, surviving instances automatically recover the failed instances and potentially aid in the automatic client failover. Recover times can be bounded by adopting the database and Oracle RAC configuration best practices and can usually lead to instance recovery times of seconds to minutes in very large busy systems, with no data loss. For Oracle RAC One Node configurations recover times are expected to take longer than full Oracle RAC; with Oracle RAC One Node a replacement instance must be started first before it can do the instance recovery.
For instance or node failures with Oracle RAC and Oracle RAC One Node, use the following recovery methods:
Instance failure occurs when software or hardware problems cause an instance to shutdown or abort. After instance failure, Oracle automatically uses the online redo log file to perform database recovery.
Instance recovery in Oracle RAC does not include restarting the failed instance or the recovery of applications that were running on the failed instance. Applications will run continuously using service relocation and fast application notification (as described in Section 13.2.3.2, "Automatic Service Relocation").
When one instance performs recovery for another instance, the recovering instance:
Reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost
Rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions
When multiple instances fail, if one instance survives Oracle RAC performs instance recovery for any other instances that fail. If all instances of an Oracle RAC database fail, then on subsequent restart of any instance a crash recovery occurs and all committed transactions are recovered. Data Guard is the recommended solution to survive outages when all instances of a cluster fail.
Service reliability is achieved by configuring and failing over among the surviving instances. A service will be made available by multiple database instances to provide a service that is needed. If a hardware failure occurs and the failure adversely affects an Oracle RAC database instance, then depending on the configuration, Oracle Clusterware does one the following:
Oracle Clusterware automatically moves any services on the failed database instance to another available instance, as configured with DBCA or Enterprise Manager. Oracle Clusterware recognizes when a failure affects a service and automatically fails over the service across the surviving instances supporting the service.
Note: With Oracle RAC One Node the relocation occurs when another instance on a different node is started and enabled for the appropriate services. Thus, Oracle RAC One Node starts a new instance when an instance fails but the new instance is not a "surviving instance." |
A service can be made available on multiple instances by default. In this case, when one of those multiple instances is lost the clients continue to use the available services across the surviving instances, but there are less resources to do the work.
In parallel, Oracle Clusterware attempts to restart and integrate the failed instances and dependent resources back into the system and Cluster Ready Services (CRS) will try to restart the database instance three times. Clients can "subscribe" to node failure events, in this way clients can be notified of instance problems quickly and new connections can be setup (Oracle Clusterware does not setup the new connections, the clients setup the new connections). Notification of failures using fast application notification (FAN) events occur at various levels within the Oracle Server architecture. The response can include notifying external parties through Oracle Notification Service (ONS), advanced queuing, or FAN callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving node when the failed node is out of service. The location and number of nodes serving a service is transparent to applications. Restart and recovery after a node shutdown or clusterware restart are done automatically.
Loss of the Oracle Cluster Registry (OCR) file affects the availability of Oracle RAC and Oracle Clusterware. The OCR file can be restored from a backup that is automatically created or from an export file that is manually created by using the ocrconfig
tool (also use ocrconfig
to restore the backup). Additionally, Oracle can optionally mirror the OCR so that a single OCR device failure can be tolerated. Ensure the OCR mirror is on a physically separate device and preferably on a separate controller. For more information, see Section 6.2.7, "Mirror Oracle Cluster Registry (OCR) and Configure Multiple Voting Disks with Oracle ASM".
If all of the voting disks are corrupted, then you must restore them. To do this you use the crsctl
command. The steps you use depend on where you store your voting files. If the voting disks are stored in Oracle ASM, then run the commands to migrate the voting disks to the Oracle ASM disk group you specify, with: crsctl replace votedisk
. If you did not store voting disks in Oracle ASM, then you run the commands to delete and add the voting disks: crsctl delete css votedisk
and crsctl add css votedisk
.
See Also:
|
With a minimal configuration, applications can receive fast and efficient notification when instances providing services become unavailable. When notified, application reconnects occur transparently to the surviving instances of an Oracle RAC database or to a standby database that has assumed the primary role following a failover.
In an Oracle RAC configuration, services are essential to achieving fast and transparent application failover. Clients are notified of a service relocation through Fast Application Notification (FAN).
In an Oracle Data Guard configuration, you can configure services for client failover across sites. After a site failure in a Data Guard configuration, the new primary database can automatically publish the production service while notifying affected clients, through FAN events, that the services are no longer available on the failed primary database.
For hangs or situations in which the response time is unacceptable, you can configure Oracle Enterprise Manager or a custom application heartbeat to detect application or response time slowdown and react to these situations. For example, you can configure the Enterprise Manager Beacon to monitor and detect application response times. Then, after a certain time threshold expires, Enterprise Manager can call the Oracle Data Guard DBMS_DG.INITIATE_FS_FAILOVER
PL/SQL procedure to initiate a database failover immediately followed by an application failover using FAN notifications and service relocation.
FAN notifications and service relocation enable automatic and fast redirection of clients if any failure or planned maintenance results in an Oracle RAC or Oracle Data Guard fail over.
See Also:
|
Table 13-3 summarizes the impacts and recommended repairs for various Oracle ASM failure types.
Table 13-3 Types of Oracle ASM Failures and Recommended Repair
Failure | Description | Impact | Recommended Repair |
---|---|---|---|
Oracle ASM instance failure |
Oracle ASM instance fails |
All database instances accessing Oracle ASM storage from the same node shut down |
Automatic Section 13.2.3, "Oracle RAC Recovery for Unscheduled Outages (for Node or Instance Failures)" If Oracle RAC is not used, use Data Guard failover (see Section 13.2.2.2, "Best Practices for Implementing Fast-Start Failover") If Oracle RAC and Data Guard are not used, fix the underlying problem and then restart Oracle ASM and the database instances |
Oracle ASM disk failure |
One or more Oracle ASM disks fail, but all disk groups remain online |
All data remains accessible. This is possible only with normal or high redundancy disk groups |
Oracle ASM automatically rebalances to the remaining disk drives and reestablishes redundancy. There must be enough free disk space in the remaining disk drives to restore the redundancy or the rebalance may fail with an |
Data area disk-group failure |
One or more Oracle ASM disks fail, and data area disk group goes offline |
Databases accessing the data area disk group shut down |
Perform Data Guard failover or local recovery as described in Section 13.2.5.3, "Data Area Disk Group Failure" |
Fast recovery area disk-group failure |
One or more Oracle ASM disks fail, and the fast recovery area disk group goes offline |
Databases accessing the fast recovery area disk group shut down |
Perform local recovery or Data Guard failover as described in Section 13.2.5.4, "Fast Recovery Area Disk Group Failure" |
If the Oracle ASM instance fails, then database instances accessing Oracle ASM storage from the same node shut down. The following list describes failover processing:
If the primary database is an Oracle RAC database, then application failover occurs automatically and clients connected to the database instance reconnect to remaining instances. Thus, the service is provided by other instances in the cluster and processing continues. The recovery time typically occurs in seconds.
If the primary database is not an Oracle RAC database, then an Oracle ASM instance failure shuts down the entire database.
If the configuration uses Oracle Data Guard and fast-start failover is enabled, a database failover is triggered automatically and clients automatically reconnect to the new primary database after the failover completes. The recovery time is the amount of time it takes to complete an automatic Data Guard fast-start failover operation. If fast-start failover is not configured, then you must recover from this outage by either restarting the Oracle ASM and database instances manually, or by performing a manual Data Guard failover.
If the configuration includes neither Oracle RAC nor Data Guard, then you must manually restart the Oracle ASM instance and database instances. The recovery time depends on how long it takes to perform these tasks.
If the Oracle ASM disk fails, then failover processing is as follows:
If an Oracle ASM disk group is configured as an external redundancy type, then a failure of a single disk is handled by the storage array and should not be seen by the Oracle ASM instance. All Oracle ASM and database operations using the disk group continue normally.
However, if the failure of an external redundancy disk group is seen by the Oracle ASM instance, then the Oracle ASM instance takes the disk group offline immediately, causing Oracle instances accessing the disk group to crash. If the disk failure is temporary, then you can restart Oracle ASM and the database instances and crash recovery occurs after the disk group is brought back online.
If an Oracle ASM disk group is configured as a normal or a high-redundancy type, then disk failure is handled transparently by Oracle ASM and the databases accessing the disk group are not affected.
An Oracle ASM instance automatically starts an Oracle ASM rebalance operation to distribute the data of one or more failed disks to the remaining, intact disks of the Oracle ASM disk group. While the rebalance operation is in progress, subsequent disk failures may affect disk group availability if the disk contains data that has yet to be remirrored. When the rebalance operation completes successfully, the Oracle ASM disk group is no longer at risk in the event of a subsequent failure. Multiple disk failures are handled similarly, provided the failures affect only one failure group in an Oracle ASM disk group with normal redundancy.
The failure of multiple disks in multiple failure groups where a primary extent and all of its mirrors have been lost causes the disk group to go offline.
When Oracle ASM disks fail, use the following recovery methods:
Figure 13-3 shows Enterprise Manager reporting disk failures. Five of 14 alerts are shown. The five alerts shown are Offline messages for Disk RECO2.