Oracle® Database Administrator's Guide 11g Release 2 (11.2) Part Number E25494-02 |
|
|
PDF · Mobi · ePub |
This section contains background information on the Oracle Database fault diagnosability infrastructure. It contains the following topics:
The fault diagnosability infrastructure aids in preventing, detecting, diagnosing, and resolving problems. The problems that are targeted in particular are critical errors such as those caused by code bugs, metadata corruption, and customer data corruption.
When a critical error occurs, it is assigned an incident number, and diagnostic data for the error (such as trace files) are immediately captured and tagged with this number. The data is then stored in the Automatic Diagnostic Repository (ADR)—a file-based repository outside the database—where it can later be retrieved by incident number and analyzed.
The goals of the fault diagnosability infrastructure are the following:
First-failure diagnosis
Problem prevention
Limiting damage and interruptions after a problem is detected
Reducing problem diagnostic time
Reducing problem resolution time
Simplifying customer interaction with Oracle Support
The keys to achieving these goals are the following technologies:
Automatic capture of diagnostic data upon first failure—For critical errors, the ability to capture error information at first-failure greatly increases the chance of a quick problem resolution and reduced downtime. An always-on memory-based tracing system proactively collects diagnostic data from many database components, and can help isolate root causes of problems. Such proactive diagnostic data is similar to the data collected by airplane "black box" flight recorders. When a problem is detected, alerts are generated and the fault diagnosability infrastructure is activated to capture and store diagnostic data. The data is stored in a repository that is outside the database (and therefore available when the database is down), and is easily accessible with command line utilities and Enterprise Manager.
Standardized trace formats—Standardizing trace formats across all database components enables DBAs and Oracle Support personnel to use a single set of tools for problem analysis. Problems are more easily diagnosed, and downtime is reduced.
Health checks—Upon detecting a critical error, the fault diagnosability infrastructure can run one or more health checks to perform deeper analysis of a critical error. Health check results are then added to the other diagnostic data collected for the error. Individual health checks look for data block corruptions, undo and redo corruption, data dictionary corruption, and more. As a DBA, you can manually invoke these health checks, either on a regular basis or as required.
Incident packaging service (IPS) and incident packages—The IPS enables you to automatically and easily gather the diagnostic data—traces, dumps, health check reports, and more—pertaining to a critical error and package the data into a zip file for transmission to Oracle Support. Because all diagnostic data relating to a critical error are tagged with that error's incident number, you do not have to search through trace files and other files to determine the files that are required for analysis; the incident packaging service identifies the required files automatically and adds them to the zip file. Before creating the zip file, the IPS first collects diagnostic data into an intermediate logical structure called an incident package (package). Packages are stored in the Automatic Diagnostic Repository. If you choose to, you can access this intermediate logical structure, view and modify its contents, add or remove additional diagnostic data at any time, and when you are ready, create the zip file from the package. After these steps are completed, the zip file is ready to be uploaded to Oracle Support.
Data Recovery Advisor—The Data Recovery Advisor integrates with database health checks and RMAN to display data corruption problems, assess the extent of each problem (critical, high priority, low priority), describe the impact of a problem, recommend repair options, conduct a feasibility check of the customer-chosen option, and automate the repair process.
SQL Test Case Builder—For many SQL-related problems, obtaining a reproducible test case is an important factor in problem resolution speed. The SQL Test Case Builder automates the sometimes difficult and time-consuming process of gathering as much information as possible about the problem and the environment in which it occurred. After quickly gathering this information, you can upload it to Oracle Support to enable support personnel to easily and accurately reproduce the problem.
See Also:
Oracle Database Performance Tuning Guide for more information on SQL Test Case Builder
To facilitate diagnosis and resolution of critical errors, the fault diagnosability infrastructure introduces two concepts for Oracle Database: problems and incidents.
A problem is a critical error in a database instance, Oracle Automatic Storage Management (Oracle ASM) instance, or other Oracle product or component. Critical errors manifest as internal errors, such as ORA-00600
, or other severe errors, such as ORA-07445
(operating system exception) or ORA-04031
(out of memory in the shared pool). Problems are tracked in the ADR. Each problem has a problem key, which is a text string that describes the problem. It includes an error code (such as ORA
600
) and in some cases, one or more error parameters.
An incident is a single occurrence of a problem. When a problem (critical error) occurs multiple times, an incident is created for each occurrence. Incidents are timestamped and tracked in the Automatic Diagnostic Repository (ADR). Each incident is identified by a numeric incident ID, which is unique within the ADR. When an incident occurs, the database:
Makes an entry in the alert log.
Sends an incident alert to Oracle Enterprise Manager (Enterprise Manager).
Gathers first-failure diagnostic data about the incident in the form of dump files (incident dumps).
Tags the incident dumps with the incident ID.
Stores the incident dumps in an ADR subdirectory created for that incident.
Diagnosis and resolution of a critical error usually starts with an incident alert. Incident alerts are displayed on the Enterprise Manager Database Home page or Oracle Automatic Storage Management Home page. The Database Home page also displays in its Related Alerts section any critical alerts in the Oracle ASM instance or other Oracle products or components. After viewing an alert, you can then view the problem and its associated incidents with Enterprise Manager or with the ADRCI command-line utility.
The following sections provide more information about incidents and problems:
See Also:
It is conceivable that a problem could generate dozens or perhaps hundreds of incidents in a short period of time. This would generate too much diagnostic data, which would consume too much space in the ADR and could possibly slow down your efforts to diagnose and resolve the problem. For these reasons, the fault diagnosability infrastructure applies flood control to incident generation after certain thresholds are reached. A flood-controlled incident is an incident that generates an alert log entry, is recorded in the ADR, but does not generate incident dumps. Flood-controlled incidents provide a way of informing you that a critical error is ongoing, without overloading the system with diagnostic data. You can choose to view or hide flood-controlled incidents when viewing incidents with Enterprise Manager or the ADRCI command-line utility.
Threshold levels for incident flood control are predetermined and cannot be changed. They are defined as follows:
After five incidents occur for the same problem key in one hour, subsequent incidents for this problem key are flood-controlled. Normal (non-flood-controlled) recording of incidents for that problem key begins again in the next hour.
After 25 incidents occur for the same problem key in one day, subsequent incidents for this problem key are flood-controlled. Normal recording of incidents for that problem key begins again on the next day.
In addition, after 50 incidents for the same problem key occur in one hour, or 250 incidents for the same problem key occur in one day, subsequent incidents for this problem key are not recorded at all in the ADR. In these cases, the database writes a message to the alert log indicating that no further incidents will be recorded. As long as incidents continue to be generated for this problem key, this message is added to the alert log every ten minutes until the hour or the day expires. Upon expiration of the hour or day, normal recording of incidents for that problem key begins again.
For any problem identified in a database instance, the diagnosability framework can identify related problems across the topology of your Oracle Database installation. In a single instance environment, a related problem could be identified in the local Oracle ASM instance. In an Oracle RAC environment, a related problem could be identified in any database instance or Oracle ASM instance on any other node. When investigating problems, you are able to view and gather information on any related problems.
A problem is related to the original problem if it occurs within a designated time period or shares the same execution context identifier. An execution context identifier (ECID) is a globally unique identifier used to tag and track a single call through the Oracle software stack, for example, a call to Oracle Fusion Middleware that then calls into Oracle Database to retrieve data. The ECID is typically generated in the middle tier and is passed to the database as an Oracle Call Interface (OCI) attribute. When a single call has failures on multiple tiers of the Oracle software stack, problems that are generated are tagged with the same ECID so that they can be correlated. You can then determine the tier on which the originating problem occurred.
The following are the key components of the fault diagnosability infrastructure:
The ADR is a file-based repository for database diagnostic data such as traces, dumps, the alert log, health monitor reports, and more. It has a unified directory structure across multiple instances and multiple products. Beginning with Release 11g, the database, Oracle Automatic Storage Management (Oracle ASM), the listener, and other Oracle products or components store all diagnostic data in the ADR. Each instance of each product stores diagnostic data underneath its own home directory within the ADR. For example, in an Oracle Real Application Clusters environment with shared storage and Oracle ASM, each database instance and each Oracle ASM instance has an ADR home directory. ADR's unified directory structure, consistent diagnostic data formats across products and instances, and a unified set of tools enable customers and Oracle Support to correlate and analyze diagnostic data across multiple instances.
Note:
Beginning with Release 11g of Oracle Database, because all diagnostic data, including the alert log, are stored in the ADR, the initialization parametersBACKGROUND_DUMP_DEST
and USER_DUMP_DEST
are deprecated. They are replaced by the initialization parameter DIAGNOSTIC_DEST
, which identifies the location of the ADR.See Also:
"Structure, Contents, and Location of the Automatic Diagnostic Repository" for more information on theDIAGNOSTIC_DEST
parameter and on ADR homes.The alert log is an XML file that is a chronological log of database messages and errors. It is stored in the ADR and includes messages about the following:
Critical errors (incidents)
Administrative operations, such as starting up or shutting down the database, recovering the database, creating or dropping a tablespace, and others.
Errors during automatic refresh of a materialized view
Other database events
You can view the alert log in text format (with the XML tags stripped) with Enterprise Manager and with the ADRCI utility. There is also a text-formatted version of the alert log stored in the ADR for backward compatibility. However, Oracle recommends that any parsing of the alert log contents be done with the XML-formatted version, because the text format is unstructured and may change from release to release.
Trace files, dumps, and core files contain diagnostic data that are used to investigate problems. They are stored in the ADR.
Each server and background process can write to an associated trace file. Trace files are updated periodically over the life of the process and can contain information on the process environment, status, activities, and errors. In addition, when a process detects a critical error, it writes information about the error to its trace file. The SQL trace facility also creates trace files, which provide performance information on individual SQL statements. You can enable SQL tracing for a session or an instance.
Trace file names are platform-dependent. Typically, database background process trace file names contain the Oracle SID, the background process name, and the operating system process number, while server process trace file names contain the Oracle SID, the string "ora", and the operating system process number. The file extension is .trc
. An example of a server process trace file name is orcl_ora_344.trc. Trace files are sometimes accompanied by corresponding trace map (.trm
) files, which contain structural information about trace files and are used for searching and navigation.
Oracle Database includes tools that help you analyze trace files. For more information on application tracing, SQL tracing, and tracing tools, see Oracle Database Performance Tuning Guide.
See Also:
"Finding Trace Files"A dump is a specific type of trace file. A dump is typically a one-time output of diagnostic data in response to an event (such as an incident), whereas a trace tends to be continuous output of diagnostic data. When an incident occurs, the database writes one or more dumps to the incident directory created for the incident. Incident dumps also contain the incident number in the file name.
In addition to files mentioned in the previous sections, the ADR contains health monitor reports, data repair records, SQL test cases, incident packages, and more. These components are described later in the chapter.
The Enterprise Manager Support Workbench (Support Workbench) is a facility that enables you to investigate, report, and in some cases, repair problems (critical errors), all with an easy-to-use graphical interface. The Support Workbench provides a self-service means for you to gather first-failure diagnostic data, obtain a support request number, and upload diagnostic data to Oracle Support with a minimum of effort and in a very short time, thereby reducing time-to-resolution for problems. The Support Workbench also recommends and provides easy access to Oracle advisors that help you repair SQL-related problems, data corruption problems, and more.
The ADR Command Interpreter (ADRCI) is a utility that enables you to investigate problems, view health check reports, and package first-failure diagnostic data, all within a command-line environment. You can then upload the package to Oracle Support. ADRCI also enables you to view the names of the trace files in the ADR, and to view the alert log with XML tags stripped, with and without content filtering.
For more information on ADRCI, see Oracle Database Utilities.
The Automatic Diagnostic Repository (ADR) is a directory structure that is stored outside of the database. It is therefore available for problem diagnosis when the database is down.
The ADR root directory is known as ADR base. Its location is set by the DIAGNOSTIC_DEST
initialization parameter. If this parameter is omitted or left null, the database sets DIAGNOSTIC_DEST
upon startup as follows:
If environment variable ORACLE_BASE
is set, DIAGNOSTIC_DEST
is set to the directory designated by ORACLE_BASE
.
If environment variable ORACLE_BASE
is not set, DIAGNOSTIC_DEST
is set to ORACLE_HOME/log.
Within ADR base, there can be multiple ADR homes, where each ADR home is the root directory for all diagnostic data—traces, dumps, the alert log, and so on—for a particular instance of a particular Oracle product or component. For example, in an Oracle Real Application Clusters environment with Oracle ASM, each database instance, Oracle ASM instance, and listener has an ADR home.
ADR homes reside in ADR base subdirectories that are named according to the product or component type. Figure 9-1 illustrates these top-level subdirectories.
Figure 9-1 Product/Component Type Subdirectories in the ADR
The location of each ADR home is given by the following path, which starts at the ADR base directory:
diag/product_type/product_id/instance_id
As an example, Table 9-1 lists the values of the various path components for an Oracle Database instance.
Table 9-1 ADR Home Path Components for Oracle Database
Path Component | Value for Oracle Database |
---|---|
product_type |
rdbms |
product_id |
|
instance_id |
SID |
For example, for a database with a SID and database unique name both equal to orclbi
, the ADR home would be in the following location:
ADR_base/diag/rdbms/orclbi/orclbi/
Similarly, the ADR home path for the Oracle ASM instance in a single-instance environment would be:
ADR_base/diag/asm/+asm/+asm/
Within each ADR home directory are subdirectories that contain the diagnostic data. Table 9-2 lists some of these subdirectories and their contents.
Table 9-2 ADR Home Subdirectories
Subdirectory Name | Contents |
---|---|
alert |
The XML-formatted alert log |
cdump |
Core files |
incident |
Multiple subdirectories, where each subdirectory is named for a particular incident, and where each contains dumps pertaining only to that incident |
trace |
Background and server process trace files, SQL trace files, and the text-formatted alert log |
(others) |
Other subdirectories of ADR home, which store incident packages, health monitor reports, and other information |
Figure 9-2 illustrates the complete directory hierarchy of the ADR for a database instance.
Figure 9-2 ADR Directory Structure for a Database Instance
ADR in an Oracle Real Application Clusters Environment
In an Oracle Real Application Clusters (Oracle RAC) environment, each node can have ADR base on its own local storage, or ADR base can be set to a location on shared storage. You can use ADRCI to view aggregated diagnostic data from all instances on a single report.
Each installation of Oracle Client includes an ADR for diagnostic data associated with critical failures in any of the Oracle Client components. The ADRCI utility is installed with Oracle Client so that you can examine diagnostic data and package it to enable it for upload to Oracle Support.
Viewing ADR Locations with the V$DIAG_INFO View
The V$DIAG_INFO
view lists all important ADR locations for the current Oracle Database instance.
SELECT * FROM V$DIAG_INFO; INST_ID NAME VALUE ------- --------------------- ------------------------------------------------------------- 1 Diag Enabled TRUE 1 ADR Base /u01/oracle 1 ADR Home /u01/oracle/diag/rdbms/orclbi/orclbi 1 Diag Trace /u01/oracle/diag/rdbms/orclbi/orclbi/trace 1 Diag Alert /u01/oracle/diag/rdbms/orclbi/orclbi/alert 1 Diag Incident /u01/oracle/diag/rdbms/orclbi/orclbi/incident 1 Diag Cdump /u01/oracle/diag/rdbms/orclbi/orclbi/cdump 1 Health Monitor /u01/oracle/diag/rdbms/orclbi/orclbi/hm 1 Default Trace File /u01/oracle/diag/rdbms/orclbi/orclbi/trace/orcl_ora_22769.trc 1 Active Problem Count 8 1 Active Incident Count 20
The following table describes some of the information displayed by this view.
Table 9-3 Data in the V$DIAG_INFO View
Name | Description |
---|---|
ADR Base |
Path of ADR base |
ADR Home |
Path of ADR home for the current database instance |
Diag Trace |
Location of background process trace files, server process trace files, SQL trace files, and the text-formatted version of the alert log |
Diag Alert |
Location of the XML-formatted version of the alert log |
Default Trace File |
Path to the trace file for the current session |
Viewing Critical Errors with the V$DIAG_CRITICAL_ERROR View
The V$DIAG_CRITICAL_ERROR
view lists all of the non-internal errors designated as critical errors for the current Oracle Database release. The view does not list internal errors because internal errors are always designated as critical errors.
The following example shows the output for the V$DIAG_CRITICAL_ERROR
view in Oracle Database 11g Release 2 (11.2.0.2):
SELECT * FROM V$DIAG_CRITICAL_ERROR; FACILITY ERROR ---------- ---------------------------------------------------------------- ORA 7445 ORA 4030 ORA 4031 ORA 29740 ORA 255 ORA 355 ORA 356 ORA 239 ORA 240 ORA 494 ORA 3137 ORA 227 ORA 353 ORA 1578 ORA 32701 ORA 32703 ORA 29770 ORA 29771 ORA 445 ORA 25319 OCI 3106 OCI 3113 OCI 3135
The following table describes the information displayed by this view.
Table 9-4 Data in the V$DIAG_CRITICAL_ERROR View
Column | Description |
---|---|
|
The facility that can report the error, such as Oracle Database (ORA) or Oracle Call Interface (OCI) |
|
The error number |
See Also:
"About Incidents and Problems" for more information about internal errors