Oracle® Text Application Developer's Guide 11g Release 2 (11.2) Part Number E24435-01 |
|
|
PDF · Mobi · ePub |
In Oracle Text, alternate spelling refers specifically to the use of spelling variations in German, Swedish, and Dutch; these variations may be indexed if the BASIC_LEXER
attribute ALTERNATE_SPELLING
has been specified.
An attribute is an optional parameter associated with a preference. For example, the BASIC_LEXER
preference includes the base_letter
attribute, which can have either the value of YES
(perform base-letter conversions) or NO
(do not perform such conversions). Attributes are set with the CTX_DDL.SET_ATTRIBUTE
procedure or with the ALTER INDEX
statement. See also: preference, base-letter conversion.
A user-defined section, representing an attribute of an XML document, such as AUTHOR
or TITLE
. Attribute sections are added to section groups with CTX_DDL.ADD_ATTR_SECTION
or with the ALTER INDEX
statement. See also: AUTO_SECTION_GROUP, section, XML_SECTION_GROUP.
A section group used to automatically crate a zone section for each start-and end-tag pair in an XML document; attribute sections are automatically created for XML tags that have attributes. See also: attribute section, section, section group, XML_SECTION_GROUP, zone section.
The conversion of a letter with alternate forms (such as accents, umlauts, or cedillas) to its basic form (for example, without an accent).
A section group used to define sections where the start and end tags are of the form <tag> and </tag>. It does not support non-basic tags, such as comment tags or those with attributes or unbalanced parentheses. See also: HTML_SECTION_GROUP, section, section group.
Case refers to the capitalization of a word or letter, where upper-case letters are capitals (M instead of m, for example). Not all languages have case. Mixed-case indexing is supported for some languages, notably those of Western Europe.
Also known as document classification. The conceptual separation of source documents into groups, or clusters, based on their content. For example, a group of documents might be separated into clusters concerning medicine, finance, and sports.
Oracle Text includes rule-based classification, in which a person writes the rules for classifying documents (in the form of queries), and Oracle Text performs the document classification according to the rules; supervised classification, in which Oracle Text creates classification rules based on a set of sample documents; and clustering (also known as unsupervised classification), in which the clusters and rules are both created by Oracle Text.
Also known as unsupervised classification. See: classification.
Also known as CDI type of index. An Oracle Text index that not only indexes and processes a specified text column, but also indexes and processes FILTER
BY
and ORDER
BY
structured columns that are specified during index creation. See also: domain index.
The basic type of Oracle Text index; an index on a text column. A CONTEXT
index is useful when your source text consists of many large, coherent documents. Applications making use of CONTEXT
indexes use the CONTAINS
query operator to retrieve text.
A role for application developers that enables a user to create Oracle Text indexes and index preferences, and to use PL/SQL packages. This role should be granted to Oracle Text users.
A combined index on a text column and one or more other columns. Typically used to index small documents or text fragments, such as item names, prices and descriptions typically found in catalogs. The CTXCAT
index typically has better mixed-query performance than the CONTEXT
index.
Applications query this index with the CATSEARCH
operator. This index is transactional, which means that it automatically updates itself with DML to the base table.
A CTXRULE
index is used to build a document classification application. The CTXRULE
index is an index created on a table of queries, where the queries serve as rules to define the classification criteria. This index is queried with the MATCHES
operator.
The CTXSYS
user is created at install time. The CTXSYS
user can view all indexes; sync all indexes; run ctxkbtc
, the knowledge base extension compiler; query all system-defined views; and perform all the tasks of a user with the CTXAPP
role.
In Oracle Text, datastore refers to the method of storing text. The method is determined by specifying a storage preference of a particular type. For example, the DIRECT_DATASTORE
type stores data directly into the text column, while the URL_DATASTORE
specifies that data is stored externally in a location specified by a URL.
An Oracle Database domain index that indexes and processes a specified text column. See also: composite domain index.
One or more non-alphanumeric characters that, when encountered as the last character in a token, explicitly identify the end of the token. The characters, as well as any startjoin characters that immediately follow it, are included in the Text index entry for the token. For example, if ++ is specified as an endjoin, then C++ will be recognized and indexed as a single token. See also: printjoin, skipjoin, startjoin.
A field section is similar to a zone section, with the main difference that the content between the start and end tags of a field section can be indexed separately from the rest of the document. This enables field section content to be "hidden" from a normal query. (The INPATH
and WITHIN
operators may be used to find the term in such a section.) Field sections are useful when there is a single occurrence of a section in a document, such as a filed in a news header. Field sections are added to section groups with the CTX_DDL.ADD_FIELD_SECTION
procedure or with the ALTER INDEX
statement. See also: INPATH operator, section, WITHIN operator, zone section.
One of the steps in the Oracle Text index-creation process. Depending on the filtering preferences associated with the creation of the index, one of three things happens during filtering: Formatted documents are filtered into marked-up text; text is converted from a non-database character set to a database character set; or no filtering takes place (HTML, XML, and plain-text documents are not filtered).
A fuzzy-matching query is one in which the query is expanded to include words that are spelled similarly to the specified term. This type of expansion is helpful for finding more accurate results when there are frequent misspellings in a document set. Fuzzy matching is invoked with the FUZZY
query operator.
A CONTAINS
query operator used to find XML documents that contain a section path exactly as specified in the query. See also: PATH_SECTION_GROUP.
Generically, in Oracle Text, highlighting refers to generating a version of a document, or document fragments, with query terms displayed or called out in a special way.
Specifically, there are three forms of highlighting. First, CTX_DOC.MARKUP
returns a document with the query term surrounded by plaintext or HTML tags. Second, CTX_DOC.HIGHLIGHT
returns offsets for the query terms, allowing the user to mark up the document as desired. Third, CTX_DOC.SNIPPET
produces a concordance, with the query term displayed in fragments of surrounding text. markup.
A section group type used for defining sections in HTML documents. See also: BASIC_SECTION_GROUP, section, section group.
A CONTAINS
query operator used to search within tags, or paths, of an XML document. It enables more generic path denomination than the WITHIN
operator. See also: WITHIN operator.
In Oracle Text, a presentation of a query term with the text that surrounds it in the source document. This presentation may consist of a single instance of the query term, several instances, or every instance in the source document. The CTX_DOC.SNIPPET
procedure produces such a presentation. Also known as Key Word in Context (KWIC).
Oracle Text includes a knowledge base, which is a hierarchical tree of concepts used for theme indexing, ABOUT
queries, and deriving themes for document services. The knowledge base may be optionally installed. You can create your own knowledge base or extend the standard Oracle Text knowledge base.
The Oracle Text lexer breaks source text into tokens—usually words—in accordance with a specified language. To extract tokens, the lexer uses parameters as defined by a lexer preference. These parameters include the definitions for the characters that separate tokens, such as whitespace, and whether to convert text to all uppercase or not. When theme indexing is enabled, the lexer analyses text to create theme tokens.
When an application needs to index a table containing documents in more than one language, it can utilize the MULTI_LEXER
(the multilingual lexer) and create sub-lexers to handle each language. Each sub-lexer is added to the main multi-lexer with the CTX_DDl.ADD_SUB_LEXER
procedure.
A form of highlighting. The CTX_DOC.MARKUP
and CTX_DOC.POLICY_MARKUP
procedures take a query term and a document, and return the document with the query terms marked up; that is, surrounded either by plaintext characters or HTML tags. You can use predefined markup tags or specify your own. In comparison, CTX_DOC.HIGHLIGHT
and CTX_DOC.POLICY_HIGHLIGHT
return offsets for query terms, so you can add your own highlighting tags. See also: highlighting.
See: metadata.
An MDATA
section contains user-defined index metadata. Use of this metadata can speed up mixed CONTAINS
queries. See also: metadata, mixed query, section.
Metadata is information about a document that is not part of a document's regular content. For example, if an HTML document contains <author>Mark Twain</author>
, author is considered the metadata type and Mark Twain is considered the value for author
.
Sections containing metadata, known as MDATA
sections, can be added to a document with the CTX_DDL.ADD_MDATA_SECTION
procedure. Taking advantage of metadata can speed up mixed queries. Such queries can be made with the MDATA
operator. See also: mixed query, section.
A query that searches for two different types of information; for example, text content and document type. For example, a search for Romeo and Juliet in <title>
metadata is a mixed query.
Name searching (also called name matching) provides a solution to match proper names that might differ in spelling due to orthographic variation. It also enables you to search for somewhat inaccurate data, such as might occur when a record's first name and surname are not properly segmented.
A section group type used for defining sections in newsgroup-formatted documents as defined by RFC 1036. See also: section, section group.
The form of a word after it has been transformed for indexing, according to transformational rules in effect. Depending on the rules in effect, the normalized form of a word may be the same as the form found in the source document. The normalized form of a word may also include both the original and transformed versions. For example, if New German Spelling has been specified, the word Potential is normalized to both Potenzial and Potential.
The default section group type when no sections are defined or when only SENTENCE
or PARAGRAPH
sections are defined. See also: section, section group, special section.
A section group type used for indexing XML documents. It is similar to the AUTO_SECTION_GROUP
type, except that it enables the use of the HASPATH
and INPATH
operators. See also: AUTO_SECTION_GROUP, HASPATH operator, INPATH operator, section, section group.
A preference is an optional parameter that affects the way Oracle Text creates an index. For example, a lexer preference specifies the lexer to use when processing documents, such as the JAPANESE_VGRAM_LEXER
. There are preferences for storage, filtering, lexers, classifiers, wordlist, section types, and more. A preference may or may not have attributes associated with it. Preferences are set with the CTX_DDL.CREATE_PREFERENCE
procedure. See also: attribute.
One or more non-alphanumeric character that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in an Oracle Text index. This includes printjoins that occur consecutively.
For example, if the hyphen (-) and underscore (_) characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Oracle Text index as pseudo-intellectual and _file_.
Printjoins differ from endjoins and startjoins in that position does not matter. For example, $35 will be indexed as one token if $ is defined as a startjoin or a printjoin, but as two tokens if it is an endjoin. See also: endjoin, printjoin, startjoin.
A page of search results in applications can consist of many disparate elements — metadata of the first few documents, total hit counts, per-word hit counts, and so on. Generating these results in earlier versions of Oracle Text required several queries and calls. Each extra call takes time to reparse the query and look up index metadata. Additionally, some search operations, such as iterative query refinement or breakdown top ten, are difficult for SQL.
The result set interface enables you to produce the various kinds of data needed for a page of search results all at once, thus improving performance by sharing overhead. The result set interface can also return data views that are difficult to express in SQL, such as top N by category queries.
See: classification.
Structured/Sort Data section. Unlike the MDATA
section type, which only supports equality searches, SDATA
sections are designed to also support range searches. By default, all FILTER
BY
and ORDER
BY
columns are mapped as SDATA
sections. An SDATA
section contains user-defined index metadata. Use of this type of section can speed up mixed CONTAINS
queries. See also: mixed query, section.
A section is a subdivision of a document; for example, everything within an <a>...</a>
section of an HTML page.
Dividing a document into sections and then searching within sections enables you to narrow text queries down to blocks of text within documents. Section searching is useful when your documents have internal structure, such as HTML and XML documents. You can also search for text at the sentence and paragraph level.
Section searching is performed with the HASPATH
, ISPATH
, or WITHIN
operator. Sections searching is enabled by the used of the section group when indexing.
The various section types include attribute, field, HTML, MDATA, special, stop, XML, and zone sections.
A section group identifies a type of document set and implicitly indicate the tag structure for indexing. For instance, to index HTML tagged documents, you use the HTML_SECTION_GROUP
. section group type. Likewise, to index XML tagged documents, you can use the XML_SECTION_GROUP
section group type. Section groups are declared with the CTX_DDL.CREATE_SECTION_GROUP
procedure or with the ALTER INDEX
statement. See also: section.
A non-alphanumeric character that, when it appears within a word, identifies the word as a single token; however, the character is not stored with the token in the Text index. For example, if the hyphen character '-' is defined as a skipjoin, the word pseudo-intellectual is stored in the Text index as pseudointellectual. See also: endjoin, printjoin, startjoin.
One or more non-alphanumeric characters that, when encountered as the first character in a token explicitly identify the start of the token. The characters, as well as any other startjoins characters that immediately follow it, are included in the Text index entry for the token. For example, if '$' is defined as a startjoin, then $35 is indexed as a single token. In addition, the first startjoins character in a string of startjoins characters implicitly ends the previous token. See also: endjoin, printjoin, skipjoin.
The expansion of a query term to include all terms having the same root word. For example, stemming the verb talk yields talking, talks, and talked, as well as talk (but not talkie). Stemming is distinct from wildcard expansion, in which results are related only through spelling, not through morphology. See also: wildcard expansion.
A document section that is not bounded by tags. Instead, sections are formed by plaintext document structures such as sentences and paragraphs. Special sections are added to a section group with the CTX_DDL.ADD_SPECIAL_SECTION
procedure. See also: section, section group.
A section that, when added to an AUTO_SECTION_GROUP
, causes the information for document sections of that type to be ignored during indexing; the section content may still be searched, however. Stop sections are added to section groups with the CTX_DDL.ADD_STOP_SECTION
procedure. See also: AUTO_SECTION_GROUP, section, section group.
A class of tokens, such as NUMBER
s, that are to be skipped over during indexing. Stopclasses are specified by adding them to stoplists with CTX_DDL.ADD_STOPCLASS
. See also: stoplist.
A list of words, known as stopwords, themes (stopthemes), and data classes (stopclasses) that are not to be indexed. By default, the system indexes text using the system-supplied stoplist that corresponds to a given database language.
Oracle Text provides default stoplists for most common languages including English, French, German, Spanish, Chinese, Dutch, and Danish. These default stoplists contain only stopwords. Stoplists are created with CTX_DDL.CREATE_STOPLIST
or with the ALTER INDEX
statement. See also: stopclass, stoptheme, stopword.
A theme to be skipped over during indexing. Stopthemes are specified by adding them to stoplists with CTX_DDL.ADD_STOPTHEMES
. See also: stoplist.
A word to be skipped over during indexing. Stopwords are specified by adding them to stoplists with CTX_DDL.ADD_STOPWORD
. They can also be dynamically added to an index using the ALTER INDEX
statement. See also: stoplist.
See: lexer.
See: classification.
A topic associated with a given document. A document may have many themes. A theme does not have to appear in a document; for example, a document containing the words San Francisco may have California as one of its themes.
Theme components are added to indexes with the INDEX_THEMES
attribute of the BASIC_LEXER
preference; they may be extracted from a document with CTX_DOC.THEMES
and queried with the ABOUT
operator.
Also known as clustering. See: classification.
The expansion of a query term to return words that fit a given pattern. For example, expansion of the query term %rot% would return both trot and rotten. Wildcard expansion is distinct from stemming. See also: stemming.
Characters that are treated as blank spaces between tokens. The predefined default values for whitespace are 'space' and 'tab'. The BASIC_LEXER
uses whitespace characters (in conjunction with punctuations and newline characters) to identify character strings that serve as sentence delimiters for sentence and paragraph searching.
A CONTAINS
query operator used to search for query terms within a given XML document section. It is similar to the INPATH
operator, but less generic. See also: INPATH operator.
An Oracle Text preference that enables features such as fuzzy, stemming, and prefix indexing for better wildcard searching, as well as substring and prefix indexing. The wordlist preference improves performance for wildcard queries with CONTAINS
and CATSEARCH
. Create wordlists with the CTX_DDL.ADD_WORDLIST
procedure or with the ALTER INDEX
statement. See also: preference.
A section that defined by XML tags, enabling XML section searching. Indexing with XML sections permits automatic sectioning as well as declaring document-type-sensitive sections. XML section searching includes attribute searching as well as path section searching with the INPATH
, HASPATH
, and WITHIN
operators. See also: section.
A section group used for identifying XML documents for indexing. See also: section, section group.
The basic type of document section; a body of text delimited by start and end tags in a document. Zone sections are well suited for defining sections in HTML and XML documents. Zone sections are added to section groups with the CTX_DDL.ADD_ZONE_SECTION
procedure or with the ALTER INDEX
statement. See also: field section, section, section group.