Introduction

User Guide

Standards

Enter Thesaurus
Alphabetical
Hierarchical

Resources

 

Jenna Johnson
IRLS 601
Spring, 2002

Thesaurus


User Guide

The Information Extraction Thesaurus may be used for the searching of relevant terms within the field of Information Extraction. It is assumed that users have prior knowledge within the fields of Computer Science or Linguistics before using this thesaurus. Many first level terms are very general within these fields and do not contain scope notes. Terms that are more specifically tailored to Information Extraction are second, third or fourth level terms and contain scope notes for proper usage. This user guide is to present the layout of the Information Extraction Thesaurus as it relates to the users, including the explanation of hierarchical structures and relationships.

Explanation of Concepts
The concepts chosen for the organization of this thesaurus are: Constraints, Elements, Measurements, Processes, Products and Tasks. These are all inserted into the hierarchy as node labels (facet indicators) but do not have term records.

Example:
<Measurement>
Algorithm
Benchmark
Evaluation

The following is an explanation of the scope of the facets:

Constraints: This category refers to that which the extraction system is tied to when performing the extraction task. Constraints include: Lexical, Pragmatic, Semantic and Syntactic.

Elements: Information Extraction uses many elements to complete the process. It is these elements that provide data and information to the final product. In this thesaurus, elements are that which can be given to the system to be processed or extracted. Elements include: Domain, Fact, Input, Named Entities, Lexicons, Scenario, Systems, Token, Query.

Measurements: The success of searches or extracting need to be measured. Information Extraction uses different types of measurements for evaluating these systems, including the Message Understanding Conferences. Other more general terms for measurement include: Algorithms, Benchmark, Effectiveness, Evaluation, Heuristics, Metrics, Patterns, Phases, Precision, Recall, Relevance and Similarity.

Processes
: This was the largest hierarchy within the thesaurus because of the active involvement different processes have on the extraction process. Extraction (the term) is included in this section but is not a parent term. It is equally significant with terms from other fields like Retrieval (from Information Retrieval) and Analysis (from Sentence Analysis). Process here refer to the computerized act of working with different elements such as: Analysis, Classification , Definition, Extraction, Filtering, Generation, Learning, Merging, Mining, Matching, Parsing, Processing, Recognition, Retrieval, Summarization, Tokenization, Training and Understanding. Most terms were compounded in this section. Those that were not can be post-coordinated with Boolean searching.

Products: Products refers to that which is produced as a result of an extraction task or process. In this case, there are 5 products which are significant to the collection and interpretation of that which is extracted: Database, Report Generators, Spreadsheets, Summarizers and Templates. These products are able to be coordinated with processes as they are a direct result of each other.

Tasks: While tasks and processes have a similar relationship, tasks in this thesaurus refers to the assignment of a computer program to perform a specific task. Two tasks are noted at this stage as having high relevance: the Extraction Task and the Search. Five very specific extraction tasks are noted here instead of under processes for their specificity to the task and performance of Information Extraction.

 

Relationships
Three kinds of major relationships were established in this thesaurus. They include the equivalence relationship, the hierarchical relationship and the associative relationship.

Equivalence Relationships (USE, UF)
Equivalence relationships are noted with USE of UF (used for) indicators. These relationships identify synonyms and are cross-referenced with each other. UF terms are italicized as non-preferred terms that should not be used as a final product. The searcher should choose the preferred term which is also the USE term.

Example:
Text Skimming
USE
Selective Content Extraction

Selective Content Extraction
UF
Text Skimming

Lexical and spelling variants are also accounted for as equivalence relationships. Several times there was a choice of British or American English spelling. To stay consistent, American English was chosen.

Example:
Tokenization
UF
Tokenisation

Tokenisation
USE
Tokenization

Hierarchical Relationships (BT, NT)
Hierarchical relationships refer to the subordinate (Narrower Term, NT) and superordinate (Broader Term, BT) relationships within the established hierarchy. These relationships can be easily identified by examining the hierarchical structure and finding more general or specific versions of a term. In this thesaurus, it is suggested that users use the most specific term as possible when referring to specific tasks within Information Extraction. Broader terms are always cross-referenced with narrower terms and narrower terms are always cross-referenced with broader terms.

Example:
Input
NT
Text

Text
BT
Input
NT
Formatted Text

Formatted Text
BT
Text
NT
Complex Formatted Text


Terms were structured beginning with the most generic or broad term as the first level hierarchy, getting more specific as levels are formed. Instance or Generic (5.3.2, NISO) were not used in this thesauri to distinguish relationships.

Associative Relationship (RT)
This field was mean to cover associative relationships (related terms, RT) between terms outside of their concept. Often the terms are semantically or conceptually related to each other and are associated by the nature of the terms and not hierarchical order. The choice was made to exclude broader terms from being associated to each other when there is a choice of a narrower term. The exception to this was when the descriptor was itself a broader term and established a fitting relationship with a correlating broader term. In this way, not all sibling terms were related to each other.

Example:
Similarity
RT
Matching rather than Pattern Matching

As is stated in NISO 5.4.2, relationships can be formed with descriptors belonging to different hierarchies, but that share the same etymological root. This was often the case when working with compound and singular terms.

Example:
Text
RT
Text Analysis
Text Generation
Text Processing

Other Record Indicators
Scope Note (SN) is used to identify a definition or scope of the term. Often times the scope notes in this thesaurus provided an explanation of the term and how it relates to Information Extraction. Other scope notes are used to identify the context of the term in relation to other terms.

Example:
Sentence Analysis
SN The processing of natural language sentences from other representations.

Non-preferred terms are listed within the hierarchy and term records in italics.

 

Example:
Information Retrieval
USE
Document Retrieval
Text Retrieval

References for scope note definitions are listed in parenthesis at the end of each scope note.

Example:
Named Entity Recognition
SN
Recognition of entity names, place names, temporal expressions, and certain types of numerical expressions (Cunningham).

Top Terms (TT) are used to indicate the node label used within the hierarchical structure. The TT indicator alerts the searcher to the fact that there are not any broader terms in this thesaurus.

Example:
Summarizers
TT
<Products>