|
User Guide
The Information Extraction Thesaurus may be used for the
searching of relevant terms within the field of Information Extraction.
It is assumed that users have prior knowledge within the fields of Computer
Science or Linguistics before using this thesaurus. Many first level
terms are very general within these fields and do not contain scope
notes. Terms that are more specifically tailored to Information Extraction
are second, third or fourth level terms and contain scope notes for
proper usage. This user guide is to present the layout of the Information
Extraction Thesaurus as it relates to the users, including the explanation
of hierarchical structures and relationships.
Explanation of Concepts
The concepts chosen for the organization of this thesaurus are: Constraints,
Elements, Measurements, Processes, Products and Tasks. These are all
inserted into the hierarchy as node labels (facet indicators) but do
not have term records.
Example:
<Measurement>
Algorithm
Benchmark
Evaluation
The following is an explanation of the scope of the
facets:
Constraints: This category refers to that which the extraction
system is tied to when performing the extraction task. Constraints include:
Lexical, Pragmatic, Semantic and Syntactic.
Elements: Information Extraction uses many elements to complete
the process. It is these elements that provide data and information
to the final product. In this thesaurus, elements are that which can
be given to the system to be processed or extracted. Elements include:
Domain, Fact, Input, Named Entities, Lexicons, Scenario, Systems, Token,
Query.
Measurements: The success of searches or extracting need to be
measured. Information Extraction uses different types of measurements
for evaluating these systems, including the Message Understanding Conferences.
Other more general terms for measurement include: Algorithms, Benchmark,
Effectiveness, Evaluation, Heuristics, Metrics, Patterns, Phases, Precision,
Recall, Relevance and Similarity.
Processes: This was the largest hierarchy within
the thesaurus because of the active involvement different processes
have on the extraction process. Extraction (the term) is included in
this section but is not a parent term. It is equally significant with
terms from other fields like Retrieval (from Information Retrieval)
and Analysis (from Sentence Analysis). Process here refer to the computerized
act of working with different elements such as: Analysis, Classification
, Definition, Extraction, Filtering, Generation, Learning, Merging,
Mining, Matching, Parsing, Processing, Recognition, Retrieval, Summarization,
Tokenization, Training and Understanding. Most terms were compounded
in this section. Those that were not can be post-coordinated with Boolean
searching.
Products: Products refers to that which is produced as a result
of an extraction task or process. In this case, there are 5 products
which are significant to the collection and interpretation of that which
is extracted: Database, Report Generators, Spreadsheets, Summarizers
and Templates. These products are able to be coordinated with processes
as they are a direct result of each other.
Tasks: While tasks and processes have a similar relationship,
tasks in this thesaurus refers to the assignment of a computer program
to perform a specific task. Two tasks are noted at this stage as having
high relevance: the Extraction Task and the Search. Five very specific
extraction tasks are noted here instead of under processes for their
specificity to the task and performance of Information Extraction.
Relationships
Three kinds of major relationships were established in this thesaurus.
They include the equivalence relationship, the hierarchical relationship
and the associative relationship.
Equivalence Relationships (USE, UF)
Equivalence relationships are noted with USE of UF (used for) indicators.
These relationships identify synonyms and are cross-referenced with
each other. UF terms are italicized as non-preferred terms that should
not be used as a final product. The searcher should choose the preferred
term which is also the USE term.
Example:
Text Skimming
USE
Selective Content Extraction
Selective Content Extraction
UF
Text Skimming
Lexical and spelling variants are also accounted for as equivalence
relationships. Several times there was a choice of British or American
English spelling. To stay consistent, American English was chosen.
Example:
Tokenization
UF
Tokenisation
Tokenisation
USE
Tokenization
Hierarchical Relationships (BT, NT)
Hierarchical relationships refer to the subordinate (Narrower Term,
NT) and superordinate (Broader Term, BT) relationships within the established
hierarchy. These relationships can be easily identified by examining
the hierarchical structure and finding more general or specific versions
of a term. In this thesaurus, it is suggested that users use the most
specific term as possible when referring to specific tasks within Information
Extraction. Broader terms are always cross-referenced with narrower
terms and narrower terms are always cross-referenced with broader terms.
Example:
Input
NT
Text
Text
BT
Input
NT
Formatted Text
Formatted Text
BT
Text
NT
Complex Formatted Text
Terms were structured beginning with the most generic or broad term
as the first level hierarchy, getting more specific as levels are formed.
Instance or Generic (5.3.2, NISO) were not used in this thesauri to
distinguish relationships.
Associative Relationship (RT)
This field was mean to cover associative relationships (related terms,
RT) between terms outside of their concept. Often the terms are semantically
or conceptually related to each other and are associated by the nature
of the terms and not hierarchical order. The choice was made to exclude
broader terms from being associated to each other when there is a choice
of a narrower term. The exception to this was when the descriptor was
itself a broader term and established a fitting relationship with a
correlating broader term. In this way, not all sibling terms were related
to each other.
Example:
Similarity
RT
Matching rather than Pattern Matching
As is stated in NISO 5.4.2, relationships can be formed with descriptors
belonging to different hierarchies, but that share the same etymological
root. This was often the case when working with compound and singular
terms.
Example:
Text
RT
Text Analysis
Text Generation
Text Processing
Other Record Indicators
Scope Note (SN) is used to identify a definition or scope of
the term. Often times the scope notes in this thesaurus provided an
explanation of the term and how it relates to Information Extraction.
Other scope notes are used to identify the context of the term in relation
to other terms.
Example:
Sentence Analysis
SN The processing of natural language sentences from other representations.
Non-preferred terms are listed within the hierarchy
and term records in italics.
Example:
Information Retrieval
USE
Document Retrieval
Text Retrieval
References for scope note definitions are listed in parenthesis
at the end of each scope note.
Example:
Named Entity Recognition
SN
Recognition of entity names, place names, temporal expressions, and
certain types of numerical expressions (Cunningham).
Top Terms (TT) are used to indicate the node label used within
the hierarchical structure. The TT indicator alerts the searcher to
the fact that there are not any broader terms in this thesaurus.
Example:
Summarizers
TT
<Products>
|
|