作者:Ricardo Baeza-Yates,Berthier Ribeiro-Neto等著
Ricardo Baeza-Yates于加拿大滑铁卢大学获得计算机科学博士学位。曾担任智利计算机科学学会主席,现任智利大学计算机科学系全职教授,还是ACM、AMS、EATCS、IEEE、SCCC及SIAM会员。他的主要研究方向为算法与数据结构、文本检索、图形界面以及可视化在数据库中的应用。Berthier Ribeiro-Neto于加利福尼亚大学洛杉矶分校获得计算机科学博士学位。现任巴西Minas Gerais联合大学计算机科学系副教授,同时也是ACM、ASIS及IEEE会员。他的主要研究方向是信息检索系统、数字图书馆、Web界面及视频点播。
1 Introduction
1.1 Motivation
1.1.1 Information versus Data Retrieval
1.1.2 Information Retrieval at the Center of the Stage
1.1.3 Focus of the Book
1.2 Basic Concepts
1.2.1 The User Task
1.2.2 Logical View of the Documents
1.3 Past, Present, and Future
1.3.1 Early Developments
1.3.2 Information Retrieval in the Library
1.3.3 The Web and Digital Libraries
1.3.4 Practical Issues
1.4 The Retrieval Process
1.5 Organization of the Book
1.5.1 Book Topics
1.5.2 Book Chapters
1.6 How to Use this Book
1.6.1 Teaching Suggestions
1.6.2 The Book's Web Page
1.7 Bibliographic Discussion
2 Modeling
2.1 Introduction
2.2 A Taxonomy of Information Retrieval Models
2.3 Retrieval: Ad hoc and Filtering
2.4 A Formal Characterization of IR Models
2.5 Classic Information Retrieval
2.5.1 Basic Concepts
2.5.2 Boolean Model
2.5.3 Vector Model
2.5.4 Probabilistic Model
2.5.5 Brief Comparison of Classic Models
2.6 Alternative Set Theoretic Models
2.6.1 Fuzzy Set Model
2.6.2 Extended Boolean Model
2.7 Alternative Algebraic Models
2.7.1 Generalized Vector Space Model
2.?.2 Latent Semantic Indexing Model
2.7.3 Neural Network Model
2.8 Alternative Probabilistic Models
2.8.1 Bayesian Networks
2.8.2 Inference Network Model
2.8.3 Belief Network Model
2.8.4 Comparison of Bayesian Network Models .
2.8.5 Computational Costs of Bayesian Networks
2.8.6 The Impact of Bayesian Network Models
2.9 Structured Text Retrieval Models
2.9.1 Model Based on Non-Overlapping Lists
2.9.2 Model Based on Proximal Nodes
2.10 Models for Browsing
2.10.1 Flat Browsing
2.10.2 Structure Guided Browsing
2.10.3 The Hypertext Model
2.11 Trends and Research Issues
2.12 Bibliographic Discussion
3 Retrieval Evaluation
3.1 Introduction
3.2 Retrieval Performance Evaluation
3.2.1 Recall and Precision
3.2.2 Alternative Measures
3.3 Reference Collections
3.3.1 The TREC Collection
3.3.2 The CACM and ISI Collections
3.3.3 The Cystic Fibrosis Collection
3.4 Trends and Research Issues
3.5 Bibliographic Discussion
4 Query Languages
4.1 Introduction
4.2 Keyword-Based Querying
4.2.1 Single-Word Queries
4.2.2 Context Queries
4.2.3 Boolean Queries
4.2.4 Natural Language
4.3 Pattern Matching
4.4 Structural Queries
4.4.1 Fixed Structure
4.4.2 Hypertext
4.4.3 Hierarchical Structure
4.5 Query Protocols
4.6 Trends and Research Issues
4.7 Bibliographic Discussion
5 Query Operations
5.1 Introduction
5.2 User Relevance Feedback
5.2.1 Query Expansion and Term Reweighting for the Vector Model
5.2.2 Term Reweighting for the Probabilistic Model
5.2.3 A Variant of Probabilistic Term Reweighting
5.2.4 Evaluation of Relevance Feedback Strategies
5.3 Automatic Local Analysis
5.3.1 Query Expansion Through Local Clustering
5.3.2 Query Expansion Through Local Context Analysis
5.4 Automatic Global Analysis
5.4.1 Query Expansion based on a Similarity Thesaurus
5.4.2 Query Expansion based on a Statistical Thesaurus
5.5 Trends and Research Issues
5.6 Bibliographic Discussion
6 Text and Multimedia Languages and Properties
6.1 Introduction
6.2 Metadata
6.3 Text
6.3.1 Formats
6.3.2 Information Theory
6.3.3 Modeling Natural Language
6.3.4 Similarity Models
6.4 Markup Languages
6.4.1 SGML
6.4.2 HTML
6.4.3 XML
6.5 Multimedia
6.5.1 Formats
6.5.2 Textual Images
6.5.3 Graphics and Virtual Reality
6.5.4 HyTime
6.6 Trends and Research Issues
6.7 Bibliographic Discussion
7 Text Operations
7.1 Introduction
7.2 Document Preprocessing
7.2.1 Lexical Analysis of the Text
7.2.2 Elimination of Stopwords
7.2.3 Stemming
7.2.4 Index Terms Selection
7.2.5 Thesauri
7.3 Document Clustering
7.4 Text Compression
7.4.1 Motivation
7.4.2 Basic Concepts
7.4.3 Statistical Methods
7.4.4 Dictionary Methods
7.4.5 Inverted File Compression
7.5 Comparing Text Compression Techniques
7.6 Trends and Research Issues
7.7 Bibliographic Discussion
8 Indexing and Searching
8.1 Introduction
8.2 Inverted Files
8.2.1 Searching
8.2.2 Construction
8.3 Other Indices for Text
8.3.1 Suffix Trees and Suffix Arrays
8.3.2 Signature Files
8.4 Boolean Queries
8.5 Sequential Searching
8.5.1 Brute Force
8.5.2 Knuth-Morris-Pratt
8.5.3 Boyer-Moore Family
8.5.4 Shift-Or
8.5.5 Suffix Automaton
8.5.6 Practical Comparison
8.5.7 Phrases and Proximity
8.6 Pattern Matching
8.6.1 String Matching Allowing Errors
8.6.2 Regular Expressions and Extended Patterns
8.6.3 Pattern Matching Using Indices
8.7 Structural Queries
8.8 Compression
8.8.1 Sequential Searching
8.8.2 Compressed Indices
8.9 Trends and Research Issues
8.10 Bibliographic Discussion
9 Parallel and Distributed IR
9.1 Introduction
9.1.1 Parallel Computing
9.1.2 Performance Measures
9.2 Parallel IR
9.2.1 Introduction
9.2.2 MIMD Architectures
9.2.3 SIMD Architectures
9.3 Distributed IR
9.3.1 Introduction
9.3.2 Collection Partitioning
9.3.3 Source Selection
9.3.4 Query Processing
9.3.5 Web Issues
9.4 Trends and Research Issues
9.5 Bibliographic Discussion
10 User Interfaces and Visualization
10.1 Introduction
10.2 Human-Computer Interaction
10.2.1 Design Principles
10.2.2 The Role of Visualization
10.2.3 Evaluating Interactive Systems
10.3 The Information Access Process
10.3.1 Models of Interaction
10.3.2 Non-Search Parts of the Information Access Process
10.3.3 Earlier Interface Studies
10.4 Starting Points
10.4.1 Lists of Collections
10.4.2 Overviews
10.4.3 Examples, Dialogs, and Wizards
10.4.4 Automated Source Selection
10.5 Query Specification
10.5.1 Boolean Queries
10.5.2 From Command Lines to Forms ana Menus
10.5.3 Faceted Queries
10.5.4 Graphical Approaches to Query Specification
10.5.5 Phrases and Proximity
10.5.6 Natural Language and Free Text Queries
10.6 Context
10.6.1 Document Surrogates
10.6.2 Query Term Hits Within Document Content
10.6.3 Query Term Hits Between Documents
10.6.4 SuperBook: Context via Table of Contents
10.6.5 Categories for Results Set Context
10.6.6 Using Hyperlinks to Organize Retrieval Results
10.6.7 Tables
10.7 Using Relevance Judgements
10.7.1 Interfaces for Standard Relevance Feedback
10.7.2 Studies of User Interaction with Relevance Feedback Systems
10.7.3 Fetching Relevant Information in the Background
10.7.4 Group Relevance Judgements
10.7.5 Pseudo-Relevance Feedback
10.8 Interface Support for the Search Process
10.8.1 Interfaces for String Matching
10.8.2 Window Management
10.8.3 Example Systems
10.8.4 Examples of Poor Use of Overlapping Windows
10.8.5 Retaining Search History
10.8.6 Integrating Scanning, Selection, and Querying
10.9 Trends and Research Issues
10.10 Bibliographic Discussion
11 Multimedia IR: Models and Languages
11.1 Introduction
11.2 Data Modeling
11.2.1 Multimedia Data Support in Commercial DBMSs
11.2.2 The MULTOS Data Model
11.3 Query Languages
11.3.1 Request Specification
11.3.2 Conditions on Multimedia Data
11.3.3 Uncertainty, Proximity, and Weights in Query Expressions
11.3.4 Some Proposals
11.4 Trends and Research Issues
11.5 Bibiographic Discussion
12 Multimedia IR: Indexing and Searching
12.1 Introduction
12.2 Background -- Spatial Access Methods
12.3 A Generic Multimedia Indexing Approach
12.4 One-dimensional Time Series
12.4.1 Distance Function
12.4.2 Feature Extraction and Lower-bounding
12.4.3 Experiments
12.5 Two-dimensional Color Images
12.5.1 Image Features and Distance Functions
12.5.2 Lower-bounding
12.5.3 Experiments
12.6 Automatic Feature Extraction
12.7 Trends and Research Issues
12.8 Bibliographic Discussion
13 Searching the Web
13.1 Introduction
13.2 Challenges
13.3 Characterizing the Web
13.3.1 Measuring the Web
13.3.2 Modeling the Web
13.4 Search Engines
13.4.1 Centralized Architecture
13.4.2 Distributed Architecture
13.4.3 User Interfaces
13.4.4 Ranking
13.4.5 Crawling the Web
13.4.6 Indices
13.5 Browsing
13.5.1 Web Directories
13.5.2 Combining Searching with Browsing
13.5.3 Helpful Tools
13.6 Metasearchers
13.7 Finding the Needle in the Haystack
13.7.1 User Problems
13.7.2 Some Examples
13.7.3 Teaching the User
13.8 Searching using Hyperlinks
13.8.1 Web Query Languages
13.8.2 Dynamic Search and Software Agents
13.9 Trends and Research Issues
13.10 Bibliographic Discussion
14 Libraries and Bibliographical Systems
14.1 Introduction
14.2 Online IR Systems and Document Databases
14.2.1 Databases
14.2.2 Online Retrieval Systems
14.2.3 IR in Online Retrieval Systems
14.2.4 'Natural Language' Searching
14.3 Online Public Access Catalogs (OPACs)
14.3.1 0PACs and Their Content
14.3.2 0PACs and End Users
14.3.3 OPACs: Vendors and Products
14.3.4 Alternatives to Vendor OPACs
14.4 Libraries and Digital Library Projects
14.5 Trends and Research Issues
14.6 Bibliographic Discussion
15 Digital Libraries
15.1 Introduction
15.2 Definitions
15.3 Architectural Issues
15.4 Document Models, Representations, and Access
15.4.1 Multilingual Documents
15.4.2 Multimedia Documents
15.4.3 Structured Documents
15.4.4 Distributed Collections
15.4.5 Federated Search
15.4.6 Access
15.5 Prototypes, Projects, and Interfaces
15.5.1 International Range of Efforts
15.5.2 Usability
15.6 Standards
15.6.1 Protocols and Federation
15.6.2 Metadata
15.7 Trends and Research Issues
15.8 Bibliographical Discussion
Appendix: Porter's Algorithm
1 Introduction
1.1 Motivation
1.1.1 Information versus Data Retrieval
1.1.2 Information Retrieval at the Center of the Stage
1.1.3 Focus of the Book
1.2 Basic Concepts
1.2.1 The User Task
1.2.2 Logical View of the Documents
1.3 Past, Present, and Future
1.3.1 Early Developments
1.3.2 Information Retrieval in the Library
1.3.3 The Web and Digital Libraries
1.3.4 Practical Issues
1.4 The Retrieval Process
1.5 Organization of the Book
1.5.1 Book Topics
1.5.2 Book Chapters
1.6 How to Use this Book
1.6.1 Teaching Suggestions
1.6.2 The Book's Web Page
1.7 Bibliographic Discussion
2 Modeling
2.1 Introduction
2.2 A Taxonomy of Information Retrieval Models
2.3 Retrieval: Ad hoc and Filtering
2.4 A Formal Characterization of IR Models
2.5 Classic Information Retrieval
2.5.1 Basic Concepts
2.5.2 Boolean Model
2.5.3 Vector Model
2.5.4 Probabilistic Model
2.5.5 Brief Comparison of Classic Models
2.6 Alternative Set Theoretic Models
2.6.1 Fuzzy Set Model
2.6.2 Extended Boolean Model
2.7 Alternative Algebraic Models
2.7.1 Generalized Vector Space Model
2.?.2 Latent Semantic Indexing Model
2.7.3 Neural Network Model
2.8 Alternative Probabilistic Models
2.8.1 Bayesian Networks
2.8.2 Inference Network Model
2.8.3 Belief Network Model
2.8.4 Comparison of Bayesian Network Models .
2.8.5 Computational Costs of Bayesian Networks
2.8.6 The Impact of Bayesian Network Models
2.9 Structured Text Retrieval Models
2.9.1 Model Based on Non-Overlapping Lists
2.9.2 Model Based on Proximal Nodes
2.10 Models for Browsing
2.10.1 Flat Browsing
2.10.2 Structure Guided Browsing
2.10.3 The Hypertext Model
2.11 Trends and Research Issues
2.12 Bibliographic Discussion
3 Retrieval Evaluation
3.1 Introduction
3.2 Retrieval Performance Evaluation
3.2.1 Recall and Precision
3.2.2 Alternative Measures
3.3 Reference Collections
3.3.1 The TREC Collection
3.3.2 The CACM and ISI Collections
3.3.3 The Cystic Fibrosis Collection
3.4 Trends and Research Issues
3.5 Bibliographic Discussion
4 Query Languages
4.1 Introduction
4.2 Keyword-Based Querying
4.2.1 Single-Word Queries
4.2.2 Context Queries
4.2.3 Boolean Queries
4.2.4 Natural Language
4.3 Pattern Matching
4.4 Structural Queries
4.4.1 Fixed Structure
4.4.2 Hypertext
4.4.3 Hierarchical Structure
4.5 Query Protocols
4.6 Trends and Research Issues
4.7 Bibliographic Discussion
5 Query Operations
5.1 Introduction
5.2 User Relevance Feedback
5.2.1 Query Expansion and Term Reweighting for the Vector Model
5.2.2 Term Reweighting for the Probabilistic Model
5.2.3 A Variant of Probabilistic Term Reweighting
5.2.4 Evaluation of Relevance Feedback Strategies
5.3 Automatic Local Analysis
5.3.1 Query Expansion Through Local Clustering
5.3.2 Query Expansion Through Local Context Analysis
5.4 Automatic Global Analysis
5.4.1 Query Expansion based on a Similarity Thesaurus
5.4.2 Query Expansion based on a Statistical Thesaurus
5.5 Trends and Research Issues
5.6 Bibliographic Discussion
6 Text and Multimedia Languages and Properties
6.1 Introduction
6.2 Metadata
6.3 Text
6.3.1 Formats
6.3.2 Information Theory
6.3.3 Modeling Natural Language
6.3.4 Similarity Models
6.4 Markup Languages
6.4.1 SGML
6.4.2 HTML
6.4.3 XML
6.5 Multimedia
6.5.1 Formats
6.5.2 Textual Images
6.5.3 Graphics and Virtual Reality
6.5.4 HyTime
6.6 Trends and Research Issues
6.7 Bibliographic Discussion
7 Text Operations
7.1 Introduction
7.2 Document Preprocessing
7.2.1 Lexical Analysis of the Text
7.2.2 Elimination of Stopwords
7.2.3 Stemming
7.2.4 Index Terms Selection
7.2.5 Thesauri
7.3 Document Clustering
7.4 Text Compression
7.4.1 Motivation
7.4.2 Basic Concepts
7.4.3 Statistical Methods
7.4.4 Dictionary Methods
7.4.5 Inverted File Compression
7.5 Comparing Text Compression Techniques
7.6 Trends and Research Issues
7.7 Bibliographic Discussion
8 Indexing and Searching
8.1 Introduction
8.2 Inverted Files
8.2.1 Searching
8.2.2 Construction
8.3 Other Indices for Text
8.3.1 Suffix Trees and Suffix Arrays
8.3.2 Signature Files
8.4 Boolean Queries
8.5 Sequential Searching
8.5.1 Brute Force
8.5.2 Knuth-Morris-Pratt
8.5.3 Boyer-Moore Family
8.5.4 Shift-Or
8.5.5 Suffix Automaton
8.5.6 Practical Comparison
8.5.7 Phrases and Proximity
8.6 Pattern Matching
8.6.1 String Matching Allowing Errors
8.6.2 Regular Expressions and Extended Patterns
8.6.3 Pattern Matching Using Indices
8.7 Structural Queries
8.8 Compression
8.8.1 Sequential Searching
8.8.2 Compressed Indices
8.9 Trends and Research Issues
8.10 Bibliographic Discussion
9 Parallel and Distributed IR
9.1 Introduction
9.1.1 Parallel Computing
9.1.2 Performance Measures
9.2 Parallel IR
9.2.1 Introduction
9.2.2 MIMD Architectures
9.2.3 SIMD Architectures
9.3 Distributed IR
9.3.1 Introduction
9.3.2 Collection Partitioning
9.3.3 Source Selection
9.3.4 Query Processing
9.3.5 Web Issues
9.4 Trends and Research Issues
9.5 Bibliographic Discussion
10 User Interfaces and Visualization
10.1 Introduction
10.2 Human-Computer Interaction
10.2.1 Design Principles
10.2.2 The Role of Visualization
10.2.3 Evaluating Interactive Systems
10.3 The Information Access Process
10.3.1 Models of Interaction
10.3.2 Non-Search Parts of the Information Access Process
10.3.3 Earlier Interface Studies
10.4 Starting Points
10.4.1 Lists of Collections
10.4.2 Overviews
10.4.3 Examples, Dialogs, and Wizards
10.4.4 Automated Source Selection
10.5 Query Specification
10.5.1 Boolean Queries
10.5.2 From Command Lines to Forms ana Menus
10.5.3 Faceted Queries
10.5.4 Graphical Approaches to Query Specification
10.5.5 Phrases and Proximity
10.5.6 Natural Language and Free Text Queries
10.6 Context
10.6.1 Document Surrogates
10.6.2 Query Term Hits Within Document Content
10.6.3 Query Term Hits Between Documents
10.6.4 SuperBook: Context via Table of Contents
10.6.5 Categories for Results Set Context
10.6.6 Using Hyperlinks to Organize Retrieval Results
10.6.7 Tables
10.7 Using Relevance Judgements
10.7.1 Interfaces for Standard Relevance Feedback
10.7.2 Studies of User Interaction with Relevance Feedback Systems
10.7.3 Fetching Relevant Information in the Background
10.7.4 Group Relevance Judgements
10.7.5 Pseudo-Relevance Feedback
10.8 Interface Support for the Search Process
10.8.1 Interfaces for String Matching
10.8.2 Window Management
10.8.3 Example Systems
10.8.4 Examples of Poor Use of Overlapping Windows
10.8.5 Retaining Search History
10.8.6 Integrating Scanning, Selection, and Querying
10.9 Trends and Research Issues
10.10 Bibliographic Discussion
11 Multimedia IR: Models and Languages
11.1 Introduction
11.2 Data Modeling
11.2.1 Multimedia Data Support in Commercial DBMSs
11.2.2 The MULTOS Data Model
11.3 Query Languages
11.3.1 Request Specification
11.3.2 Conditions on Multimedia Data
11.3.3 Uncertainty, Proximity, and Weights in Query Expressions
11.3.4 Some Proposals
11.4 Trends and Research Issues
11.5 Bibiographic Discussion
12 Multimedia IR: Indexing and Searching
12.1 Introduction
12.2 Background -- Spatial Access Methods
12.3 A Generic Multimedia Indexing Approach
12.4 One-dimensional Time Series
12.4.1 Distance Function
12.4.2 Feature Extraction and Lower-bounding
12.4.3 Experiments
12.5 Two-dimensional Color Images
12.5.1 Image Features and Distance Functions
12.5.2 Lower-bounding
12.5.3 Experiments
12.6 Automatic Feature Extraction
12.7 Trends and Research Issues
12.8 Bibliographic Discussion
13 Searching the Web
13.1 Introduction
13.2 Challenges
13.3 Characterizing the Web
13.3.1 Measuring the Web
13.3.2 Modeling the Web
13.4 Search Engines
13.4.1 Centralized Architecture
13.4.2 Distributed Architecture
13.4.3 User Interfaces
13.4.4 Ranking
13.4.5 Crawling the Web
13.4.6 Indices
13.5 Browsing
13.5.1 Web Directories
13.5.2 Combining Searching with Browsing
13.5.3 Helpful Tools
13.6 Metasearchers
13.7 Finding the Needle in the Haystack
13.7.1 User Problems
13.7.2 Some Examples
13.7.3 Teaching the User
13.8 Searching using Hyperlinks
13.8.1 Web Query Languages
13.8.2 Dynamic Search and Software Agents
13.9 Trends and Research Issues
13.10 Bibliographic Discussion
14 Libraries and Bibliographical Systems
14.1 Introduction
14.2 Online IR Systems and Document Databases
14.2.1 Databases
14.2.2 Online Retrieval Systems
14.2.3 IR in Online Retrieval Systems
14.2.4 'Natural Language' Searching
14.3 Online Public Access Catalogs (OPACs)
14.3.1 0PACs and Their Content
14.3.2 0PACs and End Users
14.3.3 OPACs: Vendors and Products
14.3.4 Alternatives to Vendor OPACs
14.4 Libraries and Digital Library Projects
14.5 Trends and Research Issues
14.6 Bibliographic Discussion
15 Digital Libraries
15.1 Introduction
15.2 Definitions
15.3 Architectural Issues
15.4 Document Models, Representations, and Access
15.4.1 Multilingual Documents
15.4.2 Multimedia Documents
15.4.3 Structured Documents
15.4.4 Distributed Collections
15.4.5 Federated Search
15.4.6 Access
15.5 Prototypes, Projects, and Interfaces
15.5.1 International Range of Efforts
15.5.2 Usability
15.6 Standards
15.6.1 Protocols and Federation
15.6.2 Metadata
15.7 Trends and Research Issues
15.8 Bibliographical Discussion
Appendix: Porter's Algorithm