书籍详情
高性能Spark(影印版)
作者:Holden Karau,Rachel Warren
出版社:东南大学出版社
出版时间:2018-02-01
ISBN:9787564175184
定价:¥88.00
购买这本书可以去
内容简介
本书描述了减少数据基础设施成本和开发时间的技巧,适用于软件工程师、数据工程师、开发者和系统管理员。你不仅可以从中获得关于Spark的全面理解,也将学会如何让它运转自如。 在本书中你将发现: * Spark SQL的新接口如何在SQL的RDD数据结构上改善性能 * Core Spark和Spark SQL之间的数据拼接选择 * 充分发挥标准RDD转换功能的技巧 * 如何处理Spark的键/值对范式的相关性能问题 * 编写高性能Spark代码,不使用Scala或JVM * 如何在应用建议的改进措施时测试功能和性能 * 使用Spark MLlib和Spark ML机器学习库 * Spark的流组件和外部社区软件包
作者简介
Holden Karau是一位跨性别加拿大人,在IBM Spark技术中心担任软件开发工程师。她是Spark代码贡献者,并且经常提交贡献代码,特别是PySpark和机器学习部分。Holden在多个国际活动中演讲Spark相关话题。 Rachel Warren是Alpine Data的软件工程师和数据科学家。在日常工作中,她使用Spark来处理真实世界的数据和机器学习问题。她也曾在工业界和学术界担任分析师和导师。
目录
Preface
1.Introductioto High Performance Spark
What Is Spark and Why Performance Matters
What You CaExpect to Get from This Book
Spark Versions
Why Scala
To Be a Spark Expert You Have to Leara Little Scala Anyway
The Spark Scala API Is Easier to Use Thathe lava API
Scala Is More Performant ThaPython
Why Not Scala
Learning Scala
Conclusion
2.How Spark Works
How Spark Fits into the Big Data Ecosystem
Spark Components
Spark Model of Parallel Computing: RDDs
Lazy Evaluation
In-Memory Persistence and Memory Management
Immutability and the RDD Interface
Types of RDDs
Functions oRDDs: Transformations Versus Actions
Wide Versus Narrow Dependencies
Spark Job Scheduling
Resource AllocatioAcross Applications
The Spark Application
The Anatomy of a Spark lob
The DAG
Jobs
Stages
Tasks
Conclusion
3.DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSessio(or HiveContext or SQLContext)
Spark SQL Dependencies
Managing Spark Dependencies
Avoiding Hive JARs
Basics of Schemas
DataFrame API
Transformations
Multi-DataFrame Transformations
PlaiOld SQL Queries and Interacting with Hive Data
Data RepresentatioiDataFrames and Datasets
Tungsten
Data Loading and Saving Functions
DataFrameWriter and DataFrameReader
Formats
Save Modes
Partitions (Discovery and Writing)
Datasets
Interoperability with RDDs, DataFrames, and Local Collections
Compile-Time Strong Typing
Easier Functional (RDD 'like') Transformations
Relational Transformations
Multi-Dataset Relational Transformations
Grouped Operations oDatasets
Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs)
Query Optimizer
Logical and Physical Plans
Code Generation
Large Query Plans and Iterative Algorithms
Debugging Spark SQL Queries
BC/ODBC Server
Conclusion
4.Joins (SQL and Core)
Core Spark Joins
Choosing a JoiType
Choosing aExecutioPlan
Spark SQL Joins
DataFrame Joins
Dataset Joins
Conclusion
5.Effective Transformations
Narrow Versus Wide Transformations
Implications for Performance
Implications for Fault Tolerance
The Special Case of coalesce
What Type of RDD Does Your TransformatioReturn
Minimizing Object Creation
Reusing Existing Objects
Using Smaller Data Structures
Iterator-to-Iterator Transformations with mapPartitions
What Is aIterator-to-Iterator Transformation
Space and Time Advantages
AExample
Set Operations
Reducing Setup Overhead
Shared Variables
Broadcast Variables
Accumulators
Reusing RDDs
Cases for Reuse
Deciding if Repute Is Inexpensive Enough
Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
Alluxio (nee Tachyon)
LRU Caching
Noisy Cluster Considerations
Interactiowith Accumulators
Conclusion
6.Working with Key/Value Data
The Goldilocks Example
Goldilocks Versio0: Iterative Solution
How to Use PairRDDFunctions and OrderedRDDFunctions
Actions oKey/Value Pairs
What's So Dangerous About the groupByKey Function
Goldilocks Versio1: groupByKey Solution
Choosing aAggregatioOperation
Dictionary of AggregatioOperations with Performance Considerations
Multiple RDD Operations
Co-Grouping
Partitioners and Key/Value Data
Using the Spark Partitioner Object
Hash Partitioning
Range Partitioning
Custom Partitioning
Preserving Partitioning InformatioAcross Transformations
Leveraging Co-Located and Co-Partitioned RDDs
Dictionary of Mapping and Partitioning Functions PairRDDFunctions
Dictionary of OrderedRDDOperations
Sorting by Two Keys with SortByKey
Secondary Sort and repartitionAndSortWithinPartitions
Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
How Not to Sort by Two Orderings
Goldilocks Versio2: Secondary Sort
A Different Approach to Goldilocks
Goldilocks Versio3: Sort oCell Values
Straggler Detectioand Unbalanced Data
Back to Goldilocks (Again)
Goldilocks Versio4: Reduce to Distinct oEach Partition
Conclusion
7.Going Beyond Scala
Beyond Scala withithe JVM
Beyond Scala, and Beyond the JVM
How PySpark Works
How SparkR Works
Spark.jl (Julia Spark)
How Eclair JS Works
Spark othe CommoLanguage Runtime (CLR)——C# and Friends
Calling Other Languages from Spark
Using Pipe and Friends
JNI
Java Native Access (JNA)
Underneath Everything Is FORTRAN
Getting to the GPU
The Future
Conclusion
8.Testing and Validation
Unit Testing
General Spark Unit Testing
Mocking RDDs
Getting Test Data
Generating Large Datasets
Sampling
Property Checking with ScalaCheck
Computing RDD Difference
IntegratioTesting
Choosing Your IntegratioTesting Environment
Verifying Performance
Spark Counters for Verifying Performance
Projects for Verifying Performance
Job Validation
Conclusion
9.Spark MLlib and ML
Choosing BetweeSpark MLlib and Spark ML
Working with MLlib
Getting Started with MLlib (Organizatioand Imports)
MLlib Feature Encoding and Data Preparation
Feature Scaling and Selection
MLlib Model Training
Predicting
Serving and Persistence
Model Evaluation
Working with Spark ML
Spark ML Organizatioand Imports
Pipeline Stages
ExplaiParams
Data Encoding
Data Cleaning
Spark ML Models
Putting It All Together ia Pipeline
Training a Pipeline
Accessing Individual Stages
Data Persistence and Spark ML
Extending Spark ML Pipelines with Your OwAlgorithms
Model and Pipeline Persistence and Serving with Spark ML
General Serving Considerations
Conclusion
10.Spark Components and Packages
Stream Processing with Spark
Sources and Sinks
Batch Intervals
Data Checkpoint Intervals
Considerations for DStreams
Considerations for Structured Streaming
High Availability Mode (or Handling Driver Failure or Checkpointing)
GraphX
Using Community Packages and Libraries
Creating a Spark Package
Conclusion
A.Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist
Index
1.Introductioto High Performance Spark
What Is Spark and Why Performance Matters
What You CaExpect to Get from This Book
Spark Versions
Why Scala
To Be a Spark Expert You Have to Leara Little Scala Anyway
The Spark Scala API Is Easier to Use Thathe lava API
Scala Is More Performant ThaPython
Why Not Scala
Learning Scala
Conclusion
2.How Spark Works
How Spark Fits into the Big Data Ecosystem
Spark Components
Spark Model of Parallel Computing: RDDs
Lazy Evaluation
In-Memory Persistence and Memory Management
Immutability and the RDD Interface
Types of RDDs
Functions oRDDs: Transformations Versus Actions
Wide Versus Narrow Dependencies
Spark Job Scheduling
Resource AllocatioAcross Applications
The Spark Application
The Anatomy of a Spark lob
The DAG
Jobs
Stages
Tasks
Conclusion
3.DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSessio(or HiveContext or SQLContext)
Spark SQL Dependencies
Managing Spark Dependencies
Avoiding Hive JARs
Basics of Schemas
DataFrame API
Transformations
Multi-DataFrame Transformations
PlaiOld SQL Queries and Interacting with Hive Data
Data RepresentatioiDataFrames and Datasets
Tungsten
Data Loading and Saving Functions
DataFrameWriter and DataFrameReader
Formats
Save Modes
Partitions (Discovery and Writing)
Datasets
Interoperability with RDDs, DataFrames, and Local Collections
Compile-Time Strong Typing
Easier Functional (RDD 'like') Transformations
Relational Transformations
Multi-Dataset Relational Transformations
Grouped Operations oDatasets
Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs)
Query Optimizer
Logical and Physical Plans
Code Generation
Large Query Plans and Iterative Algorithms
Debugging Spark SQL Queries
BC/ODBC Server
Conclusion
4.Joins (SQL and Core)
Core Spark Joins
Choosing a JoiType
Choosing aExecutioPlan
Spark SQL Joins
DataFrame Joins
Dataset Joins
Conclusion
5.Effective Transformations
Narrow Versus Wide Transformations
Implications for Performance
Implications for Fault Tolerance
The Special Case of coalesce
What Type of RDD Does Your TransformatioReturn
Minimizing Object Creation
Reusing Existing Objects
Using Smaller Data Structures
Iterator-to-Iterator Transformations with mapPartitions
What Is aIterator-to-Iterator Transformation
Space and Time Advantages
AExample
Set Operations
Reducing Setup Overhead
Shared Variables
Broadcast Variables
Accumulators
Reusing RDDs
Cases for Reuse
Deciding if Repute Is Inexpensive Enough
Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
Alluxio (nee Tachyon)
LRU Caching
Noisy Cluster Considerations
Interactiowith Accumulators
Conclusion
6.Working with Key/Value Data
The Goldilocks Example
Goldilocks Versio0: Iterative Solution
How to Use PairRDDFunctions and OrderedRDDFunctions
Actions oKey/Value Pairs
What's So Dangerous About the groupByKey Function
Goldilocks Versio1: groupByKey Solution
Choosing aAggregatioOperation
Dictionary of AggregatioOperations with Performance Considerations
Multiple RDD Operations
Co-Grouping
Partitioners and Key/Value Data
Using the Spark Partitioner Object
Hash Partitioning
Range Partitioning
Custom Partitioning
Preserving Partitioning InformatioAcross Transformations
Leveraging Co-Located and Co-Partitioned RDDs
Dictionary of Mapping and Partitioning Functions PairRDDFunctions
Dictionary of OrderedRDDOperations
Sorting by Two Keys with SortByKey
Secondary Sort and repartitionAndSortWithinPartitions
Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
How Not to Sort by Two Orderings
Goldilocks Versio2: Secondary Sort
A Different Approach to Goldilocks
Goldilocks Versio3: Sort oCell Values
Straggler Detectioand Unbalanced Data
Back to Goldilocks (Again)
Goldilocks Versio4: Reduce to Distinct oEach Partition
Conclusion
7.Going Beyond Scala
Beyond Scala withithe JVM
Beyond Scala, and Beyond the JVM
How PySpark Works
How SparkR Works
Spark.jl (Julia Spark)
How Eclair JS Works
Spark othe CommoLanguage Runtime (CLR)——C# and Friends
Calling Other Languages from Spark
Using Pipe and Friends
JNI
Java Native Access (JNA)
Underneath Everything Is FORTRAN
Getting to the GPU
The Future
Conclusion
8.Testing and Validation
Unit Testing
General Spark Unit Testing
Mocking RDDs
Getting Test Data
Generating Large Datasets
Sampling
Property Checking with ScalaCheck
Computing RDD Difference
IntegratioTesting
Choosing Your IntegratioTesting Environment
Verifying Performance
Spark Counters for Verifying Performance
Projects for Verifying Performance
Job Validation
Conclusion
9.Spark MLlib and ML
Choosing BetweeSpark MLlib and Spark ML
Working with MLlib
Getting Started with MLlib (Organizatioand Imports)
MLlib Feature Encoding and Data Preparation
Feature Scaling and Selection
MLlib Model Training
Predicting
Serving and Persistence
Model Evaluation
Working with Spark ML
Spark ML Organizatioand Imports
Pipeline Stages
ExplaiParams
Data Encoding
Data Cleaning
Spark ML Models
Putting It All Together ia Pipeline
Training a Pipeline
Accessing Individual Stages
Data Persistence and Spark ML
Extending Spark ML Pipelines with Your OwAlgorithms
Model and Pipeline Persistence and Serving with Spark ML
General Serving Considerations
Conclusion
10.Spark Components and Packages
Stream Processing with Spark
Sources and Sinks
Batch Intervals
Data Checkpoint Intervals
Considerations for DStreams
Considerations for Structured Streaming
High Availability Mode (or Handling Driver Failure or Checkpointing)
GraphX
Using Community Packages and Libraries
Creating a Spark Package
Conclusion
A.Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist
Index
猜您喜欢