2022-04-04 04460b8656db52671a474020398ea861 99+ fast 0.1 k

spark容错机制

https://blog.csdn.net/JasonDing1354/article/details/46882585

https://blog.csdn.net/dengxing1234/article/details/73613484

容错指的是一个系统在部分模块出现故障时还能否持续的对外提供服务

1 Lineage机制

2 Checkpoint机制

大数据基础组件 spark 原理

spark容错机制

2022-03-02 6793bad02a2e7b1690e764d81af32d63 99+ fast 0.1 k

pyspark

PySpark宗旨是在不破坏Spark已有的运行时架构，在Spark架构外层包装一层Python API，借助Py4j实现Python和Java的交互，进而实现通过Python编写Spark应用程序

大数据基础组件 spark 原理

pyspark

2022-03-02 ab13294482ed1f32375bbbc727356bac 99+ fast 0.0 k

Sparksql运行流程

https://blog.csdn.net/qq_25002995/article/details/104748504

sparksql底层的数据结构就是dataframe 转成rdd来做

大数据基础组件 spark 原理

Sparksql运行流程

2022-03-02 7e1c31c5567f6eebf79892986187df33 99+ fast 0.1 k

Sparkcore运行流程

sparkcore底层的数据结构是rdd

运行流程

1 注册并申请资源，分配资源

2 job->stage->task

3 4 5 driver executor 交互执行任务

executor 向driver申请任务，还是driver给executor 分配任务？

例子

大数据基础组件 spark 原理

Sparkcore运行流程

2022-02-27 48497643558afb4045d49766938887cf 99+ fast 0.1 k

sparksql对比hive sql

Hive和Spark 均是“分布式SQL计算引擎”，mysql等不是，mysql跑单机上

均是构建大规模结构化数据计算的绝佳利器，同时SparkSQL拥有更好的性能。目前，企业中使用Hive仍旧居多，但SparkSQL将会在很近的未来替代Hive成为分布式SQL计算市场的顶级

大数据基础组件 spark 原理

sparksql对比hive sql

2022-02-26 978c74f34f306a0fb603280a13e9affc 99+ 2 m 0.3 k

数据划分，rdd分区

https://www.jianshu.com/p/3aa52ee3a802

https://blog.csdn.net/hjw199089/article/details/77938688

https://blog.csdn.net/mys_35088/article/details/80864092

https://blog.csdn.net/dmy1115143060/article/details/82620715

https://blog.csdn.net/xxd1992/article/details/85254666

https://blog.csdn.net/m0_46657040/article/details/108737350

https://blog.csdn.net/heiren_a/article/details/111954523

https://blog.csdn.net/u011564172/article/details/53611109

https://blog.csdn.net/qq_22473611/article/details/107822168

https://www.jianshu.com/p/3aa52ee3a802

1 application，job，stage，task，record

0 application

任务

1 Job

一个action 一个job

2 Stage

一个job根据rdd的依赖关系构建dag，根据dag划分stage，一个job包含一个或者多个stage

3 Task

stage根据rdd的分区数决定task数量

4 record

一个task 对应很多record，也就是多少行数据

例子

这里应该是 rdd的分区数决定task数量，task数量决定inputsplit数量，然后决定block的组合

https://juejin.cn/post/6844903848536965134

2 rdd分区

为什么分区？

分区的主要作用是用来实现并行计算，提高效率

分区方式

Spark包含两种数据分区方式：HashPartitioner（哈希分区）和RangePartitioner（范围分区）

分区数设置

https://justdodt.github.io/2018/04/23/Spark-RDD-%E7%9A%84%E5%88%86%E5%8C%BA%E6%95%B0%E9%87%8F%E7%A1%AE%E5%AE%9A/

大数据基础组件 spark 原理

数据划分，rdd分区

2022-02-25 b70775067803e1bb71f0085f48c72f1c 99+ 4 m 0.6 k

Spark vs MapReduce

对比

https://www.educba.com/mapreduce-vs-spark/

	MapReduce	Spark
Product’s Category	From the introduction, we understood that MapReduce enables the processing of data and hence is majorly a data processing engine.	Spark, on the other hand, is a framework that drives complete analytical solutions or applications and hence making it an obvious choice for data scientists to use this as a data analytics engine.
Framework’s Performance and Data Processing	In the case of MapReduce, reading and writing operations are performed from and to a disk thus leading to slowness in the processing speed.	In Spark, the number of read/write cycles is minimized along with storing data in memory allowing it to be 10 times faster. But spark may suffer a major degradation if data doesn’t fit in memory.
Latency	As a result of lesser performance than Spark, MapReduce has a higher latency in computing.	Since Spark is faster, it enables developers with low latency computing.
Manageability of framework	MapReduce being only a batch engine, other components must be handled separately yet synchronously thus making it difficult to manage.	Spark is a complete data analytics engine, has the capability to perform batch, interactive streaming, and similar component all under the same cluster umbrella and thus easier to manage!
Real-time Analysis	MapReduce was built mainly for batch processing and hence fails when used for real-time analytics use cases.	Data coming from real-time live streams like Facebook, Twitter, etc. can be efficiently managed and processed in Spark.
Interactive Mode	MapReduce doesn’t provide the gamut of having interactive mode.	In spark it is possible to process the data interactively
Security	MapReduce has accessibility to all features of Hadoop security and as a result of this, it is can be easily integrated with other projects of Hadoop Security. MapReduce also supports ASLs.	In Spark, the security is by default set to OFF which might lead to a major security fallback. In the case of authentication, only the shared secret password method is possible in Spark.
Tolerance to Failure	In case of crash of MapReduce process, the process is capable of starting from the place where it was left off earlier as it relies on Hard Drives rather than RAMs	In case of crash of Spark process, the processing should start from the beginning and hence becomes less fault-tolerant than MapReduce as it relies of RAM usage.

spark为什么比MapReduce快

https://blog.csdn.net/JENREY/article/details/84873874

1 spark基于内存，mapreduce基于磁盘

指的是中间结果

MapReduce：通常需要将计算的中间结果写入磁盘，然后还要读取磁盘，从而导致了频繁的磁盘IO

Spark：不需要每次将计算的中间结果写入磁盘

2 spark粗粒度资源申请，MapReduce细粒度资源申请

spark 执行task不需要自己申请资源，提交任务的时候统一申请了

MapReduce 执行task任务的时候，task自己申请

3 spark基于多线程，mapreduce基于多进程

大数据基础组件 spark 原理

Spark vs MapReduce

2022-02-01 80da0a538cbcaa782b7431aac9d0aa1a 99+ fast 0.1 k

Spark on Hive & Hive on Spark

https://cloud.tencent.com/developer/article/1624245

https://blog.csdn.net/weixin_41290471/article/details/106203419

https://www.cnblogs.com/qingyunzong/p/8992664.html

https://www.cnblogs.com/liugp/p/16209394.html#1spark-on-hive

1 Hive on Spark

Hive既作为存储又负责sql的解析优化，Spark负责执行

2 Spark on Hive

Hive只作为存储角色，Spark负责sql解析优化，执行

hive on spark大体与spark on hive结构类似，只是SQL引擎不同

大数据基础组件 spark 原理

Spark on Hive & Hive on Spark

2022-01-19 73bd08cfafb536ce948fe047bd7c14b5 99+ a minute 0.2 k

RDD依赖关系

1 为什么要提出宽窄依赖

根据rdd的依赖关系构建dag，根据dag划分stage

2 如何区别宽窄依赖

蓝色框是rdd分区

父rdd -> 子rdd

区分宽窄依赖主要就是看父RDD数据流向，要是流向一个的话就是窄依赖，流向多个的话就是宽依赖。

3 划分stage

https://blog.csdn.net/weixin_40271036/article/details/79996516

整体思路是：从后往前推，遇到宽依赖就断开，划分为一个stage；遇到窄依赖就将这个RDD加入该stage中

4 参考

https://www.jianshu.com/p/5c2301dfa360

https://zhuanlan.zhihu.com/p/67068559

https://blog.csdn.net/m0_49834705/article/details/113111596

https://lmrzero.blog.csdn.net/article/details/106015264?spm=1001.2014.3001.5502

5 问题

为啥宽依赖有shuffle 窄依赖没有shuffle

大数据基础组件 spark 原理

RDD依赖关系

2022-01-18 d63e31e6e096fbe43c82a2f6b02c3902 99+ 7 m 1.0 k

spark部署方式

https://blog.csdn.net/qq_37163925/article/details/106260434

https://spark.apache.org/docs/latest/cluster-overview.html

https://book.itheima.net/course/1269935677353533441/1270998166728089602/1270999667882074115

通过设置mater来选择部署方式。这是Spark程序需要连接的集群管理器所在的URL地址。如果这个属性在提交应用程序的时候没设置，程序将会通过System.getenv(“MASTER”)来获取MASTER环境变量；但是如果MASTER环境变量没有设定，那么程序将会把master的值设定为local[*]

local为单机

standalone是Spark自身实现资源调度

yarn为使用hadoop yarn来实现资源调度

1 local

本地模式就是以一个独立的进程，通过其内部的多个线程来模拟整个Spark运行时环境

local【N】：N为线程数量，通常N为cpu的core的数量

local【*】：cpu的core数量

跑local可以不依赖hadoop

https://blog.csdn.net/wangmuming/article/details/37695619

https://blog.csdn.net/bettesu/article/details/68512570

2 Standalone

https://sfzsjx.github.io/2019/08/26/spark-standalone-%E8%BF%90%E8%A1%8C%E5%8E%9F%E7%90%86

2.1 client

执行流程

client 模式提交任务后，会在客户端启动Driver进程。
Driver 会向Master申请启动Application启动资源。
资源申请成功后，Driver端会将task发送到worker端执行。
worker端执行成功后将执行结果返回给Driver端

2.2 cluster

执行流程：

客户端使用命令spark-submit –deploy-mode cluster 后会启动spark-submit进程
此进程为Driver向Master 申请资源。
Master会随机在一台Worker节点来启动Driver进程。
Driver启动成功后，spark-submit关闭，然后Driver向Master申请资源。
Master接收到请求后，会在资源充足的Worker节点上启动Executor进程。
Driver分发Task到Executor中执行。

2.3 高可用HA

3 Mesos

a general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)

4 YARN

为什么要YARN？

Spark On YARN是有两种运行模式的,一种是Cluster模式一种是Client模式.这两种模式的区别就是Driver运行的位置.

Cluster模式即:Driver运行在YARN容器内部, 和ApplicationMaster在同一个容器内

Client模式即:Driver运行在客户端进程中, 比如Driver运行在spark-submit程序的进程中

4.1 cluster

具体流程步骤如下：
1）、任务提交后会和ResourceManager通讯申请启动ApplicationMaster;
2）、随后ResourceManager分配Container，在合适的NodeManager上启动ApplicationMaster，此时的ApplicationMaster就是Driver；
3）、Driver启动后向ResourceManager申请Executor内存，ResourceManager接到ApplicationMaster的资源申请后会分配Container,然后在合适的NodeManager上启动Executor进程;
4）、Executor进程启动后会向Driver反向注册;
5）、Executor全部注册完成后Driver开始执行main函数，之后执行到Action算子时，触发一个job，并根据宽依赖开始划分stage，每个stage生成对应的taskSet，之后将task分发到各个Executor上执行;

4.2 client

具体流程步骤如下：
1）、Driver在任务提交的本地机器上运行，Driver启动后会和ResourceManager通讯申请启动ApplicationMaster；
2）、随后ResourceManager分配Container，在合适的NodeManager上启动ApplicationMaster，此时的ApplicationMaster的功能相当于一个ExecutorLaucher，只负责向ResourceManager申请Executor内存；
3）、ResourceManager接到ApplicationMaster的资源申请后会分配Container，然后ApplicationMaster在资源分配指定的NodeManager上启动Executor进程；
4）、Executor进程启动后会向Driver反向注册，Executor全部注册完成后Driver开始执行main函数；
5）、之后执行到Action算子时，触发一个Job，并根据宽依赖开始划分Stage，每个Stage生成对应的TaskSet，之后将Task分发到各个Executor上执行。

5 Kubernetes

an open-source system for automating deployment, scaling, and management of containerized applications.

大数据基础组件 spark 原理

spark部署方式

1 Lineage机制

2 Checkpoint机制

运行流程

例子

1 application，job，stage，task，record

0 application

1 Job

2 Stage

3 Task

4 record

例子

2 rdd分区

对比

spark为什么比MapReduce快

1 Hive on Spark

2 Spark on Hive

1 为什么要提出宽窄依赖

2 如何区别宽窄依赖

3 划分stage

4 参考

5 问题

1 local

2 Standalone

2.1 client

2.2 cluster

2.3 高可用HA

3 Mesos

4 YARN

4.1 cluster

4.2 client

5 Kubernetes

Recents

Categories

Archives

Tags

Subscribe for updates