数据划分,rdd分区

https://www.jianshu.com/p/3aa52ee3a802

https://blog.csdn.net/hjw199089/article/details/77938688

https://blog.csdn.net/mys_35088/article/details/80864092

https://blog.csdn.net/dmy1115143060/article/details/82620715

https://blog.csdn.net/xxd1992/article/details/85254666

https://blog.csdn.net/m0_46657040/article/details/108737350

https://blog.csdn.net/heiren_a/article/details/111954523

https://blog.csdn.net/u011564172/article/details/53611109

https://blog.csdn.net/qq_22473611/article/details/107822168

https://www.jianshu.com/p/3aa52ee3a802

1 application,job,stage,task,record

0 application

任务

1 Job

一个action 一个job

2 Stage

一个job根据rdd的依赖关系构建dag,根据dag划分stage,一个job包含一个或者多个stage

3 Task

stage根据rdd的分区数决定task数量

4 record

一个task 对应很多record,也就是多少行数据

例子

这里应该是 rdd的分区数决定task数量,task数量决定inputsplit数量,然后决定block的组合

https://juejin.cn/post/6844903848536965134

2 rdd分区

为什么分区?

分区的主要作用是用来实现并行计算,提高效率

分区方式

Spark包含两种数据分区方式:HashPartitioner(哈希分区)和RangePartitioner(范围分区)

分区数设置

https://justdodt.github.io/2018/04/23/Spark-RDD-%E7%9A%84%E5%88%86%E5%8C%BA%E6%95%B0%E9%87%8F%E7%A1%AE%E5%AE%9A/

Spark vs MapReduce

对比

https://www.educba.com/mapreduce-vs-spark/

MapReduce Spark
Product’s Category From the introduction, we understood that MapReduce enables the processing of data and hence is majorly a data processing engine. Spark, on the other hand, is a framework that drives complete analytical solutions or applications and hence making it an obvious choice for data scientists to use this as a data analytics engine.
Framework’s Performance and Data Processing In the case of MapReduce, reading and writing operations are performed from and to a disk thus leading to slowness in the processing speed. In Spark, the number of read/write cycles is minimized along with storing data in memory allowing it to be 10 times faster. But spark may suffer a major degradation if data doesn’t fit in memory.
Latency As a result of lesser performance than Spark, MapReduce has a higher latency in computing. Since Spark is faster, it enables developers with low latency computing.
Manageability of framework MapReduce being only a batch engine, other components must be handled separately yet synchronously thus making it difficult to manage. Spark is a complete data analytics engine, has the capability to perform batch, interactive streaming, and similar component all under the same cluster umbrella and thus easier to manage!
Real-time Analysis MapReduce was built mainly for batch processing and hence fails when used for real-time analytics use cases. Data coming from real-time live streams like Facebook, Twitter, etc. can be efficiently managed and processed in Spark.
Interactive Mode MapReduce doesn’t provide the gamut of having interactive mode. In spark it is possible to process the data interactively
Security MapReduce has accessibility to all features of Hadoop security and as a result of this, it is can be easily integrated with other projects of Hadoop Security. MapReduce also supports ASLs. In Spark, the security is by default set to OFF which might lead to a major security fallback. In the case of authentication, only the shared secret password method is possible in Spark.
Tolerance to Failure In case of crash of MapReduce process, the process is capable of starting from the place where it was left off earlier as it relies on Hard Drives rather than RAMs In case of crash of Spark process, the processing should start from the beginning and hence becomes less fault-tolerant than MapReduce as it relies of RAM usage.

spark为什么比MapReduce快

https://blog.csdn.net/JENREY/article/details/84873874

1 spark基于内存 ,mapreduce基于磁盘

指的是中间结果

MapReduce:通常需要将计算的中间结果写入磁盘,然后还要读取磁盘,从而导致了频繁的磁盘IO

Spark:不需要每次将计算的中间结果写入磁盘

2 spark粗粒度资源申请,MapReduce细粒度资源申请

spark 执行task不需要自己申请资源,提交任务的时候统一申请了

MapReduce 执行task任务的时候,task自己申请

3 spark基于多线程,mapreduce基于多进程

Spark on Hive & Hive on Spark

https://cloud.tencent.com/developer/article/1624245

https://blog.csdn.net/weixin_41290471/article/details/106203419

https://www.cnblogs.com/qingyunzong/p/8992664.html

https://www.cnblogs.com/liugp/p/16209394.html#1spark-on-hive

1 Hive on Spark

Hive既作为存储又负责sql的解析优化,Spark负责执行

2 Spark on Hive

Hive只作为存储角色,Spark负责sql解析优化,执行

hive on spark大体与spark on hive结构类似,只是SQL引擎不同

RDD依赖关系

1 为什么要提出宽窄依赖

根据rdd的依赖关系构建dag,根据dag划分stage

2 如何区别宽窄依赖

蓝色框是rdd分区

父rdd -> 子rdd

区分宽窄依赖主要就是看父RDD数据流向,要是流向一个的话就是窄依赖,流向多个的话就是宽依赖。

3 划分stage

https://blog.csdn.net/weixin_40271036/article/details/79996516

整体思路是:从后往前推,遇到宽依赖就断开,划分为一个stage;遇到窄依赖就将这个RDD加入该stage中

4 参考

https://www.jianshu.com/p/5c2301dfa360

https://zhuanlan.zhihu.com/p/67068559

https://blog.csdn.net/m0_49834705/article/details/113111596

https://lmrzero.blog.csdn.net/article/details/106015264?spm=1001.2014.3001.5502

5 问题

为啥宽依赖有shuffle 窄依赖没有shuffle

spark部署方式

https://blog.csdn.net/qq_37163925/article/details/106260434

https://spark.apache.org/docs/latest/cluster-overview.html

https://book.itheima.net/course/1269935677353533441/1270998166728089602/1270999667882074115

通过设置mater来选择部署方式。这是Spark程序需要连接的集群管理器所在的URL地址。如果这个属性在提交应用程序的时候没设置,程序将会通过System.getenv(“MASTER”)来获取MASTER环境变量;但是如果MASTER环境变量没有设定,那么程序将会把master的值设定为local[*]

local为单机

standalone是Spark自身实现资源调度

yarn为使用hadoop yarn来实现资源调度

1 local

本地模式就是以一个独立的进程,通过其内部的多个线程来模拟整个Spark运行时环境

local【N】:N为线程数量,通常N为cpu的core的数量

local【*】:cpu的core数量

跑local可以不依赖hadoop

https://blog.csdn.net/wangmuming/article/details/37695619

https://blog.csdn.net/bettesu/article/details/68512570

2 Standalone

https://sfzsjx.github.io/2019/08/26/spark-standalone-%E8%BF%90%E8%A1%8C%E5%8E%9F%E7%90%86

2.1 client

执行流程

  1. client 模式提交任务后,会在客户端启动Driver进程。
  2. Driver 会向Master申请启动Application启动资源。
  3. 资源申请成功后,Driver端会将task发送到worker端执行。
  4. worker端执行成功后将执行结果返回给Driver端

2.2 cluster

执行流程:

  1. 客户端使用命令spark-submit –deploy-mode cluster 后会启动spark-submit进程
  2. 此进程为Driver向Master 申请资源。
  3. Master会随机在一台Worker节点来启动Driver进程。
  4. Driver启动成功后,spark-submit关闭,然后Driver向Master申请资源。
  5. Master接收到请求后,会在资源充足的Worker节点上启动Executor进程。
  6. Driver分发Task到Executor中执行。

2.3 高可用HA

3 Mesos

a general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)

4 YARN

为什么要YARN?

Spark On YARN是有两种运行模式的,一种是Cluster模式一种是Client模式.这两种模式的区别就是Driver运行的位置.

Cluster模式即:Driver运行在YARN容器内部, 和ApplicationMaster在同一个容器内

Client模式即:Driver运行在客户端进程中, 比如Driver运行在spark-submit程序的进程中

4.1 cluster

具体流程步骤如下:
1)、任务提交后会和ResourceManager通讯申请启动ApplicationMaster;
2)、随后ResourceManager分配Container,在合适的NodeManager上启动ApplicationMaster,此时的ApplicationMaster就是Driver;
3)、Driver启动后向ResourceManager申请Executor内存,ResourceManager接到ApplicationMaster的资源申请后会分配Container,然后在合适的NodeManager上启动Executor进程;
4)、Executor进程启动后会向Driver反向注册;
5)、Executor全部注册完成后Driver开始执行main函数,之后执行到Action算子时,触发一个job,并根据宽依赖开始划分stage,每个stage生成对应的taskSet,之后将task分发到各个Executor上执行;

4.2 client

具体流程步骤如下:
1)、Driver在任务提交的本地机器上运行,Driver启动后会和ResourceManager通讯申请启动ApplicationMaster;
2)、随后ResourceManager分配Container,在合适的NodeManager上启动ApplicationMaster,此时的ApplicationMaster的功能相当于一个ExecutorLaucher,只负责向ResourceManager申请Executor内存;
3)、ResourceManager接到ApplicationMaster的资源申请后会分配Container,然后ApplicationMaster在资源分配指定的NodeManager上启动Executor进程;
4)、Executor进程启动后会向Driver反向注册,Executor全部注册完成后Driver开始执行main函数;
5)、之后执行到Action算子时,触发一个Job,并根据宽依赖开始划分Stage,每个Stage生成对应的TaskSet,之后将Task分发到各个Executor上执行。

5 Kubernetes

an open-source system for automating deployment, scaling, and management of containerized applications.


:D 一言句子获取中...