Spark vs MapReduce

对比

https://www.educba.com/mapreduce-vs-spark/

MapReduce Spark
Product’s Category From the introduction, we understood that MapReduce enables the processing of data and hence is majorly a data processing engine. Spark, on the other hand, is a framework that drives complete analytical solutions or applications and hence making it an obvious choice for data scientists to use this as a data analytics engine.
Framework’s Performance and Data Processing In the case of MapReduce, reading and writing operations are performed from and to a disk thus leading to slowness in the processing speed. In Spark, the number of read/write cycles is minimized along with storing data in memory allowing it to be 10 times faster. But spark may suffer a major degradation if data doesn’t fit in memory.
Latency As a result of lesser performance than Spark, MapReduce has a higher latency in computing. Since Spark is faster, it enables developers with low latency computing.
Manageability of framework MapReduce being only a batch engine, other components must be handled separately yet synchronously thus making it difficult to manage. Spark is a complete data analytics engine, has the capability to perform batch, interactive streaming, and similar component all under the same cluster umbrella and thus easier to manage!
Real-time Analysis MapReduce was built mainly for batch processing and hence fails when used for real-time analytics use cases. Data coming from real-time live streams like Facebook, Twitter, etc. can be efficiently managed and processed in Spark.
Interactive Mode MapReduce doesn’t provide the gamut of having interactive mode. In spark it is possible to process the data interactively
Security MapReduce has accessibility to all features of Hadoop security and as a result of this, it is can be easily integrated with other projects of Hadoop Security. MapReduce also supports ASLs. In Spark, the security is by default set to OFF which might lead to a major security fallback. In the case of authentication, only the shared secret password method is possible in Spark.
Tolerance to Failure In case of crash of MapReduce process, the process is capable of starting from the place where it was left off earlier as it relies on Hard Drives rather than RAMs In case of crash of Spark process, the processing should start from the beginning and hence becomes less fault-tolerant than MapReduce as it relies of RAM usage.

spark为什么比MapReduce快

https://blog.csdn.net/JENREY/article/details/84873874

1 spark基于内存 ,mapreduce基于磁盘

指的是中间结果

MapReduce:通常需要将计算的中间结果写入磁盘,然后还要读取磁盘,从而导致了频繁的磁盘IO

Spark:不需要每次将计算的中间结果写入磁盘

2 spark粗粒度资源申请,MapReduce细粒度资源申请

spark 执行task不需要自己申请资源,提交任务的时候统一申请了

MapReduce 执行task任务的时候,task自己申请

3 spark基于多线程,mapreduce基于多进程

spark配置

1.设置方式

https://blog.51cto.com/u_16213328/7866422

优先级

hadoop设置???

2.代码中设置(SparkSession、SparkContext、HiveContext、SQLContext)

https://blog.csdn.net/weixin_43648241/article/details/108917865

SparkSession > SparkContext > HiveContext > SQLContext

SparkSession包含SparkContext

SparkContext包含HiveContext

HiveContext包含SQLContext

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SparkSession.builder.\
config("hive.metastore.uris", "thrift://xxx.xx.x.xx:xxxx").\
config("spark.pyspark.python", "/opt/dm_python3/bin/python").\
config('spark.default.parallelism ', 10 ).\
config('spark.sql.shuffle.partitions', 200 ).\
config("spark.driver.maxResultSize", "16g").\
config("spark.port.maxRetries", "100").\
config("spark.driver.memory","16g").\
config("spark.yarn.queue", "dcp" ).\
config("spark.executor.memory", "16g" ).\
config( "spark.executor.cores", 20).\
config("spark.files", addfile).\
config( "spark.executor.instances", 6 ).\
config("spark.speculation", False).\
config( "spark.submit.pyFiles", zipfile).\
appName("testing").\
master("yarn").\
enableHiveSupport().\
getOrCreate()![11](D:\blog\blog\source\_posts\context\11.JPG)

提交Spark任务

1.spark-submit

https://spark.apache.org/docs/latest/submitting-applications.html

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.

1
2
3
4
5
6
7
8
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
  • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). Multiple configurations should be passed as separate arguments. (e.g. --conf = --conf =)
  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
  • application-arguments: Arguments passed to the main method of your main class, if any

当前为客户端,driver在哪取决于deploy mode

2.python file.py

应该只能local和client

此时若是代码指定cluster会报错

1
config("spark.submit.deployMode", "cluster")

Exception in thread “main” org.apache.spark.SparkException: Cluster deploy mode is not applicable to Spark shells.

3.jupyter notebook

应该只能local和clien

数据库分类

常见关系型数据库

  1. Oracle
  2. MySql
  3. Microsoft SQL Server
  4. SQLite
  5. PostgreSQL
  6. IBM DB2

常见的非关系型数据库

  1. 键值数据库:Redis、Memcached、Riak
  2. 列族数据库:Bigtable、HBase、Cassandra
  3. 文档数据库:MongoDB、CouchDB、MarkLogic
  4. 图形数据库:Neo4j、InfoGrid

二者区别

非关系型没有表结构,不支持sql

hive架构

https://cwiki.apache.org/confluence/display/hive/design#Design-HiveArchitecture

https://zhuanlan.zhihu.com/p/87545980

https://blog.csdn.net/oTengYue/article/details/91129850

https://jiamaoxiang.top/2020/06/27/Hive%E7%9A%84%E6%9E%B6%E6%9E%84%E5%89%96%E6%9E%90/

https://www.javatpoint.com/hive-architecture

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types of clients such as:-

  • Thrift Server - It is a cross-language service provider platform that serves the request from all those programming languages that supports Thrift.
  • JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
  • ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.

Hive Services

The following are the services provided by Hive:-

  • Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.
  • Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and commands.
  • Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored.
  • Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.
  • Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
  • Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
  • Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.

计算引擎

Hive支持MapReduce、Tez、Spark

https://cloud.tencent.com/developer/article/1893808

https://blog.csdn.net/kwu_ganymede/article/details/52223133

数据存储

https://cloud.tencent.com/developer/article/1411821

Hive是基于hdfs的,它的数据存储在Hadoop分布式文件系统中。Hive本身是没有专门的数据存储格式,也没有为数据建立索引,只需要在创建表的时候告诉Hive数据中的列分隔符和行分隔符,Hive就可以解析数据。

default数据库中的表的存储位置 /user/hive/warehouse
其他数据库的表自己指定

sql

1.SQL语句类别

SQL 语句主要可以划分为以下 3 个类别。

DDL(Data Definition Languages)语句:数据定义语言,这些语句定义了不同的数据段、数据库、表、列、索引等数据库对象的定义。常用的语句关键字主要包括 create、drop、alter等。

DML(Data Manipulation Language)语句:数据操纵语句,用于添加、删除、更新和查询数据库记录,并检查数据完整性,常用的语句关键字主要包括 insert、delete、udpate 和select 等。(增添改查)

DCL(Data Control Language)语句:数据控制语句,用于控制不同数据段直接的许可和访问级别的语句。这些语句定义了数据库、表、字段、用户的访问权限和安全级别。主要的语句关键字包括 grant、revoke 等。

2.sql语句执行顺序

https://www.cnblogs.com/Qian123/p/5669259.html

https://www.cnblogs.com/Qian123/p/5669259.html

https://cloud.tencent.com/developer/article/1600323

1
2
3
4
5
6
7
8
1、from子句组装来自不同数据源的数据;
2、where子句基于指定的条件对记录行进行筛选;
3、group by子句将数据划分为多个分组;
4、聚合函数进行计算;
5、having子句筛选分组;
6、计算所有的表达式;
7、select字段;
8、order by对结果集进行排序。

感觉好像先select后having

Atlas

1.概述

Apache Atlas为组织提供开放式元数据管理和治理功能,用以构建其数据资产目录,对这些资产进行分类和管理,并为数据分析师和数据治理+团队,提供围绕这些数据资产的协作功能。

2.Atlas的具体功能

元数据分类 支持对元数据进行分类管理,例如个人信息,敏感信息等
元数据检索 可按照元数据类型、元数据分类进行检索,支持全文检索
血缘依赖 支持表到表和字段到字段之间的血缘依赖,便于进行问题回溯和影响分析等

1)表与表之间的血缘依赖

2)字段与字段之间的血缘依赖

3.Atlas架构原理

4.使用

4.1 Hive元数据初次导入

操作:

Atlas提供了一个Hive元数据导入的脚本,直接执行该脚本,即可完成Hive元数据的初次全量导入。

/opt/module/atlas/hook-bin/import-hive.sh

问题:

Failed to import Hive Meta Data!!!

注意:hive —service metastore &

4.2 Hive元数据增量同步

Hive元数据的增量同步,无需人为干预,只要Hive中的元数据发生变化(执行DDL语句),Hive Hook就会将元数据的变动通知Atlas。除此之外,Atlas还会根据DML语句获取数据之间的血缘关系。


:D 一言句子获取中...