2022-02-17 e1d6e8ab5d773f5bd2fadd4520a69999 99+ 2 m 0.3 k

提交Spark任务

1.spark-submit

https://spark.apache.org/docs/latest/submitting-applications.html

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). Multiple configurations should be passed as separate arguments. (e.g. --conf = --conf =)
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any

当前为客户端，driver在哪取决于deploy mode

2.python file.py

应该只能local和client

此时若是代码指定cluster会报错

1	config("spark.submit.deployMode", "cluster")

Exception in thread “main” org.apache.spark.SparkException: Cluster deploy mode is not applicable to Spark shells.

3.jupyter notebook

应该只能local和clien

大数据基础组件 spark 使用

提交Spark任务

2022-02-15 f94d3a4c0e55f582c29cf39edbf5087f 99+ fast 0.1 k

数据库分类

常见关系型数据库：

Oracle
MySql
Microsoft SQL Server
SQLite
PostgreSQL
IBM DB2

常见的非关系型数据库：

键值数据库：Redis、Memcached、Riak
列族数据库：Bigtable、HBase、Cassandra
文档数据库：MongoDB、CouchDB、MarkLogic
图形数据库：Neo4j、InfoGrid

二者区别

非关系型没有表结构，不支持sql

大数据基础组件数据库数据库

数据库分类

2022-02-15 db61a011aa751550ee5924b560fe7bda 99+ fast 0.1 k

主外键

主键、外键

https://blog.csdn.net/weixin_31642161/article/details/113113942

有可能没有主键

联合主键，复合主键

联合主键：数据库表的主键由两个及以上的字段组成。

复合主键：有争议

大数据基础组件数据库关系型关系型

主外键

2022-02-13 ceb83d33cfdb41ef31afbc9c7818a126 99+ 2 m 0.3 k

数据采集

1 离线

1.用户行为数据

jar-》log-》flume-》kafka-》flume-》hdfs

用户行为数据存储在日志服务器，以.log文件存在，log-》flume-》kafka-》flume-》hdfs

2 业务数据

jar-》mysql-》sqoop-》hdfs

业务数据存储在mysql，使用sqoop导入hdfs

2 实时

1.用户行为数据

前端-》Nginx-》日志服务器-》）log，Kafka（ods

1 前端埋点数据

通过jar包模拟

2 Nginx

https://blog.csdn.net/qq_40036754/article/details/102463099

负载均衡

3 日志服务器

spring boot搭建

首先，Spring 就是一个java框架，spring boot在 Spring 的基础上演进

4 落盘，整合 Kafka

落盘指的是存在日志服务器

生产者-》kafka-》消费者

生产消费

2 业务数据

jar-》mysql-》flinkcdc-》kafka（ods）

不能使用sqoop，因为sqoop底层为mapreduce，太慢了，改用canal，maxwell或者flinkcdc

数据从mysql读到kafka，不是hdfs

flink-cdc

https://cloud.tencent.com/developer/article/1801766

Change Data Capture(变更数据获取）

大数据数据仓库数据集成

数据采集

2022-02-10 522fb13633e59b2798e7f59e5d7e4d74 99+ fast 0.0 k

数据字典

https://www.cnblogs.com/arxive/p/9673830.html

https://blog.panoply.io/how-to-create-a-data-dictionary

https://www.secoda.co/blog/how-to-create-a-data-dictionary-a-step-by-step-guide

大数据基础组件数据库关系型关系型

数据字典

2022-02-09 c786b02848cf708e86f89ecdc2ee96d1 99+ fast 0.0 k

OLAP和OLTP的区别

https://www.cnblogs.com/schoolbag/p/9759214.html

大数据数据仓库数仓离线数仓

OLAP和OLTP的区别

2022-02-09 1024c734bfd8074cc4c6970e7055f339 99+ fast 0.1 k

ETL

https://www.cnblogs.com/yjd_hycf_space/p/7772722.html

https://blog.csdn.net/qq_33269009/article/details/90522087

https://blog.csdn.net/Stubborn_Cow/article/details/48420997

注意：很多人理解的ETL是在经过前两个部分之后，加载到数据仓库的数据库中就完事了。ETL不仅仅是在源数据—>ODS这一步，ODS—>DW, DW—>DM包含更为重要和复杂的ETL过程。

大数据数据仓库 ETL

ETL

2022-02-09 994bba34f6e2a110bfe254b9f87e10e5 99+ 2 m 0.2 k

数据同步

1.数据同步策略

1 全量

存储完整的数据。

2 增量

存储新增加的数据。

3 新增及变化

存储新增加的数据和变化的数据。

4 特殊

某些特殊的表，可不必遵循上述同步策略。

1.例如某些不会发生变化的表

地区表，省份表，民族表等，可以只存一份固定值。

2.拉链表

在第一天同步拉链表的时候，需要同步全量数据，并且设置endtime = 9999；首日过后，每日同步数据到拉链表中就是新增及变化的数据，可以采用分区策略，以999为一个分区表示有效的数据，加上以过期时间为分区；

2.首日，每日

首日同步，视情况，不一定全量

每日同步，视情况

大数据数据仓库数据集成

数据同步

2022-02-09 dd6c73e83c6e484737870bf370ca00b1 99+ 3 m 0.5 k

hive架构

https://cwiki.apache.org/confluence/display/hive/design#Design-HiveArchitecture

https://zhuanlan.zhihu.com/p/87545980

https://blog.csdn.net/oTengYue/article/details/91129850

https://jiamaoxiang.top/2020/06/27/Hive%E7%9A%84%E6%9E%B6%E6%9E%84%E5%89%96%E6%9E%90/

https://www.javatpoint.com/hive-architecture

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types of clients such as:-

Thrift Server - It is a cross-language service provider platform that serves the request from all those programming languages that supports Thrift.
JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.

Hive Services

The following are the services provided by Hive:-

Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.
Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and commands.
Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored.
Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.
Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.