2022-05-06 8f935ed29e76c7ff2a9dcdc7a339d376 99+ fast 0.1 k

hql增删改查

1增

insert

load

2删

1 表

Drop 表结构都没有了

1	DROP TABLE IF EXISTS employee;

2 记录

没有DELETE

TRUNCATE

所有记录

truncate table employees;

INSERT OVERWRITE

1 2	INSERT OVERWRITE TABLE dpc_test SELECT * FROM dpc_test WHERE age is not null;

3改

1 update

针对记录

1	update student set id='444' where name='tom';

2 Alter

表结构

4查

select

大数据基础组件 hive hql

hql增删改查

2022-04-06 0509235888d0c41d4af00199a811a75f 99+ a minute 0.2 k

hive优化

https://blog.csdn.net/yu0_zhang0/article/details/81776459

1 索引

https://www.jianshu.com/p/28b825367ba1

https://www.jianshu.com/p/d53f528daca7

Hive索引的目标是提高对表的某些列进行查询查找的速度。

索引所能提供的查询速度的提高是以存储索引的磁盘空间为代价的。

Hive 3.0开始将移除index的功能，取而代之的是Hive 2.3版本开始的物化视图，自动重写的物化视图替代了index的功能。

2 物化视图

https://blog.csdn.net/u011447164/article/details/105790713

区别于普通视图

create materialized view view2
as
select dept.deptno,dept.dname,emp.ename
from emp,dept
where emp.deptno=dept.deptno;

大数据基础组件 hive hql

hive优化

2022-04-02 81d788ecd1ef18638e236b5f6546b59c 99+ 3 m 0.4 k

hql常见操作

1 with…as…

https://www.jianshu.com/p/d518e9f5d5f9

1 好处

a. 提高代码可读性

结构清晰

b. 优化执行速度

子查询结果存在内存中，不需要重复计算

2 用法

1	with table_name as(子查询语句) 其他sql;

with temp as (
    select * from xxx
)
select * from temp;

2 视图

与基本表不同，它是一个虚表。在数据库中，存放的只是视图的定义，而不存放视图包含的数据项，这些项目仍然存放在原来的基本表结构中。

视图是只读的，不能向视图中插入或加载或改变数据

作用：

1 便捷

通过引入视图机制，用户可以将注意力集中在其关心的数据上（而非全部数据），这样就大大提高了用户效率与用户满意度，而且如果这些数据来源于多个基本表结构，或者数据不仅来自于基本表结构，还有一部分数据来源于其他视图，并且搜索条件又比较复杂时，需要编写的查询语句就会比较烦琐，此时定义视图就可以使数据的查询语句变得简单可行。

2 安全

定义视图可以将表与表之间的复杂的操作连接和搜索条件对用户不可见，用户只需要简单地对一个视图进行查询即可，故增加了数据的安全性，但不能提高查询效率。

创建

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

hive> CREATE VIEW emp_30000 AS
   > SELECT * FROM employee
   > WHERE salary>30000;

删除

1	DROP VIEW view_name

大数据基础组件 hive hql

hql常见操作

2022-04-02 53fe8ed5e0553008897e15aeea410857 99+ 3 m 0.5 k

函数

1 系统函数

https://www.studytime.xin/article/hive-knowledge-function.html

1 get_json_object

https://blog.csdn.net/weixin_43412569/article/details/105290637

2 nvl

空值判断转换函数

https://blog.csdn.net/a850661962/article/details/101209028

3 coalesce

https://blog.csdn.net/yilulvxing/article/details/86595725

1	select coalesce(success_cnt,period,1) from tableA

当success_cnt不为null，那么无论period是否为null，都将返回success_cnt的真实值（因为success_cnt是第一个参数），当success_cnt为null，而period不为null的时候，返回period的真实值。只有当success_cnt和period均为null的时候，将返回1。

4 collect_list和collect_set

https://blog.csdn.net/weixin_30230059/article/details/113324945

https://blog.csdn.net/qq_44104303/article/details/117551807

它们都是将分组中的某列转为一个数组返回，不同的是collect_list不去重而collect_set去重。

5 named_struct

https://blog.csdn.net/weixin_43597208/article/details/117554838

做字段拼接

区别于struct，struct 是集合数据类型，一般用于建表，named_struct是字段拼接函数，一般用于查询

6 array_contains()

1	array_contains(array，值)

判断array中是否包含某个值，包含返回true，不包含返回false

7 cast

https://www.jianshu.com/p/999176fa2730

显式的将一个类型的数据转换成另一个数据类型

1	Cast(字段名 as 转换的类型 )

2 用户自定义函数

UDF(User-Defined-Function)：单入单出
UDTF(User-Defined Table-Generating Functions)：单入多出
UDAF(User Defined Aggregation Function)：多入单出
https://blog.csdn.net/qq_40579464/article/details/105903405

1.编写代码

jar不能随意编写，需要和hive对齐接口，可以借助工具import org.apache.hadoop.hive.ql.exec.UDF;

1 2	1 public class classname extends UDF 2 编写evalute

https://blog.csdn.net/eyeofeagle/article/details/83904147

2.打包
3.导入hive
复制到hdfs上
Hive安装目录的lib目录下
4.创建关联
add jar hdfs://localhost:9000/user/root/hiveudf.jar
create temporary function my_lower as ‘com.example.hive.udf.LowerCase’;
5.使用

hql udf的使用和普通内置函数一样，比如有udf1

1	select udf1（col1） from table1

大数据基础组件 hive hql

hql函数

2022-04-02 02cc0a152fe19584e1d6d82973ddd1ed 99+ a minute 0.2 k

加载数据

https://www.cnblogs.com/bjlhx/p/6946422.html

https://blog.csdn.net/m0_49092046/article/details/109251015

1 load

1	load data [local] inpath ‘/opt/module/datas/student.txt’ [overwrite] \| into table tabName [partition (partcol1=val1,…)];

（1）load data:表示加载数据
（2）local:表示从本地加载数据到 hive 表；否则从 HDFS 加载数据到 hive 表
（3）inpath:表示加载数据的路径
（4）overwrite:表示覆盖表中已有数据，否则表示追加
（5）into table:表示加载到哪张表
（6）tabName:表示具体的表
（7）partition:表示上传到指定分区

例子：

1	load data inpath '/origin_data/gmall/log/topic_log/2020-06-14' into table ods_log partition(dt='2020-06-14')

2 INSERT

https://help.aliyun.com/document_detail/73775.html

insert into 和insert overwrite

1
insert into table student  partition(month='20201022') values(1,'zhangsan');
2 
insert overwrite table student partition(month='20201023')
select id, name from student where month='20201023';

大数据基础组件 hive hql

hql加载数据

2022-04-02 df40401f8f6524c375dc1d53f7b3ebed 99+ 7 m 1.0 k

建表

https://www.jianshu.com/p/4f60f3c923fe

0 CREATE TABLE

https://blog.csdn.net/Thomson617/article/details/86153924

CREATE EXTERNAL TABLE dim_sku_info (
    `id` STRING COMMENT '商品id',
    `price` DECIMAL(16,2) COMMENT '商品价格',
    `sku_name` STRING COMMENT '商品名称',
    `sku_desc` STRING COMMENT '商品描述',
    `weight` DECIMAL(16,2) COMMENT '重量',
    `is_sale` BOOLEAN COMMENT '是否在售',
    `spu_id` STRING COMMENT 'spu编号',
    `spu_name` STRING COMMENT 'spu名称',
    `category3_id` STRING COMMENT '三级分类id',
    `category3_name` STRING COMMENT '三级分类名称',
    `category2_id` STRING COMMENT '二级分类id',
    `category2_name` STRING COMMENT '二级分类名称',
    `category1_id` STRING COMMENT '一级分类id',
    `category1_name` STRING COMMENT '一级分类名称',
    `tm_id` STRING COMMENT '品牌id',
    `tm_name` STRING COMMENT '品牌名称',
    `sku_attr_values` ARRAY<STRUCT<attr_id:STRING,value_id:STRING,attr_name:STRING,value_name:STRING>> COMMENT '平台属性',
    `sku_sale_attr_values` ARRAY<STRUCT<sale_attr_id:STRING,sale_attr_value_id:STRING,sale_attr_name:STRING,sale_attr_value_name:STRING>> COMMENT '销售属性',
    `create_time` STRING COMMENT '创建时间'
) COMMENT '商品维度表'

1 EXTERNAL

关键字可以让用户创建一个外部表，默认是内部表

2 字段的数据类型

https://blog.csdn.net/weixin_46941961/article/details/108551512

https://blog.csdn.net/weixin_43215250/article/details/90034169

集合数据类型：Array、Map和Struct

1.分区

https://www.jianshu.com/p/5dbbaea8ff41

PARTITIONED BY (dt string)

0 分类

静态分区SP（static partition）
动态分区DP（dynamic partition）

静态分区与动态分区的主要区别在于静态分区是手动指定，而动态分区是通过数据来进行判断。

1 静态分区

--建表
DROP TABLE IF EXISTS dwd_display_log;
CREATE EXTERNAL TABLE dwd_display_log(
    `area_code` STRING COMMENT '地区编码',
    `brand` STRING COMMENT '手机品牌',
    `channel` STRING COMMENT '渠道',
    `is_new` STRING COMMENT '是否首次启动',
    `model` STRING COMMENT '手机型号',
    `mid_id` STRING COMMENT '设备id',
    `os` STRING COMMENT '操作系统',
    `user_id` STRING COMMENT '会员id',
    `version_code` STRING COMMENT 'app版本号',
    `during_time` BIGINT COMMENT 'app版本号',
    `page_item` STRING COMMENT '目标id ',
    `page_item_type` STRING COMMENT '目标类型',
    `last_page_id` STRING COMMENT '上页类型',
    `page_id` STRING COMMENT '页面ID ',
    `source_type` STRING COMMENT '来源类型',
    `ts` BIGINT COMMENT 'app版本号',
    `display_type` STRING COMMENT '曝光类型',
    `item` STRING COMMENT '曝光对象id ',
    `item_type` STRING COMMENT 'app版本号',
    `order` BIGINT COMMENT '曝光顺序',
    `pos_id` BIGINT COMMENT '曝光位置'
) COMMENT '曝光日志表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_display_log'
TBLPROPERTIES('parquet.compression'='lzo');
--加载数据
insert overwrite table dwd_display_log partition(dt='2020-06-14')
select
    get_json_object(line,'$.common.ar'),
    get_json_object(line,'$.common.ba'),
    get_json_object(line,'$.common.ch'),
    get_json_object(line,'$.common.is_new'),
    get_json_object(line,'$.common.md'),
    get_json_object(line,'$.common.mid'),
    get_json_object(line,'$.common.os'),
    get_json_object(line,'$.common.uid'),
    get_json_object(line,'$.common.vc'),
    get_json_object(line,'$.page.during_time'),
    get_json_object(line,'$.page.item'),
    get_json_object(line,'$.page.item_type'),
    get_json_object(line,'$.page.last_page_id'),
    get_json_object(line,'$.page.page_id'),
    get_json_object(line,'$.page.source_type'),
    get_json_object(line,'$.ts'),
    get_json_object(display,'$.display_type'),
    get_json_object(display,'$.item'),
    get_json_object(display,'$.item_type'),
    get_json_object(display,'$.order'),
    get_json_object(display,'$.pos_id')
from ods_log lateral view explode_json_array(get_json_object(line,'$.displays')) tmp as display
where dt='2020-06-14'
and get_json_object(line,'$.displays') is not null;

2 动态分区

注意分区字段dt数据来源于date_format(create_time,’yyyy-MM-dd’)

和静态分区比较，建表的时候没区别，加载数据有区别

--建表
DROP TABLE IF EXISTS dwd_comment_info;
CREATE EXTERNAL TABLE dwd_comment_info(
    `id` STRING COMMENT '编号',
    `user_id` STRING COMMENT '用户ID',
    `sku_id` STRING COMMENT '商品sku',
    `spu_id` STRING COMMENT '商品spu',
    `order_id` STRING COMMENT '订单ID',
    `appraise` STRING COMMENT '评价(好评、中评、差评、默认评价)',
    `create_time` STRING COMMENT '评价时间'
) COMMENT '评价事实表'
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/warehouse/gmall/dwd/dwd_comment_info/'
TBLPROPERTIES ("parquet.compression"="lzo");
--加载数据
insert overwrite table dwd_comment_info partition (dt)
select
    id,
    user_id,
    sku_id,
    spu_id,
    order_id,
    appraise,
    create_time,
    date_format(create_time,'yyyy-MM-dd')
from ods_comment_info
where dt='2020-0

2 LOCATION

LOCATION ‘/warehouse/gmall/ods/ods_log’

指定数据在hdfs上的存储位置

3 ROW FORMAT

https://www.imooc.com/article/12213

https://blog.csdn.net/S_Running_snail/article/details/84258162

指定数据切分格式

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’

4 STORED AS

https://blog.csdn.net/ZZQHELLO2018/article/details/106175887

指定存储方式

行式存储：TEXTFILE 、SEQUENCEFILE 列式存储： ORC、PARQUET

5 TBLPROPERTIES

https://blog.csdn.net/yangguosb/article/details/83651073

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable

TBLPROPERTIES是表的一些属性，HIVE内置了一部分属性，使用者也可以在创建表时进行自定义；

TBLPROPERTIES (“parquet.compression”=”lzo”);

大数据基础组件 hive hql

hql建表

2022-03-30 0f904a84cec8445052159e2c3d7f15f7 99+ fast 0.1 k

Hive MetaStore

1 描述

Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored.

2 Hive的元数据存储(Metastore三种配置方式)

Embedded，Local，Remote

https://blog.csdn.net/epitomizelu/article/details/117091656

https://zhuanlan.zhihu.com/p/473378621

https://blog.csdn.net/qq_40990732/article/details/80914873

3 Hive元数据库介绍

https://blog.csdn.net/victorzzzz/article/details/81874674

大数据基础组件 hive hive

Hive MetaStore

2022-03-17 47acde4d4f107844153e2d170e54163c 99+ fast 0.0 k

hive数据导入导出

https://www.cnblogs.com/xing901022/p/5801061.html

大数据基础组件 hive hql

hive数据导入导出

2022-03-01 f0f77484ddf479d86ba02e238804669e 99+ fast 0.1 k

Hive与传统数据库对比

	Hive	传统数据库
查询语言	HQL	SQL
数据存储	HDFS	Raw Device或者 Local FS
数据格式	用户自定义	系统决定
数据更新	不支持	支持
执行	MapReduce	Excutor
执行延迟	高	低
处理数据规模	大	小
索引	0.8版本后加入位图索引	有复杂的索引
可扩展性	高	低

https://cloud.tencent.com/developer/article/1785857

大数据基础组件 hive hive

Hive与传统数据库对比

2022-02-09 dd6c73e83c6e484737870bf370ca00b1 99+ 3 m 0.5 k

hive架构

https://cwiki.apache.org/confluence/display/hive/design#Design-HiveArchitecture

https://zhuanlan.zhihu.com/p/87545980

https://blog.csdn.net/oTengYue/article/details/91129850

https://jiamaoxiang.top/2020/06/27/Hive%E7%9A%84%E6%9E%B6%E6%9E%84%E5%89%96%E6%9E%90/

https://www.javatpoint.com/hive-architecture

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types of clients such as:-

Thrift Server - It is a cross-language service provider platform that serves the request from all those programming languages that supports Thrift.
JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.

Hive Services

The following are the services provided by Hive:-

Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.
Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and commands.
Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored.
Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.
Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their dependencies.

计算引擎

Hive支持MapReduce、Tez、Spark

https://cloud.tencent.com/developer/article/1893808

https://blog.csdn.net/kwu_ganymede/article/details/52223133

数据存储

https://cloud.tencent.com/developer/article/1411821

Hive是基于hdfs的，它的数据存储在Hadoop分布式文件系统中。Hive本身是没有专门的数据存储格式，也没有为数据建立索引，只需要在创建表的时候告诉Hive数据中的列分隔符和行分隔符，Hive就可以解析数据。

default数据库中的表的存储位置 /user/hive/warehouse
其他数据库的表自己指定

大数据基础组件 hive hive

hive架构

hql增删改查

1增

2删

3改

4查

hive优化

1 索引

2 物化视图

hql常见操作

1 with…as…

2 视图

函数

1 系统函数

2 用户自定义函数

加载数据

1 load

2 INSERT

建表

0 CREATE TABLE

1.分区

2 LOCATION

3 ROW FORMAT

4 STORED AS

5 TBLPROPERTIES

Hive MetaStore

1 描述

2 Hive的元数据存储(Metastore三种配置方式)

3 Hive元数据库介绍

hive数据导入导出

Hive与传统数据库对比

hive架构

Hive Client

Hive Services

计算引擎

数据存储

Recents

Categories

Archives

Tags

Subscribe for updates