关于 HiveSQL 常见的 Left Join 误区,你知道吗

写在前面

很多时候,你知道吗由于SQL逻辑复杂,关于加之对SQL执行逻辑理解不透彻,误区很容易产生一些莫名其妙的你知道吗结果,这些结果看似不符合预期,关于殊不知这就是误区真实结果。本文整理了几个常见的你知道吗SQL问题,我们在实际书写SQL脚本时,关于需要多加注意,误区希望本文对你有所帮助。你知道吗

关于LEFT JOIN

外连接是关于我们书写SQL时经常使用的多表连接方式,使用起来也是误区十分的简单。值得注意的你知道吗是,越是关于简单的东西,越是误区容易被忽略细节。通常我们都是这样理解LEFT JOIN的:

语义是满足Join on条件的直接返回,但不满足情况下,需要返回Left Outer Join的高防服务器left 表所有列,同时右表的列全部填null

上述对于LEFT JOIN的理解是没有任何问题的,但是里面有一个误区:谓词下推。具体看下面的实例:

假设有如下的三张表:

复制--建表create table t1(id int, value int) partitioned by (ds string);create table t2(id int, value int) partitioned by (ds string);create table t3(c1 int, c2 int, c3 int);--数据装载,t1表insert overwrite table t1 partition(ds=20220120) select 1,2022;insert overwrite table t1 partition(ds=20220121) select 2,2022;insert overwrite table t1 partition(ds=20220122) select 2,2022;--数据装载,t2表insert overwrite table t2 partition(ds=20220120) select 1,120;1.2.3.4.5.6.7.8.9.10.11.

当我们执行如下的SQL查询时,会返回什么数据呢?

复制SELECT *FROM

t1

LEFT JOIN

t2

ON t1.id = t2.idAND t1.ds = 20220120;1.2.3.4.5.6.

结果1:

复制1 2022 20220120 1 120 202201201.

结果2:

复制1 2022 20220120 1 120 202201202 2022 20220121 NULL NULL NULL1 2022 20220122 NULL NULL NULL1.2.3.

相信对于很多初学者,甚至是一个有开发经验的人来说,会认为结果1是正确的返回结果。其实结果1的并不是正确的结果,真正的返回值是结果2.

是不是跟预期的结果不一致呢?很多初学者会认为上述查询SQL中AND t1.ds = 20220120会进行谓词下推,从而得到结果2。其实,b2b信息网SQL本身的语义不是这样的,如果需要获取结果1的数据,正确的查询方式是下面这样:

复制--方式1:SELECT *FROM

t1

LEFT OUTER JOIN

t2

ON t1.id = t2.idWHERE t1.ds = 20220120;--方式2:SELECT *FROM ( SELECT * FROM

t1

WHERE ds = 20220120 )

t1

LEFT OUTER JOIN

t2

ON t1.id = t2.id;1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.

细心的你看出差异了吗?重点是在WHERE t1.ds = 20220120过滤条件上,最上面的查询方式是ON t1.ds = 20220120,所以按照LEFT JOIN的语义,如果没有过滤条件,那么左表的数据应该全部返回,右表匹配不上则补null。

执行计划

我们先来看看没有谓词下推的查询SQL的执行计划

正常LEFT JOIN

查看执行计划

复制EXPLAIN

SELECT *FROM

t1

LEFT JOIN

t2

ON t1.id = t2.idAND t1.ds = 20220120;1.2.3.4.5.6.7.

执行计划结果

复制hive>

EXPLAIN

> SELECT * > FROM

t1

> LEFT JOIN

t2

> ON t1.id = t2.id > AND t1.ds = 20220120 > ;

OK

STAGE DEPENDENCIES: Stage-4 is

a root stage

Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3STAGE PLANS: Stage: Stage-4

Map Reduce Local Work

Alias -> Map Local Tables: $hdt$_1:

t2

Fetch Operator

limit: -1 Alias -> Map Local Operator Tree: $hdt$_1:

t2

TableScan

alias:

t2

Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1,

_col2

Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:

NONE

HashTable Sink Operator

filter predicates: 0 {(_col2 = 20220120)} 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias:

t1

Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1,

_col2

Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats:

NONE

Map Join

Operator

condition map: Left Outer Join0 to 1 filter predicates: 0 {(_col2 = 20220120)} 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4,

_col5

Statistics: Num rows: 3 Data size: 19 Basic stats: COMPLETE Column stats:

NONE

File Output Operator

compressed: false Statistics: Num rows: 3 Data size: 19 Basic stats: COMPLETE Column stats:

NONE

table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1 Processor Tree: ListSink1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.68.69.70.71.72.73.

从上面的执行计划可以看出:总共有3个stage,

复制STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-31.

其中stage4是map任务读取t2表,将t2表加载成HashTable,用于map端join。t2表数据量为1行。

复制Select Operator expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator1.

stage3是map任务读取t1表数据并执行map端join。t1表数量为3行,可见并没有进行过滤操作。云服务器

复制 Map Operator Tree:

TableScan

alias:

t1

Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1,

_col2

Statistics: Num rows: 3 Data size: 18 Basic stats: COMPLETE Column stats: NONE1.2.3.4.5.6.7.8.

Stage-0进行结果输出,最终并未执行过滤操作。

复制Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink1.

谓词下推的LEFT JOIN

查看执行计划

复制EXPLAIN

SELECT *FROM

t1

LEFT OUTER JOIN

t2

ON t1.id = t2.idWHERE t1.ds = 20220120;1.2.3.4.5.6.7.

执行计划结果

复制STAGE DEPENDENCIES: Stage-4 is

a root stage

Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3STAGE PLANS: Stage: Stage-4

Map Reduce Local Work

Alias -> Map Local Tables: $hdt$_1:

t2

Fetch Operator

limit: -1 Alias -> Map Local Operator Tree: $hdt$_1:

t2

TableScan

alias:

t2

Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1,

_col2

Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:

NONE

HashTable Sink Operator

keys: 0 _col0 (type: int) 1 _col0 (type: int) Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias:

t1

Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: id (type: int), value (type: int) outputColumnNames: _col0,

_col1

Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats:

NONE

Map Join

Operator

condition map: Left Outer Join0 to 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col1, _col3, _col4,

_col5

Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: _col0 (type: int), _col1 (type: int), 20220120 (type: string), _col3 (type: int), _col4 (type: int), _col5 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4,

_col5

Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats:

NONE

File Output Operator

compressed: false Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats:

NONE

table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1 Processor Tree: ListSink1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.

从上面的执行计划可以看出:总共有3个stage,

复制STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-31.

其中stage4是map任务读取t2表,将t2表加载成HashTable,用于map端join。t2表数据量为1行。

复制 TableScan

alias:

t2

Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: id (type: int), value (type: int), ds (type: string) outputColumnNames: _col0, _col1,

_col2

Statistics: Num rows: 1 Data size: 5 Basic stats: COMPLETE Column stats:

NONE

HashTable Sink Operator1.2.3.4.5.6.7.8.

stage3是map任务读取t1表数据并执行map端join。t1表数量为1行,执行了过滤操作。

复制TableScan

alias:

t1

Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats:

NONE

Select

Operator

expressions: id (type: int), value (type: int) outputColumnNames: _col0,

_col1

Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats:

NONE

Map Join

Operator

condition map: Left Outer Join0 to 1 keys: 0 _col0 (type: int) 1 _col0 (type: int) outputColumnNames: _col0, _col1, _col3, _col4,

_col5

Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column stats: NONE1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.

Stage-0进行结果输出,最终并未执行过操作。

复制Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink1.

总结本文主要结合具体的使用示例,对HiveSQL的LEFT JOIN操作进行了详细解释。主要包括两种比较常见的LEFT JOIN方式,一种是正常的LEFT JOIN,也就是只包含ON条件,这种情况没有过滤操作,即左表的数据会全部返回。另一种方式是有谓词下推,即关联的时候使用了WHERE条件,这个时候会会对数据进行过滤。所以在写SQL的时候,尤其需要注意这些细节问题,以免出现意想不到的错误结果。

域名
上一篇:如何使用光驱安装系统(光驱安装系统的步骤和技巧)
下一篇:电脑剪映成品教程(轻松学会电脑剪映,制作出令人惊艳的影片效果)