Replica Watches: HiveSQL常用（下篇：使用技巧与优化）

2021-06-30

HiveSQL常用（下篇：使用技巧与优化）

结合实际工作应用，整理HiveSQL常用实用相关，包括常用函数、语句，以及使用技巧与优化和其它注意项等，分为上下篇，下篇：使用技巧与优化

很高兴遇到你~

（1）Hive常用日期格式处理

（2）Hive常用函数

（3）Hive常用语句（实用）

数据加载清理与建表
表检索与表结构查询

（4）HiveSQL使用技巧与优化

（5）HiveSQL使用注意项

HiveSQL使用技巧与优化

SQL执行顺序：FROM->JOIN->WHERE->GROUP BY->HAVING->SELECT->ORDER BY->LIMIT

distinct去重与count

--distinct去重时，如果存在NULL，结果会异常，Hive不会将null值归为一个值处理，此时需要给NULL进行转换select distinct nvl(column1,''),nvl(column2,0) from t;--count(*)、count(1)对所有行进行统计，包括null行，count(column_name)只对该列中非null的进行统计--Hive中要避免使用count(distinct)，它无法进行聚合操作，只在一个reduce上完成，容易出现性能瓶颈甚至oom内存溢出，使用group by来替代--count distinctselect col1,count(distinct id) as didfrom tgroup by col1;--使用group by优化替代select col1,count(id) as didfrom(select col1,id from t group by col1,id) as tempgroup by col1;

subquerys子查询&exists/in&left semi join

--subquerys子查询:hive只支持from和where后的子查询
--如果子查询中包含null值，不能使用not in(not in会报错，in不会)
--不推荐使用in/not in，可使用exists/not exists替代,支持子查询中的多值匹配--not exists和left join可以有等价写法--not existsselect a,bfrom t1where not exists(select 1 from t2 where t1.a=t2.a and t1.b=t2.b);--等价not exists的left join写法select t1.a,t2.bfrom t1left join t2on (t1.a=t2.a and t1.b=t2.b)where t2.a is null;--left semi join 替代 in和exists,效率更高--LEFT SEMI JOIN（左半连接）是IN/EXISTS子句查询的一种更高效的实现--LEFT SEMI JOIN 的限制是：JOIN 子句中右边的表只能在ON 子句中设置过滤条件，在WHERE 子句、SELECT 子句或其他地方过滤都不行--LEFT SEMI JOIN 只会显示出左边表的字段，left semi join会掉右表中重复的记录，不会因为右表重复key join出多条--in/existsSELECT a.key, a.valueFROM aWHERE a.key in(SELECT b.keyFROM B);
--left semi join替代in/existsSELECT a.key, a.valFROM a LEFT SEMI JOIN b on (a.key = b.key)

sort by&distribute by&cluster by&order by

--ORDER BY 全局排序，默认了reducer个数为1，只有一个Reduce任务，效率低下，如果对大数据集进行order by排序可能会造成性能瓶颈，造成reduce的时间非常长--如果在strict模式下使用order by语句，那么必须要在语句中加上limit关键字，因为执行order by只启动单个reduce，如果排序的结果集过大，那么执行时间会很久的原因set hive.mapred.mode=nonstrict; (default value / 默认值)set hive.mapred.mode=strict;--order by会引发全局排序，数据量较小order by即可（Hive中尽量不要使用order by，除非非常确定结果集非常小）--实际场景中一般先使用sort by再使用order by效率更高一些，使用distribute和sort进行分组排序，sort by+order by，sort by过程可以设置reducer个数（n），order by过程用n个reduce的输出文件进行一次全排序，得到最终结果
--sort by&distribute by--sort by只能保证在单个reduce内有序select * from baidu_click distribute by product_line sort by click desc;select * from t distribute by id sort by id;--distribute指定map输出结果是如何分配的，上句中相同的id会被分配到同一个reduce上去处理，然后再通过sort by对各个reduce上的id进行排序（被distribute by设定的字段为KEY，数据会被HASH分发到不同的reducer机器上，然后sort by会对同一个reducer机器上的每组数据进行局部排序）--cluster by(distribute by + sort by替代方案)--当distribute by和sort by的字段完全一致时，等价于cluster by,但cluster by排序只能是升序排序，不能指定排序规则为ASC或者DESC--cluster by 和 distribute by 是很相似的,最大的不同是, cluster by 里含有一个分桶的方法select * from emp cluster by deptno;select * from emp distribute by deptno sort by deptno; --常见两种高效的排序实现--可先通过一个group by的子查询来取一个小的结果集，然后再对这个结果集进行全局排序select * from (select id,count(id) as cntfrom tgroup by id) as temporder by temp.cnt;--高效实现top排序--先取出各个结果集的top n，再取出全局的top nselect a.id,salaryfrom (select id,salary from t1 distribute by sort by salary desc limit 10) as temporder by temp.salary limit 10;

HiveSQL使用注意项

创......
原文转载：http://www.shaoqun.com/a/836132.html
跨境电商：https://www.ikjzd.com/
blackbird：https://www.ikjzd.com/w/950

mile：https://www.ikjzd.com/w/1746

宝贝格子：https://www.ikjzd.com/w/1322

结合实际工作应用，整理HiveSQL常用实用相关，包括常用函数、语句，以及使用技巧与优化和其它注意项等，分为上下篇，下篇：使用技巧与优化很高兴遇到你~（1）Hive常用日期格式处理（2）Hive常用函数（3）Hive常用语句（实用）数据加载清理与建表表检索与表结构查询（4）HiveSQL使用技巧与优化（5）HiveSQL使用注意项HiveSQL使用技巧与优化SQL执行顺序：FROM->JOI
亚马逊海外购：https://www.ikjzd.com/w/998
立刻网：https://www.ikjzd.com/w/2323
跨境电商选品分析，有哪些不错的产品市场？：https://www.ikjzd.com/articles/96538
525事件还没完？亚马逊在酝酿第二拨封号？！：https://www.ikjzd.com/articles/96541
跨境电商平台物流有哪些？他们的区别是什么：https://www.ikjzd.com/articles/96544
深度解析！投身印度电商必须了解的GST税收改革！：https://www.ikjzd.com/articles/96545
打开腿我想尝尝你的味道公公舔吸我下面的故事：http://lady.shaoqun.com/a/247746.html
男友开车到没人的地方要我随着车的摇晃滑进去：http://lady.shaoqun.com/m/a/247396.html
被男朋友强奸算强奸吗？：http://lady.shaoqun.com/a/391607.html
男朋友旅行时一夜情求我别走。：http://lady.shaoqun.com/a/391608.html
天真的小姑娘，男朋友演的，女人珍惜：http://lady.shaoqun.com/a/391610.html
情人节那天和男友共度良宵后，这位24岁的女孩失去了生育能力：http://lady.shaoqun.com/a/391611.html

Replica Watches

2021-06-30

HiveSQL常用（下篇：使用技巧与优化）

（1）Hive常用日期格式处理

（2）Hive常用函数

（3）Hive常用语句（实用）

（4）HiveSQL使用技巧与优化

（5）HiveSQL使用注意项

No comments:

Post a Comment