![]() ![]() ![]() MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF's), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. We changed the Hive configuration from Hive 0.10 on CDH4 to Hive 0.12 on HDP 2.0.6.The workload here is simply one set of queries that most of these systems these can complete. We have added Tez as a supported platform.It is difficult to account for changes resulting from modifications to Hive as opposed to changes in the underlying Hadoop distribution.Īs a result, direct comparisons between the current and previous Hive results should not be made. Hive has improved its query optimization, which is also inherited by Shark.It is important to note that Tez is currently in a preview state. We have changed the underlying filesystem from Ext3 to Ext4 for Hive, Tez, Impala, and Shark benchmarking.This set of queries does not test the improved optimizer. This work builds on the benchmark developed by Pavlo et al. However, results obtained with this software are not directly comparable with results in the Pavlo et al paper, because we use different data sets, a different data generator, and have modified one of the queries (query 4 below) In particular, it uses the schema and queries from that benchmark. Our dataset and queries are inspired by the benchmark contained in a comparison of approaches to large scale analytics. The input data set consists of a set of unstructured HTML documents and two SQL tables which contain summary information. It was generated using Intel's Hadoop benchmark tools and data sampled from the Common Crawl document corpus. There are three datasets with the following schemas:įor Impala, Hive, Tez, and Shark, this benchmark uses the m2.4xlarge EC2 instance type. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |