Take advantage of Spark 2.0
Nive Biljani, Business Analyst.
Apache Spark is the fastest developing open source data engine and with its latest major release it brings a bunch of new features. Once Spark 2.0 was out, we couldn’t wait to try it and savour its delights…
The first pleasant surprise is that Spark 2.0 supports subqueries in the SELECT and WHERE clauses (both correlated and not). Previous versions of Spark supported only subqueries in the FROM clause. It is an important benefit when migrating existing systems from other databases to Spark. Before Spark 2.0 all queries with subqueries had to be rewritten to use JOINs instead.
Another improvement related to the SQL support is that In Spark 2.0 built-in analytic/window functions are available directly via SparkSession. In the previous versions these were only available when using HiveContext.
The next item you will notice immediately when you start the Spark shell – Spark 2.0 comes with a new consolidated API model starting with the new entry point SparkSession. Does it mean that your Spark 1.6 client code does no longer work with Spark 2.0? It does, but it might be good to start thinking about migration.
While the above improvements come as a natural evolution from the previous versions, performance wise the new version is revolutionary. The new technique called “Whole-stage code generation” generates tailored Java code at runtime for every SQL statement. As a result, execution time achieved in all the queries we measured is on average 10 times faster compared to Spark 1.6. If you are curious what code Spark generated for your SQL, you can see it by switching to DEBUG level log mode. Just be aware that there is a threshold to the number of the output columns and once the threshold is met the code generation is switched off. However you can control this threshold from the configuration via the “spark.sql.codegen.maxFields” parameter.
The only disappointment here is that data cached into the memory no longer bring performance improvement. Executing the same queries against parquet files and against cached dataframes, takes on average 1.5 times longer for the cached data. Both execution plans show usage of the “Whole-stage code generation”, but the one which accesses the cached in memory data has many more steps in the overall plan – one thing we are going to investigate closer later.
To measure the performance of a typical data warehouse report we decided to compare the same query executed on Spark 2.0 vs Oracle 12c on Exadata.
Our test case uses as source data two datamarts with 27 million and 250 columns each. The size of the uncompressed tables is 20GB each. The query joins thee subqueries, first is Join between the two datamarts, the other two are aggregations over each of the datamart. The output of the query is 75 calculations, 50 of them use analytic/windowing functions, and then the result is further grouped by and aggregated by two dimensions.
The source data for Spark are stored in a parquet file with snappy compression. The source data for Oracle are stored in tables with Exadata Hybrid Columnar Compression (Query High compression).
Both parquet snappy and Query High are hybrid columnar store formats, both achieving factor 10 compression. Both Spark and Oracle machines are limited to 4 cores, parallel query is enforced for Oracle, cell smart scan is enabled.
Now the result: 145 sec Oracle on Exadata vs 165 sec Spark 2.0.
Pretty impressive for free open-source software.