Which Cloud Native Data Warehouse is right for your business?
Companies understand that data is the key component driving the competitive advantage for businesses today. Organizations are collecting, purchasing, and storing petabytes of data in either data lakes or data warehouses to gain business insights. Whether it is comparing past performance, employing analytics to predict the future, or simply performing ad-hoc queries against large data sets, you need a data warehouse that is cost-effective, quick, and developer-friendly.
The Targeted Warehouses
While the wide selection of data warehouses gives organizations choices, it can also make the task of choosing which to use daunting. Fortunately, this article will give a brief overview of four popular options native to the cloud.
Amazon Redshift – A fully-managed, petabyte-scale data warehouse native to Amazon Web Services (AWS) and released in October 2012
Azure Synapse – A limitless analytics service native to Microsoft Azure and first released in July 2016 under the name SQL Data Warehouse
Google BigQuery – A serverless and highly scalable multi-cloud data warehouse released by Google in May 2010
Snowflake – A fully-managed global platform for data analytics and released in 2014
Before we dive into the specifics of each data warehouse, it helps to identify which metrics we will compare:
Price vs. performance – Which warehouse provides the best bang for the buck?
Ease of development – How easy is it to load data, create tables, and perform queries? Does the warehouse play nicely with other systems?
Scaling and Maintenance – Can the data warehouse scale quickly and easily? How much maintenance does the data warehouse require?
Further, below are the test steps we used to compare the data warehouses:
Create table schema using SQL
Load 1 million records into table from a raw CSV file
Perform a simple read operation on all records
Perform a simple aggregation on all records
Price vs. Performance
A key point of comparison for any type of service, including data warehouses, is price. However, price alone does not tell the whole picture. It is important to look at how much a service costs versus the performance it provides.
It is also important to note that each of the four data warehouses have different pricing models. Each warehouse separates the compute and storage pricing at some layer, with varying costs. Of these, Google BigQuery is the least predictable, given that it charges per terabyte processed. The table below gives an overview for the tests we ran:
Redshift separates storage costs for RA3 nodes only. Other nodes include storage in their monthly costs.
Requires a minimum of 2 nodes, for a minimum of $6.52 per hour.
Google defines long-term storage as a table not edited for 90 consecutive days.
Assumes Standard Edition pricing ($2 per credit).
We cannot measure performance in a vacuum, as different use cases can result in different results. These numbers can vary depending on the warehouse instance sizes, the type of data, the size of data, and a myriad of other factors. Nevertheless, we distilled our tests into three main comparison points against a single large table: load speeds, read speeds, and simple aggregations.
Of the warehouses we evaluated, Amazon Redshift was significantly slower than the other three data warehouses. Synapse did best in data load and read, while Snowflake did the best in aggregation.
On a pricing perspective, Google BigQuery was the clear winner, costing next to nothing to load, read, and aggregate a million records.
Ease of Development
Another principal factor to consider is ease of development. How long it takes to build your data warehouse and how much effort it takes to interact with it can be just as important as the pricing. Much of this can be subjective, dependent on developer’s preferences, but there are key areas in which these data warehouses have some distinct differences.
All the evaluated data warehouses use familiar SQL to load and query data. Azure Synapse uses SQL Server-style syntax while Redshift, BigQuery, and Snowflake each have their own unique flavor of SQL. Since each data warehouse follows industry standards, the effort to learn the differences between them is minimal.
From an ingestion perspective, each data warehouse has a similar mechanism for loading raw CSV files from a storage service. Redshift is among the simplest, with little configuration needed between S3 and Redshift. BigQuery does not have SQL to support copying data from a CSV, but it can still load raw data from the console and through its API.
Regarding error reporting, Redshift and Synapse have the worst reporting. Redshift hides errors behind a “stl_load_errors” table, rather than returning the results with the load query. Synapse is more troubling, as it does not report any error when it fails to find a file and does not report the number of records ingested.
From an ease of access perspective, Redshift, Synapse, and BigQuery are accessible through their respective cloud platforms, whereas Snowflake is available as a standalone web application. Redshift, BigQuery, and Snowflake allow developers to run their queries in the browser, while Synapse requires a third-party tool. However, BigQuery’s console does not behave consistently on non-Chrome browsers, with some operations silently failing.
When it comes to supporting complex data types, however, not all services are equal. Each data warehouse supports semi-structured data in some way or shape, but only Redshift and Snowflake natively support JSON data types, as opposed to storing JSON strings in Synapse and BigQuery. All the data warehouses provide utility functions to interact with JSON, however.
Scaling and Maintenance
Ease of development is not the only factor to consider. How much effort it takes to scale the computation layer and how easy it is to administer the system is important. When loading data, building models, and performing ad-hoc queries, the data warehouse should be able to scale to your organization’s needs. This is an area where the data warehouse options have some key differences.
Synapse and BigQuery provide automatic scaling options. These are great for short workloads. Both also offer dedicated options for more predictable pricing or intense workloads. Snowflake ties its scaling to warehouse sizes that the admin can increase or decrease according to business needs in less than a minute. On the other end, Redshift has the most complex scaling. It provides no automatic scaling and Redshift admins must manage node clusters, with most clusters having a minimum number of nodes that must be active, and the provisioning process takes minutes rather than seconds.
From a maintenance point of view, Snowflake has a focused and cleaner interface than the other three. Provisioning user access in Snowflake is much more straightforward than the other three providers, and all admin functions are scriptable in SQL.
The choice of which data warehouse works best for your organization may depend on several factors. Selecting the right data warehouse should involve a detailed look into which aligns best with your organization needs. Is development speed the main concern? Or is simply price the most crucial factor? Whichever the case, Scalesology can help your organization not only decide on what platform to use, but also help you every step of the way, from ingestion to analytics. Contact us now, and let’s move forward with gaining insights from your data.