May 27, 2019 - azure big-data

Big Data Platform Comparisons

Xavier Geerinck

@XavierGeerinck

Choosing a platform for doing your Big Data processing tasks is not an easy choice. At one side you want to be flexible and open, but at another you would like a stable and robust platform that can handle your critical business workloads.

This is the reason that I decided to create a comparison of the different Big Data platforms from a Microsoft perspective.

Note: I am confident that other major cloud vendors such as Google and AWS, or even other vendors also have have excellent Big Data platform products, but seeing that I am not a specialist in these, I would like to keep it to the ones I am proficient in.

Note 2: This will also include a lot of assumptions to make the comparison as fair as possible.

The products that we will be covering are:

Note: I will not include a detailed overview of the services but rather a comparison. For a detailed overview, feel free to check the associated links for each service.

Assumptions

As in any comparison, some assumptions were made. In this case the following assumptions were made:

  • 1 Web Node was utilized where a web interface is required
    • Size: D2v3
    • Solutions: Cloudera Cloudbreak
  • 2 Head Nodes were utilized where required for HA purposes
    • Size: A3
    • Solutions: HDInsight, Cloudera Cloudbreak
  • 3 Worker Nodes were utilized
    • Size: D13v2
    • Solutions: Azure Databricks, HDInsight, Machine Learning Services, Cloudera Cloudbreak, Apache Spark on Kubernetes (K8S)
  • No Data Disks were selected
  • For Pausing enabled clusters, 8h was included (240h/mo) was taken, for others 24h (720h/mo)
    • Note: 8h pricing is also included in others, but has been wrapped with () for clarity reasons. They can pause but extra work will be required to support this.

Comparison Matrix

-Spark on K8SAzure DatabricksHDInsightCloudera CloudbreakAzure Machine Learning Services
Multi CloudYesYesNoYesYes (1)
Deployment ModelIaaS / Half PaaSPaaSPaaS (with full cluster control)IaaSPaaS, with integrated support for compute on ML Services, VMs, Databricks, HDI and K8S
Auto ScaleYes (will require manual configuration)YesYes (preview)YesYes (on Machine Learning Compute or Databricks)
Compute Pause SupportNo (but scale-down yes and can be automated)YesNo (but scale-down yes, and can be automated)YesYes
Language SupportScala, Python, R, SQL, Java, .NETScala, Python, R, SQL, JavaScala, Python, R, SQL, JavaScala, Python, R, SQL, JavaPython & REST
Notebook SupportNoYesYesYesYes
Scheduling SupportNoYesYes, through OozieYes, through OozieYes, Through Platform or SDK integration
Tooling Re-training RequiredServer management through K8SDatabricks InterfaceHDP ComponentsHDP Components & Cloudbreak InterfaceSDK Interface OR GUI Interface in Azure Portal
ExtensibilityNoNoYesYesYes
Performance Gain Out-Of-The-Box0%40%0%0%N/A
Cost24h: 1,662.21 USD
(8h: 546.48 USD)
24h: 2,409.00 USD
8h: 803,88 USD
Note: perf increase added (2)
24h: 2,084.00 USD
(8h: 685.15 USD)
24h: 2,100.36 USD
8h: 749.43 USD
+375 license cost (3) / mo
Depends on K8S, HDI, Databricks VMs implementation

Notes:

  • (1): Multi Cloud since this is an offering that can be implemented through an SDK and is more on the Model Training and Operationalization part. Notebook support however has been included recently, making this a viable solution now. For Spark workloads, I however recommend to include another service with it.
  • (2): Databricks offers an out of the box performance increase - see: website1 and website2 for more details
  • (3): For enterprise support, licenses are required. See this website for more information. For our comparison, we took a price of 1.500 USD per license for only the worker nodes (so 3 worker nodes * 1.500 USD / 12 months). Exact pricing needs to be checked with Cloudera and this is purely indicative!

References

More references can be found for the following products at these links:

HDInsight

Spark on K8S

Cloudbreak

Azure Machine Learning Services

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

Xavier Geerinck © 2020

Twitter - LinkedIn