Data Warehouse

Running on cluster.

A place to store structured data.

Source: offline ETL pipline to insert by batch

goal is to make data scientist to run SQL query

特点:

  1. It is a distributed database hence it has a lot features that database own
  2. Structured data

snowflake

managed cloud data warehouse

data warehouse on cloud

databricks and data lakehouse

datalake -> data lakehouse

No need for schema: friendly to semi-structured and non structured data

产品

delta lake

stores data itself and its metadata.

metadata is for ACID transactions and support for a schema with a supporting write-ahead log for data rolling back when necessary

Data can be stored on the top of S3, like object storage. Delta Lake imports metadata for ACID transactions and schema. These ACID transactions and schema are necessary for SQL workload

Databricks SQL

Photon execution engine.

Databricks Runtime

Create machine -> a tuned spark cluster, and spark is ready to go

Notebook

Connect to a cluster that the user have access to. Use scala, python, sql or R to run the spark command and do the manipulation with the data on top of S3