DataBricks
本篇文章援引:
一亩三分地 - newgpu
Data Warehouse
Running on cluster.
A place to store structured data.
Source: offline ETL pipline to insert by batch
goal is to make data scientist to run SQL query
特点:
- It is a distributed database hence it has a lot features that database own
- Structured data
snowflake
managed cloud data warehouse
data warehouse on cloud
databricks and data lakehouse
datalake -> data lakehouse
No need for schema: friendly to semi-structured and non structured data
产品
delta lake
stores data itself and its metadata.
metadata is for ACID transactions and support for a schema with a supporting write-ahead log for data rolling back when necessary
Data can be stored on the top of S3, like object storage. Delta Lake imports metadata for ACID transactions and schema. These ACID transactions and schema are necessary for SQL workload
Databricks SQL
Photon execution engine.
Databricks Runtime
Create machine -> a tuned spark cluster, and spark is ready to go
Notebook
Connect to a cluster that the user have access to. Use scala, python, sql or R to run the spark command and do the manipulation with the data on top of S3