Data Ingestion and Engineering
Open-Source Scripts
We believe in transparency and collaboration. That's why many of our data engineering and ingestion scripts are open-source, available under the Apache 2.0 license. You can explore and contribute to these scripts on our GitHub page.
Repository Structure
Our GitHub repositories are organized for easy navigation:
- Repository Naming: Each repository is named after its data source.
- Example: The
mimi-cdc
repository contains scripts for datasets from the Centers for Disease Control and Prevention.
- Example: The
- Script Types: Within each repository, you'll find two main types of Python/PySpark scripts:
download*
: These scripts fetch source files from various websites or APIs.ingest*
: These scripts take the downloaded files and load them into corresponding delta tables in our workspace.
Column Naming Convention
To ensure consistency and readability:
- We use Snake case for all column names.
- Example: "Provider Name" becomes "provider_name".
- This makes it easier to work with the data across different programming languages and platforms.
Data Alignment
Data publishers occasionally change their file formats or column names. To handle this:
- We've aligned columns by carefully editing the original names.
- Always refer to the column descriptions in the Data Catalog on Databricks for the most up-to-date and accurate information about each column.
Additional Variables
To enhance traceability and version control, we've added three special variables to all tables:
mimi_src_file_name
: Shows the source file's name.mimi_src_file_date
: Indicates the date mark of the source file (creation date or measurement cut-off date).mimi_dlt_load_date
: Shows when the data was loaded into the lake house as a delta table.
For more detailed information about these variables and how they can be used, check the Catalog section in the workspace.
Accessing Data (Basic)
You can access the data using SQL or PySpark. Here's a basic SQL example:
-- The first level maps to {schema}
-- and the next level maps to {table} fields.
SELECT * FROM mimi_ws_1.{schema}.{table} LIMIT 10;
Accessing the Latest Data
Often, you'll only need the most recent data. We've made this easy with a utility function.
-- The first level maps to {schema}
-- and the next level maps to {table} fields.
SELECT * FROM mimi_ws_1.{schema}.{table}
WHERE mimi_src_file_date = get_max_date('mimi_ws_1.{schema}.{table}');
This ensures you're always working with the most up-to-date information.
Finding Examples and Getting Help
We want to make your data exploration as smooth as possible. Here are some resources:
- Databricks Search: Use the search function on Databricks to find relevant notebooks and examples.
- Mimibot on Slack: Our intelligent bot can answer many of your questions instantly.
- Public Example Scripts: We've made many example scripts publicly accessible to help you get started quickly.
Feedback and Contributions
Your input is valuable! We're always looking to improve:
- Found a bug?
- Have an idea for improvement?
- Want to contribute to our open-source scripts?
Please email us at info@mimilabs.ai. We're excited to hear from you and work together to make our data tools even better. Happy data exploring!