# Drift Happens

> Navigating Unstable File Formats in Modern Data Architecture

**Published by:** [ByteByByte](https://bytebybyte.tech/)
**Published on:** 2025-02-23
**Categories:** cloud data platforms, data schema management, healthcare analytics, data best practices
**URL:** https://bytebybyte.tech/drift-happens-1

## Content

Today, I want to dive into a topic that comes up all the time when working with data platforms: handling schema drift. This term describes the constant (and often unexpected) changes in file formats or data structures—changes that can quickly break data ingestion processes and cause a flurry of alerts and hotfixes. Recently, I was chatting with a client about their struggles with ever-changing file formats from vendors, third-party partners, and other external sources. These changes often cause data pipelines to break—alerts go off, hotfixes are needed, and engineers scramble to patch things up (I was one of those engineers for a long time). They asked me how I've handled these issues while keeping solutions flexible, structured, and easy to manage. I’ve run into this challenge repeatedly, especially in my experience leading healthcare analytics data platform and AI use case builds. In one project, we had to process hundreds of claims payment documents in PDF, Excel, and flat file formats from multiple payers—files that changed on what felt like a monthly schedule. Each time a payer tweaked the layout or added new columns, our ETL jobs would fail. We’d scramble to update mappings, rework transformations, and re-deploy quickly to avoid pipeline downtime. It wasn't sustainable and we had to iterate to find the right approach to solve the problem. Tools That Can HelpBefore diving into architectural solutions, it’s worth noting a couple of tools that can ease the burden (...and no neither Databricks, DataForge, or Fivetran are paying me here, so take these recommendations for what they're worth). Databricks Auto Loader can dynamically ingest files and handles schema drift, I've seen it used a few times and is great for anyone using DBx and willing to do some development to solve this problem. Additionally, Fivetran (a favorite in DBx environments) can automate ingestion, manage schema drift, and alert users to attribute changes for these files. It's a bit more low code/no code. DataForge is another tool I’ve used—the founders are former colleagues of mine, and they’ve written extensively on this topic. It provides an effective way to handle schema drift in data ingestion workflows. There are many other tools that do this but these have been used by several of my clients and have had really good results. They don’t eliminate the need for a solid data architecture and strategy but they can fit into that architecture. Outside of tools, there are a few other approaches to consider. I take a look at the standard third normal form approach and two others below. I recognize that there are many other ways to solve for this problem (schema on read/views, etc.) but focused on these for the purposes of this post. Approaches to Managing File Format Changes1. The Traditional 3NF Data ModelThis approach follows standard relational modeling principles—atomic values, minimal redundancy, and key-based relationships. ProsRemoves redundant data, reducing storage costs and improving consistency.Enforces data integrity through key relationships.Efficient for querying with well-defined schemas and indexing.Great for transactional systems requiring strong consistency.Plays nicely with star schema modeling when data is well-structured.ConsCan get complex with many dimensions, fact tables, and crosswalk tables.Schema changes require updates, which can break ETL processes.Needs ongoing maintenance (performance tuning, indexing, etc.).Less flexible for handling semi-structured data.Not ideal for API-driven architectures that prefer JSON.Key Takeaway Use 3NF when data structures are relatively stable, or when strong consistency and integrity are paramount. It’s powerful, but schema changes can be painful—plan for regular maintenance cycles and version control to handle evolving requirements.2. JSON (Denormalized Approach)Storing data as JSON objects offers more flexibility, reducing schema-related ETL failures when fields are added or removed. ProsReduces schema update requirements; easier for data to evolve over time.Improves query performance by reducing the need for joins assuming a "One Large Table" approach.Supports modern applications that natively work with JSON.Can store precomputed measures to optimize query times.ConsCan get messy if users aren’t familiar with “wide-table” (OLT) models.JSON querying can be slower due to nested structures.Storage costs can go up due to duplicated data.Requires extra processing for JSON parsing and transformation.Key Takeaway JSON is a powerful option for managing semi-structured data and adapting to frequent schema changes, but it comes with trade-offs. Performance and cost considerations should not be overlooked, as querying large nested structures can be inefficient. Additionally, working with JSON and wide-table models requires a different mindset—developers and power users will need training to effectively navigate this paradigm. If your workflow relies heavily on self-joins, be prepared for potential complexity and performance overhead.3. The Hybrid Approach (Structured & Unstructured Data)A hybrid approach blends structured data with flexible JSON storage, aiming to strike a balance between data integrity and adaptability. When to Consider This ApproachSome attributes are stable, while others change frequently.Core data and frequently queried attributes live in structured tables.Rarely queried or dynamic attributes are stored in JSON.Your database supports mixed data types.Your team is comfortable with performance trade-offs and query complexity.Key Takeaway The hybrid approach is often a sweet spot for teams dealing with frequent schema changes on certain attributes but still needing a robust relational backbone. You get the best of both worlds, but it demands solid governance to track where each piece of data resides.Common Pitfalls & Governance TipsVersioning: Maintain a version history of your schemas. This way, you know exactly which schema was in use when data was ingested.Documentation: Keep clear documentation of which fields are in your structured tables vs. your JSON columns. This reduces confusion when changes inevitably occur.Alerting & Monitoring: Even with flexible storage, you want alerts when new fields appear. Tools like Databricks Auto Loader or Fivetran can notify you of schema changes immediately.Data Governance: Have a plan for how new fields or attributes get validated, labeled, and whether they belong in structured or unstructured sections. This prevents “sprawl” over time.How to Decide Which Approach is Right for YouBefore picking an approach, ask yourself:How often is this data queried? Frequent queries may justify a structured approach for performance.Does it need to integrate with APIs? JSON-friendly storage might be better if API integration is key.How many records are we dealing with? Large volumes of semi-structured data might need a scalable, flexible design.How frequently does the schema change? A very dynamic schema pushes you toward JSON or hybrid solutions.Answering these questions will help you choose the best model. Remember, there’s no one-size-fits-all. The hybrid approach often provides the right balance, but you need a team comfortable with managing both structured and semi-structured data efficiently.Final ThoughtsSchema drift is an unavoidable challenge in data engineering, but there are proven strategies to tackle it. Whether you choose a traditional relational model, a flexible JSON approach, or a hybrid solution, the key is understanding your data’s usage patterns and anticipating future evolution. At the end of the day, data architecture is all about trade-offs. I love digging into these kinds of challenges, and I hope this breakdown helps you think through the best approach for your own platform needs. Got thoughts or experiences dealing with schema drift? What’s the trickiest schema drift issue you’ve faced, and how did you solve it? Do you have a favorite tool or framework for managing unexpected file format changes? Drop a comment—I’d love to hear how you’re tackling it! P.S. At the top of the post is a photo I took of a piece of art currently on display at the Brooklyn Museum. It’s one of those pieces that makes me think, "I could have done that"—but I didn’t. I don’t have the experience, background, or understanding of art to have created it. The artist is Jaye Moon... and I like her work! UPDATE 2025/02/24 I received some feedback from my colleague and Databricks MVP, Doug MacWilliams. Doug suggests leveraging Delta Lake schema enforcement within Databricks to manage schema drift. This method works well when handling frequently changing file formats in a Medallion architecture (bronze to silver to gold).Two Main Approaches to Handling Schema Drift in Delta Lake1. Schema Enforcement (Default)Strict Schema Matching: If incoming data doesn't align with the existing Delta Lake table schema, an error is triggered.Ensures Stability: This prevents unintended schema alterations, maintaining a stable model/schema.2. Schema EvolutionAutomatic Adaptation: When enabled, new column from incoming data are added without overwriting existing records.Manages Changes: It handles renamed or removed columns, preserving historical data integrity while accommodating new attributes.Requires Maintenance: Periodic cleanup is necessary to maintain consistency.Consistent Progression: As data moves from bronze to silver to gold layers, mapping into a consistent schema supports consistent insights/querying/etc. There is work to do there as well. This approach offers flexibility while keeping data clean and consistent (with a little work) for end users—a great choice for Databricks-based platforms. Appreciate the feedback, Doug!

## Publication Information

- [ByteByByte](https://bytebybyte.tech/): Publication homepage
- [All Posts](https://bytebybyte.tech/): More posts from this publication
- [RSS Feed](https://api.paragraph.com/blogs/rss/@bytebybyte): Subscribe to updates

## Optional

- [Collect as NFT](https://bytebybyte.tech/drift-happens-1): Support the author by collecting this post
- [View Collectors](https://bytebybyte.tech/drift-happens-1/collectors): See who has collected this post