Reflections on Building Data Systems in 2025

![[IMG_2678 (1).jpeg]] <small>*Longs Peak, Sept. 2025... or thematically, top of the data mountain. Not there yet!*</small> I’m writing this as someone who blurred a number of lines in 2025 -- and will likely do so again. Some days, I'm an individual contributor. I get hours of time to work on a few problems. This is probably what I am most naturally attracted to -- the dopamine of solving a problem. Other days, I'm a leader -- setting roadmaps, planning sprints, grooming backlogs. I'm presenting and defending the roadmap to leadership (our C-suite, VPs, etc.). Still other days -- I'm an analyst. I get to go deep on a question and try and answer it, and then present and discuss it with whoever asked. These stages push me, and sometimes I fall flat. Oh well. This fits me; I like wearing a lot of hats. I'm lucky that right now, that's what Suvida needs. ### What Worked? #### Flexible, expandable, adaptable data modeling Fundamentally, the work of a data team is to create structures that scale -- patients, records, facts, dimensions -- while minimizing friction to change, remaining resilient to errors, and staying automated and testable enough to support fast iteration cycles. It's about having a priority stack, and constantly trying to juggle these priorities when they inevitably point in opposite directions. The underlying data models must be able to adapt to the changing needs and winds. The team must be ruthless with complexity, especially for data models as expansive as a typical VBC healthcare company needs. #### Tooling for fast iteration cycles Nothing the data team shipped in 2025 was perfect from the start. We shipped a comprehensive Quality management workflow solution in February/March. It updates automatically with new information multiple times per day. While its MVP functionality was solid, it evolved week by week, month by month. Weekly meetings came with asks, changes, questions. These were mostly easy, five- to thirty-minute changes. The data team worked hard with our finance team to create sophisticated medical economics reporting. The nature of this data is that it follows a monthly tempo, as we get new data -- and the finance team receives rolled-up financial statements that can be used to QA our more granular information. I messed up our numbers for months -- it's really hard to make everything tie. We met weekly. Each week, we knocked down one more problem. Today, we have sophisticated medical economics that can be combined seamlessly with all of our other data -- and we know that it ties exceptionally well. In each case, and many more beyond these, tight iteration cycles were enabled by smart upfront tooling selection. (Quality: [Airtable](https://www.airtable.com), Python "reverse ETL", Snowflake, [dbt](https://github.com/dbt-labs/dbt-core); Finance: [Hex](https://hex.tech), [dbt](https://github.com/dbt-labs/dbt-core), Snowflake, Excel, [Tuva Project](https://github.com/tuva-health/tuva)) #### Open source bootstraps All ~~data~~ technology teams stand on the shoulders of giants. We stand on [dbt-core](https://github.com/dbt-labs/dbt-core) (please -- [stay open and supported/updated!](https://www.fivetran.com/press/fivetran-and-dbt-labs-unite-to-set-the-standard-for-open-data-infrastructure-2025)), [Airbyte](https://airbyte.com), [Tuva](https://github.com/tuva-health/tuva), [Lightdash](https://www.lightdash.com), [HCCinFHIR](https://github.com/mimilabs/hccinfhir), and more broadly -- the [Python project](https://www.python.org). #### LLMs Despite this coming in down here, 2025 is likely the year that I remember most for LLMs (I'm intentionally avoiding "AI" here). While I was an early adopter of ChatGPT (and of the original Gemini releases) for information lookup, I hadn't yet jumped into their coding capabilities. I dabbled with some of the VSCode-based IDEs -- but I've never really been a fan of these IDEs. I like my (simpler, basic, Mac-native) text editors, like [BBEdit](https://www.barebones.com/products/bbedit/index.html) and [Nova](https://nova.app). [Claude Code](https://github.com/anthropics/claude-code) was the missing piece for me. To this day, I *love* the concept and it's the primary way I work with these tools in my development workflow (though I do also use [Codex CLI](https://openai.com/codex/) now). The timing worked out well here -- Suvida's data environment had thousands of lines of Python, SQL, and YAML documentation just waiting to be feed into an LLM. By choosing tools and best practices for fast iteration cycles (e.g., dbt, CI/CD workflows, everything is a version controlled repo, etc.), we have a strong foundation to use with LLMs. A significant amount of the work I've done in 2025 has been augmented by these tools. I know that these tools are controversial for some. For me, it feels like the handcuffs have been taken off. I can think about higher level architecture and design decisions, and focus on the most important pieces, while these tools work next to me to complete the less exciting work. And let's be honest... a *lot* of data engineering and analytics engineering is plumbing. It's critical work -- but it's the kind of work that can be robustly accomplished by these kinds of tools, surrounded by good CI/CD and testing practices. ### What Didn't Work? #### Long, slow cycles: lots of activity, but no work gets done Long iteration cycles are not necessarily bad. Complex problems require thought. Thought takes time. Solving complicated problems takes time. The "smell" here is when long, slow iteration cycles result in lots of activity, but no actual work. It's a sign that although activity is happening, it may not actually be accomplishing the right thing. The project may not be structured correctly. Incentives may not be aligned. There may not be clarity on what you're actually trying to do. People may dip in and dip out, moving the goal posts with them as they go. These types of projects likely need a reset, and a sharp focus -- or else need to be purposefully put on pause. Leaders should keep their eyes open for these types of cycles. In 2026, I will need to step in earlier when I see the telltale signs start. #### Keeping point solutions and compounding complexity Complexity will drown us all (though maybe not the artificial superintelligences...). The core scaling mechanism is abstraction. How can you generalize a problem? What are the primitives? What is the invariant? What is the variant? It takes discernment and experience to know if the problem you're looking at deserves a generalized approach, or a specific, targeted solution. I get it wrong all the time. But I'm getting better at finding the edges. The data team must level up its ability to work with and through complexity, and know when we should be refactoring rather than “just bolting on another feature”. The code base is a toddler, and it has all of the quirks of one. ### What I'm Focusing on in 2026 #### Organizing for fast iteration cycles Choosing tools for fast iteration cycles is necessary but insufficient. The support structures must be in place to keep the feedback and iteration from drowning a lean, focused team. Meetings must be prepared for in advance and have a purpose. Notes should be kept, with focused, feasible next steps captured. The loop only works if communication happens: "Hey *Person*, we just shipped *x feature*; look for it in the next hour!" I am not naturally good at the focused, precise, and regular implementation of this kind of practice. I generally will do these things for myself, but getting a team to follow this is a goal of mine for 2026. And I'm hopeful that I can get it to be self-reinforced: the team itself starts to hold itself to these practices. If we don't, we will have difficulty keeping pace with the systems we have set up. If we do, this is how you build the system that builds the business. #### Interop Augmenting workflows with more advanced LLMs will require greater interoperability. The "batch" nature of much of our data ecosystem doesn't cut it -- people want speed (e.g., a patient update entered in system X should immediately be visible in system Y). And LLM "agents" demand similar speed -- otherwise their usefulness is immediately cut down. An LLM agent should not wait on a nightly batch job. “Speed” in these integrations is a natural extension. Recently, a coworker had an off-the-cuff remark about one of our systems that has stuck with me. "It's fast! Nothing else we use in healthcare is *fast*!" And he's right! I'm not naive enough to think that speed is an accident -- it takes purposeful design, and an [emphasis on speed](https://webkit.org/performance/), to stay fast. The bar is so low, though, that I think this is a great design "north star" to aim for. #### Doing the basics, better Suvida's data ecosystem was largely built on Microsoft Synapse. We migrated from SQL Server to Snowflake in January 2024. We moved from PowerBI to Lightdash in June 2024. About 50% of the ingestion of data to Snowflake is still in Microsoft Synapse. Synapse violates many of the above principles about choosing tools for fast iteration cycles. I've been working through much of 2025 to replace it (Airbyte, Fivetran, Python classes as glue between systems where the previous two don't suffice). A strong foundation is vital -- it will create more time for other work, enable faster processing of important information, and ultimately be the foundation for increasingly sophisticated BI, workflow (human and LLM), and systems interop work. *The above are my own opinions and do not reflect the opinions of Suvida Healthcare*