When we talk about implementing an ‘agile’ methodology our goal is to implement incremental work cadences, known as sprints that allow us to deliver early, improve continuously and be able to respond to change throughout the development cycle. The end goal of such processes is not to take shortcuts in what we deliver from a quality perspective. It’s really about getting to the finish line sooner than if we had followed a waterfall methodology with all of the ‘bells and whistles’ we had thought about including with a more traditional approach.
But how do we implement such an approach for data warehouse solutions when implementing a data architecture and integration strategy takes considerable effort to build? How do we address data quality issues and their impact on the underlying components we are building? How can we respond to changes for information quickly when new requirements for data and analysis arise?
This blog discusses one approach to employing an agile methodology when building an agile data warehouse architecture. Below we summarize 5 core components of this approach. In a later article, we will take a deeper dive into how Cervello implemented this methodology for a recent global management consulting organization.
Component 1: Take a High Level Architecture Approach
Developing a high level architecture with considerations for data architecture, data integration, data quality and flexibility as requirements for new data and analyses arise is a critical first step. First, create data models that focus on the integration of individual source systems and not reporting requirements alone. Reporting requirements change over time, but the model relationships within individual source systems do not. This will simplify data integration processes and the sourcing of new systems in the future. You should also address integration challenges across individual source systems as additional model components. For example, map similar, but inconsistent data between different source systems.
Next, bring in all the attributes when sourcing data if those attributes are used. Simple data dictionary queries can help identify attributes that are not populated and can be ignored. This will alleviate having to change models and ETL processes as new attributes are requested. During this exercise some minimal data analysis is needed. We will address this in a future article. Furthermore, you’ll want to use views to simplify access to the underlying data models and consolidate tables when performance for analyses becomes a concern. This provides the ability to react quickly to new or changing reporting requirements in the short term. Be sure to capture common metadata (across all data model components) that can be used to track data lineage and support data quality and audit processes.
Component 2 – Develop a Data Modeling Methodology
The next step is to develop a consistent data model methodology that will lead to the development of reusable data integration processes and the ability to respond quickly to change.
First, develop consistency across data models and relationships. For example, this would include the use of business and surrogate keys, the modeling of relationships (e.g. Parent-Child, 1 to Many, Many to Many) and rules for when and how to model historical data changes. Implement consistent naming standards across all model components. When possible, and as a recommended practice use a robust data modeling tool for developing and managing model changes. Next, implement common metadata across every component of the data mode that can be used to support data quality, data lineage, audit and traceability.
Component 3 – Develop a Flexible Data Quality Strategy
Data quality challenges exist within and across source systems. They may go un-noticed due to lack of data integrity and enforcement of business rules which can impact the integrity of downstream data models (and process failures) and quality of reporting if not addressed as part of the solution architecture. In an agile approach, data quality is still a design consideration so coming up with an agile solution for implementing this strategy is critical. Several steps can be taken to implement a robust and flexible data quality strategy.
First, include common data quality metadata across data models. For example, error filter indicators on staging tables and single record keys can be used as pointers back to data quality logs. Next, develop data quality processes that are dynamic and rule driven. Be sure they have the following capabilities: (1) ability to identify and capture data quality issues (2) provide data lineage back to source tables in error (3) ability to flag error records to prevent severe data from loading downstream and (4) provide mechanisms for communication and resolution of data issues. Lastly, be sure to implement data quality processes throughout the integration lifecycle, but not embedded inside core ETL processes. Keeping ETL processes simple is key to being agile. Data quality rules can be added or changed without impact to underlying ETL code.
Component 4 – Create a Common Data Integration Framework
Creating a standard and consistent data architecture with a common metadata approach allows for rapid development of simple and reusable ETL processes. This allows ETL processes (based on type of tables loaded) to be developed once and used as templates for new ETL processes that can be developed quickly. Additionally, data quality logic should not be embedded in ETL processes thereby keeping ETL logic simple and ‘cookie cutter’. Processes of different types are similar in nature and are therefore easy to replicate, understand and maintain. Lastly, common metadata driven processes that include process logging, index management and partition management can be encapsulated outside of ETL processes and plugged in as necessary without impact to underlying code thereby providing extensive capabilities on as ‘as needed’ basis that can be leveraged with minimal effort.
Component 5 – Develop Metadata Driven and Reusable Processes
The key to implementing an agile data warehouse is to simplify processes and leverage code reuse. This in turn requires a consistent approach to data and integration architecture activities which then leads to the ability to create and leverage common reusable processes that speed up the development life cycle.
Data model metadata helps define when data was created or updated, the process or person responsible and the load cycle the update occurred in. Common processes such as event and row level logging can leverage this metadata to capture detailed process statistics without duplication of code across the solution. Data quality metadata defines the rules to be exercised and actions to be taken. When stored in the database they can be executed when and where as needed. Rules can be added, modified or removed without impacting underlying ETL code. ETL metadata can be used to encapsulate common and process specific metadata. Common processes can be used to make that metadata available to ETL processes seamlessly with no custom ETL logic required.
In summary, an agile approach to agile data warehouse architecture development is not an impossible task. As with any architecture the foundation is key. Development of common standards and processes and adherence to those standards can ensure a quality product delivered in a timely manner.
Interested in learning more? Join our mailing list!