One thing is for certain, being able to make an informed decision is always making a decision from a position of strength. Data is what informs decisions and therefore management of data is key to the modern company decision making. This post marks the first in a series that will be created to understand what the term Modern Data Stack means as well as what each component is and used for.
As with many things, such as whether you are talking about gym supplements or technologies, the group of which that are being used is referred to as a stack. Essentially, it means what items are you using together to get your results whether its bigger muscles or better informed decisions?
At the core of the Modern Data Stack is the Data Warehouse. A Data Warehouse is where the data is stored at the various stages of its transformation journey. Generally speaking there are three layers: Raw, Cleansed and Transformed, Presentation.
Raw data is like meat before having anything done to make it edible. Unlike meat it won't go bad but we do refer to it's freshness as it transitions from up-to-date to historical over time. In it's current form it is hard to digest but keeping it is very useful in the event that you need to reprocess the data at a later date. Storage is pretty cheap these days and that makes this an affordable option.
Cleansed and Transformed data is where the raw data has been cleansed of erroneous data, such as duplicate entries, and then transformed using the analysis techniques as prescribed by data scientists / analysts. At this point it is changing from data into information - which IS digestible.
Presentation data is what is consumed by visualization tools and sometimes directly by people. This is the finished product where the plate of food is at the pass having passed the inspection of the chef and is ready to be delivered to the diners by the server. In that analogy the kitchen team is the data warehouse, the server is the data visualization tool and the diner is the business person who is now about to make a more informed or data-driven decision.
The purpose of the three layers is to allow work to be done efficiently and without affecting the other parts of the data journey. For instance, complex number crunching is a fairly heavy operation and with a potentially large data pool to work with, it can take both time and effort on the part of the system. The last thing you want to be doing is trying to change or read that data during this time period. As such, the reading of the latest data from the Presentation layer remains unaffected by the crunching going on because it is separated. Likewise, you have a flow of data and you don't want the new data to affect the processing of the existing data. This applies even in the case of real-time data. As you can see, by separating data at the various stages of its journey, each stage can happen safely.
This is powerful because it means that the performance of the system can be scaled by adding more cloud-based processing nodes as needed. Likewise, the raw data is stored in what is called a data lake. The data lake is a large, cloud-based data storage area that can be visualized as a large body of water from which the data can be consumed. Because it is in the cloud, its storage size can be scaled as needed. Likewise, the transformed data and presentation data are stored within the data warehouse in scalable cloud-based areas. The types of data that can be consumed by the data warehouse can be nearly anything from real-time streams of data to files to videos.
The modern Data Warehouse is different from a database or a data mart because each is made for different purpose. A database is built for and is very quick to handle transactional data such as recording eCommerce transactions, user management, shopping cart management and a product catalogue. A data mart, unlike the database, is built for analysis of data, something that a database would be much slower at handling. However, the data mart is built with department level analysis in mind and would not be suitable for company wide analysis. To continue with our example, the sales and marketing department would employ a data mart to analyze the eCommerce data over time. The data warehouse is built for the company wide analysis and often consumes data from databases and data marts in addition to many other sources to provide higher level analysis. The cost effectiveness of the modern, cloud-based data warehouse products virtually eliminate the benefits of using a data mart which would have been a start-small, cost savings approach. Now, companies affordably get both company wide and departmental analysis using a data warehouse.
As the cloud has matured, each cloud provider has created their own data warehouse products. Microsoft has Synapse, Google has BigQuery and AWS (Amazon) has Redshift. Other companies have created products such as Databricks and Snowflake which are agnostic to the cloud in which they are hosted. The leaders of the pack are Snowflake and BigQuery with Databricks in third place. Your cloud strategy, team make-up and long term goals will determine which product is selected.
It is helpful to engage a cloud specialist to help you navigate the pitfalls and complexities of the cloud and cloud-based data management. TMH Solutions works with all three of the major players within the data warehouse world and can take the challenge out of cloud-based data management with our managed cloud services.
TMH Solutions bring Enterprise Grade Expertise coupled with Boutique Firm Service to our customers' data and software development solutions.
Would you like to know more about how we can help you navigate the complexities of cloud-based data management? Mike Conway would love to have that conversation with you.
1 (289) 430-0419
Comments