Data is a crucial building block in every modern organization. Especially now that companies are exploring the possibilities of generative AI, it is important to provide such systems with high-quality data. If you don’t feed the models accurate data, you can’t expect the results to be reliable. What is the best data preparation strategy? And where is the best place to store this data? An overview of some trends in building an effective data architecture.
Data analytics and AI are high on the agenda for most companies. But they do not deliver the desired results and productivity increases everywhere. In a previous article you could read what the stumbling blocks are and how you can avoid them. The basis of every successful analytics project is the data with which we implement a model. Without good, easily accessible data, something is guaranteed to go wrong.
There is a growing realization everywhere that AI technology is not a miracle cure that always delivers good results. If your data is only 80% accurate, the output of a model will likely be wrong one in five times. Worse, if you train models based on historical data that contains human bias, that bias will also come to life in your models. For example, if a bank has ever given certain populations or minorities a poor credit rating, your models will also exhibit this behavior. In a next article we will go into this in more detail and explain why trust is so important when using data.
What can you do to increase the quality of the data?
Clean up data instead of throwing everything together
In practice, two extremes are possible. On the one hand, we could try to clean all raw data before applying analytics to it. To do this, we remove the data from the known silos and store it in a central data warehouse. Due to the volume and required quality and speed of data, this approach is time-consuming and nearly impossible. The data warehouse is contrasted with the option of a data lake, in which we bring together all data – including raw and unstructured data – and do nothing with it until we actually need the data. While this is certainly not impossible given the current capacity of hard drives and the cloud, this way we are just postponing the problem. As is often the case, the answer lies somewhere in the middle…
Silos are so stubborn that we have to accept that they will always exist. However, with the right data architecture we can keep it under control. That’s why companies are increasingly looking for a pragmatic way to offer data through diversification. While some data needs to be ready to use immediately, other data can safely wait. So we get a dynamic mix of a data warehouse strategy and other forms of data architecture.
Data as a product & self-service
The architecture that aims to combine the best of both worlds is Data Mesh. Some pillars of this concept will certainly remain. For example, Data Mesh is based on the idea that data is a product. Instead of trying to place everything centrally, responsibility remains with the person managing the data. Just like a physical product, data also has a life cycle and a certain quality that determines its value. And just like this product, you can offer data by putting everything in one catalog.
This brings us to another trend that will only increase productivity when working with data: self-service. In contrast to the classic Excel sheet, which is actually a form of self-service, we need modern tools that offer the user as much transparency as possible. Where does the data come from? How complete and accurate is the data? These are all indicators that provide insight into an often complex network and increase the traceability of data. The SAS catalog takes this into account and automates the process by automatically complementing itself as the data is used. Combined with feedback from users themselves, such an approach can significantly enrich your data catalog so that everyone can easily use the relevant data.
Cloud or no cloud?
All of this makes it clear that it no longer has to matter where you store your data. As we constantly move data back and forth, we no longer need to always bring platforms as close to the center of gravity of the data as possible. This allows companies to explore alternatives such as the possibilities of a multi-cloud approach. Today, there are a variety of cloud databases that offer cost-effective storage without forcing an architecture on users. Such platforms are very open and mainly focus on performance, scalability and security.
A good example of such a rapidly growing cloud data platform is SAS partner Snowflake. Since you as a user don’t have to create anything yourself, such solutions reduce the hurdle of putting your data in the cloud. Snowflake goes one step further and offers companies the opportunity to offer data to external users via a marketplace, with or without payment. This could be interesting for app developers. For example, consider a smart application to better coordinate the production, transport and consumption of energy. For this we would need a lot of data that is distributed everywhere. What capacity do solar masts have? What kind of weather can we expect? How high are the current electricity prices? How many cars are currently charging? Cloud platforms allow organizations to leverage and enrich each other’s data.
For those not interested in public cloud, the world of open source offers interesting data storage solutions. Parquet and ORC are two storage formats supported by a wide community. Both are easy to start and offer excellent data usage performance. Accessing these tools is also very easy via the SAS platform.
Stable and dynamic
All of these examples fit the philosophy that the user should be able to access as much data as possible, regardless of where exactly it is stored. After all, no one has their data in one place. Especially now that business users are working more frequently with data and analytics, solutions need to be robust, simple and dynamic. Only then will the productivity of your data projects skyrocket.
This article is a contribution from SAS. In a third and final article in this series about working productively with analytics, we focus on the importance of trust. Without trust in data, algorithms and model results, only discussions will arise, which in turn undermines the productivity of projects.