Without high-quality data, there can be no correct insights or predictions. However, maintaining data quality is a tall order, as we hear at Smals, where particular attention is paid to ongoing data quality. A database that is correct today may already be out of date tomorrow.
Data is the basis for insights, predictions and today even for artificially intelligent assistants and bots. If the foundation isn’t of the right quality, anything you build on it will fail. Poor quality data can cause problems and this is not a new phenomenon, explains Professor Isabelle Boydens. She is an expert in data quality, both at the UniversitĆ© libre de Bruxelles and at Smals, where she works in the research team.
Bomb without the latest updates
She takes us to Belgrade during the Kosovo War on May 7, 1999. To be more precise: to the Chinese embassy in the Serbian capital. On this day, three journalists from the Chinese state media are killed by NATO bombs that accidentally hit the building. The bombing is accurate but the aim is wrong. The target for the bombs was checked against a no strikeDatabase of hospital, center and embassy locations. “However, the map was not updated correctly,” says Boydens. “The Chinese embassy had moved, but the database still contained the old address.”
The Chinese embassy had moved, but the database still contained the old address.
Prof. Isabelle Boydens, Smals & ULB
The dates of no strikeDatabases were of excellent quality, but the data they contained were caught up by reality at a crucial moment, with all the consequences that entailed. “Once we’ve put together a database, the reality it represents can evolve,” she says.
āIn many projects, past and present, the technical aspect of the data is very important,ā notes Boydens. āOf course, the technology is elementary, otherwise nothing works. But data is connected to reality and is evolving. You also have to take that into account.ā
How important do you think quality is?
This brings the professor to a second problem, which is also not very technical. Whoever enters the data does not always have a great interest in its quality. Vandy Berten, ICT researcher at Smals, agrees. āIn the construction industry today, you have to state who is present on a construction site. This courtyard may be on the Turnhoutsebaan, but upon entering the responsible person can write Turnhoutsestraat. That’s not right, but he or she isn’t losing sleep over it. In retrospect it is very difficult to determine whether the builders working on the Turnhoutsebaan and the Turnhoutsestraat are actually working on the same site.ā
The quality of the data depends so much on who enters data and what interest that person has in it. This takes you to databases that are naturally of good quality and datasets that are organically a bit more problematic. In the construction industry, the correctness of administrative data is perhaps less important than in the financial world.
Obvious and unclear errors
Before you can start with data, you need to assess its quality. “There are different types of errors that can occur,” says Boydens. āFirst, there are definite and formal flaws. Think of an empty field that should actually be full. Such a mistake is obvious.ā
Subsequently, a database may contain suspected errors. Sometimes it is possible to identify and fix them. Boydens: “A suspected anomaly can indicate a real error or a valid value that was not considered in the domain because it was not known when the model was built.”
Boydens continues: āSuch cases are common in all strongly developing areas: development of the categories of social security activities, mutations of viruses in medicine, features of global warming ⦠In such a case, you need to turn to specialists in this field to determine if it is a bug or a development that should be reflected in the database schema and associated standard. This step is very important as it may involve the emergence of a new version of the database schema and associated theory.ā
Companies going bankrupt have other things on their mind than reporting it correctly.
Prof. Isabelle Boydens, Smals & ULB
The third type of error is much more problematic. “These are mistakes that you can’t easily spot,” says Boydens. āCompanies that go bankrupt have other things on their mind than reporting it correctly. Such errors cannot be detected from the data in the database alone.ā
Dissolve
According to Boydens, there are two major action plans to spot problems, both in the data being input and in the quality of the data over time. āYou can work curatively, for example by conducting an audit of the values. Or you can compare a database to other records to look for unwanted differences. There are tools that help process millions of data in batches or evaluate them directly online.
Curative approaches are very important but do not necessarily provide a definitive solution. Equally important is a preventative approach. āFor example, you can adapt the database schema to a more suitable model. The synergy between data suppliers and administrators makes it possible to monitor anomalies and their processing in order to track down the causes of errors and fix them structurally at the source. This could be, for example, misinterpretation or insufficient documentation.ā
āCurative and preventive approaches make it possible to improve the quality of information and, in certain cases, achieve a significant ROI. As technologies, perceived reality and standards evolve, data quality requires constant attention.
How accurate does data need to be to be effective? Boydens also has some reservations about this. āThe data must be suitable for use. There are different needs for different applications.ā
Is your data biased?
This applies to the accuracy of data sets, but also to other problems such as bias. In this case, a record may be accurate but colored by the people who created it or by circumstances. This is particularly problematic when data is used for more advanced applications such as algorithm training.
A classic example is an algorithm trained to distinguish dogs from wolves based on correct data sets. The algorithm works fine at first, until it suddenly says it’s a wolf with a lot of canine conviction. This is due to the inappropriate training data: photos of wolves mostly have snow in the background, those of dogs do not. Although the dates were correct, it caused snow bias while training the image recognition algorithm, which accidentally learned to label images with snow as images of wolves and images without as images of dogs, without paying attention to the animal itself.
Sometimes very important, sometimes less
It gets worse when, for example, such correct datasets contain mostly photos of white men, so that an algorithm cannot recognize black women. Data used to train such algorithms must not only be factual, but also representative of what you are trying to achieve. This doesn’t mean that suddenly every data set has to be absolutely accurate and representative in order to be used as training data.
Katy Fokou, researcher at Smals, illustrates. āUse your data to build algorithms that provide purchase suggestions in a web shop bias not a big catastrophe. When it comes to training an algorithm used by the police or the judiciary, the insufficient data is much more problematic.ā At Smals, a provider of IT for state social security agencies, data quality is therefore monitored much more strictly than at many ( international) private companies.
Whatever your plan with data, it all starts with its quality. And a very important lesson is that this quality is not static. The real world is evolving, and your data is representations of that real world. So it makes sense that they evolve with him.