What is Data Discovery?

Drawing value from the vast amount of data your business accumulates is made more difficult by the time it takes to understand your data sources and their underlying technical structure. Organisations can face long start-up times for implementing a data management platform. It often takes a long time to understand the file structure so it can be ingested into the data warehouse. Innovation in data discovery processes makes the creation of data workflow and cleansing easier and quicker.

In any data management platform, there is a step where the data should be described or metadata should be created. A relational database needs to have the columns described. Big Data and data warehouses need metadata to be created on top of each Parquet file in order to use it effectively. It often falls to data users to describe the format of the data for each data source that feeds into the data management platform. This takes valuable time, time that could be better spent analysing and visualising data and getting valuable insights from data outputs.

Data Discovery creates metadata, so you don’t have to

Data Discovery automatically completes the analysis of the data and creates metadata. Everything is confirmed in a review step where the user can review the results and edit the definitions created by the system. The user doesn’t need knowledge of the data or data types because data discovery creates it for them. For example, column names are derived from the existing Excel header rows; the solution gets metadata directly from the content.

Automating the data discovery processes has an important impact on your resources as it eliminates the need for your team to do any coding and risks are mitigated by reducing human error. Data discovery optimises data quality and data governance before connecting to Business Intelligence tools.

There are 2 key benefits of data discovery:

Data discovery reduces the time needed to create an automated cleaning line. As the start-up time is reduced, data management processes are put in place quicker
Resources are released to focus on the value of data. Data and business users do not need detailed technical knowledge about the underlying input files or databases in order to define their data types and create data cleansing workflows

The Finworks Data Platform includes Data Discovery to accelerate the process of data ingestion by the auto-generation of metadata from any source and in any format. Data discovery is a preparatory step. In order to ingest data into the data warehouse, you need a definition of the data. Based on the data definition, data types and min/max values we generate a cleaning workflow for data ingestion. Data discovery is a preparatory step that makes the creation of the cleaning line easier and the automation of the cleaning lines quicker. Data is then ingested through this automated cleaning workflow.

Our self-service data solution accelerates the process of data ingestion by gathering data from multiple sources in any format to cleanse, consolidate and aggregate into a database of your own choosing. The predefined data workflows enable users to proxy upstream systems and validate and normalise data. Legacy data is added quickly and easily, just point the solution to any data source including your existing data warehouse, what’s more, our data platform continuously monitors the data ingestion and ongoing data integrity so you can be sure of maintaining the trust and goodwill of your data consumers.

The Finworks Data Platform goes beyond the isolated identification of metadata by providing a holistic workflow for data ingestion. The data discovery process reduces the setup time for your data management solution so businesses can gain control of their data and quickly move from data ingestion to data insights. By accelerating the data discovery process, the focus can shift to exploiting the full value of data to not only improve decision-making but to directly impact the optimisation of business processes. The Finworks Data Platform business users, data analysts and data scientists can quickly and easily query data and visualize their findings.

Our expertise in building data cleaning a universe of 40 million financial instruments for a major public European institution. The solution includes daily processes to update 2 million dynamic data sets and perform associated complex calculations. The system detects and reports on anomalies across 12 months of data history, comprising 480 million entries.

So instead of setting up and coding the metadata each time, we took the approach of automating the process by defining the rules and then letting the system automate the cleaning line creation. The implementation achieved a massive improvement in both volume and quality of outputs. The solution eliminated duplication of work across affiliates and departments.