In the following article, we tell the story of data using this infographic as a framework, and talk about how we work with clients to leverage their data assets for positive business change.
This article will grow over time as we publish each component of the story.
Keep an eye out for updates as we publish them here and on LinkedIn.
Follow us on LinkedIn to keep informed, or Contact Us to chat about your data challenges and opportunities.
Part 2: Data Collection: Discovery
Part 3: Data Collection: Acquisition
You may have seen something like this infographic before. We like it - not because it is searingly accurate, but because it is a simple aid to telling a story. As with any analogy, if you push it too hard it will fall over. However, with each section being meaningful, we can use it as a framework to tell the story of how we can help transform your business by turning your raw data into actionable, impactful insights to deliver positive business change.
Over the coming weeks, we will visit each stage in the journey from raw, unrefined data through to analysis, insight and action – using this simple device to tie the story together. We will talk about moving from the realm of enablement into the realm of positive business change and how you can make this happen. In this series of posts, we want to shed some light on how we help organisations at any level of data maturity make the best use of the data they have, and how this can be tied to business goals.
We will also be talking about the techniques, analyses, models and approaches we use to deliver real benefit in the realm of positive business change - the real game-changing things which sit within the Data Analysis & Modelling phase. We will talk about why a particular model or analysis can be of use, and how you can start to identify these opportunities within your business.
As we post updates to this series on LinkedIn, we will bring each element together here, so you can see the story building and how each of those elements fit together.
The Data Collection phase in the data story is concerned with two key concepts – Discovery and Acquisition. Discovery is the process through which we gain a deep understanding of our client’s business, the people involved with the processing and use of data, the current systems that perform that processing, the formal and informal processes in place and current and future requirements. Acquisition is the process of putting in place systems and process to gather each data set from around the business into a central location.
Using the analogy, you could imagine that this is process of asking your household where all the Lego sets are, going around and finding them, and then putting them all in a pile in the living room ready to sort.
Data Collection: Discovery
The importance of the discovery phase cannot be overstated – this part of the story lays the foundations on which the remaining parts are built. By diligently working with the client to really understand the what, where, who, why and when of data within the organisation, we can prepare all parties for the road ahead. In general there are five areas of work in the discovery phase that we concentrate on: People, Systems, Processes, Data and Access – however this may change depending on the size, type and maturity of the business. A key asset from the discovery process is a clear understanding of requirements – sometimes the client has a clear vision of what they want, and sometimes we partner with the business to help define that vision.
Data Collection: Acquisition
This phase is concerned with using all of the knowledge gained during the discovery phase to create a central “collection point” for all appropriate data. Early in the engagement the central collection point may be conceptual (a document listing all data sources, responsible people, processing rules, etc) and in later stages this may be materialised as an actual system (for example, a new cloud-based data storage facility with relevant ingress tooling). With modern tooling, this central collection point may remain “virtual” – i.e. data remains where it is, but becomes queryable from a single central location.
As part of the Acquisition process, we may be asked to recommend (and potentially install) a system that is appropriate for the task. In these instances we take a wide view of the market and recommend appropriate systems that fit the requirements and budget in place. We can also run RFI/RFP processes for our clients, asking the difficult questions to discover what is behind the shiny PowerPoint sales decks.
Data Collection: Discovery
In some cases, a first engagement with a new client may be a tightly bounded brief with specific deliverables and timelines. However if this is not the case, we always recommend starting new relationships with a discovery phase. This can lay the foundations for all future projects that leverage organisational data for positive business change. Although not imperative, it is incredibly useful to have an internal sponsor to work with who can make introductions to relevant people and help coordinate requests.
The objective of the discovery phase is to understand the key motivations behind the engagement, where the organisation is aiming to get to, and the key people, processes and systems that inform how data is currently used within the business.
For some organisations – particularly small businesses or those with high data maturity – the discovery phase can be fairly quick, as the scope may be narrow or documentation good. However, for other organisations this phase may take some time and will bring to the surface new insights into their people, systems, processes and data (valuable assets in their own right). Of course, one of the key insights from this work is discovering where the data actually is.
While we always seek to tailor our approach to discovery to the bespoke needs of the organisation we are working with, there are some key commonalities that feature in most discovery exercises.
Conducting a series of interviews with stakeholders and domain experts within the business will shed light on who owns what, who needs what, what challenges exist and the aspirations of teams and individuals when it comes to data and insight. It will also bring data champions to the fore – those who are already personally advancing up the data maturity ladder and who wish to bring their colleagues with them. These individuals are invaluable, as they can be key agents of change. From these interviews (and there may be many of them) a core group of people with the right knowledge and skills will emerge. This internal group will be critical in helping with the remaining aspects of discovery – and likely instrumental in delivery beyond the Discovery phase.
What we learn from these interviews will inform the remaining discovery exercise – systems, processes, data and access.
During the interviews, we aim to uncover as many of the business systems and linkages between them as possible. These will be mapped out and presented back, with data flows, ownership and responsibilities, external partners and suppliers, and other information attached. There can then be an iterative approach to completing the systems maps – once the internal teams see this new asset, it is likely to spark more conversations about which systems are in place and how they are connected and used (and what the opportunities and challenges are with them). Quite often there will be important components that are understood by a single person – and this may be the first time that many in the organisation have seen the full depth and breadth of systems in operation.
All organisations will have some processes in place to govern the implementation, development and use of their systems and data – whether these are well documented or live inside a single team member’s head. It is important that we pick out and document the processes that are pertinent to the data project – care should be taken here not to turn this effort into a full business process mapping exercise if that is not needed (for large organisations this can be a multi-month or multi-year project in its own right). We are mostly concerned with the processes that define how data systems operate, and how data flows into, through and out of the organisation. Where no process exists it may be necessary to help the organisation create one, particularly if it helps with good data governance.
The discovery of processes helps us to understand many aspects of the organisation’s operations, but it also helps to define how data is used and how it can be analysed and presented back to drive growth.
Often the most complex part of discovery, data can be anywhere within an organisation. More often than not there will be a mix of well-governed and wild-west data – nicely configured and maintained databases alongside hundreds or thousands of individual, unmanaged and uncontrolled spreadsheets, CSV files and other structured and unstructured documents. During this part of discovery it is critical we focus on those data repositories and sources that are going to add value to the process. Often, a spreadsheet that a team member uses for their day-to-day work and would appear to be critical to them is derived from deeper, more “truthful” data sources – we need to find that source of the truth and work with it – how that team member then uses and manipulates that data for their own requirements needs to be understood, but this becomes more of a “process” than a data source.
A common trip hazard in data projects is a lack of access to the systems and data required for discovery. During the interviews and development of the systems and process maps, we need to make sure that we find out how to gain access to those systems and data assets. This could be as simple as the provision of an endpoint, username and password or it may require more a more in-depth process where no direct access is provided and all requests for data are channelled through an internal team. Either way we work diligently with the security requirements of the organisation – we are being entrusted with access to the organisation’s crown jewels and always ensure we treat them as such. A security breach can spell serious trouble for both the organisation and us, so good governance is paramount.
Data Collection: Acquisition
Once we have discovered where the data is, what systems there are, who the relevant people in the organisation are and what processes and access requirements are in place, we can move on to the act of actually acquiring the data.
Acquisition is the final part of the “Collection” phase of the Data Story – and it may be surprising to learn that “acquisition” in this case doesn’t necessarily mean “gathering together”, or in fact moving the data at all. What we mean by acquisition is the process of making the data ready to “prepare” for analysis.
The process we go through for this depends partly on the data maturity of the organisation we are working with. Those that have a high level of data maturity may have great data environments, ready to use to process the data and prepare it for analysis – sometimes in-situ. For organisations that have a low level of data maturity, however, we may need to bring all of the data we have discovered into a new data environment to enable the preparation process to start. Sometimes this will necessitate the implementation of a temporary or permanent (depending on the type of engagement we have) system that will act as a central data store. Acquisition is then about bringing all of the relevant data into this new environment.
Either way, the outcome of the data acquisition process should be an environment that contains all relevant data (still quite raw), structured in a way that allows it to be prepared for analysis. This data preparation is then the next phase of the Data Story, which we’ll talk about next time.
With thousands of different potential data sources, it is critical that we have experience with a good range of them. Acquiring data from a nicely constructed database might be relatively easy, however extracting tabular data from poorly scanned documents in PDF format presents some unique challenges.
Some common data acquisition methods are:
Querying structured files/databases
Making up the vast majority of data acquisition tasks, querying structured databases or files involves automated methods to extract data from systems using the relevant interfaces for those systems. Extracting large data sets from an existing database using SQL via a script is an example, as well as automatically processing 10,000 csv files stored in a document store and recognising and extracting appropriate data. Having a wide range of experience across many different types of datastore is critical – as well as the experience and skills to know how to extract that data in the most efficient way.
Sometimes we find that an organisation has a critical but poorly maintained and documented application that they use to enter and store data. If there are no data export capabilities, then web scraping may be the only way to extract the data. If the application is not web based, then there are tools we can use to interact automatically with it in a scripted way and creatively extract the required data.
Where an API interface to an application is available, we can leverage that to extract data in a formal way. Sometimes this might require a bit of creativity to work with the limitations of the API, but usually there is a way! This requires an understating of the technology used for the API and how to construct efficient queries against it. APIs are sometimes “rate limited” (only allow a certain number of queries per second) or “response size” limited (will only return a response up to a maximum size) so for large extracts we may again need to be creative in how we construct the scripts.
May PDF files contain both the “image” of the document and the underlying text (which can easily be extracted) – but this isn’t always the case. Even if the underlying text can be extracted it’s usually not presented in a nice table (even though that is what the document shows). And if the text layer isn’t available then we need to train OCR (Optical Character Recognition) software to do the heavy lifting for us.
Manual Data Entry
It might sound archaic, but sometimes manual data entry is the best that’s available. If you have a cabinet of handwritten documents that need to be processed, it is sometimes much more efficient (and accurate) to transcribe them manually, compared with trying to configure a cutting edge scanning and text recognition system to do the job. Humans are still (currently) superior when context is key and content representing similar things is presented in different formats. There are many companies around the world that specialise in providing this service.
And many others
There are thousands of different ways that data can be stored, from hand-written documents to super-fast, niche, in-memory databases and they all have their idiosyncrasies. It is crucial that you have a data partner with the experience and creativity to deal with them all.
At the end of the Acquisition stage, we should have a well-documented set of data sources ready for processing. We may have moved some of the raw data into more appropriate locations, but some may be left where it is (if that location remains the most appropriate for that data). This documentation is now a hugely valuable asset for the organisation, and should be widely distributed particularly in the technical teams. It may contain information about the organisation’s data that has never before been seen and will support projects well beyond the current scope of work.
Preparing data to be useful. Content arriving in the coming weeks.
Data Visualisation and Exploration
Discovering what the data can potentially tell us and showing this visually. Content arriving in the coming weeks.
Data Analysis and Modelling
Extracting actionable insight from the data. Content arriving in the coming weeks.
Data Action and Storytelling
Putting analytics to work transforming your organisation and creating a data culture. Content arriving in the coming weeks.