We often think of Big Data or Dark Data as a thing – one big, amorphous blob of stuff, that either we can’t do anything with, or that we must deal with as one big blob of stuff.
That isn’t true, of course. What we really mean by a big data environment is a very large repository, or group of repositories, that contain a very large volume of material, of which we may have very little understanding or control. In the case of dark data, the acknowledgment that we have little understanding of control is explicit – that’s what’s dark about it.
Both of these assumptions are untrue, at least to some extent. Neither Big Data nor Dark Data are ever one thing; there were instead a collection of many things in many forms. Nor is it true that we have no control – or at least, that we can gain no control. Even in a very large-scale environment, where very little is known about the contents initially, we can be sure that it has some or all of the following characteristics:
- There are multiple file types in multiple formats;
- The information content of the files is varied, and likely covers the entire spectrum of activities of the owner;
- Some of the data objects will reside in well organized, well identified repositories, but much or most will not;
- Some of the data objects will be well identified with respect to their metadata and other characteristics, but much or most will not;
- There is not a great deal of institutional knowledge about the environment as a whole – instead, parts of it will be understood by some people, parts by others, and some by no one at all; and
- The environment as a whole is not subject to overall rules of management, governance and retention.
None of these characteristics is in either – or. Instead, each operates on a sliding scale, and any environment could have any one of these characteristics in any measure. But, to the extent only sliders are pushed all the way to the right, it’s merely a big data environment. To the extent that the sliders are all pushed the left, it’s also become a dark data environment.
So, now that we know the environment is a bit of a mess, maybe quite a lot of mess, the question is what to do about it. The short answer is, more than a lot of people think. The key here is to recognize that:
- First, within any organization, there are a finite number of activities generating a finite number of data types.
- Second, if we understand, even broadly, the kinds of data types we are dealing with, we are in a position to begin applying some sort of governance to them.
- Third, finding out what data types we have, and where they might reside, is not impossible. There are tools available that can assist us with this, provided we understand what they do and what their limitations are.
Let’s consider point number 1. Your organization may be big, and it may be complicated, but analytically, and at a high level, it really only does a few things: it makes or markets products and services, it ships things, it receives things, it runs physical plants, be they offices or factories. And so on. Even for a very large, complex organization, the number of these topics is finite and remarkably small. And then of course, it has all of the support activities necessary to do these things – human resources, tax and accounting, and so on. And again, these things are all pretty well understood and finite in number. And it’s a pretty good bet that most or all of the data in your Big, Bad, Dark environment falls into these buckets.
Now let’s consider point number 2. If you know what kinds of data you have, you’re on track to building some of the basic governance tools necessary to get a handle on that big, ugly environment. Most or all of the basic governance tools – records retention schedules, privacy policies, data loss prevention policies, and so on – are content based. This means that they apply to data objects based upon the information contained in those data objects. So if you know what kind of data objects you have, you’re in a position to begin crafting the basic set of governance tools necessary to begin effective management of them.
Finally, point number 3. Those governance tools aren’t much use if you can’t identify the data objects to which they apply. Fortunately, you can. There is a wide variety of software available to do precisely this in very large environments. It is variously characterized: e-discovery software, data loss prevention software, predictive coding software and so on, but it’s all really different flavors of the same thing: you feed some characteristics into it, be they keywords, character patterns, such as those found in taxpayer identification numbers, or whatever, and it goes out and crawls your system and finds data objects that match those characteristics. In the case of data objects, which can’t be read in this fashion, it is even something called glyph recognition software that does a similar sort of analysis, based upon the visual characteristics of each page. Each of these technologies permits you to sort that gigantic pile of stuff into smaller, reasonably well-organized piles to which you can then apply your governance rules.
Is any of this perfect? Of course not! Not even close. But that’s not the point. In a large-scale environment, perfect application of the rules is impossible and always has been. Even in the good old days, when boxes were carefully packed and labeled, and all of your records were placed into warehouses with barcodes and other finding aids, there was a lot of leakage and loss, and knowledgeable insiders have always been aware of this. This is no different, in fact it’s exactly the same conceptually – it’s a big data environment that cannot be fully controlled. So your goal shouldn’t be perfection, it should instead be based upon reasonable risk and cost/benefit analysis, and implemented recognizing that you always retain some risk, that becomes a cost of doing business. How you go about analyzing that risk is the subject for my next blog post.