Will Human-Generated Big Data ‘Digital Exhaust’ Give Us Digital Indigestion?
By Rob Sobers
The potential of big data—and by that I mean the petabytes’ and exabytes’ worth of structured and unstructured information at this very moment being generated and assiduously gathered by users and networks across the world—is always portrayed as a good, bright shiny new wave of opportunities, presenting endless new methods for continued growth and even more innovation.
Big data sometimes seems to me like those corporations that continuously announce bigger profits every quarter without fail. I just feel that at some point the bubble will have to burst and they will go into the red, and it got me thinking that maybe with big data if we carry on producing it at this rate, there may come a point where we just cannot analyze it all.
I thought it just me until I saw a recent survey, and that’s the conclusion that they draw as well. The experts believe the vast quantities of data—which they have delicately called “digital exhaust”—that humans and computers will be creating by the year 2020 could enhance productivity, improve organizational transparency, and expand our “knowable future.” But they also worry about “humanity’s dashboard” being in government and corporate hands. They also doubt our ability to analyze it properly. Are their worries well-founded?
Human generated content is comprised of all the files and e-mails that we create every day, all the presentations, word processing documents, spreadsheets, audio files and other documents our employers ask us to produce hour by hour. These are the files that take up the vast majority of digital storage space in most organizations—they are kept for significant amounts of time and they have huge amounts of metadata associated with them. Human generated content is huge, and its metadata is even bigger. Metadata is the information about a file: who might have created it, what type of file it is, what folder it is stored in, who has been reading it and who has access to it. The content and metadata together make up human generated big data.
The problem is that most of us, meaning organizations and governments, are not yet equipped with the tools to exploit human generated big data. The conclusion of a recent survey of over 1000 Internet experts and other Internet users, published by the Pew Research Centre and the Imagining the Internet Center at Elon University, is that the world may not be ready to properly handle and understand big data. These experts have come to the conclusion that the huge quantities of data, which, as we mentioned before, they term “digital exhaust,” which will be created by the year 2020 could very well enhance productivity, improve organizational transparency and expand the frontier of the “knowable future.” However they are concerned about whose hands this information is in and whether government or corporates will use this information wisely.
The survey found that “…human and machine analysis of big data could improve social, political and economic intelligence by 2020. The rise of what is known as big data will facilitate things like real-time forecasting of events; the development of “inferential software” that assesses data patterns to project outcomes; and the creation of algorithms for advanced correlations that enable new understanding of the world.”
Of those surveyed, 39% of the Internet experts asked agreed with the counter-argument to big data’s benefits, which posited that “Human and machine analysis of big data will cause more problems than it solves by 2020. The existence of huge data sets for analysis will engender false confidence in our predictive powers and will lead many to make significant and hurtful mistakes. Moreover, analysis of big data will be misused by powerful people and institutions with selfish agendas that manipulate findings to make the case for what they want.”
As one of the study’s participants, entrepreneur Bryan Trogdon put it: “Big data is the new oil,” observing that, “…the companies, governments, and organizations that are able to mine this resource will have an enormous advantage over those that don’t. With speed, agility, and innovation determining the winners and losers, big data allows us to move from a mindset of ‘measure twice, cut once’ to one of ‘place small bets fast.’”
Sean Mead, Director of Analytics at Mead, Mead & Clark, Interbrand said: “Large, publicly available data sets, easier tools, wider distribution of analytic skills, and early stage artificial intelligence software will lead to a burst of economic activity and increased productivity comparable to that of the Internet and PC revolutions of the mid to late 1990s. Social movements will arise to free up access to large data repositories, to restrict the development and use of AIs, and to ‘liberate’ AIs.”
These are very interesting arguments and they do begin to get to the heart of the matter—which is that our data sets have grown beyond our ability to analyze and process them without sophisticated automation. We simply have to rely on technology to analyze and cope with this enormous wave of content and metadata.
Analyzing human generated big data has enormous potential. More than potential, harnessing the power of metadata has become essential to manage and protect human generated content. File shares, emails, and intranets have made it so easy for end users to save and share files that organizations now have more human generated content than they can sustainably manage and protect using small data thinking. Many organizations face real problems because questions that could be answered 15 years ago on smaller, more static data sets can no longer be answered. These questions include: Where does critical data reside, who accesses it, and who should have access to it? As a consequence, IDC estimates that only half the data that should be protected is protected.
And there is yet another problem that the more pessimistic scientists are as worried about, and this is cloud based file sharing. We love the cloud in the same way that we love our iPhones and iPads, and that is why the clouds are expanding. In fact the cloud is expanding so fast, it is also causing a problem. Not only do we generate a lot more data but we like to keep it on different devices and, increasingly, we like to back it up. It is these good habits that are causing the cloud to expand dramatically.
And these cloud services create yet another growing store of human generated content requiring management and protection—one that lies outside corporate infrastructure with many different levels of control by management—in fact, many are saying no controls.
To my thinking, we have been here before. Those who said the planet was so overpopulated, that we would be falling off the edges soon have been proven to be wrong—as have those who say that our carbon emissions would one day kill us. They haven’t yet because we have recognized the problem and are taking remedial steps. They have all been proven wrong because A) the scale of the problem has been exaggerated and B) we keep coming up with new ways of solving our problems by turning our problems into the solution. I think big data falls into this category—a solution not a problem—and it is one that will help us through our more difficult ventures in the years to come.
Maybe the difference between us and the dinosaurs was that we can see the threats coming and act to avoid them. Now, about that huge meteor hurtling towards Earth…
Rob Sobers is a designer, web developer, and Technical Strategist for Varonis—the leader in data governance and secure collaboration solutions—where he oversees the company’s online strategy. He writes a popular blog on software development and security at accidentalhacker.com and is co-author of the book “Learn Ruby the Hard Way”, which has been used by thousands of students to learn the Ruby programming language. As a former developer and technical consultant, Mr. Sobers has helped many companies design and implement enterprise network architectures, enterprise security solutions, and enterprise management systems.