How to Stop Data Leakage in a Google World
The causes of data leakage range from simple misconfiguration or improper classification of data, which makes it possible for Web servers to publish private and/or sensitive information, to users unwittingly (or not) storing sensitive data where they shouldn’t. Even best business practices such as site scraping—a known method for gathering competitive intelligence in which company A automatically reads company B’s website for available data like price tables, and uses that information to cut its own prices and remain on top—can lead to leakage of information.
By their very nature, search engines are the Internet’s biggest and most public Indexers. Search engines analyze websites, indexing them for the benefit of everyone who has ever done an Internet search. One urban legend even states that Google has a complete copy of the entire public Internet on file for data mining and analysis purposes.
As consumers and users of the Internet, we like to believe that most organizations do their best to remove sensitive information from their websites, FTP sites and other front facing business applications, as it turns out, however, this is not always the case.
Google Tables Search
Google has always been a pioneer of search algorithms, search visibility and advanced indexing, remaining one step ahead of other search engines, and introducing new ways to tag images by context, and even FTP sites for content. The recently added Table Search capabilities, however, really bring to surface the idea of impact of data leakage in an indexed world.
This is not to say that Google is causing data leakage, but abilities like “Indexed FTP”, “Search by image”, and now “Table Search” offer new ways to discover and extract data, which would otherwise have remained undiscovered.
Here’s one scary example involving passwords publicly available on the Internet (you can think of other interesting tables: PII, salaries, CC,…)
Which is a structured representation of:
What is the security takeaway?
The takeaway here is that Web security is more important now than ever before. Obviously, it doesn’t make sense to block Google from indexing your site (a business driver), but you should be aware of what content you are allowing access to, and who is accessing it.
- Implement web application security to mitigate hacker risk.
- Validate the content that is accessible via their web servers on a regular basis and/or implement policies to check for outgoing data.
- Implement policies to mitigate bots that may scrape their website for available content.
This post originally appeared on Imperva’s Data Security Blog.