How to Stop Data Leakage in a Google World

In the past, information leakage conjured images of securing data from physical theft (remember the alleged FBI laptop?) but thanks to the web, organizations need to secure information from growing “search giants.” In short, data leakage has taken on a whole new meaning in the Google age.

The causes of data leakage range from simple misconfiguration or improper classification of data, which makes it possible for Web servers to publish private and/or sensitive information, to users unwittingly (or not) storing sensitive data where they shouldn’t. Even best business practices such as site scraping—a known method for gathering competitive intelligence in which company A automatically reads company B’s website for available data like price tables, and uses that information to cut its own prices and remain on top—can lead to leakage of information.

Search Engines

By their very nature, search engines are the Internet’s biggest and most public Indexers. Search engines analyze websites, indexing them for the benefit of everyone who has ever done an Internet search. One urban legend even states that Google has a complete copy of the entire public Internet on file for data mining and analysis purposes.

As consumers and users of the Internet, we like to believe that most organizations do their best to remove sensitive information from their websites, FTP sites and other front facing business applications, as it turns out, however, this is not always the case.

Google Tables Search

Google has always been a pioneer of search algorithms, search visibility and advanced indexing, remaining one step ahead of other search engines, and introducing new ways to tag images by context, and even FTP sites for content. The recently added Table Search capabilities, however, really bring to surface the idea of impact of data leakage in an indexed world.

This is not to say that Google is causing data leakage, but abilities like “Indexed FTP”, “Search by image”, and now “Table Search” offer new ways to discover and extract data, which would otherwise have remained undiscovered.

Here’s one scary example involving passwords publicly available on the Internet (you can think of other interesting tables: PII, salaries, CC,…)

We used: http://research.google.com/tables?hl=en&q=username+password


Which is a structured representation of:

What is the Security Takeaway?

The takeaway here is that Web security is more important now than ever before. Obviously, it doesn’t make sense to block Google from indexing your site (a business driver), but you should be aware of what content you are allowing access to, and who is accessing it.

Companies should:

  1. Implement web application security to mitigate hacker risk.
  2. Validate the content that is accessible via their web servers on a regular basis and/or implement policies to check for outgoing data.
  3. Implement policies to mitigate bots that may scrape their website for available content.
The bottom line is that no one really wants to block Google from indexing their website. However, controlling the content that your website serves is important. Organizations should note that once content is up there, the “search giants” will index it and with ever-evolving mechanisms it will become easier to get around leaked information.

 

This post originally appeared on Imperva’s Data Security Blog.