Identity theft is one of the fastest growing crimes worldwide. Take the US for example; it is now occurring every 79 seconds and has become the fastest growing crime in America. In light of the safety net available in technology to mitigate the risk of unintended personal data disclosure the continuing wave of data breaches that are fuelling identity theft are simply unacceptable. The breadth of the problem is also rapidly expanding, personal data breaches are no longer only limited to credit card clearing firms, online banks, brokerage firms and off-shore data clearing firms. This article explores the recent and unintentional data disclosures at three different organisations that perhaps would not traditionally be considered 'at risk' for personal data disclosure issues.
The Best of IntentionsThe best of intentions has resulted in several serious incidents of unintentional personal data leakage involving organisations that have not been traditionally thought of as 'at risk' for personal data breaches including; publicly released court documents, Internet web search research data and a URL database in a security tool intended to reduce the risk of phishing.
Case in point. The Federal Energy Regulatory Commission (FERC) had released a massive amount of information that was a result of their investigation of Enron and the Western Energy Crisis.
The released information included:
- 92 per cent of Enron staff e-mails
- Over 85,000 records and 150,000 scanned pages of information that was provided to the FERC during the investigation
- 40 transcripts related to the case
Anyone with an Internet connection can simply go to the FERC web site and choose to order copies of the data on CD or be forwarded to a link that permits browsing and searching through the data on line.
At issue is the data was never sanitised before it was publicly released. A cursory review of only the e-mail data (200,000 e-mails) found:
- Searching for the term 'password' returned 3840 hits
- Searching for the term 'username' returned 767 hits
- Searching for specific banking, credit card and brokerage firm names resulted in complete sets of user credentials
The FERC data disclosure issue is further compounded by Trampoline Systems in their 'good intentioned' effort to showcase the capabilities of their Sonar product using the Enron data from FERC. Trampoline Systems has placed a copy of the 200,000 Enron e-mail database on line searchable with Sonar.
Simply hitting the explore tab and entering a search term in the search dialog provides anyone with the ability to quickly search through all of the 200,000 Enron e-mails. Further in their efforts to showcase their Sonar products capabilities Trampoline Systems has also mapped the e-mails in to both themes and social networks. From the Sonar interface at Trampoline Systems:
How to Use ItClick a name in the panel below to open an Enron executive's mailbox. Read any e-mail by clicking the subject. Enron Explorer analyses each person's main contacts and the themes they're talking about. Launch the Visualiser to see each person's social network (Java required). Click a contact in the Visualiser to shift the focus to them and load their mailbox, contacts and themes. Click a theme to access all relevant e-mails.
Clearly the good intentions of both the FERC and Trampoline Systems have gone bad in that neither had considered the exposure of the personal information of perhaps innocent parties that were simply a part of the Enron e-mail system.
AOL's User Search Data ReleasedAOL provided another example of Good Intentions Gone Bad in their release of AOL search data for 685,000 of their users in their efforts to gain recognition from the academic/research community. The data was quickly mirrored across the Internet on multiple servers and was easily downloadable by anyone with an Internet connection.
AOL reacted by removing the data from their web site, apologising for the mistake and later firing their CTO.
AOL had apparently thought that by simply removing the users ID number from the respective search string that the data was sanitised enough for release. Unfortunately they never considered that the search data might also contain other personal data on their users. A cursory analysis of the 20,000,000-search queries release by AOL revealed:
- 223 hits for valid social security numbers
- 70 hits for valid credit card numbers
- Complete names, address, telephone numbers and even driver's licenses numbers were also easily found in the released AOL data.
The technology to have properly sanitised the search data was readily available to AOL; hence the 'mistaken' disclosure by AOL is simply inexcusable. Three AOL users in Northern California seeking USD 1,000 in damages per user affected and an additional USD 4,000 for each user residing in California have filed a class action lawsuit.
Google's Safe Browsing InitiativeGoogle provides a free product as a 'tool bar add' on to alert users that a web page they are visiting may be asking for personal or financial information under false pretenses.
While the intention is of Google to thwart phishing with a free product is noble, the data in the form of URL updates provided by Google in support of their effort actually exposes personal information itself.
A cursory examination of data updates for Google Safe Browsing reveals little has been done to sanitise the data collected and made publicly available by Google. In fact the Google Safe Browsing data actually contains the personal information of persons that had previously visited the URL of phishing sites while Google was collecting data.
A quick web search for 'goog-black-url' returns a Google Safe Browsing update. Searching within the Google Safe Browsing update data quickly reveals the user names and passwords for Papal accounts, online bank accounts and MySpace accounts. All from victims that had apparently visited a phishing web site and mistakenly entered their user name and password, that is:
+http://mail.mordecainet.net/manual/en/www.paypal.com/cgi-
bin/us/cmd/webscr-cmd=_login/formular.php?user=XXXXohnstudinva@yahoo.com&pass=XXXXinianTomlinson
+http://www.agnes.netsons.org/bnakofamerica/update.html?Access_ID=XXXX5345345&Current_Passcode=XXXXdfsdf
+http://www.ebuell.com/gadgets/myspace.asp?up_Username=XXXXdstick@comacst.net&up_Password=XXXX1764&lang= en&country=us&.lang=en&.country=us&synd=ig&mid=56&parent= http://www.google.com&&libs=dsxAwmPdoAA/lib/libcore.js
(The above data has been sanitised.)
Another issue has recently been raised regarding the Google Safe Browse product. When running in enhanced mode, each request to visit a web page sends the entire GET request to Google in the clear (with out the use of encryption). More disturbing is that even when you are visiting a Web page that utilises Secure Sockets Layer (SSL) to encrypt the data, Google sends a copy of the decrypted GET request to their server. Hence if you were submitting a credit card number to an SSL Web server in a GET request, the entire request would be sent to Google in the clear. Effectively anyone on the wire between you and Google would have the ability to see your credit card number in the clear.
Simply put, in their efforts to protect the user from a potential phishing exploit that may expose personal information—Google themselves when operating in enhanced mode are exposing the personal information found within the users GET requests 'in the clear' and easily intercepted even when the user is visiting legitimate web sites that use encryption SSL to protect their personal information.
The good intentions of Google to protect users from phishing sites have gone bad in a number of respects:
- The data collected by Google and used with the Google Safe Browsing product is available to anyone with an Internet connection. Hence anyone with an Internet connection has the ability to search through the data to harvest the personal data of users perhaps inadvertently collected by Google.
- The transmission of the users personal data in the clear within user GET requests to even legitimate Web sites when operating in enhanced mode exposes the users personal data (even when they are doing business on a Web server that uses SSL to protect the users data) to anyone along the connection path between the user and Google.
While Google maintains that the user is warned that data may be sent in the clear when running in enhanced mode before they are able to enact it, there is simply no excuse for Google to make the personal information found within the URL updates available for harvesting to the Internet connected public through a simple 'Google Search'.
Current Tech to Plug Personal Data LeakageThe technology is readily available to mitigate the risk of personal data exposure. We can quickly examine the use of three different methodologies that are commonly in use today and how they could have impacted the above data leakage examples:
- Digital rights management
- Traditional secure content management
- Adaptive secure content management
Digital Rights Management (DRM)DRM based content management is effective only in maintaining control over specified documents and is not simply effective in securing data (Figure 1). Further DRM provides no safety net for user error in rights assignment. Hence a wayward / disgruntled document owner or user with access to an unprotected document could potentially assign rights to a third party in order to pass along personal information.
Work flow (Figure 1 below) in a typical DRM implementation for content security:
- Author receives a Client Licensor Certificate (CLC) the first time they rights-protect information.
- Author defines a set of usage rights and rules for their file; Application creates a 'publishing license' and encrypts the file.
- Author distributes file.
- Recipient clicks file to open, the application calls to the Rights Management Server (RMS), which validates the user and issues a 'use license'.
- Application renders file and enforces rights.
The use of DRM in securing the examples given earlier could have potentially restricted access to the AOL data to only the researchers that it was originally intended for but would not have mitigated the risks of exposure for either FERC or Google where the data was intended to be made generally available to the public.
Traditional Secure Content Management (SCM)The security afforded in the implementation of a traditional SCM is based in part on the administrative development of a data dictionary. In the simplest of terms the data dictionary contains information such as watermarks, keywords, that is, 'password' and 'user name' as well as generic templates describing the format of credit card numbers, social security numbers, drivers license numbers and other personal information. All content is then filtered against the data dictionary to provide for compliance.
The action taken by a traditional SCM is typically administratively configurable and can in part include blocking an entire document or file that contains administratively prohibited information or by obscuring the administratively prohibited data within a given file or document as determined in a test against the data dictionary.
Current generation SCM offerings also include the ability to decrypt and enforce policy against SSL encrypted communications, which effectively eliminates a blind spot found in traditional SCM solutions.
Traditional SCM could have afforded effective risk mitigation in each of the three examples of data leakage given earlier. However security would have been at the cost of high administrative burden in the development of an effective data dictionary.
Adaptive Secure Content Management (ASCM)ASCM provides for the granular filtering capability of a traditional SCM without the administrative burden of creating an extensive data dictionary. While still utilising traditional content analysis of pattern matching ASCM also introduces many additional capabilities to further enhance SCM risk mitigation capabilities operational efficiency (figure 2) including but not limited to:
- Fingerprinting: The fingerprinting engine decomposes a document into a series of algorithm-generated hashes. This collection of hashes is referred to as the document 'fingerprint'. The engine then creates algorithmic hashes for all data being tested and will compare those hashes to known hashes. Fingerprinting looks for exact replicas of protected documents, or to detect modifications to protected documents.
- Adaptive Lexical Analysis: Documents fed into this engine are examined for lexical structures such as frequency of words, and position of words with respect to each other. Once engine is trained on protected documents it will analyse data looking for lexical structures similar to those within the documents and or data that it was trained on.
- Clustering: The clustering engine is trained on groups of documents or data sets that are similar in nature. Clustering considers the individual words, the counts of those words and the correlations between the words in a document or data, and the correlation of the documents and data in relation to others within the group. This way documents and data are placed in mathematical clusters. The clustering engine scans documents and data to determine whether the document or data is similar to know clusters, which would indicate, protected content.
- Advanced Content Filtering: Allows for searching content using 'and' and 'or' expressions so that multiple dictionaries and Boolean expressions can be used in combination. Therefore, advanced content filtering can search for combinations of expressions that when used together could constitute a violation, but used individually would not.
ASCM could have afforded effective risk mitigation in each of the three examples of data leakage given earlier without the high administrative burden of traditional SCM offerings in the development of an effective data dictionary.
Organisations that perhaps would not typically be considered to be at risk for personal data disclosure are finding themselves inadvertently in the middle of serious data disclosure issues. Even with the best of intentions things can go horribly wrong when technology safety nets are not utilised to support the security of personal data.
Posted by: nnn on June 15, 2007
it very useful to everyone