Monday, June 11, 2012

Hash Collisions in Oracle: SQL Signature and SQL_ID

Topic: Discussion and implementation of a method for finding hash collisions for sql_id and SQL signature in Oracle.

Introduction:

MD5 hashing is used by Oracle to compute sql_id and SQL signature (see also a previous blog post on the topic). Those hash values are normally used as if they were a unique representation of a given SQL statement. However collisions (2 different statements with the same hash value) can happen or, as it is the case of this post, will happen!

The underlying math is a short discussion of the birthday attack: Let's say you walk into a room an meet 30 people: shake hand with the first person you see and also disclose each other's the birthday. There is about 1 chance out 365 that that particular person has the same birthday as you. If you repeat this process with the rest of the 30 people in the room, chances of a 'positive event' increase but still stay below 10%. However when all the people in the room repeat the procedure of shaking hands and sharing birthday details, probabilities add up quickly (there are N(N-1)/2 different handshakes possible in a room of N people). The non-intuitive result is that it is likely that at least 2 people share a common birthday already in such a small group. See the link for the details.

The main point is that to find collisions by brute force on a hash function of n bits we don't need to generate O(2^n) hash values but rather the square root of that O(2^n/2). In particular this means that the complexity of finding hash collisions for 32-bit and 64-bit hash functions by brute force are down to numbers that are easily computable with current HW. sql_id and signatures are indeed 64-bit hashes, as discussed here, so let's give it a try!

Warm-up, collisions on hash_value:

The hash_value of a a given SQL statement (as shown in V$SQL.hash_value for example) is a 32-bit hash. Collisions can be found using about 100K random hashes. A simple algorithm using Oracle and PL/SQL is:
  • Take a couple of different SQL statements and append to them a randomly-generated string
  • Compute the MD5 hash and take the number of bytes that are needed (4 bytes in this case)
  • Insert the hashes in a table
  • Repeat ~100K times
  • When finished look in the output table for duplicates in the hash value
  • If duplicates are found, those are the hash_value collisions
The code can be found at this link. The execution takes just a few seconds and displays the collisions that have been found (results varies for each execution, normally 2-5 collisions are found for this size of search).

SQL> @hash_birthday_sql hashtab 10 4
SQL> @find_dup hashtab

-- these 2 statements are different but have the same hash_value (=3577689134)
SQL> select owner, index_name from dba_indexes where rownum<=9 --CTUUewnYYPJXFuFyAzcMtXTVNmPtNgAV;
SQL> select owner, table_name from dba_tables where rownum<=10 --ooXUUTNfekKkRZGlCJfiPlxKcoCfpDAh;

Scaling up to 64-bit hash: sql_id collisions

The same algorithm used for hash_value collisions can be used for sql_id, although we need to scale up the calculation: the hash is 8 byte long now and we need ~10 billion random hashes to find a few meaningful collisions. SQL> @hash_birthday_sql hashtab 500000 8 would take about 60 hours to do such calculation (mileage may vary). A faster way is to split the execution in smaller chunks and parallelize  the execution across multiple CPU and possibly RAC nodes. See the link here for an example on 2 RAC nodes and parallelism 10 each.

Results: the following two SQL statements are different but have the same sql_id (ayr58apvbz37z) and hash_value (1992264959).

SQL> select owner, index_name from dba_indexes where rownum<=9 --BaERRzEYqyYphBAvEbIrbYYDKkemLaib;
SQL> select owner, table_name from dba_tables where rownum<=10 --XhiidvehudXqDpCMZokNkScXlQiIUkUq;

--confirm this by running:
SQL> select sql_text from v$sql where sql_id='ayr58apvbz37z';

Another 64 bit hash, SQL signature collisions

A variation from the theme above is to calculate statements that have a SQL signature collision. The signature is used by SQL profiles, baselines and SQL patches. It is calculated in a similar but different way than sql_id, in particular SQL text is normalized before hashing (see this link for additional details). The base script is hash_birthday_signature.sql again parallel computation can be used to speed up the computation.

Results: the following two SQL statements are different but have the same signature (6284201623763853025), note sql_id values are different.

SQL> SELECT TABLE_NAME FROM USER_TABLES --JO7HWQAPNFYEALIOA3N5ODTR46INWY6N;
SQL> SELECT INDEX_NAME FROM USER_INDEXES --PAOZY8VWGDHDS9NICVREGTR22J6C24DQ;

--confirm this by running:
select sql_text from v$sql where force_matching_signature=6284201623763853025;

How does Oracle cope with collisions

In a few tests I have seen dbms_xplan throwing ORA-01422 when displaying plans for sql_id with collisions. There may be other packages/tools that make the assumption that sql_id values don't have collisions and they may throw errors.
These are a few more problems with SQL signature collisions. In particular a source of potential trouble is that one of the main (and common) underlying tables for  SQL plan management (baselines, profiles, patches) which stores the mappings of sql_text to signature, SYS.SQL$TEXT, has a unique index on signature. However we have seen that it is quite unlikely to have a signature collision just by pure chance (i.e. in the normal usage of the DB).

Conclusions:

This post reports on an exercise to compute collisions for sql_id and SQL signatures of Oracle statements using the Oracle database and PL/SQL. We have seen how some simple math, called birthday attack, can be of help and how an implementation of this in Oracle is quite easy up to hash length of 64 bit. We also report on the findings, notably collisions of sql_id and SQL signature.

Additional comments: a collision on full_hash_value (as reported in v$db_object_cache) could not be calculated with this method as it is a 128 bit hash and the brute force method would require resources out of a 'reasonable scale' . However, since 2005 there are methods known to find collisions for MD5 hashes on the full 128 bits that use weaknesses in the algorithm (that's why MD5 should not be used for security purposes any more). Those methods could be used to find full_hash_value collision.

PS: SQL and scripts mentioned in this article can be found following this link

No comments:

Post a Comment