Designing Tables with Integer Keys

While the “natural” key for a table is often a relatively stable string, better database designs leave the natural string keys as table columns/attributes and use integer “tag” values not containing any real world data for primary keys and foreign key relationships.   These integer keys values offer better maintainability and performance.  When using integer key tags, a unique key constraint and index should always be created on the natural primary key.

First “relatively stable” often isn’t as stable as believed during application/database design, particularly if business operations decides to implement some type of “coding” into the value like an leading “R” for residential customers.  During the next business reorganization, company merger, software upgrade or maybe when your company purchases a portfolio of customers, you may need to update that customer number or on-board additional customers from another system.  When merging data from two or more systems there are often “collisions” where the same customer number is in use in both systems or the number needs to change to meet the “encoding” standard business operations favors.  If the customer number used in customer table primary key changes then all the relationships into other tables like order, address, etc will be “broken” and require updating to re-establish the relationship – maybe requiring thousands or millions of updates in the related tables and their indexes.  If the primary key for the customer table is a integer “tag” and the character based value is an column/attribute, then the actual customer number only needs to be updated once in the customer table; none of the relationships to other tables have been broken and will not need repairs.

The integer tag key also provides better data integrity.  The preceding paragraph demonstrated the orphan record problem caused by updating the natural key in the parent table.  When data is duplicated in multiple tables, there are lots of variations where integrity is loss.  The update that fails on one table while updating the other nine or the developer missing one table requiring the update during the last application upgrade are just a couple that come to mind.  The best practice is to only store “key” information once and use non-changing integer tags to reference it where needed with appropriate unique and foreign key constraints.

The tag key also protects sensitive information.  It’s not uncommon for business tables containing information about people to use  a social security number for the identifying key.  When the SSN is an attribute in a single table, it is fairly easy to protect with role based security, data masking or VPDB.  If the SSN is a key value scattered through several related tables, then it needs to be protected in each table.  Keeping a value “hidden” while making it available for follow-up or drill-down queries is also a challenge that can be avoided by using tag keys.

Second, comparing integer values is faster than comparing character values.  While the capacity and coding available today enables computers to quickly search and compare all kinds of data, comparison operations ultimately reduce all that data to a numeric value.  Searches on large data sets require hundreds to thousands of comparisons to find or filter down to a particular data set.  Integer comparisons may be a single operation while character string comparisons usually require at least one operation per character used.

Third, using the integer key reduces the size of table rows in related and child tables.  Each row in the order table contains the integer customer id instead of the longer customer number string.  An order_item table would use the shorter integer order id instead of the longer string order number.  The deeper the design encompasses nested one to many relationships, the greater the size reduction.

Fourth, child tables and their integer indexes are smaller with each table or index block containing more records.  Even if you are not concerned with your disk storage, consider the blocks loaded into the database cache cover a greater number of table/index entries and are more likely to be “hit” during query operations.  Active blocks remain the cache longer too so the table/index block may very well still be sitting in memory for the next query.   Smaller databases also need fewer backup resources and can be cloned or restored quicker.

NOTE: The vsize and dump functions can be used to determine storage size and see the storage values byte by byte.

create table customer (cust_nbr varchar2(30), cust_id number(38));
begin
  for i in 1..10 loop
    insert into customer values ( to_char(power(10,i)), power(10,i) );
  end loop;
end;
/
column cust_nbr format a12
column cust_id format 999999999999
column cust_nbr_dmp format a48
column cust_id_dmp format a20
select cust_nbr, vsize(cust_nbr), dump(cust_nbr) as cust_nbr_dmp, 
       cust_id,  vsize(cust_id),  dump(cust_id)  as cust_id_dmp
  from customer;

NOTE: Use the integer data type for keys of less than 2 billion values or number(38) for hundreds of quadrillions (999,999,999,999,999,999).  Changing number(16) to number(38) doesn’t require any changes to current stored values it just increases the size future values may use.

 

Advertisements