Null values and their pitfalls

Learning: Best to not permit NULL values in columns which are used for comparisons

If you compare a NULL value with any other character the output will always be false which sometimes can lead to confusion since it is not the desired behavior.

SELECT CASE WHEN NULL = 'test' THEN 1 ELSE 0 END
-- result: 0
SELECT CASE WHEN NULL != 'test' THEN 1 ELSE 0 END
-- result: still 0

The case

I am currently involved in developing an open source datawarehouse stack. Within the ETL process we rather want to load only new data into our warehouse. To detect changes we can calculate a hash over a concatinated string of all columns and save it in a separate column. If the hash changes then we know that some entry changed as well and the new row gets inserted (very simplified).

old entries
ColAColBConcatinationmd5()
helloworldhello_world99b1ff8f11781541f7f89f9bd41c4a17
bingobongobingo_bongod0946da002ae63f07dc9e1c4489b2c91
new entries
ColAColBConcatinationmd5()
helloworldhello_world99b1ff8f11781541f7f89f9bd41c4a17
bingobongibingo_bongia303447dc3952485141a4acb6f096f45
The hash of the second row changed because bongo turned into bongi

Note that we at first concatinated the strings using postgres’ || such as

SELECT md5(ColA||'_'|| ColB) FROM table

The pitfall here is, that a concatination with a NULL value results in a NULL value. This leads to a value of NULL in our hash column. This hash column in return is used to detect changes in a way such as:

INSERT INTO table
SELECT * FROM table
WHERE old.hash != new.hash -- if this statement is true, meaning the hashes differ from each other, changes occured in that row
AND old.pk = new.pk

What it does is that it will only insert rows, which have been changed based on the hash. However as mentioned above, a comparison with a NULL value always yields a false result.
This means:

  • If the old hash is NULL and our new hash is a legit string, the row does not get updated although changes occured
  • A string concatination in such a way as above will result into NULL if any column is NULL. Given 10 columns with the value of NULL. If at any point in time 8 of them would receive entries, the resulting hash would still be NULL therefore detecting no changes even if we would wrap the hash with a COALESCE() function.

Given our insert logic this particular entry will never see an update value again.

Solution

  • COALESCE(): By using COALESCE() or ISNULL() (t-sql) it is possible to replace the NULL value with a avlue which can be used to compare to each other, however as above mentioned,that might not work in all cases for example, when both hashes are NULL.
  • IS DISTINCT FROM: This operator does exactly what we need – it compares two values and can discern NULL values from any character strings giving us a result of true if that’s the case. Note that this operator is not available on all DBMS.
  • CONCAT(): All of this would not happen in the first place, if we would concatinate our strings by using the function CONCAT(). If a NULL value occurs it would replace it with an empty string, therefore preventing NULL values in our updatehash.

Note: Sometimes you might want a result returning NULL if any of its component have a NULL value, e.g. in the case of a compound key. Add the condition of NOT NULL to the key’s column and we have somewhat of a integrity check. Now the components of a compound key should not be NULL but if the case does exist, the concatination of the key results into a NULL value, which then throws an error in the process of insertion into the NOT NULL column which should be the expected behaviour.

Leave a Comment

Your email address will not be published.