Learning: Best to not permit NULL values in columns which are used for comparisons
If you compare a NULL value with any other character the output will always be false which sometimes can lead to confusion since it is not the desired behavior.
SELECT CASE WHEN NULL = 'test' THEN 1 ELSE 0 END -- result: 0 SELECT CASE WHEN NULL != 'test' THEN 1 ELSE 0 END -- result: still 0
I am currently involved in developing an open source datawarehouse stack. Within the ETL process we rather want to load only new data into our warehouse. To detect changes we can calculate a hash over a concatinated string of all columns and save it in a separate column. If the hash changes then we know that some entry changed as well and the new row gets inserted (very simplified).
Note that we at first concatinated the strings using postgres’ || such as
SELECT md5(ColA||'_'|| ColB) FROM table
The pitfall here is, that a concatination with a NULL value results in a NULL value. This leads to a value of NULL in our hash column. This hash column in return is used to detect changes in a way such as:
INSERT INTO table SELECT * FROM table WHERE old.hash != new.hash -- if this statement is true, meaning the hashes differ from each other, changes occured in that row AND old.pk = new.pk
What it does is that it will only insert rows, which have been changed based on the hash. However as mentioned above, a comparison with a NULL value always yields a false result.
- If the old hash is NULL and our new hash is a legit string, the row does not get updated although changes occured
- A string concatination in such a way as above will result into NULL if any column is NULL. Given 10 columns with the value of NULL. If at any point in time 8 of them would receive entries, the resulting hash would still be NULL therefore detecting no changes even if we would wrap the hash with a COALESCE() function.
Given our insert logic this particular entry will never see an update value again.
- COALESCE(): By using COALESCE() or ISNULL() (t-sql) it is possible to replace the NULL value with a avlue which can be used to compare to each other, however as above mentioned,that might not work in all cases for example, when both hashes are NULL.
- IS DISTINCT FROM: This operator does exactly what we need – it compares two values and can discern NULL values from any character strings giving us a result of true if that’s the case. Note that this operator is not available on all DBMS.
- CONCAT(): All of this would not happen in the first place, if we would concatinate our strings by using the function CONCAT(). If a NULL value occurs it would replace it with an empty string, therefore preventing NULL values in our updatehash.
Note: Sometimes you might want a result returning NULL if any of its component have a NULL value, e.g. in the case of a compound key. Add the condition of NOT NULL to the key’s column and we have somewhat of a integrity check. Now the components of a compound key should not be NULL but if the case does exist, the concatination of the key results into a NULL value, which then throws an error in the process of insertion into the NOT NULL column which should be the expected behaviour.