How Data is Standardized

  • Data Standardization

Attribute data values are standardized or normalized both to remove invalid values and to ensure consistent format for use in matching algorithm decisions. The Universal Identity platform standardizes incoming data values on both web service requests and batch files. Standardized data values are persisted in their changed or standardized form.

When those values are retrieved later (by retrieving the identity to which the attribute values belong), the values returned are the standardized values rather than the original values. Standardization (or normalization) can vary depending on the nature of each data element.

Data Standardization

Attribute or attribute cluster Standardization performed by Universal Identity
All

For all attributes, extended ASCII characters with a logical ASCII equivalent are converted to their ASCII equivalent. For example, the extended ASCII character Ñ is converted to the ASCII character N.

All For all attributes, alphabetic characters are converted to uppercase. A subset of the Extended Attribute fields can optionally be kept in their original case, but by default all alphabetic characters are converted to uppercase.
Name
  • Numeric characters are not removed from name attribute values (though this can be changed with a configuration setting by contacting Verato support).
  • The position of name strings are not changed during normalization, but they are factored into matching decisions. For example, one patient record might have the name strings First Name=”JOHN PAUL” and Last Name=”SMITH”, while a second patient record might have the name strings First Name=”JOHN”, Middle Name=”PAUL”, and Last Name=”SMITH”. In this case, the position of the name strings are not changed or standardized, but the matching algorithm is designed to recognize that there is still a match between the names JOHN-JOHN and PAUL-PAUL.
  • Multi-byte characters, if present, are NOT removed from name strings for many different non-LATIN character sets, such as Chinese or Arabic character sets. However, this behavior is not exhaustive for the entire set of UTF-8 multi-byte characters. There can still be multi-byte UTF characters that get removed during standardization.
Birth Date
  • Birth date values are standardized to ISO format of YYYYMMDD.
  • Input birth date values are allowed with or without hyphen or slash characters. Input birth date values are also allowed in MMDDYYYY format, but they will be converted to YYYYMMDD format.
  • Input birth date values in the format of MMM DD, YYYY (i.e., Jan 1, 2024) are also allowed.
  • Input birth date values with only a 2-digit year (YYMMDD or MMDDYY) are converted to YYYYMMDD format only if the conversion is clear and unambiguous. For example, an input date value of 030405 is ambiguous – it could represent March 4, 2005, or April 5, 2003, so it is rejected in standardization.
  • Invalid dates and future dates are also rejected in standardization.
SSN

Alphabetic characters and punctuation symbols are removed. The last four digits are kept. 

Address

Address components are standardized in several ways consistent with US Postal Service conventions. Examples include:

  • Street direction values presented as a full text string are converted to their 1- or 2- character abbreviation (N, S, W, E, NW, NE, SW, SE)
  • Street type values presented as a full text string are converted to their USPS short format (Street becomes ST, Avenue becomes AVE, and so on)
  • Ordinal street names are converted to their numeric format (Second Street becomes 2 ND ST)
  • Unit, suite, or apartment information is parsed into address line 2, even if it is presented as part of address line 1
  • State names presented as a full text string are converted to their 2-character abbreviation (California becomes CA) Multi-byte characters, if present, are NOT removed from city or state attributes, but they are removed from address line 1/address line 2 attributes for many different non-LATIN character sets, such as Chinese or Arabic character sets. However, this behavior is not exhaustive for the entire set of UTF-8 multi-byte characters. There can still be multi-byte UTF characters that get removed during standardization.
  • Fully spelled-out country names that are valid are converted to a 3-character code (Germany becomes DEU, Canada becomes CAN, and so on). Any invalid country names will simply be uppercased. 
Gender

Gender values are standardized to single-character gender codes of M, F, U, O, T, A, X, and N if the full string is provided as input. For example, an input gender of FEMALE is standardized to F. An input gender value that does not correspond to M, F, U, O, T, A, X, or N, is not standardized – it is ignored and nothing is stored.

M = Male, F = Female, U = Unknown, O = Other, T= Transgender, A = Ambiguous, N = Not Applicable, X = Non Binary. 

Phone
  • Phone Alphabetic characters and punctuation symbols are removed.
  • A 10-digit phone number can be submitted either in its entirety (e.g., 1112223333) or broken down into an area code and the remaining digits (e.g., 111 and 2223333)
Email
  • Prefix “mailto:” is removed if present​
  • Common consumer email domains are appended with “.com” if missing (gmail, yahoo, Hotmail)​