Cleaning up text in Excel: CLEAN, TRIM, and SUBSTITUTE

In Excel, you will occasionally deal with text fields containing line breaks, multiple spaces, hyphenation, currency and letterlike symbols, etc. This is common when dealing with unrestricted user-inputted data and can cause issues.

In particular, these kinds of values will be hard to connect with related values without cleanup. When we MATCH those values with related data tables, errors are bound to occur.

 

Basic cleanup approaches include manual replacement of all of the unnecessary characters with a desired or no character using the Find and Replace feature and/or the Text to Columns wizard in some cases.

The CLEAN function, the TRIM function, and the SUBSTITUTE function can be used to perform cleanup as well. Those functions are also often used in combination.

There are no simple or straightforward solutions to these issues, and the process often can’t be automated. All of the Excel features and functions (which we will cover here) come with their own limitations.

 

The CLEAN function removes nonprintable characters and formatting from text.

The CLEAN function syntax is as follows:

= CLEAN ( text )

Text can be any text string, located in a cell or generated by another function. Nonprintable or control characters which can be removed are ASCII characters, as the CLEAN function as of yet does not support the removal of nonprintable Unicode characters. A typical example of a character that can be removed is the line break.

 

In the following example, we are applying the CLEAN function on several different text strings with varying degrees of success:

CLEAN function

If no characters can be removed, the original text string is returned.

If there is formatting, it is removed.

If our font is Webdings, the font is reset to default.

If our text contains line breaks, all line breaks are removed.

If our text contains ASCII nonprintable characters, those are removed.

If our text contains nonprintable characters supported in Unicode but not ASCII, in this case, character that has a decimal value of 127 in Excel, those are not removed.

The underscore [_] is not removed.

The hyphen [‐] is not removed.

 

The TRIM function removes extra spaces from text, leaving only single spaces between words.

The TRIM function syntax is as follows:

= TRIM ( text )

Text can be any text string, located in a cell or generated by another function. Repeating spaces which can be removed are ASCII spaces, as the TRIM function as of yet does not support the removal of all Unicode supported spaces. Space occurring after the last character will also be removed.

 

In the following example we are applying the TRIM function on several different text strings with varying degrees of success:

TRIM function
If no extra spaces exist, the original text string is returned.

If there are multiple spaces, those are reduced to one space.

If our text contains multiple spaces supported in Unicode but not ASCII (characters that have a decimal value of 160 in Excel, commonly used in Web pages as the HTML entity,  ), those are not removed.

 

The SUBSTITUTE function replaces existing text with new text in a text string.

The SUBSTITUTE function syntax is as follows:

= SUBSTITUTE ( text; old_text; new_text; [instance_num] )

Text can be any text string, located in a cell or generated by another function.

Old text is a (sequence of) characters to be replaced.

New text is a (sequence of) replacement characters. If we just want to remove one or more characters, this part of the argument can be omitted. Alternatively, with “” we can specify that there is no replacement.

An instance number specifies which occurrence one or more characters is to be replaced. If specified, only the specified occurrence will be removed. If we just want to remove all occurrences of one or more characters, this part of the argument can be omitted.

The SUBSTITUTE function is case-sensitive.

 

In the following example, we are applying the SUBSTITUTE function on several different text strings with varying results:

SUBSTITUTE function
As shown, we can SUBSTITUTE several spaces with nothing or with a single space. We can SUBSTITUTE underscores or other characters with nothing or with spaces, and vice versa.

We can SUBSTITUTE multiple sequences of characters with other sequences of characters in the same formula. We can accomplish this by first substituting one set of characters, and then nesting the resulting text string into the next SUBSTITUTE function.

We can also SUBSTITUTE characters by nesting the UNICHAR functions into the SUBSTITUTE function. In this particular example, two successful ways of removing unnecessary Unicode spaces (present in text strings in cells A10, A11, and A12) are shown.

 

In the following example, we are tasked with cleaning up several IBANs. Account numbers are formatted in multiple ways, and text strings contain spaces, spaces in Unicode format, and hyphens:

SUBSTITUTE function IBAN

Our approach here is basically reduction to common denominators. We are first removing all of the extra formatting characters and leaving only the actual IBAN figures (column B).

Coincidentally, this is also how IBAN will typically be formatted in most databases, i.e., it will not contain unnecessary characters. We can link mail addresses related to those IBANs with the help of the INDEX MATCH combination (column D).

If we also need our account numbers formatted in a certain way, we can generate such text strings from our reduced text strings (column C).

 

Note:
All of these functions are Text functions. They always return text. If we, for example, use the CLEAN function on a numerical value, the result will be a number formatted as text.

 

Dig deeper:

Text in Excel

Lookup with unique identifiers

How to MATCH numbers formatted as text in Excel

UNICODE and UNICHAR functions

Line breaks in Excel

4 thoughts on “Cleaning up text in Excel: CLEAN, TRIM, and SUBSTITUTE”

Leave a Reply