sanitize
sanitize, function
def pure sanitize(source: text): text
Removes supported invisible characters from source.
Example
t = "A\u00A0B\tC"
rawLen = strlen(t)
clean = sanitize(t)
cleanLen = strlen(clean)
show summary "Sanitize" with
escape(t) as "Raw"
rawLen as "RawLen"
escape(clean) as "Clean"
cleanLen as "CleanLen"
This outputs the following list:
| Label | Value |
|---|---|
| Raw | “A\u00A0B\tC” |
| RawLen | 5 |
| Clean | “ABC” |
| CleanLen | 3 |
Remarks
Use escape to reveal invisible characters before
cleaning. You can also combine sanitize with replace
when you need to substitute characters instead of removing them.
Supported invisible characters
The sanitize function removes the following characters:
| Unicode | Unicode character name |
|---|---|
U+0009 |
CHARACTER TABULATION |
U+00A0 |
NO-BREAK SPACE |
U+00AD |
SOFT HYPHEN |
U+034F |
COMBINING GRAPHEME JOINER |
U+061C |
ARABIC LETTER MARK |
U+115F |
HANGUL CHOSEONG FILLER |
U+1160 |
HANGUL JUNGSEONG FILLER |
U+17B4 |
KHMER VOWEL INHERENT AQ |
U+17B5 |
KHMER VOWEL INHERENT AA |
U+180E |
MONGOLIAN VOWEL SEPARATOR |
U+2000 |
EN QUAD |
U+2001 |
EM QUAD |
U+2002 |
EN SPACE |
U+2003 |
EM SPACE |
U+2004 |
THREE-PER-EM SPACE |
U+2005 |
FOUR-PER-EM SPACE |
U+2006 |
SIX-PER-EM SPACE |
U+2007 |
FIGURE SPACE |
U+2008 |
PUNCTUATION SPACE |
U+2009 |
THIN SPACE |
U+200A |
HAIR SPACE |
U+200B |
ZERO WIDTH SPACE |
U+200C |
ZERO WIDTH NON-JOINER |
U+200D |
ZERO WIDTH JOINER |
U+200E |
LEFT-TO-RIGHT MARK |
U+200F |
RIGHT-TO-LEFT MARK |
U+202F |
NARROW NO-BREAK SPACE |
U+205F |
MEDIUM MATHEMATICAL SPACE |
U+2060 |
WORD JOINER |
U+2061 |
FUNCTION APPLICATION |
U+2062 |
INVISIBLE TIMES |
U+2063 |
INVISIBLE SEPARATOR |
U+2064 |
INVISIBLE PLUS |
U+206A |
INHIBIT SYMMETRIC SWAPPING |
U+206B |
ACTIVATE SYMMETRIC SWAPPING |
U+206C |
INHIBIT ARABIC FORM SHAPING |
U+206D |
ACTIVATE ARABIC FORM SHAPING |
U+206F |
NOMINAL DIGIT SHAPES |
U+2800 |
BRAILLE PATTERN BLANK |
U+3000 |
IDEOGRAPHIC SPACE |
U+3164 |
HANGUL FILLER |
U+FEFF |
ZERO WIDTH NO-BREAK SPACE |
U+FFA0 |
HALFWIDTH HANGUL FILLER |
Valid source text
The function accepts text from the following Unicode ranges:
| Character range | Unicode block |
|---|---|
U+0020 - U+007F |
Basic Latin (without C0 control codes) |
U+0080 - U+009F |
C1 control codes |
U+00A0 - U+00FF |
Latin-1 Supplement |
U+0100 - U+017F |
Latin Extended-A |
U+0180 - U+024F |
Latin Extended-B |
U+0250 - U+02AF |
IPA Extensions |
U+02B0 - U+02FF |
Spacing Modifier Letters |
U+0300 - U+036F |
Combining Diacritical Marks |
U+0370 - U+03FF |
Greek/Coptic |
U+0400 - U+04FF |
Cyrillic |
U+2010 - U+2027 |
General Punctuation |
U+2030 - U+205E |
General Punctuation |
U+2061 - U+2064 |
General Punctuation |
U+20A0 - U+20C0 |
Currency Symbols |
Errors
Characters outside the accepted ranges raise an error like:
sanitize(): "<source>" has invalid character \u2702.