Navigation :

sanitize

sanitize, function

def pure sanitize(source: text): text

Removes supported invisible characters from source.

Example

t = "A\u00A0B\tC"
rawLen = strlen(t)
clean = sanitize(t)
cleanLen = strlen(clean)

show summary "Sanitize" with
  escape(t) as "Raw"
  rawLen as "RawLen"
  escape(clean) as "Clean"
  cleanLen as "CleanLen"

This outputs the following list:

Label	Value
Raw	“A\u00A0B\tC”
RawLen	5
Clean	“ABC”
CleanLen	3

Remarks

Use escape to reveal invisible characters before cleaning. You can also combine sanitize with replace when you need to substitute characters instead of removing them.

Supported invisible characters

The sanitize function removes the following characters:

Unicode	Unicode character name
`U+0009`	CHARACTER TABULATION
`U+00A0`	NO-BREAK SPACE
`U+00AD`	SOFT HYPHEN
`U+034F`	COMBINING GRAPHEME JOINER
`U+061C`	ARABIC LETTER MARK
`U+115F`	HANGUL CHOSEONG FILLER
`U+1160`	HANGUL JUNGSEONG FILLER
`U+17B4`	KHMER VOWEL INHERENT AQ
`U+17B5`	KHMER VOWEL INHERENT AA
`U+180E`	MONGOLIAN VOWEL SEPARATOR
`U+2000`	EN QUAD
`U+2001`	EM QUAD
`U+2002`	EN SPACE
`U+2003`	EM SPACE
`U+2004`	THREE-PER-EM SPACE
`U+2005`	FOUR-PER-EM SPACE
`U+2006`	SIX-PER-EM SPACE
`U+2007`	FIGURE SPACE
`U+2008`	PUNCTUATION SPACE
`U+2009`	THIN SPACE
`U+200A`	HAIR SPACE
`U+200B`	ZERO WIDTH SPACE
`U+200C`	ZERO WIDTH NON-JOINER
`U+200D`	ZERO WIDTH JOINER
`U+200E`	LEFT-TO-RIGHT MARK
`U+200F`	RIGHT-TO-LEFT MARK
`U+202F`	NARROW NO-BREAK SPACE
`U+205F`	MEDIUM MATHEMATICAL SPACE
`U+2060`	WORD JOINER
`U+2061`	FUNCTION APPLICATION
`U+2062`	INVISIBLE TIMES
`U+2063`	INVISIBLE SEPARATOR
`U+2064`	INVISIBLE PLUS
`U+206A`	INHIBIT SYMMETRIC SWAPPING
`U+206B`	ACTIVATE SYMMETRIC SWAPPING
`U+206C`	INHIBIT ARABIC FORM SHAPING
`U+206D`	ACTIVATE ARABIC FORM SHAPING
`U+206F`	NOMINAL DIGIT SHAPES
`U+2800`	BRAILLE PATTERN BLANK
`U+3000`	IDEOGRAPHIC SPACE
`U+3164`	HANGUL FILLER
`U+FEFF`	ZERO WIDTH NO-BREAK SPACE
`U+FFA0`	HALFWIDTH HANGUL FILLER

Valid source text

The function accepts text from the following Unicode ranges:

Character range	Unicode block
`U+0020 - U+007F`	Basic Latin (without C0 control codes)
`U+0080 - U+009F`	C1 control codes
`U+00A0 - U+00FF`	Latin-1 Supplement
`U+0100 - U+017F`	Latin Extended-A
`U+0180 - U+024F`	Latin Extended-B
`U+0250 - U+02AF`	IPA Extensions
`U+02B0 - U+02FF`	Spacing Modifier Letters
`U+0300 - U+036F`	Combining Diacritical Marks
`U+0370 - U+03FF`	Greek/Coptic
`U+0400 - U+04FF`	Cyrillic
`U+2010 - U+2027`	General Punctuation
`U+2030 - U+205E`	General Punctuation
`U+2061 - U+2064`	General Punctuation
`U+20A0 - U+20C0`	Currency Symbols

Errors

Characters outside the accepted ranges raise an error like:

sanitize(): "<source>" has invalid character \u2702.