sanitize

sanitize, function

def pure sanitize(source: text): text

Removes supported invisible characters from source.

Example

t = "A\u00A0B\tC"
rawLen = strlen(t)
clean = sanitize(t)
cleanLen = strlen(clean)

show summary "Sanitize" with
  escape(t) as "Raw"
  rawLen as "RawLen"
  escape(clean) as "Clean"
  cleanLen as "CleanLen"

This outputs the following list:

Label Value
Raw “A\u00A0B\tC”
RawLen 5
Clean “ABC”
CleanLen 3

Remarks

Use escape to reveal invisible characters before cleaning. You can also combine sanitize with replace when you need to substitute characters instead of removing them.

Supported invisible characters

The sanitize function removes the following characters:

Unicode Unicode character name
U+0009 CHARACTER TABULATION
U+00A0 NO-BREAK SPACE
U+00AD SOFT HYPHEN
U+034F COMBINING GRAPHEME JOINER
U+061C ARABIC LETTER MARK
U+115F HANGUL CHOSEONG FILLER
U+1160 HANGUL JUNGSEONG FILLER
U+17B4 KHMER VOWEL INHERENT AQ
U+17B5 KHMER VOWEL INHERENT AA
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+2060 WORD JOINER
U+2061 FUNCTION APPLICATION
U+2062 INVISIBLE TIMES
U+2063 INVISIBLE SEPARATOR
U+2064 INVISIBLE PLUS
U+206A INHIBIT SYMMETRIC SWAPPING
U+206B ACTIVATE SYMMETRIC SWAPPING
U+206C INHIBIT ARABIC FORM SHAPING
U+206D ACTIVATE ARABIC FORM SHAPING
U+206F NOMINAL DIGIT SHAPES
U+2800 BRAILLE PATTERN BLANK
U+3000 IDEOGRAPHIC SPACE
U+3164 HANGUL FILLER
U+FEFF ZERO WIDTH NO-BREAK SPACE
U+FFA0 HALFWIDTH HANGUL FILLER

Valid source text

The function accepts text from the following Unicode ranges:

Character range Unicode block
U+0020 - U+007F Basic Latin (without C0 control codes)
U+0080 - U+009F C1 control codes
U+00A0 - U+00FF Latin-1 Supplement
U+0100 - U+017F Latin Extended-A
U+0180 - U+024F Latin Extended-B
U+0250 - U+02AF IPA Extensions
U+02B0 - U+02FF Spacing Modifier Letters
U+0300 - U+036F Combining Diacritical Marks
U+0370 - U+03FF Greek/Coptic
U+0400 - U+04FF Cyrillic
U+2010 - U+2027 General Punctuation
U+2030 - U+205E General Punctuation
U+2061 - U+2064 General Punctuation
U+20A0 - U+20C0 Currency Symbols

Errors

Characters outside the accepted ranges raise an error like:

sanitize(): "<source>" has invalid character \u2702.

See also

User Contributed Notes
0 notes + add a note