Page History

...

Wiki Markup

The following function from \[[Viega 03|AA. C References#Viega 03]\] will detect invalid character sequences in a string but will not reject non-minimal forms. It returns {{1}} if the string is comprised only of legitimate sequences,; else it returns {{0}}:

Code Block

int spc_utf8_isvalid(const unsigned char *input) {
  int nb;
  const unsigned char *c = input;

  for (c = input;  *c;  c += (nb + 1)) {
    if (!(*c & 0x80)) nb = 0;
    else if ((*c & 0xc0) == 0x80) return 0;
    else if ((*c & 0xe0) == 0xc0) nb = 1;
    else if ((*c & 0xf0) == 0xe0) nb = 2;
    else if ((*c & 0xf8) == 0xf0) nb = 3;
    else if ((*c & 0xfc) == 0xf8) nb = 4;
    else if ((*c & 0xfe) == 0xfc) nb = 5;
    while (nb-- > 0)
      if ((*(c + nb) & 0xc0) != 0x80) return 0;
  }
  return 1;
}

...

Wiki Markup

\[ISO/IEC 10646:2003(E)\] Information technology -- Universal Multiple-Octet Coded Character Set (UCS), First Edition. December, 2003.
\[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux
\[[Pike 93]\] Rob Pike, Ken Thompson. _Hello World_. USENIX Winter 1993 Technical Conference, San Diego, January 25-29, 1993, Proceedings, pp. 43-50. 
\[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters"
\[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO
\[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646

...

Space shortcuts

Page tree

Versions Compared

Old Version 24

New Version 25

Key