...
| Wiki Markup |
|---|
The following function from \[[Viega 03|AA. C References#Viega 03]\] will detect invalid character sequences in a string but will not reject non-minimal forms. It returns {{1}} if the string is comprised only of legitimate sequences,; else it returns {{0}}: |
| Code Block |
|---|
int spc_utf8_isvalid(const unsigned char *input) {
int nb;
const unsigned char *c = input;
for (c = input; *c; c += (nb + 1)) {
if (!(*c & 0x80)) nb = 0;
else if ((*c & 0xc0) == 0x80) return 0;
else if ((*c & 0xe0) == 0xc0) nb = 1;
else if ((*c & 0xf0) == 0xe0) nb = 2;
else if ((*c & 0xf8) == 0xf0) nb = 3;
else if ((*c & 0xfc) == 0xf8) nb = 4;
else if ((*c & 0xfe) == 0xfc) nb = 5;
while (nb-- > 0)
if ((*(c + nb) & 0xc0) != 0x80) return 0;
}
return 1;
}
|
...
| Wiki Markup |
|---|
\[ISO/IEC 10646:2003(E)\] Information technology -- Universal Multiple-Octet Coded Character Set (UCS), First Edition. December, 2003. \[[Kuhn 06|AA. C References#Kuhn 06]\] UTF-8 and Unicode FAQ for Unix/Linux \[[Pike 93]\] Rob Pike, Ken Thompson. _Hello World_. USENIX Winter 1993 Technical Conference, San Diego, January 25-29, 1993, Proceedings, pp. 43-50. \[[Viega 03|AA. C References#Viega 03]\] Section 3.12. "Detecting Illegal UTF-8 Characters" \[[Wheeler 06|AA. C References#Wheeler 06]\] Secure Programming for Linux and Unix HOWTO \[[Yergeau 98|AA. C References#Yergeau 98]\] RFC 2279 - UTF-8, a transformation format of ISO 10646 |
...