Illegal character 0x1FFFF

| No Comments | No TrackBacks
$ perl -le 'use warnings; my $x=chr(0x1FFFF)' 
Unicode character 0x1ffff is illegal at -e line 1.

XML supports UTF-8 so I check for valid UTF-8 string and use it in XML if valid. Right? No!!!

There are some "non-illegal" characters that are perfect valid in UTF-8 (or even in the plain old ASCII), but are invalid for XML. The most obvious 0x00. Here is what W3C XML 1.0 specification say:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I spend some time playing with it and the result is XML::Char->valid(). The dev Data::asXML is using it now. If you you want, have a look at the test suit and try to break it. :-)

No TrackBacks

TrackBack URL: https://blog.meon.sk/admin/tb/103

Leave a comment

Pages

About this Entry

This page contains a single entry by Jozef Kutej published on January 27, 2010 9:53 PM.

My first Perl6 regexp grammar in Perl5 was the previous entry in this blog.

Compare Search Engines 20100117 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.