Handling invalid UTF-8

by **paulmansour** on Wed Mar 11, 2015 3:31 pm

If you have a string of bytes that you think is, or should be, UTF-8, you will get a domain error using ⎕UCS if there are invalid byte sequences. For example, consider:

      'UTF-8' ⎕UCS  ('UTF-8'⎕UCS'Bjørn'),(⎕UCS'Café')
DOMAIN ERROR

This fails because when the byte 233 (latin e with acute) is encountered in UTF-8, it is a control byte indicating a multi-byte sequence to follow for a single unicode character.

This situation could arise if someone unthinkingly combined a UTF-8 file with a ISO-8859 file, for example.

If one encounters this sort of thing in Excel or Notepad when opening a text file, you will get the special "replacement character" unicode 65533. In other words, these apps handle this type of corrupt file as best they can, if the file starts with a UTF-8 BOM or if you explicitly say it is UTF-8.

So the question arises, how could I do this in APL? I don't think I can currently effectively do this.

Would it make sense that there is an option to use 65533 and not have ⎕UCS return a domain error in this case?

Is there some way to easily and efficiently determine invalid byte sequences and do the replacement myself?

I have seen some warning in discussions of parsing UTF-8 that one should not attempt to handle invalid UTF-8, as this could lead to security issues, but I think this is mostly related to the Web. If Excel and Notepad can read a file, my customers will want my app to read it as well.

by **PGilbert** on Wed Mar 11, 2015 4:47 pm

The method 'EncodingDetector' include in the namespace DIO available at http://aplwiki.com/netDIO may help you. It will try to find the encoding with the BOM but also will analyze the file to find out if it is an UTF-8 without BOM. This link is also quite good on the subject in case you have not seen it yet http://aplwiki.com/Utf8orNot

You are correct in not wanting to read a file 'hoping' that everything will be OK when it could be inadvertently modify by the user. The method 'EncodingDetector' was made explicitly for that purpose and will never bug.

by **paulmansour** on Fri Mar 13, 2015 12:47 pm

Pierre,

Thanks for the links. Phil Lasts' aplwiki article was very useful.

Paul

The tool of thought for

software solutions

Handling invalid UTF-8

Handling invalid UTF-8

Re: Handling invalid UTF-8

Re: Handling invalid UTF-8

Who is online

QUICK LINKS