Fix Python – UnicodeDecodeError, invalid continuation byte

Question

Asked By – RuiDC

Why is the below item failing? Why does it succeed with “latin-1” codec?

o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving
v = o.decode("utf-8")

Which results in:

 Traceback (most recent call last):  
 File "<stdin>", line 1, in <module>  
 File "C:\Python27\lib\encodings\utf_8.py",
 line 16, in decode
     return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError:
 'utf8' codec can't decode byte 0xe9 in position 10: invalid continuation byte

Now we will see solution for issue: UnicodeDecodeError, invalid continuation byte


Answer

In binary, 0xE9 looks like 1110 1001. If you read about UTF-8 on Wikipedia, you’ll see that such a byte must be followed by two of the form 10xx xxxx. So, for example:

>>> b'\xe9\x80\x80'.decode('utf-8')
u'\u9000'

But that’s just the mechanical cause of the exception. In this case, you have a string that is almost certainly encoded in latin 1. You can see how UTF-8 and latin 1 look different:

>>> u'\xe9'.encode('utf-8')
b'\xc3\xa9'
>>> u'\xe9'.encode('latin-1')
b'\xe9'

(Note, I’m using a mix of Python 2 and 3 representation here. The input is valid in any version of Python, but your Python interpreter is unlikely to actually show both unicode and byte strings in this way.)

This question is answered By – Josh Lee

This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0