Special characters in Python

Richy · October 2013

So I am writing a Python program that takes in text and had to analyze and sort it. I'm having a problem dealing with special characters in the input text. Examples:

Schröder becomes SchrÃ¶der
Flüe becomes FlÃ¼e
Menchú becomes MenchÃº
Aysén becomes AysÃ©n
Ibáñez becomes IbÃ¡Ã±ez

I'd keep going, but I think you get the point.

So, what's the best way of dealing with those? I'd be ok with keeping the original accented letters or replacing them with plain letters. I just can't keep the gibberish it's outputting.

Thanks!

The Anonymous · October 2013

Looks like a problem with how text encoding is handled. Does your program output to a text file?

Bowen · October 2013

Try this:

import unicodedata
msg = u"Schröder "
newMsg = unicodedata.normalize('NFKD', msg).encode('ascii','ignore')
print(newMsg)

Richy · October 2013

The Anonymous wrote: »

Looks like a problem with how text encoding is handled. Does your program output to a text file?

Yes, it does. I use the "open(filename, mode='w')" command.

Richy · October 2013

bowen wrote: »

Try this:

import unicodedata
msg = u"Schröder "
newMsg = unicodedata.normalize('NFKD', msg).encode('ascii','ignore')
print(newMsg)

That helps, in that it finds the problem characters, but it removes them. If I use "replace" instead of "ignore", it puts in question marks instead. Neither options are good - I need letters of some kind. I need it to output "Schröder" or "Schroder", not "Schrder" or "Schr???der".

Bowen · October 2013

Hmm it should be changing the characters to their appropriate ascii version.

Sounds like the text being input/output isn't in unicode in order to accept them properly. Once you switch it, you shouldn't need to do anything more with it.

Something like this:

msg = u"Schröder"
fHandle = open('output.txt', 'w')
fHandle.write(msg.encode('utf8'))
fHandle.close()

to read that back in you'd do this:

fHandle = file('output.txt', 'r')
msg = fHandle.read().decode('utf8')

Richy · October 2013

That seems to have worked. Thanks Bowen!

Barrakketh · October 2013

FYI, there is a codecs module for this. Use codecs.open instead of the builtin open function.

Mego Thor · October 2013

Honest to God, I opened this thread thinking it would be about this fellow..

Penny Arcade

Quick Links

Special characters in Python

Posts