The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
Please vote in the Forum Structure Poll. Polling will close at 2PM EST on January 21, 2025.

Special characters in Python

RichyRichy Registered User regular
So I am writing a Python program that takes in text and had to analyze and sort it. I'm having a problem dealing with special characters in the input text. Examples:

Schröder becomes Schröder
Flüe becomes Flüe
Menchú becomes Menchú
Aysén becomes Aysén
Ibáñez becomes Ibáñez

I'd keep going, but I think you get the point.

So, what's the best way of dealing with those? I'd be ok with keeping the original accented letters or replacing them with plain letters. I just can't keep the gibberish it's outputting.

Thanks!

sig.gif

Posts

  • The AnonymousThe Anonymous Uh, uh, uhhhhhh... Uh, uh.Registered User regular
    Looks like a problem with how text encoding is handled. Does your program output to a text file?

  • BowenBowen Sup? Registered User regular
    Try this:
    import unicodedata
    msg = u"Schröder "
    newMsg = unicodedata.normalize('NFKD', msg).encode('ascii','ignore')
    print(newMsg)
    

  • RichyRichy Registered User regular
    Looks like a problem with how text encoding is handled. Does your program output to a text file?
    Yes, it does. I use the "open(filename, mode='w')" command.

    sig.gif
  • RichyRichy Registered User regular
    bowen wrote: »
    Try this:
    import unicodedata
    msg = u"Schröder "
    newMsg = unicodedata.normalize('NFKD', msg).encode('ascii','ignore')
    print(newMsg)
    

    That helps, in that it finds the problem characters, but it removes them. If I use "replace" instead of "ignore", it puts in question marks instead. Neither options are good - I need letters of some kind. I need it to output "Schröder" or "Schroder", not "Schrder" or "Schr???der".

    sig.gif
  • BowenBowen Sup? Registered User regular
    edited October 2013
    Hmm it should be changing the characters to their appropriate ascii version.

    Sounds like the text being input/output isn't in unicode in order to accept them properly. Once you switch it, you shouldn't need to do anything more with it.

    Something like this:
    msg = u"Schröder"
    fHandle = open('output.txt', 'w')
    fHandle.write(msg.encode('utf8'))
    fHandle.close()
    

    to read that back in you'd do this:
    fHandle = file('output.txt', 'r')
    msg = fHandle.read().decode('utf8')
    

    Bowen on
  • RichyRichy Registered User regular
    That seems to have worked. Thanks Bowen!

    sig.gif
  • BarrakkethBarrakketh Registered User regular
    FYI, there is a codecs module for this. Use codecs.open instead of the builtin open function.

    Rollers are red, chargers are blue....omae wa mou shindeiru
  • Mego ThorMego Thor "I say thee...NAY!" Registered User regular
    Honest to God, I opened this thread thinking it would be about this fellow..

    Gumby.png

    kyrcl.png
Sign In or Register to comment.