Problem with Notepad when opening UTF-8 encoded files - Chess Forums

stephen_33

Mar 19, 2019

0

#1

I may be missing something obvious here but when I open UTF-8 encoded files that I've saved in MS Notepad, it sometimes opens them in ANSI mode by default. The result is that any non-standard characters are not displayed & their codes are shown instead. It may only be when I've saved a UTF-8 encoded file as output from a Python script but they do save correctly in that format.

Here's an example - writing '✮ Chess Abrogators ✮' to file returns the following if opened in ANSI: âœ® Chess Abrogators âœ®

Now this is easily corrected if the option of UTF-8 is selected as the encoding mode when opening the file but is there a way of setting this as the default? And why, if I'm saving a file in UTF-8, does Notepad bother to use ANSI as the default sometimes?

I can't see a setting in Notepad for correcting this.

MGleason

Mar 19, 2019

0

#2

Have you tried using Notepad++? It's a much more sophisticated text editor.

ASCII stupid question, get a stupid ANSI...

skelos

Mar 19, 2019

0

#3

Notebook or Notepad? I always default to Wordpad over Notepad when using Windows, and second Michael's recommendation of Notepad++.

No comment re the OS wars; Windows ...

WhiteDrake

Mar 19, 2019

0

#4

Note that a text editor has no way to recognise the encoding of a file. Notepad has no settings, I believe it just uses your windows profile locale. Changing the default encoding in your profile locale to utf8 might solve the issue, but installing Notepad++ is definitely an easier option.

stephen_33

Mar 19, 2019

0

#5

skelos wrote:

Notebook or Notepad? I always default to Wordpad over Notepad when using Windows, and second Michael's recommendation of Notepad++.

No comment re the OS wars; Windows ...

Yes, typo - I meant Notepad. I've corrected that above.

But I tried downloading Notepad++ when I was setting up Python first time around & I had some problem with it - forgotten what. The Python IDE is adequate for my needs anyway so don't really need anything more sophisticated & I follow the motto of where possible, keep it simple

ljvankuiken

Mar 19, 2019

0

#6

Notepad++

stephen_33

Mar 19, 2019

0

#7

WhiteDrake wrote:

Note that a text editor has no way to recognise the encoding of a file. Notepad has no settings, I believe it just uses your windows profile locale. Changing the default encoding in your profile locale to utf8 might solve the issue, but installing Notepad++ is definitely an easier option.

O/k, I hadn't taken that into account so I'm probably stuck with the situation as it is because I'd prefer not to change my default encoding to UTF-8 for all text files..

Not even a problem really, more an irritation because I have only to check the encoding type before I open the file. Otherwise, file I/O regarding my scripts works fine. It's just a case of remembering to do that.

WhiteDrake

Mar 19, 2019

0

#8

By insisting on viewing text files in MS Notepad, you’d be doing yourself a disservice. Pretty much any other text editor is considerably better. It’s like if you wanted to chop down a tree with an axe rather than using a chain saw; yes, the axe is definitely simpler.

stephen_33

Mar 19, 2019

0

#9

WhiteDrake wrote:

Note that a text editor has no way to recognise the encoding of a file.

Is that strictly true though? When saving a file encoded as UTF-8 (or UTF-16), a special byte marker (BOM) is placed at the start of the file. This identifies the file's encoding type.

In Python, if you forget to specify the encoding with 'utf-8-sig' when opening such a file, a sequence of unwanted characters is returned at the start of the first string read from it. This gave me a headache until I realised what was going on.

https://docs.python.org/3/howto/unicode.html

" The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file’s byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.

In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. For reading such files, use the ‘utf-8-sig’ codec to automatically skip the mark if present. "

So while Notepad seems to ignore the BOM that's present, it is possible by the looks of it to detect if a file is encoded in UTF-8

WhiteDrake

Mar 19, 2019

0

#10

A text editor can make an educated guess, but it can’t determine the encoding with certainty in general (most encodings don’t use any kind of a signature or BOM). Even BOM gives “just” a high level of confidence but not a certainty.

I’m not a fan of BOM in utf8 as it can break portability between different text editors, in my experience. If you’re the only one using your code, just one OS and just one text editor, you can use whatever works for you. But it’s not a good practice in my opinion to rely on a text editor (or a script) recognising the encoding.

(I don’t think good text editors place a BOM at the beginning of a utf8 file but I’ll have to check.)

stephen_33

Mar 19, 2019

0

#11

I'm stuck at the moment because a lot of what I'm doing involves lists of clubs & their admins & you have to save non-ANSI text in Notepad in some kind of unicode format, else text is lost. And I think Notepad automatically inserts a BOM marker at the start of the file.

The problem is a trivial one because it only affects viewing UTF-8 encoded files but I'll certainly look at downloading Notepad++ when I have a spare moment.

SJCVChess

Mar 19, 2019

0

#12

Try a hex editor? Do a dump of the raw files (saved with Notepad), see if there's a difference. Write a PowerShell or Bash script to convert in bulk. I don't see why Unicode shouldn't be used. Other than character support, which is just relative to what the font supports.

I haven't used Windows or Notepad in years, so, I can't be of much further help. But there's probably a way to set or launch defaults, either via command parameters, or possibly in the registry. Oddly enough, those are the exact two reasons I don't use Windows: (1) the Registry, and (2) lack of support for things like defaults via command line.

As for Python:

Python 2: You can prefix the file with " # -*- coding: utf-8 -*-"

Python 3: Saves UTF-8 by default.

Either way, if you look at the options for reading and writing files in either Python 2 or 3, it's very specific about formats and conversion.

When I first joined this club, I was reading some past posts to get an idea of what goes on here, and, I was reading another of your posts. No offense meant, but, Notepad++ is pretty standard from a software development perspective. I never encountered any issues downloading or installing it. When you say something like that, it sounds strange to the rest of us. It also comes with a lot of nifty features for automating repetitive tasks, tons of keyboard short-cuts, great pattern-matching support, etc. It seems like your dependence on Notepad is a crutch. As you said, you're stuck.

I agree with the "keep it simple" philosophy, but, as someone else said, you're doing yourself a disservice, you are providing your own irritation by relying on inadequate tools. I can also remember many years ago when I spent a lot of time playing with text files, encountering similar issues. What it boils down to is long-term experience: Now I go all-in for consistency above all else, and since we operate in a global village or economy, and ASCII simply doesn't cut it, I prefer UTF-8.

Plus, this is the development forum of a chess website. Support of operating-system specific tools might get answers (a lot of good help here), but, any number of people could be using any number of other operating systems, etc. Asking for help with Notepad could easily be hit-or-miss. I'd also suggest simplifying things by getting and actively using Notepad++. If you still have problems with getting or using Notepad++, I can throw-in two other alternatives that I use on Linux, but are available for Windows: Geany, and Bluefish.

MGleason

Mar 19, 2019

0

#13

Notepad is a horse and buggy, Notepad++ is a car. Yeah, the horse and buggy will get you from A to B, but why wouldn't you drive a car?

stephen_33

Mar 20, 2019

0

#14

I hope I didn't overstate the problem. Creating UTF-8 files from within Python & then inputting them again causes no difficulty whatsoever.

The only problem is when I open such a file in a Notepad window to view it but forget to open it as a UTF-8 encoded file. Even then, once I've selected UTF-8 as the encoding with which to open any file, Notepad then saves that option for future use.

For all its faults, I quite like the stripped down operation of Notepad. I find with some apps. that lots of additional features that I never use can get in the way.

WhiteDrake

Mar 20, 2019

0

#15

stephen_33 wrote:

I hope I didn't overstate the problem.

No problem is too small for an internet discussion forum.

stephen_33 wrote:

For all its faults, I quite like the stripped down operation of Notepad. I find with some apps. that lots of additional features that I never use can get in the way.

That makes sense, but some features are useful, like lexical highlighting, setting the tab width (and inserting spaces instead of a tab) or displaying line numbers.

stephen_33

Mar 20, 2019

0

#16

I don't write actual code within Notepad so many of those features would be of little use to me. I use notepad mostly for constructing lists of items (club names with their admins) which I then use as input for my scripts. All of that works seamlessly.

It's just a pity that Notepad is unable to detect the encoding type used when a file was saved. If it did, it would be perfect for my needs. Even then, it's nothing more than a small irritation for me to have to remember to check the encoding type when opening a file for viewing.