You may distribute this paper by any means for non-profit, non commercial
applications. Please send it to your friends and local BBSes and FTP archives.
No charge for distribution is allowed. All text in this file is copyrighted
and no changes are allowed. Please email corrections, suggestions and paper
ideas to
pete@peterbenjamin.com
This paper is one of a series on cross platform and operating system portability issues.
The intended reader of this paper is the novice to intermediate computer user who needs to share text files across multiple operating systems. This paper provides a simple explanation of man-readable computer files. The EOL or end-of-line character difference between platforms is explained for DOS, Macintosh and Unix. FTP (file transfer protocol) binary/ascii mode is explained to a degree.
The Many Names of Man Readable Text Files.
The Many Filename Endings or "Extensions"
ASCII - What it means and how to pronounce it.
Display and Non-display characters
End Of Line Characters: carriage return and line feed
Display on other platforms
Text Edit and Word Processing Software Features
Conversions between DOS, Macintosh and Unix.
Internet FTP
Many computer users are confused and disappointed with the results when transferring a man-readable text file to another platform. The reason is the different End-Of-Line (EOL) character used by DOS, Macintosh and Unix. These differences are expounded. Conversion methods are mentioned. ASCII is the character set most talked about, though other character sets are mentioned, the EOL method is usually the same.
"Fixed length" record files do not use the EOL concept. The record
length gives the EOL.
With the increasing use of Internet and it's FTP service, the incorrect setting of FTP options will prevent accurate transfer and commonly results in garbage files. Binary files must be transferred in a "binary", or "integer" mode different from text files. However, text files can be transferred in binary mode with no harm to the data.
Top of Page | Introduction | Text
File Facts | End-of-Line Characters | Conversions
These files are known by many names including, but not limited to:
ASCII text file, flat text file, flat file, man-readable file, ascii file, text file, notepad file (for Windows) ascii, simple text, teachtext (for Macintosh)
They have filenames that can end in the following, but the filename ending with these letter does not guarrantee it is man-readable text.
*.txt *.lst *.doc *.man *.1 *.inf *.eps *.ps *.rtf *.htm *.asc *.stf *.ls *.lsr *.tex *.cfg *.pst *.mal *.rme *.mbx
ASCII stands for American Standard Code for Information Interchange. ASCII is pronounced "as-key"
The man readable portions are the same for all ASCII platforms and can be ported between all ASCII platforms with little or no conversion. There are other character sets like IBM EBCDIC. These other formats can be converted. ASCII character set consists of the characters you see on the keyboard.
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789`~!@#$%^&*()_+-=[]{};':",.<>?/|\
These are the "displayed" characters. Special "non-display"
characters do exist like "space" (a blank), "tab" and
the "End-Of-Line" or EOL. These charcters are supposed to be
invisible to the reader, that is they are in the class of "non-displayed"
characters. In ASCII there are 94 display characters and 162 non-display
characters, for a total of 256 possible characters. Some edit software
will display these non-display characters as special symbols, most common
as a period. Other display types are "hex" or "octal"
for programmers.
Top of Page | Introduction | Text File Facts | End-of-Line Characters | Conversions
The three platforms, DOS, Macintosh and Unix, all use a different end-of-line character(s) or EOL to indicate the start of a new line. The EOL character represents the two actions the computer should take in displaying the text file lines. Upon encountering the EOL character the computer should do the following common typewriter functions: carriage return and line feed. These terms are commonly abbreviated as <cr> and <lf>. These abbreviations are used for now on.
<cr>Carriage return is to return the cursor or current active display location for the next character to the beginning of the same line.
<lf>Line feed is to change the display location right below the current position, or in other words, go to the next line below.
DOS—uses two characters at the end of the line: <cr><lf>, in that order.
Macintosh—uses one character at the end of the line: <cr>
Unix—uses one character at the end of the line: <lf>
They are different in order to protect copyright privileges.
FIGURE 1. Simplified Platform Text File EOL Representations
DOS/WINDOWS 3.x, 95, NT +----------------------+ |top of file <cr><lf>| |line of text <cr><lf>| | <cr><lf>| | <cr><lf>| |bottom of file<cr><lf>| +----------------------+ Macintosh +------------------+ |top of file <cr>| |line of text <cr>| | <cr>| | <cr>| |bottom of file<cr>| +------------------+ Unix +------------------+ |top of file <lf>| |line of text <lf>| | <lf>| | <lf>| |bottom of file<lf>| +------------------+
The above figures use a "fixed length" line of text to increase clarity, that is, the end of each line and entire blank lines are "padded" with blanks. In actuality, to save storage space, these trailing padding blanks are not present in the stored file. For example:
top of file<cr><lf>
line of text<cr><lf>
<cr><lf>
<cr><lf>
bottom of file<cr><lf>
<^Z>
Note the presence of <^Z>,or "Control-Z", a non-displayed ASCII character, used by old versions of DOS to indicate the end of the file.
The presence of trailing blanks can increase the size of the file by factors of 2 or 3. Sometimes the display of the text file does appear to have them.
On some systems, like mainframes, the blanks can really be there. These are know as fixed length record files. No EOL characters are used.
A DOS file will display correct on both Macintosh and Unix except for the presence of an extra character that may or may not be displayed at the beginning or the end of a line. On Unix the extra <cr> is commonly displayed as "^M" or Control-M.
A Macintosh ASCII file has an extra "header" before the ascii begins that is used by the Macintosh operating system (OS) to tell finder what application made the file (i.e. Simple Text BBedit TeachText).
Top of Page | Introduction | Text File Facts | End-of-Line Characters | Conversions
There are many public domain programs that will convert between these format. Many edit and/or word processing software can handle these EOL variations.
DOS
In DOS there is no way to convert without having a special program. Neither the EDIT or EDLIN command will allow one to change the EOL characters or insert them. Recommended is a shareware program crlf.exe, found as crlf###.zip, where ### is the version number at many internet software archive sites (i.e. oak.oakland.edu).
Windows
In Windows use program "Write", "Word Pad", or "Quick View" to properly display the text. Also, see the Unix section below to use "sed" which DOS versions are available. Printing is done correctly. No conversion is done.
Macintosh
On Macintosh "Apple Exchange" is commonly used. There are many other Macintosh third party products for this function. Most are marketed as being able to read and write DOS diskettes and floppies. These programs do the conversions invisible to you.
Unix
Unix comes with many commands and filters that will do the conversions
in any direction. Here are some "sed" examples:
For DOS to Unix: sed s/.$// infilename >outfilename For Mac to Unix: sed s/x0d/x0a/ infilename >outfilename For Unix to DOS: sed s/$/x0d/ infilename >outfilename For Unix to Mac: sed s/x0a/x0d/ infilename >outfilename For DOS to Mac: sed s/x0d// infilename >outfilename For Mac to DOS: sed s/x0a/x0dx0a/ infilename >outfilename
Not all Unix sed commands are the same. This method may not work. It
is possible to prepare the files ahead of time for transfer.
Internet FTP
This section deals only with one FTP subcommand "binary." For more information see the share text file FTPBEST.STF for exact commands to enter to ease your use of FTP and ensure consistent and best results.
The "binary" or "integer" mode ensures the file
is transferred without any conversions of the internal data. This lack
of conversion means the integrity of the internal data of the file is assured
and the file will perform as advertised.
Conversions used are for ASCII text files only and deal with the character
set and the End-Of-Line methods. Not all FTP software packages are the
same. Different release levels and vendors will have different, more or
less, commands, and even some commands with the same spelling will work
differently. It is best if you stay with a limited set of commands that
past experience has shown you that they work.
The "binary" command is a toggle switch with no arguments or other parameters (or words on the same line). Toggle switches are like light switches, either on or off, only two values are possible, and the same switch or command is used. Using the command at anytime simply changes the mode to the opposite value.
So, ALWAYS read the output of the binary command. Most FTP's start with
it off, but some default it to on. The binary switch is on when after entering
the command you see FTP> binary Integer mode FTP> Integer mode is
a technical Unix term for treating all data files as integers or numbers,
not as text. Integers or numbers must not have their values changed. Text
on the other hand, might be converted to other character sets or EOL methods
with no loss of information.
Since most FTP users transfer both text files and binary files, the preferred
setting is always binary or integer mode. It is recommended that you get
in the habit of always using the binary command right after typing FTP
or entering FTP mode.
Having binary mode always on will ensure all the files you transfer will not change some of their internal data fields. Thus, all binary files will work as advertised and all text files will be intact and at most the text files can be run through a simple conversion.
Binary files can not be reverse converted due to the nature of having
look alike EOL characters that should not be converted. Also, that is the
reason why binary files that are converted will not work as advertised,
since the conversion is done on these look alike fields and changes the
"instructions" for the computer that are stored in the binary
file.
Good Luck andcruisin'!
Top of Page | Introduction | Text File Facts | End-of-Line Characters | Conversions
The information provided is provided without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchant ability and fitness for a particular purpose.
HTML editing by Randy Clemens