Wednesday, December 14, 2016

Conversion between UTF-16, UTF-8 encoded files on Linux

1. Introduction

The encoding used by Windows for Unicode is UTF-16, to be specific, UTF-16LE (Little Endian). Linux uses UTF-8 to encode Unicode. A file encoded with Unicode can optionally contain a Byte Order Mark(BOM) which is a special magic number at the start of file. Byte Order Mark(BOM) is optional for UTF-8, but mandatory for UTF-16 as per Unicode standard. So, Linux does not use BOM for Unicode files as it uses UTF-8. But Windows applications look for BOM in Unicode encoded file as they use UTF-16.

So in summary, Windows uses UTF-16LE with BOM, and Linux uses UTF-8 without BOM.

To verify type of encoding used for a file, we can use file command on Linux.

$ file Unicode_Windows.txt 
Unicode_Windows.txt: Little-endian UTF-16 Unicode text, with CR line terminators

We can see more details using hexdump also,

$ hexdump -C Unicode_Windows.txt 
00000000  ff fe 24 0c 46 0c 32 0c  41 0c 17 0c 41 0c 0d 00  |..$.F.2.A...A...|
00000010  0a 00                                             |..|
00000012
ff fe is BOM for UTF-16LE, and we can see end of line character as 0d 00 (Carriage Return CR) and 0a 00 (Line Feed LF)

2. Converting from UTF-16 to UTF-8

Above file is created in Windows. To convert it to Linux encoding, there are multiple ways.

2a. Using iconv

$ iconv -f UTF-16LE -t UTF-8 Unicode_Windows.txt > Unicode_Linux1.txt
Let us check this file.
$ file Unicode_Linux1.txt 
Unicode_Linux1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

$ hexdump -C Unicode_Linux1.txt 
00000000  ef bb bf e0 b0 a4 e0 b1  86 e0 b0 b2 e0 b1 81 e0  |................|
00000010  b0 97 e0 b1 81 0d 0a                              |.......|
00000017
This converts to UTF-8, but keeps BOM at the begining of file (ef bb bf). Also we have CR (0d) and LF (0a) characters for end of line. So, convert it to UTF-8 without BOM and CR, here is the command.
$ iconv -f UTF-16LE -t UTF-8 Unicode_Windows.txt | sed 1s/^.//g | sed s/"\r$"//g > Unicode_Linux1.txt
We can verify it using below commands.
$ file Unicode_Linux1.txt 
Unicode_Linux1.txt: UTF-8 Unicode text

$ hexdump -C Unicode_Linux1.txt 
00000000  e0 b0 a4 e0 b1 86 e0 b0  b2 e0 b1 81 e0 b0 97 e0  |................|
00000010  b1 81 0a                                          |...|
00000013

2b. Using dos2unix command

We can also use dos2unix command, which converts file from UTF-16LE to UTF-8, and also removes BOM and CR characters. Here is the example,
$ dos2unix -n Unicode_Windows.txt Unicode_Linux2.txt 
dos2unix: converting file Unicode_Windows.txt to file Unicode_Linux2.txt in Unix format ...

$ file Unicode_Linux2.txt 
Unicode_Linux2.txt: UTF-8 Unicode text

$ hexdump -C Unicode_Linux2.txt 
00000000  e0 b0 a4 e0 b1 86 e0 b0  b2 e0 b1 81 e0 b0 97 e0  |................|
00000010  b1 81 0a                                          |...|
00000013

3. Converting from UTF-8 to UTF-16

Now to convert files from UTF-8 to UTF-16LE on Linux, there is no direct way. The command unix2dos coverts from UTF-8 to UTF-8 only just by adding CR character. Also, unix2dos does not add BOM by default. So, we have to force it with -m option.
$ unix2dos -m -n Unicode_Linux1.txt Unicode_Windows1.txt 
unix2dos: converting file Unicode_Linux1.txt to file Unicode_Windows1.txt in DOS format ...

$ file Unicode_Windows1.txt 
Unicode_Windows1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

$ hexdump -C Unicode_Windows1.txt 
00000000  ef bb bf e0 b0 a4 e0 b1  86 e0 b0 b2 e0 b1 81 e0  |................|
00000010  b0 97 e0 b1 81 0d 0a                              |.......|
00000017
To covert it to UTF-16LE, we have to use iconv command after using unix2dos.
$ iconv -f UTF-8 -t UTF-16LE Unicode_Windows1.txt > Unicode_Windows2.txt

$ file Unicode_Windows2.txt 
Unicode_Windows2.txt: Little-endian UTF-16 Unicode text, with CR line terminators

$ hexdump -C Unicode_Windows2.txt
00000000  ff fe 24 0c 46 0c 32 0c  41 0c 17 0c 41 0c 0d 00  |..$.F.2.A...A...|
00000010  0a 00                                             |..|
00000012
Instead of using unix2dos, we can directly use sed command to add BOM and CR like below and covert to UTF-16.
$ sed 1s/^/"\xef\xbb\xbf"/g Unicode_Linux1.txt | sed s/$/"\r"/g | iconv -f UTF-8 -t UTF-16LE > Unicode_Windows3.txt 
 
$ file Unicode_Windows3.txt 
Unicode_Windows3.txt: Little-endian UTF-16 Unicode text, with CR line terminators

$ hexdump -C Unicode_Windows3.txt 
00000000  ff fe 24 0c 46 0c 32 0c  41 0c 17 0c 41 0c 0d 00  |..$.F.2.A...A...|
00000010  0a 00                                             |..|
00000012