Jaimon's Blog

An Excel Addin to work with Unicode CSV files

Excel treat .csv files as text files and will replace all Unicode characters with “?” when saved normally. If you want to preserve the Unicode characters, you would need to save the file as “Unicode text (*.txt)”, which is a Tab delimited file. Since I couldn’t find any existing solution to preserve Unicode characters in CSV format, I thought I’ll give it a go in creating an Excel Addin to do just that.

Installing the Addin

If you want to try this on your Excel 2007 or 2010, please download http://blog.jaimon.co.uk/UnicodeCSV/setupfiles.zip and extract the files to a folder before running setup.exe.


****Update****

Some of the readers have experienced issues with this installation file on some machines. For a successful installation,

  • Please make sure to install this as Local admin user
  • If setup.exe doesn’t work for some reason try running the UnicodeCSVAddin.vsto file
(Many thanks to our reader Sean for pointing this out)

Once the Addin is installed, it will look out for any CSV file being opened. If you’re opening an existing Unicode CSV file, this Addin will override the current saving method (either by pressing Ctrl-S, or clicking on any of the Save buttons/options from the Ribbon or menu) and will always save as a Unicode file.

But if you are creating a new file and want to save that as a Unicode csv file, then choose the Save or SaveAs option from the newly created Ribbon item “Unicode CSV”.

When clicking on Save from “Unicode CSV” Ribbon, it will make sure the file extension is .csv, if not it will prompt you to select another file name. Once a file is saved this way, then you don’t need to use the “Unicode CSV” Ribbon item any more, just save as you do normally.

How does it work?

As Excel can handle Unicode text (*.txt) files perfectly well, I’m relying on that to make this Addin work. On save, this Addin will be calling Excel’s SaveAs method to create a temporary Unicode Text file. It will then overwrite the original file by replacing all Tabs with comma and reload it. As it is a CSV file, Excel doesn’t preserve any column width adjustments or last active cell. Currently this Addin will reset the last active cell, and I think it should be pretty straightforward to reset column widths as well, if you wish to do so.

The code behind (for the geeks!)

This was done as an Excel 2007 application level Addin project in Visual Studio 2010 targeting .Net framework 4.0, so it can be installed on both 2007 & 2010 versions of MS Office. Feel free to fork the project from https://github.com/jaimonmathew/Unicode-CSV-Addin as a Visual Studio 2010 solution, if you want to make any modification or create an MSI installer etc. (Please read the license and notice files before using the code)

–ThisAddin.cs

using System;
using Excel = Microsoft.Office.Interop.Excel;
using System.Windows.Forms;
using Office = Microsoft.Office.Core;
using System.Runtime.InteropServices;
using System.Text;
using System.Collections.Generic;
using System.IO;

namespace UnicodeCSVAddin
{
    public partial class ThisAddIn
    {
        private Excel.Application app;
        private List<string> unicodeFiles; //a list of opened Unicode CSV files. We populate this list on WorkBookOpen event to avoid checking for CSV files on every Save event.
        private bool sFlag = false;

        //Unicode file byte order marks.
        private const string UTF_16BE_BOM = "FEFF";
        private const string UTF_16LE_BOM = "FFFE";
        private const string UTF_8_BOM = "EFBBBF";

        private void ThisAddIn_Startup(object sender, System.EventArgs e)
        {
            app = this.Application;
            unicodeFiles = new List<string>();
            app.WorkbookOpen += new Excel.AppEvents_WorkbookOpenEventHandler(app_WorkbookOpen);
            app.WorkbookBeforeClose += new Excel.AppEvents_WorkbookBeforeCloseEventHandler(app_WorkbookBeforeClose);
            app.WorkbookBeforeSave += new Excel.AppEvents_WorkbookBeforeSaveEventHandler(app_WorkbookBeforeSave);
        }

        void ThisAddIn_Shutdown(object sender, EventArgs e)
        {
            app = null;
            unicodeFiles = null;
        }

        void app_WorkbookBeforeSave(Excel.Workbook Wb, bool SaveAsUI, ref bool Cancel)
        {
            //Override Save behaviour for Unicode CSV files.
            if (!SaveAsUI && !sFlag && unicodeFiles.Contains(Wb.FullName))
            {
                Cancel = true;
                SaveAsUnicodeCSV(false, false);
            }
            sFlag = false;
        }

        //This is required to show our custom Ribbon
        protected override Office.IRibbonExtensibility CreateRibbonExtensibilityObject()
        {
            return new Ribbon1();
        }

        void app_WorkbookBeforeClose(Excel.Workbook Wb, ref bool Cancel)
        {
            unicodeFiles.Remove(Wb.FullName);
            app.StatusBar = "Ready";
        }

        void app_WorkbookOpen(Excel.Workbook Wb)
        {
            //Check to see if the opened document is a Unicode CSV files, so we can override Excel's Save method
            if (Wb.FullName.ToLower().EndsWith(".csv") && isFileUnicode(Wb.FullName))
            {
                if (!unicodeFiles.Contains(Wb.FullName))
                {
                    unicodeFiles.Add(Wb.FullName);
                }
                app.StatusBar = Wb.Name + " has been opened as a Unicode CSV file";
            }
            else
            {
                app.StatusBar = "Ready";
            }
        }

        /// <summary>
        /// This method check whether Excel is in Cell Editing mode or not
        /// There are few ways to check this (eg. check to see if a standard menu item is disabled etc.)
        /// I know in cell editing mode app.DisplayAlerts throws an Exception, so here I'm relying on that behaviour
        /// </summary>
        /// <returns>true if Excel is in cell editing mode</returns>
        private bool isInCellEditingMode()
        {
            bool flag = false;
            try
            {
                app.DisplayAlerts = false; //This will throw an Exception if Excel is in Cell Editing Mode
            }
            catch (Exception)
            {
                flag = true;
            }
            return flag;
        }
        /// <summary>
        /// This will create a temporary file in Unicode text (*.txt) format, overwrite the current loaded file by replaing all tabs with a comma and reload the file.
        /// </summary>
        /// <param name="force">To force save the current file as a Unicode CSV.
        /// When called from the Ribbon items Save/SaveAs, <i>force</i> will be true
        /// If this parameter is true and the file name extention is not .csv, then a SaveAs dialog will be displayed to choose a .csv file</param>
        /// <param name="newFile">To show a SaveAs dialog box to select a new file name
        /// This will be set to true when called from the Ribbon item SaveAs</param>
        public void SaveAsUnicodeCSV(bool force, bool newFile)
        {
            app.StatusBar = "";
            bool currDispAlert = app.DisplayAlerts;
            bool flag = true;
            int i;
            string filename = app.ActiveWorkbook.FullName;

            if (force) //then make sure a csv file name is selected.
            {
                if (newFile || !filename.ToLower().EndsWith(".csv"))
                {
                    Office.FileDialog d = app.get_FileDialog(Office.MsoFileDialogType.msoFileDialogSaveAs);
                    i = app.ActiveWorkbook.Name.LastIndexOf(".");
                    if (i >= 0)
                    {
                        d.InitialFileName = app.ActiveWorkbook.Name.Substring(0, i);
                    }
                    else
                    {
                        d.InitialFileName = app.ActiveWorkbook.Name;
                    }
                    d.AllowMultiSelect = false;
                    Office.FileDialogFilters f = d.Filters;
                    for (i = 1; i <= f.Count; i++)
                    {
                        if ("*.csv".Equals(f.Item(i).Extensions))
                        {
                            d.FilterIndex = i;
                            break;
                        }
                    }
                    if (d.Show() == 0) //User cancelled the dialog
                    {
                        flag = false;
                    }
                    else
                    {
                        filename = d.SelectedItems.Item(1);
                    }
                }
                if (flag && !filename.ToLower().EndsWith(".csv"))
                {
                    MessageBox.Show("Please select a CSV file name first");
                    flag = false;
                }
            }

            if (flag && filename.ToLower().EndsWith(".csv") && (force || unicodeFiles.Contains(filename)))
            {
                if (isInCellEditingMode())
                {
                    MessageBox.Show("Please finish editing before saving");
                }
                else
                {
                    try
                    {
                        //Getting current selection to restore the current cell selection
                        Excel.Range rng = (Excel.Range)app.ActiveCell;
                        int row = rng.Row;
                        int col = rng.Column;

                        string tempFile = System.IO.Path.GetTempFileName();

                        try
                        {
                            sFlag = true; //This is to prevent this method getting called again from app_WorkbookBeforeSave event caused by the next SaveAs call
                            app.ActiveWorkbook.SaveAs(tempFile, Excel.XlFileFormat.xlUnicodeText);
                            app.ActiveWorkbook.Close();

                            if (new FileInfo(tempFile).Length <= (1024 * 1024)) //If its less than 1MB, load the whole data to memory for character replacement
                            {
                                File.WriteAllText(filename, File.ReadAllText(tempFile, UnicodeEncoding.UTF8).Replace("\t", ","), UnicodeEncoding.UTF8);
                            }
                            else //otherwise read chunks for data (in 10KB chunks) into memory
                            {
                                using (StreamReader sr = new StreamReader(tempFile, true))
                                using (StreamWriter sw = new StreamWriter(filename, false, sr.CurrentEncoding))
                                {
                                    char[] buffer = new char[10 * 1024]; //10KB Chunks
                                    while (!sr.EndOfStream)
                                    {
                                        int cnt = sr.ReadBlock(buffer, 0, buffer.Length);
                                        for (i = 0; i < cnt; i++)
                                        {
                                            if (buffer[i] == '\t')
                                            {
                                                buffer[i] = ',';
                                            }
                                        }
                                        sw.Write(buffer, 0, cnt);
                                    }
                                }
                            }
                        }
                        finally
                        {
                            File.Delete(tempFile);
                        }

                        app.Workbooks.Open(filename, Type.Missing, Type.Missing, Excel.XlFileFormat.xlCSV, Type.Missing, Type.Missing, Type.Missing, Type.Missing, ",");
                        Excel.Worksheet ws = app.ActiveWorkbook.ActiveSheet;
                        ws.Cells[row, col].Select();
                        app.StatusBar = "File has been saved as a Unicode CSV";
                        if (!unicodeFiles.Contains(filename))
                        {
                            unicodeFiles.Add(filename);
                        }
                        app.ActiveWorkbook.Saved = true;
                    }
                    catch (Exception e)
                    {
                        MessageBox.Show("Error occured while trying to save this file as Unicode CSV: " + e.Message);
                    }
                    finally
                    {
                        app.DisplayAlerts = currDispAlert;
                    }
                }
            }
        }

        /// <summary>
        /// This method will try and read the first few bytes to see if it contains a Unicode BOM
        /// </summary>
        /// <param name="filename">File to check for including full path</param>
        /// <returns>true if its a Unicode file</returns>
        private bool isFileUnicode(string filename)
        {
            bool ret = false;
            try
            {
                byte[] buff = new byte[3];
                using (FileStream fs = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                {
                    fs.Read(buff, 0, 3);
                }

                string hx = "";
                foreach (byte letter in buff)
                {
                    hx += string.Format("{0:X2}", Convert.ToInt32(letter));
                    //Checking to see the first bytes matches with any of the defined Unicode BOM
                    //We only check for UTF8 and UTF16 here.
                    ret = UTF_16BE_BOM.Equals(hx) || UTF_16LE_BOM.Equals(hx) || UTF_8_BOM.Equals(hx);
                    if (ret)
                    {
                        break;
                    }
                }
            }
            catch (IOException)
            {
                //ignore any exception
            }
            return ret;
        }

        #region VSTO generated code

        /// <summary>
        /// Required method for Designer support - do not modify
        /// the contents of this method with the code editor.
        /// </summary>
        private void InternalStartup()
        {
            this.Startup += new System.EventHandler(ThisAddIn_Startup);
            this.Shutdown += new System.EventHandler(ThisAddIn_Shutdown);
        }

        #endregion
    }
}

The crux of the application is in ThisAddin.cs file, and I’ve put some comments wherever I can in the code to explain the functionality. But here is a brief info on how the code works. (As it was my first attempt at using VSTO (Visual Studio Tools for Office) and the fact that I’m a Java/Oracle developer for my day job, I’m sure the .net experts here might find a better way of coding in many places).

On the ThisAddin_Startup method I’m adding three event handlers,

  1. Application_WorkbookOpen – to check for whether the new file being opened is a Unicode CSV file. The code looks at the BOM (Byte Order Mark) bytes to figure out whether it’s a Unicode file or not. At the moment it will check for UTF8, UTF16-LE and UTF16-BE preambles. If it’s a Unicode CSV file, then it will add the file name to a global list
  2. Application_WorkbookBeforeClose – to remove file name from the global Unicode CSV file list
  3. Application_WorkbookBeforeSave – This will call the method SaveAsUnicodeCSV to override Excel’s default Save behaviour if the filename is in the global Unicode CSV file list. Once some basic checks are passed, SaveAsUnicodeCSV method will save the current file as a Unicode Text file to a temporary file, and close it. We now need to replace all the Tabs in that file with a comma. If the file is small enough (<= 1MB), I’m using File.ReadAllText and File.WriteAllText to read and write in one go. Otherwise I’m using a StreamReader and StreamWriter to read/write data in 10KB chunks. (BTW, these are numbers I plucked out of thin air, so feel free to change this to suite your use case). Once the new file has been created, we will then open the new csv into current workbook.

Ribbon1.xml has the entries to create new Ribbon items.

Jaimon

Advertisements

August 23, 2011 Posted by | Uncategorized | , , , , , | 105 Comments