Technical

Fast xlsx parsing with Python

featured 21

So you have tried all the tools and libraries, but extracting data from
giant Excel sheets is still too slow ?

On the menu for this blog post we have:

  • Quick dive into the structure of XLSX files
  • The lxml library and the XSLT transformation
  • How to use XSLT to parse XLSX sheets
  • Full listing of the code

While the method presented here-under is fast, it cuts a lot of
corners, it is only meant to extract raw data.

XLSX file structure and the performance implications

Basically an XLSX file is a zip archive of a bunch of xml files. This is how a basic sheet looks in Excel:

And this is the content of the archive (seen inside emacs) corresponding
to the above screenshot:

  Filemode      Length  Date         Time      File
- ----------  --------  -----------  --------  -----------------------------------
  -rw-rw-rw-      1811   1-Jan-1980  00:00:00  [Content_Types].xml
  -rw-rw-rw-       588   1-Jan-1980  00:00:00  _rels/.rels
  -rw-rw-rw-      1266   1-Jan-1980  00:00:00  xl/_rels/workbook.xml.rels
  -rw-rw-rw-      1883   1-Jan-1980  00:00:00  xl/workbook.xml
  -rw-rw-rw-       199   1-Jan-1980  00:00:00  xl/sharedStrings.xml
  -rw-rw-rw-       304   1-Jan-1980  00:00:00  xl/worksheets/_rels/sheet1.xml.rels
  -rw-rw-rw-      8390   1-Jan-1980  00:00:00  xl/theme/theme1.xml
  -rw-rw-rw-      1618   1-Jan-1980  00:00:00  xl/styles.xml
  -rw-rw-rw-      1464   1-Jan-1980  00:00:00  xl/worksheets/sheet1.xml
  -rw-rw-rw-       235   1-Jan-1980  00:00:00  customXml/itemProps2.xml
  -rw-rw-rw-        72   1-Jan-1980  00:00:00  xl/customProperty1.bin
  -rw-rw-rw-       296   1-Jan-1980  00:00:00  customXml/_rels/item1.xml.rels
  -rw-rw-rw-       296   1-Jan-1980  00:00:00  customXml/_rels/item2.xml.rels
  -rw-rw-rw-       296   1-Jan-1980  00:00:00  customXml/_rels/item3.xml.rels
  -rw-rw-rw-       296   1-Jan-1980  00:00:00  customXml/_rels/item4.xml.rels
  -rw-rw-rw-        41   1-Jan-1980  00:00:00  customXml/item1.xml
  -rw-rw-rw-       235   1-Jan-1980  00:00:00  customXml/itemProps1.xml
  -rw-rw-rw-       991   1-Jan-1980  00:00:00  docProps/app.xml
  -rw-rw-rw-        49   1-Jan-1980  00:00:00  customXml/item3.xml
  -rw-rw-rw-       235   1-Jan-1980  00:00:00  customXml/itemProps3.xml
  -rw-rw-rw-        49   1-Jan-1980  00:00:00  customXml/item4.xml
  -rw-rw-rw-       235   1-Jan-1980  00:00:00  customXml/itemProps4.xml
  -rw-rw-rw-       623   1-Jan-1980  00:00:00  docProps/core.xml
  -rw-rw-rw-        41   1-Jan-1980  00:00:00  customXml/item2.xml
- ----------  --------  -----------  --------  -----------------------------------
                 21513                         24 files

Workbook

The usual entry point is the xl/workbook.xml file which contains a
mapping between the sheet names and the corresponding file. In our
minimal example the exercise is pointless since we only have one sheet,
but let’s see have a look anyway:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" ... >

  ...

  <sheets>
	<sheet name="Sheet1" sheetId="1" r:id="rId1"/>
  </sheets>

  ...

</workbook>

For the sake of brevity, most of the content has been replaced by dots.
So to extract this information we use XPath, but because this file
is usually rather small any other method would fit the bill.

ns = {
    'ns': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
}
fh = zipfile.ZipFile(file_path)
name = 'xl/workbook.xml'
root = etree.parse(fh.open(name))
workbook = {}
for el in etree.XPath("//ns:sheet", namespaces=ns)(root):
    workbook[el.attrib['name']] = el.attrib['sheetId']

So after running the above code, workbook is a dictionary containing
a mapping between a sheet name and the corresponding id. In our case:
{'Sheet1': '1'}.

Shared strings

Similarly, we have to also extract data out of
xl/sharedStrings.xml. It contains all the strings that will appear
in the sheets. So when a cell contains a string, Excel actually store
an id that refers to the position of the corresponding string in the
sharedStrings.xml file. This is the full content for our example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="2" uniqueCount="2">
<si><t>Ham</t></si>
<si><t>Spam</t></si>
</sst>

So parsing this xml is also done with XPath:

name = 'xl/sharedStrings.xml'
root = etree.parse(fh.open(name))
res = etree.XPath("/ns:sst/ns:si/ns:t", namespaces=ns)(root)
shared = {
    str(pos): el.text
    for pos, el in enumerate(res)
}

This shared string structure is one of the indirections (but not the
only one) you have to solve when parsing XLSX files. There is no easy
way to extract information as-is, hence the raw representation has to
be post-processed, that is why Python suffers when parsing XLSX.

La pièce de résistance

Now that we have collected the workbook and shared string data, we can
move our attention to the sheets themselves. This is where performance
matters because those are the parts that growth when sheets content
gets larger.

For our example, xl/worksheets/sheet1.xml looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" ... >
  <dimension ref="A1:B4"/>
  <sheetViews>
	<sheetView tabSelected="1" workbookViewId="0">
	  <selection activeCell="C4" sqref="C4"/>
	</sheetView>
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25"/>
  <sheetData>
	<row r="1" spans="1:2" x14ac:dyDescent="0.25">
	  <c r="A1" t="s">
		<v>0</v>
	  </c>
	  <c r="B1" t="s">
		<v>1</v>
	  </c>
	</row>
	<row r="2" spans="1:2" x14ac:dyDescent="0.25">
	  <c r="A2">
		<v>1</v>
	  </c>
	  <c r="B2">
		<v>2</v>
	  </c>
	</row>
	<row r="3" spans="1:2" x14ac:dyDescent="0.25">
	  <c r="A3">
		<v>3</v>
	  </c>
	  <c r="B3">
		<v>4</v>
	  </c>
	</row>
	<row r="4" spans="1:2" x14ac:dyDescent="0.25">
	  <c r="A4">
		<v>5</v>
	  </c>
	  <c r="B4">
		<v>6</v>
	  </c>
	</row>
  </sheetData>
</worksheet>

To be able to extract sheet content we have to collect cells values,
their types and their positions.

  • A cell is of type string when there is t="s" on the cell element (c).
  • The cell position is given by the r attribute (ex: r="B3"). We
    have to rely on this attribute because Excel sometimes likes to
    shuffle thoses.
  • The cell value is in the v element (which can be missing if the
    actual cell is empty)

So you can see that cell A1 contains the shared string 0 which is
“Ham”. B2 is 1 and “Spam”.

Parsing this info is not doable with XPath, even in several passes (as
we would loose relations between different bits of info). Another way
to do that would be to use a DOM parser and then walk across the
structure to gather what we want, it works but it tend to be slow and
memory heavy on large files.

The ugly duck saves the day

XSLT is an XML to transform XML into XML (and it’s Turing-complete!), so
nobody wants to deal with it. But:

  • It’s available in lxml and it’s fast, at least faster than a
    pure-python equivalent.
  • It can spit out any text content (not only XML).

The strategy here is to use XSLT to generate a CSV (it’s fast) and
parse it again with pandas’ read_csv (it’s fast too). As long as
pandas does not provide a read_xml method, it’s the best way to plug
those two together.

This is how it looks:

sheet_xslt = etree.XML('''
    <xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:sp="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
        >
        <xsl:output method="text"/>
        <xsl:template match="sp:row">
           <xsl:for-each select="sp:c">
            <xsl:value-of select="parent::*/@r"/> <!-- ROW -->
            <xsl:text>,</xsl:text>
            <xsl:value-of select="@r"/> <!--CELL-->
            <xsl:text>,</xsl:text>
            <xsl:value-of select="@t"/> <!-- TYPE -->
            <xsl:text>,</xsl:text>
            <xsl:value-of select="sp:v/text()"/> <!-- VALUE -->
           <xsl:text>\n</xsl:text>
           </xsl:for-each>
        </xsl:template>
    </xsl:stylesheet>
''')

It’s ugly and it took a lot of trial and error to get right. But
the most difficult part was to figure out which combination of
namespaces would appease the gods of XML.

What happens here:

  • The line <xsl:output method="text"/> ask the transformer to output
    us a simple test file.
  • <xsl:template match="sp:row"> will select all row elements.
  • <xsl:for-each select="sp:c"> ... </xsl:for-each> is a loop on c
    elements.
  • <xsl:text>something</xsl:text> will output something
  • <xsl:value-of select="@t"/>extract the t attribute of the
    current c tag.

The result of this transformation is a csv that contains one line per
excel cell and whose columns are ‘row’, ‘cell’, ‘type’ and ‘value’.

Once we have the correct XSLT, it’s straightforward to plug the output
of the transformation to read_csv:

sheet_name = 'Sheet1'
sheet_id = workbook[sheet_name]
sheet_path = 'xl/worksheets/sheet%s.xml' % sheet_id
root = etree.parse(fh.open(sheet_path))

transform = etree.XSLT(sheet_xslt)
result = transform(root)
df = read_csv(BuffIO(str(result)),
    header=None, dtype=str,
    names=['row', 'cell', 'type', 'value'],
)

So at this point in the code, you have a DataFrame which reflects the
xml content of the sheet, something like this:

(Pdb) print(df)
  row cell type value
0   1   A1    s     0
1   1   B1    s     1
2   2   A2  NaN     1
3   2   B2  NaN     2
4   3   A3  NaN     3
5   3   B3  NaN     4
6   4   A4  NaN     5
7   4   B4  NaN     6

The next steps are:

# First row numbers are sometimes filled with nan
df['row'] = to_numeric(df['row'].fillna(0))

# Translate string contents
cond = (df.type == 's') & (~df.value.isnull())
df.loc[cond, 'value'] = df[cond]['value'].map(shared)
# Add column label ('AA99' becomes 'AA')
df['col'] = df.cell.str.replace(r'[0-9]+', '')
df = df.sort_values(by='row')

# Pivot everything
df = df.pivot(
    index='row', columns='col', values='value'
).reset_index(drop=True).reset_index(drop=True)
df.columns.name = None  # pivot adds a name to the "columns" array
# Sort columns (pivot will put AA before B)
cols = sorted(df.columns, key=lambda x: (len(x), x))
df = df[cols]
df = df.dropna(how='all')  # Ignore empty lines
df = df.dropna(how='all', axis=1)  # Ignore empty cols

As you can see, there is no for-loop in Python, every low-level
operation is done through vectorized pandas method. This is the
resulting DataFrame (yay!):

     A     B
0  Ham  Spam
1    1     2
2    3     4
3    5     6

Benchmark & Closing words

The XSLT method is twice as fast as pandas’ read_excel. The
benchmarks were run on a large sheet of 537 lines and 341 columns:

$ python fast_xlsx_parsing.py
xslt: 3.311
pandas: 6.248

Incidentally, googling for "xslt" "pandas" "csv" returns this
unanswered StackOverflow question with the same approach,
as usual it’s easier to find a solution when you already know the
answer. So I’m not the first to combine XSLT and read_csv, but I
probably am in the context of XLSX file parsing.

Full code

import io
import zipfile

from lxml import etree
from pandas import read_csv, to_numeric


class XLSX:

    sheet_xslt = etree.XML('''
        <xsl:stylesheet version="1.0"
            xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            xmlns:sp="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
            >
            <xsl:output method="text"/>
            <xsl:template match="sp:row">
               <xsl:for-each select="sp:c">
                <xsl:value-of select="parent::*/@r"/> <!-- ROW -->
                <xsl:text>,</xsl:text>
                <xsl:value-of select="@r"/> <!--REMOVEME-->
                <xsl:text>,</xsl:text>
                <xsl:value-of select="@t"/> <!-- TYPE -->
                <xsl:text>,</xsl:text>
                <xsl:value-of select="sp:v/text()"/> <!-- VALUE -->
               <xsl:text>\n</xsl:text>
               </xsl:for-each>
            </xsl:template>
        </xsl:stylesheet>
    ''')

    def __init__(self, file_path):
        self.ns = {
            'ns': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
        }
        self.fh = zipfile.ZipFile(file_path)
        self.shared = self.load_shared()
        self.workbook = self.load_workbook()

    def load_workbook(self):
        # Load workbook
        name = 'xl/workbook.xml'
        root = etree.parse(self.fh.open(name))
        res = {}
        for el in etree.XPath("//ns:sheet", namespaces=self.ns)(root):
            res[el.attrib['name']] = el.attrib['sheetId']
        return res

    def load_shared(self):
        # Load shared strings
        name = 'xl/sharedStrings.xml'
        root = etree.parse(self.fh.open(name))
        res = etree.XPath("/ns:sst/ns:si/ns:t", namespaces=self.ns)(root)
        return {
            str(pos): el.text
            for pos, el in enumerate(res)
        }

    def _parse_sheet(self, root):
        transform = etree.XSLT(self.sheet_xslt)
        result = transform(root)
        df = read_csv(io.StringIO(str(result)),
                      header=None, dtype=str,
                      names=['row', 'cell', 'type', 'value'],
        )
        return df

    def read(self, sheet_name):
        sheet_id = self.workbook[sheet_name]
        sheet_path = 'xl/worksheets/sheet%s.xml' % sheet_id
        root = etree.parse(self.fh.open(sheet_path))
        df = self._parse_sheet(root)

        # First row numbers are filled with nan
        df['row'] = to_numeric(df['row'].fillna(0))

        # Translate string contents
        cond = (df.type == 's') & (~df.value.isnull())
        df.loc[cond, 'value'] = df[cond]['value'].map(self.shared)
        # Add column number and sort rows
        df['col'] = df.cell.str.replace(r'[0-9]+', '')
        df = df.sort_values(by='row')

        # Pivot everything
        df = df.pivot(
            index='row', columns='col', values='value'
        ).reset_index(drop=True).reset_index(drop=True)
        df.columns.name = None  # pivot adds a name to the "columns" array
        # Sort columns (pivot will put AA before B)
        cols = sorted(df.columns, key=lambda x: (len(x), x))
        df = df[cols]
        df = df.dropna(how='all')  # Ignore empty lines
        df = df.dropna(how='all', axis=1)  # Ignore empty cols
        return df

if __name__ == '__main__':
    xlsx = XLSX('example.xlsx')
    xlsx.read('Sheet1')