Thursday, August 11, 2011

Open XML SDK and Excel Pivot Tables - Pulling Data

As you may already know, the new Office document formats are pretty much "ZIP" files with XML files inside. These XML documents are formatted according to the Open XML spec, which Microsoft has released version 2.0 of their SDK for interacting with using strongly typed classes in .net.

If you are interested in downloading the SDK it is here:
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=5124
The documentation for the SDK is here (thought it is rather sparse):
http://msdn.microsoft.com/en-us/library/bb448854.aspx

Now, if all you want to do is read data out of particular cells, the best article I have found thus far explaining this is here:
There is also a useful MSDN article that has a method XLGetCellValue() that you can copy-and-paste into a class:

However, if you are working with pivot tables, this is probably not the case since you want the data and the presentation of the data in the cells may change if the user, say, reorders the spreadsheet.

First you need to get the PivotCacheDefinition from the WorkbookPart. A sample implementation of this might be...

The PivotCacheDefinition has an enumeration of CacheField's, which define each field in the dataset and also contain the values of a particular field in any given row. Since you will need to reference these often, I found it helpful to put them into a dictionary so that I could reference them by name and have the index handy.
I created a class to store the CacheElement and the index...


...and populated the a dictionary from the PivotCacheDefinition:

To get at the rows themselves, you enumerate the PivotTableCacheDefinitionPart.PivotCacheRecords property. Be careful not to confuse this with the "PivotCacheDefinition" we used earlier. Here is some sample code that would use the dictionary above to pull the address value for each row:


This can all be greatly simplified by putting a method that accepts a CacheElement type and a PivotCacheRecord and extracts the value for us. This will always be a type string (since this is XML) which we will have to cast into whatever type of value that we want.

Not all values are stored in the CacheField elements since some of them are strings that are stored in the row/field itself. For these you can use something like...
PivotCacheRecord.ElementAt(x).GetAttributes().First(a => a.LocalName == "v").Value;

I found it very helpful to look inside the XLSX file itself and dig around to make sure I was pulling the correct data and learn where everything is stored. Just use your favorite ZIP extractor and unzip the XLSX. Here is what I found:

(root)\xl\sharedStrings.xml – An index of strings
(root)\xl\worksheets\sheet1.xml – Sheet 1’s cells, which may reference strings in sharedStrings.xml
(root)\xl\pivotCache\pivotCacheRecords1.xml – Each record which indexes values in pivotCacheDefinition1.xml, unless the particular field is of special String type (then it has the value itself)
(root)\xl\pivotCache\pivotCacheDefinition1.xml – The columns in the pivot table data (not what is shown in the Excel file) and the actual values that are indexed