Sunday, February 19, 2017

Filling in Missing Grouping Columns of MSSQL SSRS Report Dumped to Excel

This is another simple but common problem in certain business environments:

1) Data are presented via a Microsoft SQL Server Reporting Services report, BUT

2) The user wants the data in Excel, and, further, wants to play with it (pivot, etc.) there.  The problem is that the grouping column labels are not in every record, only in the one row that begins the list of records for that group (sanitized screenshot below):

But I don't WANT to copy and paste all those groupings for 30,000 records :*-(
I had this assignment recently from a remote request.  It took about four rounds of an e-mail exchange to figure out that it really wasn't a data problem, but a formatting one that needed solving.

It is possible to do the whole thing in Python.  I did the Excel part by hand in order to get a handle on the data:

1) In Excel, delete the extra rows on top of the report leaving just the headers and the data.

2) In Excel, select everything on the data page, format the cells correctly by unselecting the Merge Cells and Wraparound options.

3) In Excel, at this point you should be able to see if there are extra empty columns as space fillers; delete them.  Save the worksheet as a csv file.

4) In a text editor, open your csv file, identify any empty rows, and delete them.  Change column header names as desired.

Now the Python part:

#!python36

"""
Doctor csv dump from unmerged cell
dump of SSRS dump from MSSQL database.

Fill in cell gaps where merged
cells had only one grouping value
so that all rows are complete records.
"""

import pprint

COMMA = ','
EMPTY = ''

INFILE = 'rawdata.csv'
OUTFILE = 'canneddumpfixed.csv'

ERRORFLAG = 'ERROR!'

f = open(INFILE, 'r')
headerline = next(f)
numbercolumns = len(headerline.split(COMMA))

f2 = open(OUTFILE, 'w')

# Assume at least one data column on far right.
missingvalues = (numbercolumns - 1) * [ERRORFLAG]

for linex in f:
    print('Processing line {:s} . . .'.format(linex))
    splitrecord = linex.split(COMMA)
    for slotx in range(0, numbercolumns - 1):
        if splitrecord[slotx] != EMPTY:
            missingvalues[slotx] = splitrecord[slotx]
        else:
            splitrecord[slotx] = missingvalues[slotx]
    f2.write(COMMA.join(splitrecord))

f2.close()

print('Finished')


At this point you've got your data in csv format - you can open it in Excel and go to work.

There may be a free or COTS (commercial off the shelf) utility that does all this somewhere in the Microsoft "ecosystem" (I think that's their fancy enviro-friendly word for vendor-user community) but I don't know of one.


Thanks for stopping by.





3 comments:

  1. It would be much easier to read the sheet using pandas and using forward-fill grouping columns:
    http://pandas.pydata.org/pandas-docs/stable/missing_data.html#filling-missing-values-fillna

    ReplyDelete
  2. It can be done in Excel, unmerge the grouping cells which leaves the group label in the first cell of that group. Next, highlight the columns in the table with missing labels. Hit ctrl-g then select special, select blanks. All the blank cells will be highlighted, and one of those will be active. Enter equals and the press up arrow to select the cell above it, then press cntrl-enter and it will copy that formula to all the other blank cells. Copy and paste as values if you want to make sure the values are locked in.

    ReplyDelete
  3. Alhashah, thank you. I've heard great things about Pandas. Unfortunately we don't always have the freedom at work to install the packages we want. Python without third party packages is the safest bet.

    Unknown, thank you for the Excel solution - this would work best for my user base. I appreciate it and learned something.

    ReplyDelete