Thursday, August 11, 2016

sqlcmd faux csv dump and parsing with the csv module

Lately I had another Excel-VBA-Python one off hack project.  Once again there was the dilemma of not being able to use MSSQL's bcp because my query string was too long.  sqlcmd can run a query from a big SQL file, but, to the best of my knowledge, it does not do csv dumps.

This is a hack.  I would normally go to hell for it, but I've done so many other bad hacks I'd have to declare bankruptcy on my programming soul and start over.  Onward.

mssql query file:

<SQL code>

< . . . variable declarations, temp table declarations, etc. . . . >


DECLARE @COMMA CHAR(1) = ',';
DECLARE @LOSSLESS INT = 3;

DECLARE @DOUBLEQUOTE CHAR(1) = CHAR(34);
-- Concatenate strings.
-- Need quoted strings for stockpiles with spaces.
SELECT @DOUBLEQUOTE + StockpileShortName +

       @DOUBLEQUOTE + @COMMA +
       @DOUBLEQUOTE + StockpileID +
       @DOUBLEQUOTE + @COMMA +
       @DOUBLEQUOTE + StkLoc +

       @DOUBLEQUOTE + @COMMA +
       -- Go for full float precision.
       CONVERT(VARCHAR(35),

               tonnes,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),
               grade01,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade02,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade03,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade04,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade05,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade06,
               @LOSSLESS) 
FROM ##inputresultspvctrachte


< . . . ORDER BY clause . . .>

<End SQL code>

It's pretty obvious what I'm doing (and I'd be shocked if I'm the first to do it):  list all my fields on one line separated by commas that are part of the result record.

A couple notes:

1) all my string identifiers are in double quotes; all my float values are in unquoted text - this will help simplify the Python csv module code below.

2) the @LOSSLESS "constant" - Microsoft's SQL documentation doesn't list an enumeration for this per se.  It's just a straight up whole number 3.  I'm a bit obsessive about constants - wrap that baby in a variable declaration!  Lossless double precision means, if I recall correctly, SQL Server will give you seventeen digits of precision.  This works for what I'm doing (mining stockpile management).

The (rough) mssql command to run the query from a DOS prompt:

sqlcmd -S MYSERVERNAME -U MYUSERNAME -P MYPASSWORD -I myqueryfile.sql -o theoutputfile.csv -b

The -b switch provides a Windows error code.  It's a crude check for whether the query parsed OK and ran, but it's better than nothing.

The output looks something like this (sorry about the small font):

<. . . sqlcmd messages . . .>

"KEY003","hakunamatadacopper","good",28776.5,X.XXXXX,X.XXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY005","tembomalachite","not as good",25855.9,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY006","simbacobalt","not as good",156767,X.XXXXXX,X.XXXXXXX,X.XXXXXX,X.XXXXXXX,XX.XXXX,X.XXXXXX
"KEY010","jambocobalt","good",488977,X.XXXXX,X.XXXXXX,X.XXXX,X.XXXXXX,XXX.XXX,X.XXXXX
"KEY015","cucoagogo","good",39576.7,X.XXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY016","greenrock","good",160,X.XXX,X.XXX,X.XXX,X.XXX,XXX.XX,X.XX
"KEY033","pinkrock","not as good",81504.3,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXX
"KEY006","funkyleach","not as good",55866.1,X.XXXXXX,X.XXXXXX,X.XXXXXX,X.XXXXXX,XXX.XXX,X.XXXXXX
"KEY010","metalhome","good",30301.1,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XX,X.XXXXX
"KEY015","boulderpile","good",2878.25,X.XX,X.XX,X.XXX,X.XXX,XX.XXX,X.XXX
"KEY033","berm","not as good",5309.97,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXXX

(11 rows affected)

I've given my stockpiles funny names and X'ed out the numeric grades to sanitize this, but you get the general idea.

Now, finally to some Python code.  I'll get the lines of the file (faux csv) I want and parse them with the csv module reader object.  The whole deal is kind of verbose (I have a collections.namedtuple object that takes each "column" as an attribute).  I'm only going to show the part that segregates the lines I want and reads them with the csv reader.  The wpx module has all of my constants and static data definition in it.  Some of the whitespace issues I still need to work out.  For now I brute force stripped off leading and trailing whitespace from values.

def parsesqlcmdoutput():
    """
    Parse output from sqlcmd.

    Returns list of
    collections.namedtuple
    objects.
    """
    lines = []
    with open(wpx.OUTPUTFILE +

              wpx.CSVEXT, 'r') as f:
        # Get relevant lines.
        # Rip whitespace off end - excessive.

        # XXX - string overloading - hack.
        lines = [linex.strip() for

                 linex in f if
                 linex[0:wpx.STKFLAG[0]] ==

                         wpx.STKFLAG[1]]
    rdr = csv.reader(lines, quoting =

                            QUOTENONN)
    records = []
    for r in rdr:
        # Get rid of whitespace padding

        # around string values.
        for x in xrange(wpx.IHSTRIDX):
            r[x] = r[x].strip()
        records.append(wpx.INPUTRECORD(*r))
    return records

That csv.QUOTENONN (quote non-numeric) is handy.  As per the Python doc, anything that isn't quoted is taken as a float.  As long as my data are clean, I should be good there and it strips out some cruft code-wise.

The list comprehension is an iterable object the same way a file is, so the csv module's reader works fine on it.

That's about it (minus a lot of background code - if you need that, let me know and I'll put it in the comments).

Thanks for stopping by.