Thursday, March 26, 2015

IE and Getting a Text File Off the Web - Selenium Web Tools

I've blogged previously about getting information off of a distant server on my employer's internal SharePoint site.  Automating this can be a little challenging, especially when there's a change.

My new desktop showed up with Internet Explorer 11 and Windows 7 Enterprise.  When I went to run my MineSight multirun (basically a batch file with a GUI front end that our mine planning vendor provides) the file fetch from our SharePoint site didn't work.  A little googling led me to Selenium.

As is often the case, I am wayyyy late to the party here.  I remember Selenium from Pycon 2010 in Atlanta because they gave us a nice mug with new string formatting on it that I use frequently (both the mug and the formatting):

I was at Pycon 2010 . . . and I have the mug to prove it.

My project manager/boss at the time, Eric, seeing me gush over the string formatting commands, did his usual button-pushing exercise by commenting, "I don't know; why didn't they put something on there like 'from pot import coffee'?"  People, y'know?

Back to Selenium - I was able to get what I needed from it with some research and downloading.  The steps are basically:

    1) Download IEDriverServer.exe
    2) Put the executable in a location in your path.
    3) Download Python Selenium Bindings and follow the install instructions.  I went the Python 3.4 route (versus the Python 2.7 that comes with MineSight) - personal preference on my part.

    4) Make sure your Internet Explorer environment/application is set up in a way that won't cause you problems.  I could try to describe this, but this blog post from a Selenium developer does it so much better (complete with screenshots):  When Microsoft talks about "zones" and IE Protected Mode, the zones refer to things like "Trusted Sites," company web, external internet, etc. - all those have to be set to protected mode or things won't work and you'll get a fairly cryptic error message when the script crashes.

For my example, I was able to comment out some of the things I need to do within the MineSight multirun.  The DOS window hangs and IEDriverServer stays open within the MineSight multirun and app - I hacked this problem by killing it with an os.system() call.  Whatever it takes.
I couldn't efficiently get the script to recognize HTML tag names, so I hacked that with text processing.  This is bad, but effective.
The code:
Get text from site via Internet Explorer.
INST = 'instructions.txt'
# For killing process inside Multirun.
# import os
from time import sleep as slpx
from selenium import webdriver
# XXX - hack - had difficulty getting
#       things by tag - text processed it.
PRETAG = '<pre>'

PRETAGCLOSE = '</pre>'
# Seconds to pause at end.
INSTR = 'instructions.txt'
# XXX - may not matter (\r versus \n), in all cases
#       but for numbers in multirun, makeshift chomp
#       processing made a difference.

RETCHAR = '\r'
# Hack to shutdown DOS window.
# TASKKILL = 'taskkill /im IEDriverServer.exe /F'
def getbody(url):
    Given the website address (url),
    returns inner HTML text stripped of tags.
    browser = webdriver.Ie()
    text = browser.page_source
    text = text[(text.index(PRETAG) + PRETAGLEN):]
    text = text[:(text.index(PRETAGCLOSE))]
    text = text.split(RETCHAR)
    [x.strip() for x in text]
    return text
textii = getbody(INSTRUCTIONS)
print('\nDealing with writing of instructions file . . .\n')
textii = ''.join(textii)
f = open(INSTR, 'w')
print('Instructions copied.')
print('\nPausing {:d} seconds . . .\n'.format(PAUSE))

# XXX - can't get window to close in Multirun (MXPERT) - CBT 23MAR2015
# os.system(TASKKILL)