Thursday, March 26, 2015

IE and Getting a Text File Off the Web - Selenium Web Tools

I've blogged previously about getting information off of a distant server on my employer's internal SharePoint site.  Automating this can be a little challenging, especially when there's a change.

My new desktop showed up with Internet Explorer 11 and Windows 7 Enterprise.  When I went to run my MineSight multirun (basically a batch file with a GUI front end that our mine planning vendor provides) the file fetch from our SharePoint site didn't work.  A little googling led me to Selenium.

As is often the case, I am wayyyy late to the party here.  I remember Selenium from Pycon 2010 in Atlanta because they gave us a nice mug with new string formatting on it that I use frequently (both the mug and the formatting):

 
I was at Pycon 2010 . . . and I have the mug to prove it.


 
My project manager/boss at the time, Eric, seeing me gush over the string formatting commands, did his usual button-pushing exercise by commenting, "I don't know; why didn't they put something on there like 'from pot import coffee'?"  People, y'know?

Back to Selenium - I was able to get what I needed from it with some research and downloading.  The steps are basically:


 
    1) Download IEDriverServer.exe
 
    2) Put the executable in a location in your path.
 
    3) Download Python Selenium Bindings and follow the install instructions.  I went the Python 3.4 route (versus the Python 2.7 that comes with MineSight) - personal preference on my part.

    4) Make sure your Internet Explorer environment/application is set up in a way that won't cause you problems.  I could try to describe this, but this blog post from a Selenium developer does it so much better (complete with screenshots):  http://jimevansmusic.blogspot.com/2012/08/youre-doing-it-wrong-protected-mode-and.html.  When Microsoft talks about "zones" and IE Protected Mode, the zones refer to things like "Trusted Sites," company web, external internet, etc. - all those have to be set to protected mode or things won't work and you'll get a fairly cryptic error message when the script crashes.


For my example, I was able to comment out some of the things I need to do within the MineSight multirun.  The DOS window hangs and IEDriverServer stays open within the MineSight multirun and app - I hacked this problem by killing it with an os.system() call.  Whatever it takes.
 
I couldn't efficiently get the script to recognize HTML tag names, so I hacked that with text processing.  This is bad, but effective.
 
The code:
 
#!C:\Python34\python
 
"""
Get text from site via Internet Explorer.
"""
 
INST = 'instructions.txt'
 
# For killing process inside Multirun.
# import os
 
from time import sleep as slpx
 
from selenium import webdriver
 
# XXX - hack - had difficulty getting
#       things by tag - text processed it.
PRETAG = '<pre>'
PRETAGLEN = len(PRETAG)

PRETAGCLOSE = '</pre>'
# Seconds to pause at end.
PAUSE = 3
INSTRUCTIONS = 'http://ftp3.usa.openbsd.org/pub/OpenBSD/5.6/README'
INSTR = 'instructions.txt'
 
# XXX - may not matter (\r versus \n), in all cases
#       but for numbers in multirun, makeshift chomp
#       processing made a difference.

RETCHAR = '\r'
 
# Hack to shutdown DOS window.
# TASKKILL = 'taskkill /im IEDriverServer.exe /F'
 
def getbody(url):
    """
    Given the website address (url),
    returns inner HTML text stripped of tags.
    """
    browser = webdriver.Ie()
    browser.get(url)
    text = browser.page_source
    browser.close()
    text = text[(text.index(PRETAG) + PRETAGLEN):]
    text = text[:(text.index(PRETAGCLOSE))]
    text = text.split(RETCHAR)
    [x.strip() for x in text]
    return text
 
textii = getbody(INSTRUCTIONS)
print('\nDealing with writing of instructions file . . .\n')
textii = ''.join(textii)
f = open(INSTR, 'w')
f.write(textii)
f.close()
print('Instructions copied.')
print('\nPausing {:d} seconds . . .\n'.format(PAUSE))
slpx(PAUSE)

# XXX - can't get window to close in Multirun (MXPERT) - CBT 23MAR2015
# os.system(TASKKILL)