We were given the task here at the Washington Post to create a utility that could take screen captures of interactive (read: JavaScript) based elements, and load them into our Content API, to be used in place of that code for our mobile team. It seemed straightforward, and teams had already done this before: during the 2016 elections, constant screenshots of our election map were being generated using PhantomJS. So, off we went, hoping that PhantomJS and its descendants would help us out.

In our discussion of the project, one of our senior developers had mentioned Chrome headless and that the developer of PhantomJS had stopped working on it because of it, but even with some research, we put that aside for the moment to get a proof of concept.

Tricksy JavaScript

After many attempts and working with many different PhantomJS projects in Node and similar Python projects, we found that although these were all supposedly JavaScript-aware, there was something weird going on.

Basically, the JavaScript on the page was being evaluated sequentially, where the JavaScript changing the source was being evaluated, but those newly added JavaScript libraries were not. The only way to get it to work was by using a full browser, i.e. Chrome or Firefox, in order to evaluate properly.

Ghost.py & PySide

Ghost.py was recommended by someone who wrote another similar project. It looked like this could have been the way to go: no browser dependency, and it didn't need Selenium (if you haven't used Selenium, you'll understand my pain in a bit); the Qt libraries were more than enough to get it working.

PhantomJS is essentially a Qt based project and Qt is extremely useful if you know how to use it, but learning to use it is a pain.

What we learned is that Qt earned its reputation for a reason. After a lot of issues getting Qt to compile, we managed to find a Docker image for Ghost.py that worked for our purposes.

So, what did we learn? Although it seemed that Ghost.py was very powerful, we were having serious issues with it, mainly because of sparse (and at the time, outdated) documentation, that would not allow us to do what we wanted to do easily. Although there is a screenshot method available, selecting a portion of a page to do the same was not straightforward at all, no matter how JavaScript was used within the code to evaluate it. The conclusion we came to was that it required too much effort to get it to do what we needed and although developers have used it successfully before, there wasn't much of an ecosystem around it.

Back to Selenium

Selenium in Python; a familiar friend, ended up being used here. After some work testing the different drivers, we found that Firefox and Chrome worked best with the project. The problem was not that it was running a GUI app unnecessarily, which was why people preferred PhantomJS;  it was essentially a completely headless browser. We needed the power of Firefox or Chrome without the overhead of a GUI. In the end, we decided that headless Chrome was the best option.

Headless Chrome

Since version 59 of Chrome (for Mac and Linux, 60 for Windows), the browser has been able to run from the command-line using the –headless switch. It's pretty amazing. Our first thought was to dump Selenium completely, and drive Chromium through code; unfortunately, there was no easy way to do that in Python (at least, not that could be found at the time), and what was required (selecting portions of the rendered page) would not work well. Selenium gave the project the complete control it needed, even with some compromises. We set up our project in Python using an Alpine Linux Docker container:

options = webdriver.ChromeOptions()
options.binary_location = '/usr/bin/chromium-browser'
options.add_argument('headless')
options.add_argument('disable-gpu')
options.add_argument('no-sandbox')
options.add_argument(f'window-size={WINDOW}')

RemoteWebDriver

If you've used Selenium before, you're likely aware that it is actually a Java-based project that uses 'drivers' to control the browsers. You are safe if you use the local driver and your language's bindings. Typically Selenium works like a charm, but this project, which was meant to run continuously in production, probably should not have chromium and chromedriver run in the same container, as they were two distinctly different programs, and errors from one should not affect the other. You can find more info about Docker container isolation in their documentation.

We settled on using the RemoteWebDriver interface with chromium in one container and chromedriver in a separate container. This driver is a JAR file and the developer experience was rough. When you can connect to RemoteWebDriver and it works, it's like magic, but the problem is when it crashes. Not only does the Java stack trace not help at all, but you also need to get the browser up and running again, which is easier said than done.

After the project was up running, Java and Chromium managed to eat away at memory and CPU in the AWS ECS cluster, bringing things essentially to a halt.

It was not fun.

Even with settings like so for Chromium to make sure that certain mounted directories didn’t blow up the container…

volumes:
- shm:/dev/shm # container can crash if not shared with host, or given a separate volume
- tmp:/tmp/ # where all generated files went, in case it got too big
- dbus:/tmp/dbus # Lots of issues in Docker with dbus

…either Chromium would not stop crashing or the ECS instance would grind to a halt.

The next attempt was to run a Grid; don't do this if you don't have so. We figured using Selenium Grid would make sense to offload the work, and let containers recover, but not only is Grid essentially a beta feature, it would need to be maintained for production in the long run.

Grid Services

Our team lead had mentioned Grid providers. We were kind of aware that they existed, but had not considered offloading that problem to them instead. The RemoteWebDriver functionality wasn't completely useless! All that was needed was some credentials and their Grid was accessible, and all we had to deal with was costs. Unfortunately, that didn't fix the other issues: browsers still crash, access can be revoked, and the program still has to work.

What’s Next?

BrowserStack has been working, but has been temperamental at best; it’s hard to keep sessions intact, and there tend to be a lot of hanging sessions. There’s reporting, and it works, but it doesn’t feel very clean.

Working on the project, we recently found puppeteer, a Chrome team project to create an 'official' JavaScript control interface for headless Chrome. It also turns out there is a new Python project called pyppeteer, which makes things both easier (downloads Chromium for you) and more difficult (need to learn async i/o), plus it's a WIP. There are a few options here, but we could cut out BrowserStack entirely. Without the complexity of the browser crashing within the RemoteWebDriver, we might actually be able to debug the Chromium resource issue that seems to be happening.

One way or another, it looks like headless Chrome is the way forward, it just remains to be seen how that happens. R.I.P.

PhantomJS

.

References