Exploring the Web with Node, PhantomJS, and Horseman

This article has been peer reviewed by Lukas White. Thanks to all SitePoint peer reviewers for making SitePoint content the best it can be!

During a project, it is quite common to have to write custom scripts to perform various actions. These one-time scripts, which are typically executed through the command line (CLI), can be used for virtually any type of task. Having written many such scripts over the years, I have come to appreciate the importance of taking a little time up front to set up a custom CLI microframework to facilitate this process. Fortunately, Node.js and its large ecosystem of packages, npm, make this task easier. Whether parsing a text file or running an ETL, setting up a convention makes it easy to add new functionality in an efficient and structured way.

While not necessarily associated with the command line, web crawling is often used in some problematic areas such as automated functional testing and degradation detection. This tutorial shows how to implement a lightweight CLI framework whose supported actions revolve around web crawling. Hopefully, this will stimulate your creativity, whether your interest is specific to exploration or the command line. Technologies covered include Node.js, PhantomJS, and an assortment of npm packages related to both exploration and CLI.

The source code for this tutorial is available on GitHub. To run the examples, you will need to have both Node.js and PhantomJS installed. Instructions for downloading and installing them are available here: Node.js, PhantomJS.

Setting up a basic command line framework

At the heart of any CLI framework is the concept of converting a command, which typically includes one or more optional or mandatory arguments, into a concrete action. Commander and prompt are two very useful npm packages in this regard.

Commander lets you define which arguments are supported, while prompt allows you (conveniently enough) to prompt the user for input at run time. The end result is a syntactically smooth interface for performing a variety of actions with dynamic behaviors based on certain user-supplied data.

Say, for example, we want our command to look like this:

$ node run.js -x hello_world

Our entry point (run.js) defines the possible arguments like this:

program
  .version('1.0.0')
  .option('-x --action-to-perform [string]', 'The type of action to perform.')
  .option('-u --url [string]', 'Optional URL used by certain actions')
  .parse(process.argv);

and defines the different user input cases like this:

var performAction = require('./actions/' + program.actionToPerform)

switch (program.actionToPerform) {
  case 'hello_world':
    prompt.get([{

      
      name: 'url',

      
      description: 'Enter a URL',

      
      required: true,

      
      conform: function (value) {

        
        return validUrl.isWebUri(value);
      }
    }], function (err, result) {

      
      performAction(phantomInstance, result.url);
    });
    break;
}

At this point, we’ve defined a basic path through which we can specify an action to take and added a prompt to accept a URL. All you need to do is add a module to manage the logic specific to this action. We can do this by adding a file named hello_world.js to the actions directory:

'use strict';


module.exports = function (phantomInstance, url) {

  if (!url || typeof url !== 'string') {
    throw 'You must specify a url to ping';
  } else {
    console.log('Pinging url: ', url);
  }

  phantomInstance
    .open(url)
    .status()
    .then(function (statusCode) {
      if (Number(statusCode) >= 400) {
        throw 'Page failed with status: ' + statusCode;
      } else {
        console.log('Hello world. Status code returned: ', statusCode);
      }
    })
    .catch(function (err) {
      console.log('Error: ', err);
    })

    
    
    .close();
};

As you can see, the module expects to come with an instance of a PhantomJS object (phantomInstance) and a URL (url). We’ll momentarily go into the details of defining a PhantomJS instance, but for now, just see that we’ve set the stage for triggering a particular action. Now that we have a convention in place, we can easily add new actions in a defined and healthy way.

Crawling with PhantomJS using Horseman

Horseman is a Node.js package that provides a powerful interface for building and interacting with PhantomJS processes. A full explanation of Horseman and its features would warrant its own article, but suffice it to say that it allows you to easily simulate just about any behavior that a human user might exhibit in their browser. Horseman provides a wide range of configuration options, including things like automatic jQuery injection and ignoring SSL certificate warnings. It also provides functionality for managing cookies and taking screenshots.

Whenever we trigger an action through our CLI framework, our input script (run.js) instantiates an instance of Horseman and passes it to the specified action module. In pseudo-code, it looks like this:

var phantomInstance = new Horseman({
  phantomPath: '/usr/local/bin/phantomjs',
  loadImages: true,
  injectJquery: true,
  webSecurity: true,
  ignoreSSLErrors: true
});

performAction(phantomInstance, ...);

Now when we run our command, the Horseman instance and input URL are passed to the hello_world module, which causes PhantomJS to request the URL, capture its status code, and print the status to the console. We just started our first bona fide exploration with Horseman. Delighted!

Chain Horseman Methods for Complex Interactions

So far we’ve looked at a very simple use of Horseman, but the package can do a lot more when we chain its methods to perform a sequence of actions in the browser. In order to demonstrate a few of these features, let’s define an action that simulates a user browsing GitHub to create a new repository.

Please note: This example is purely for demonstration purposes and should not be viewed as a viable method for creating Github repositories. This is just an example of how one could use Horseman to interact with a web application. You must use the official Github API if you want to create repositories in an automated way.

Suppose the new crawl will be triggered like this:

$ node run.js -x create_repo

Following the CLI framework convention that we have already set up, we need to add a new module to the actions directory named create_repo.js. As with our previous example of “hello world”, the create_repo The module exports a single function containing all the logic of this action.

module.exports = function (phantomInstance, username, password, repository) {

  if (!username || !password || !repository) {
    throw 'You must specify login credentials and a repository name';
  }

  ...
}

Note that with this action, we are passing more parameters to the exported function than before. Parameters include username, password, and repository. We will pass on these values run.js once the user has successfully completed the prompt challenge.

Before any of this can happen, we need to add some logic to run.js to trigger the prompt and capture the data. We do this by adding a case to our hand switch declaration:

switch (program.actionToPerform) {

  case 'create_repo':
    prompt.get([{
       name: 'repository',
       description: 'Enter repository name',
       required: true
    }, {
       name: 'username',
       description: 'Enter GitHub username',
       required: true
     }, {
       name: 'password',
       description: 'Enter GitHub password',
       hidden: true,
       required: true
    }], function (err, result) {
      performAction(
        phantomInstance,
        result.username,
        result.password,
        result.repository
      );
    });
    break;

    ...

Now that we’ve added this hook to run.js, when the user enters the relevant data, it will be transmitted to the action, which will allow us to proceed with the exploration.

Regarding the create_repo crawl logic itself, we use Horseman’s range of methods to access the Github login page, enter the file provided username and password, and send the form:

phantomInstance
  .open('https://github.com/login')
  .type('input[name="login"]', username)
  .type('input[name="password"]', password)
  .click('input[name="commit"]')

We continue the chain while waiting for the form submission page to load:

.waitForNextPage()

after which we use jQuery to determine if the connection was successful:

.evaluate(function () {
  $ = window.$ || window.jQuery;
  var fullHtml = $('body').html();
  return !fullHtml.match(/Incorrect username or password/);
})
.then(function (isLoggedIn) {
  if (!isLoggedIn) {
    throw 'Login failed';
  }
})

An error is returned if the connection fails. Otherwise, we continue to chain the methods to access our profile page:

.click('a:contains("Your profile")')
.waitForNextPage()

Once on our profile page, we access our Deposits tab:

.click('nav[role="navigation"] a:nth-child(2)')
.waitForSelector('a.new-repo')

In our Deposits tab, we check if a repository with the specified name already exists. If so, we return an error. Otherwise, we continue with our sequence:


.evaluate(function () {
  $ = window.$ || window.jQuery;

  var possibleRepositories = [];
  $('.repo-list-item h3 a').each(function (i, el) {
    possibleRepositories.push($(el).text().replace(/^s+/, ''));
  });

  return possibleRepositories;
})


.then(function (possibleRepositories) {
  if (possibleRepositories.indexOf(repository) > -1) {
    throw 'Repository already exists: ' + repository;
  }
})

Assuming that no error has been generated, we proceed by programmatically clicking on the “new repository” button and waiting for the following page:

.click('a:contains("New")')
.waitForNextPage()

after which we enter the provided repository name and submit the form:

.type('input#repository_name', repository)
.click('button:contains("Create repository")')

Once we reach the resulting page, we know the repository has been created:

.waitForNextPage()
.then(function () {
  console.log('Success! You should now have a new repository at: ', 'https://github.com/' + username + "https://www.sitepoint.com/" + repository);
})

As with any Horseman crawl, it’s crucial that we close the Horseman instance at the end:

.close();

Failure to close the Horseman instance may cause orphaned PhantomJS processes to persist on the machine.

Explore to collect data

At this point, we’ve put together a static sequence of actions to programmatically create a new repository on GitHub. We did this through a series of Horseman methods.

This approach can be useful for specific structural and behavioral models known in advance, however, you may find that you need to implement more flexible scripts at some point. This may be the case if your sequence of action has the potential to vary greatly depending on the context or to produce several different results. This would also be the case if you were to extract data from the DOM.

In such cases, you can use Horseman’s evaluation () method, which allows you to perform free-form interactions in the browser by injecting inline or external link JavaScript.

This section shows an example of extracting basic data from a page (anchor links, in this case). One scenario where this might be necessary would be to create a degradation detection bot to reach every URL in a domain.

Like our last example, we first need to add a new module to the actions directory:

module.exports = function (phantomInstance, url) {

  if (!url || typeof url !== 'string') {
    throw 'You must specify a url to gather links';
  }

  phantomInstance
    .open(url)

    
    .evaluate(function () {
      $ = window.$ || window.jQuery;

      
      
      var result = {
        links: []
      };

      if ($) {
        $('a').each(function (i, el) {
          var href = $(el).attr('href');
          if (href) {
            if (!href.match(/^(#|javascript|mailto)/) && result.links.indexOf(href) === -1) {
              result.links.push(href);
            }
          }
        });
      }
      
      else {
        var links = document.getElementsByTagName('a');
        for (var i = 0; i < links.length; i++) {
          var href = links[i].href;
          if (href) {
            if (!href.match(/^(#|javascript|mailto)/) && result.links.indexOf(href) === -1) {
              result.links.push(href);
            }
          }
        }
      }

      return result;
    })
    .then(function (result) {
      console.log('Success! Here are the derived links: n', result.links);
    })

    .catch(function (err) {
      console.log('Error getting links: ', err);
    })

    
    
    .close();

And then add a hook for the new action in run.js:

switch (program.actionToPerform) {

  ...

  case 'get_links':
    prompt.get([{
        name: 'url',
        description: 'Enter URL to gather links from',
        required: true,
        conform: function (value) {
          return validUrl.isWebUri(value);
        }
    }], function (err, result) {
      performAction(phantomInstance, result.url);
    });
    break;

Now that this code is in place, we can run an analysis to extract the links from any page by running the following command:

$ node run.js -x get_links

This action demonstrates fetching data from a page and does not use any browser actions built into Horseman. It directly executes any JavaScript you put in the evaluate() and does it as if it was running natively in a browser environment.

One more thing should be noted in this section, which was alluded to earlier: not only can you run custom JavaScript in the browser using the evaluate() method, but you can also inject external scripts into the runtime environment before running your evaluation logic. It can be done like this:

phantomInstance
  .open(url)
  .injectJs('scripts/CustomLogic.js')
  .evaluate(function() {
    var x = CustomLogic.getX(); 
    console.log('Retrieved x using CustomLogic: ', x);
  })

By extending the logic above, you can perform virtually any action on any website.

Use Horseman to take screenshots

The last use case I want to demonstrate is how you would use Horseman to take screenshots. We can do this with Horseman’s screenshotBase64 () method, which returns a base64 encoded string representing the screenshot.

As with our previous example, we first need to add a new module to the actions directory:

module.exports = function (phantomInstance, url) {

  if (!url || typeof url !== 'string') {
    throw 'You must specify a url to take a screenshot';
  }

  console.log('Taking screenshot of: ', url);

  phantomInstance
    .open(url)

    
    .status()
    .then(function (statusCode) {
      console.log('HTTP status code: ', statusCode);
      if (Number(statusCode) >= 400) {
        throw 'Page failed with status: ' + statusCode;
      }
    })

    
    .screenshotBase64('PNG')

    
    .then(function (screenshotBase64) {

      
      var urlSha1 = crypto.createHash('sha1').update(url).digest('hex')
        , filePath = 'screenshots/' + urlSha1 + '.base64.png.txt';

      fs.writeFile(filePath, screenshotBase64, function (err) {
        if (err) {
          throw err;
        }
        console.log('Success! You should now have a new screenshot at: ', filePath);
      });
    })

    .catch(function (err) {
      console.log('Error taking screenshot: ', err);
    })

    
    
    .close();
};

And then add a hook for the new action in run.js:

case 'take_screenshot':
  prompt.get([{
      name: 'url',
      description: 'Enter URL to take screenshot of',
      required: true,
      conform: function (value) {
        return validUrl.isWebUri(value);
      }
  }], function (err, result) {
    performAction(phantomInstance, result.url);
  });
  break;

You can now take screenshots with the following command:

$ node run.js -x take_screenshot

The reason for using base64 encoded strings (and not, for example, saving real images) is that they are a convenient way to represent raw image data. This StackOverflow answer goes into more detail.

If you wanted to save real images, you would use the screenshot () method.

Conclusion

This tutorial attempted to demonstrate both custom CLI microframework and basic logic for exploring in Node.js, using the Horseman package to take advantage of PhantomJS. While using a CLI framework would likely be beneficial for many projects, the use of exploration is generally limited to very specific problem areas. One common area is quality assurance (QA), where exploration can be used for functional and user interface testing. Another area is security where, for example, you might want to crawl your website periodically to detect if it has been degraded or otherwise compromised.

Whatever the case may be for your project, make sure you clearly define your goals and be as low-key as possible. Get permission when you can, be polite as much as possible, and be careful never to do DDoS on any site. If you think you’re generating a lot of automated traffic, you probably are and should probably re-evaluate your goals, implementation, or permission level.


Source link

Rosemary S. Bishop