Creating a Crawler in Symfony 2 Using the DomCrawler and Client

There have been a few times where I have needed to crawl a Symfony 2 site to index pages and execute code so I built a crawler console command designed using the Symfony 2 DomCrawler and Client. This is a fun alternative to using curl, and the client offers plenty of browser-like features that come in handy, such as a saved history of visited pages or testing the forward and back button functionality on your pages. The authentication cookie below can be used for a curl request to protected pages as well if desired.

The DomCrawler class allows you to manipulate the DOM while the Client class functions like a browser to make requests and receive responses, as well as follow links and submit forms. Symfony has documented how this works in the Testing chapter of The Book, but I needed something that would work outside of unit and functional tests in the form of a console command that could be scheduled to run.

The crawler command is designed to take a few required arguments: the starting link to crawl and the username to authenticate with so restricted pages can be crawled. It also takes a few optional arguments: the number of pages to crawl at most to prevent the command from infinite crawling, keywords to search for in a route where a matching route should only be indexed once to prevent infinite crawling of dynamic links, and the name of a security firewall to authenticate with. To start, create the command class and set up the arguments and options.

<?php
namespace Acme\Bundle\Command;

use Symfony\Bundle\FrameworkBundle\Command\ContainerAwareCommand;
use Symfony\Component\Console\Input\InputArgument;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;

use Symfony\Component\HttpFoundation\RedirectResponse;
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\HttpKernel\Client;
use Symfony\Component\BrowserKit\Cookie;

use Symfony\Component\Security\Core\Authentication\Token\UsernamePasswordToken;

/**
 * This class crawls the Acme site
 *
 * @author  Joe Sexton <joe@webtipblog.com
 */
class SiteCrawlerCommand extends ContainerAwareCommand
{
     /**
     * @var OutputInterface
     */
    protected $output;

    /**
     * @var Router
     */
    protected $router;

    /**
     * @var EntityManager
     */
    protected $entityManager;

    /**
     * @var string
     */
    protected $username = null;

    /**
     * @var string
     */
    protected $securityFirewall = null;

    /**
     * @var integer
     */
    protected $searchLimit;

    /**
     * index routes containing these keywords only once
     * @var array
     */
    protected $ignoredRouteKeywords;

    /**
     * @var string
     */
    protected $domain = null;

    /**
     * Configure
     *
     * @author  Joe Sexton <joe@webtipblog.com
     */
    protected function configure()
    {
        $this
            ->setName( 'crawler:crawl' )
            ->setDescription( 'Crawls the Acme website.' )
            ->setDefinition(array(
                new InputArgument( 'startingLink', InputArgument::REQUIRED, 'Link to start crawling' ),
                new InputArgument( 'username', InputArgument::REQUIRED, 'Username' ),
                new InputOption( 'limit', null, InputOption::VALUE_REQUIRED, 'Limit the number of links to process, prevents infinite crawling', 20 ),
                new InputOption( 'security-firewall', null, InputOption::VALUE_REQUIRED, 'Firewall name', 'default_firewall' ),
                new InputOption( 'ignore-duplicate-keyword', null, InputOption::VALUE_IS_ARRAY|InputOption::VALUE_REQUIRED, 'Index routes containing this keyword only one time (prevents infinite crawling of routes containng query parameters)', array() ),
            ))
            ->setHelp(<<<EOT
The <info>crawler:crawl</info> command crawls the Acme website:

<info>php app/console crawler:crawl <startingLink> <username></info>
EOT
            );
    }

    /**
     * Execute
     *
     * @author  Joe Sexton <joe@webtipblog.com
     * @param   InputInterface $input
     * @param   OutputInterface $output
     * @todo    use product sitemap to crawl product pages
     */
    protected function execute( InputInterface $input, OutputInterface $output )
    {
        // user input
        $startingLink               = $input->getArgument( 'startingLink' );
        $this->domain               = parse_url( $startingLink, PHP_URL_HOST );
        $this->username             = $input->getArgument( 'username' );
        $this->searchLimit          = $input->getOption( 'limit' );
        $this->securityFirewall     = $input->getOption( 'security-firewall' );
        $this->ignoredRouteKeywords = $input->getOption( 'ignore-duplicate-keyword' );
        $this->output               = $output;
        $this->router               = $this->getContainer()->get( 'router' );
        $this->entityManager        = $this->getContainer()->get( 'doctrine.orm.entity_manager' );

        // start
          $output->writeln('
<info>A super-duper web crawler written by:

   ___              _____           _
  |_  |            /  ___|         | |
    | | ___   ___  \ `--.  _____  _| |_ ___  _ __
    | |/ _ \ / _ \  `--. \/ _ \ \/ / __/ _ \| |_ \
/\__/ / (_) |  __/ /\__/ /  __/>  <| || (_) | | | |
\____/ \___/ \___| \____/ \___/_/\_\\__\___/|_| |_|

</info>');

    }

     /**
     * Interact
     *
     * @author  Joe Sexton <joe@webtipblog.com
     * @param   InputInterface $input
     * @param   OutputInterface $output
     */
    protected function interact( InputInterface $input, OutputInterface $output )
    {
        if ( ! $input->getArgument( 'startingLink' ) ) {
            $startingLink = $this->getHelper( 'dialog' )->askAndValidate(
                $output,
                'Please enter the link to start crawling:',
                function( $startingLink ) {
                    if ( empty( $startingLink ) ) {
                        throw new \Exception('starting link can not be empty');
                    }

                    return $startingLink;
                }
            );
            $input->setArgument( 'startingLink', $startingLink );
        }

        if ( ! $input->getArgument( 'username' ) ) {
            $username = $this->getHelper( 'dialog' )->askAndValidate(
                $output,
                'Please choose a username:',
                function( $username ) {
                    if ( empty( $username ) ) {
                        throw new \Exception( 'Username can not be empty' );
                    }

                    return $username;
                }
            );
            $input->setArgument( 'username', $username );
        }
    }

}

The configure and interact methods set up the command to run and take arguments, more information on how that works can be found in the Symfony console documentation. The execute method starts by setting some class properties based on user input. At this point you should be able to open your terminal and in your project directory run the command with php app/console crawler:crawl.

The next step is to create and boot the kernel, simply add this method to the SiteCrawlerCommand.

/**
 * createKernel
 *
 * @author  Joe Sexton <joe@webtipblog.com
 * @return  \AppKernel
 */
protected function _createKernel() {

    $rootDir = $this->getContainer()->get( 'kernel' )->getRootDir();
    require_once( $rootDir . '/AppKernel.php' );
    $kernel = new \AppKernel( 'test', true );
    $kernel->boot();

    return $kernel;
}

Then call _createKernel() by adding the following to the execute() method:

$kernel = $this->_createKernel();

Next, get the Symfony Client which is used to make the requests and retrieve page content.

$client = $kernel->getContainer()->get( 'test.client' );

In order to crawl pages that require a user to be logged in or posses certain roles, we’ll need to authenticate a user with those permissions. Start by creating a an _authenticate() method as discussed in Symfony’s testing documentation:

/**
     * authenticate with a user account to access secured urls
     *
     * @author  Joe Sexton <joe@webtipblog.com
     * @param   AppKernel $kernel
     * @param   Client $client
     */
    protected function _authenticate( $kernel, $client ) {

        // however you retrieve a user in your application
        $user = $this->entityManager->getRepository( 'Entity:User' )->findOneByUsername( $this->username );
        $token = new UsernamePasswordToken( $user, null, $this->securityFirewall, $user->getRoles() );

        // set session
        $session = $client->getContainer()->get('session');
        $session->set('_security_'.$this->securityFirewall, serialize($token));
        $session->save();

        // set cookie
        $cookie = new Cookie($session->getName(), $session->getId());
        $client->getCookieJar()->set($cookie);
    }

Of course we need to add a call to _authenticate from execute();

$this->_authenticate( $kernel, $client );

It’s time to request the starting page, adding these lines to execute() will get the first page and get a DomCrawler object with its contents.

// start crawling
$output->writeln( sprintf( 'Dominating <comment>%s</comment>, starting at <comment>%s</comment>.  At most, <comment>%s</comment> pages will be crawled.', $this->domain, $startingLink, $this->searchLimit ) );

// crawl starting link
$crawler = $client->request( 'GET', $startingLink );

// redirect if necessary
while ( $client->getResponse() instanceof RedirectResponse ) {
    $crawler = $client->followRedirect();
}

make sure to follow any redirects that your site may return

At this point you can do whatever you’d like to do with the DomCrawler that has been returned. In my implementation I filtered all of the links on the page that were a part of the $domain with this method.

/**
 * get all links on the page as an array of urls
 *
 * @author  Joe Sexton <joe@webtipblog.com
 * @param   Crawler $crawler
 * @return  array
 */
protected function _getLinksOnCurrentPage( Crawler $crawler ) {

    $links = $crawler->filter( 'a' )->each( function ( Crawler $node, $i ) {
            return $node->link()->getUri();
        });

    // remove outboundlinks
    foreach ( $links as $key => $link ) {
        $this->output->writeln( 'Link: '.$link );
        $linkParts = parse_url( $link );
        if ( empty( $linkParts['host'] ) || $linkParts['host'] !== $this->domain || $linkParts['scheme'] !== 'http' ) {

            unset( $links[$key] );
        }
    }

    return array_values( $links );
}

Then I added them to an array and processed each one, adding new links along the way until I ran out of links or reached the $searchLimit.

To see my full implementation, check out my SiteCrawlerCommand Gist.