There have been a few times where I have needed to crawl a Symfony 2 site to index pages and execute code so I built a crawler console command designed using the Symfony 2 DomCrawler and Client. This is a fun alternative to using curl, and the client offers plenty of browser-like features that come in handy, such as a saved history of visited pages or testing the forward and back button functionality on your pages. The authentication cookie below can be used for a curl request to protected pages as well if desired.

The DomCrawler class allows you to manipulate the DOM while the Client class functions like a browser to make requests and receive responses, as well as follow links and submit forms. Symfony has documented how this works in the Testing chapter of The Book, but I needed something that would work outside of unit and functional tests in the form of a console command that could be scheduled to run.

 

The crawler command is designed to take a few required arguments: the starting link to crawl and the username to authenticate with so restricted pages can be crawled. It also takes a few optional arguments: the number of pages to crawl at most to prevent the command from infinite crawling, keywords to search for in a route where a matching route should only be indexed once to prevent infinite crawling of dynamic links, and the name of a security firewall to authenticate with. To start, create the command class and set up the arguments and options.

 

The configure and interact methods set up the command to run and take arguments, more information on how that works can be found in the Symfony console documentation. The execute method starts by setting some class properties based on user input. At this point you should be able to open your terminal and in your project directory run the command with php app/console crawler:crawl.

 

The next step is to create and boot the kernel, simply add this method to the SiteCrawlerCommand.

 

Then call _createKernel() by adding the following to the execute() method:

 

Next, get the Symfony Client which is used to make the requests and retrieve page content.

 

In order to crawl pages that require a user to be logged in or posses certain roles, we’ll need to authenticate a user with those permissions. Start by creating a an _authenticate() method as discussed in Symfony’s testing documentation:

 

Of course we need to add a call to _authenticate from execute();

 

It’s time to request the starting page, adding these lines to execute() will get the first page and get a DomCrawler object with its contents.

make sure to follow any redirects that your site may return

 

At this point you can do whatever you’d like to do with the DomCrawler that has been returned. In my implementation I filtered all of the links on the page that were a part of the $domain with this method.

 

Then I added them to an array and processed each one, adding new links along the way until I ran out of links or reached the $searchLimit.

 

To see my full implementation, check out my SiteCrawlerCommand Gist.