Scraping a page with php

Jan 24, 2012 in PHP
2 min read

I ran across a problem recently, I needed to get the contents of a page so that I could mimic a widget’s functionality without having access to the database that said page’s widget used (confused yet). Basically, one site had a widget that displayed an upcoming event (information with came from a database), I needed to copy that widget to another website. Moreover, I did not want to use JavaScript to accomplish my goal because I need the new site to be cached pretty heavily on the server. My first thought was to use cURL, but when I ran curl_exec my function would output the entire page contents onto my new site, something that I did not want, I only wanted a small portion of the website. My next option was to use file_get_contents(“my_url”), as was pointed out to me, this will return the full page as a string. With this I could have accomplished my goal, but not with as much ease as I would hope. I ran into the DOMDOcument() php object a while back and was curious about what I could do with it so I decided to do some research on that. This would eventually lead me to my answer. It turns out that the script is very simple and I can see myself using it quite often so I decided to share it. Here it is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


<?php
public function scrapePage($url,$id)
{
    $d = new DOMDOcument();
    libxml_use_internal_errors(true);
    $d->loadHTMLFile($url);
    $widget = $d->getElementById($id);

    if ($widget !== null)
    {
        return simplexml_import_dom($widget)->asXML();
    }
}