Search This Blog

Loading...

Tuesday, November 15, 2011

PHP Simple HTML DOM Parser

If you want to develop advanced Google like apps that crawl and parse the HTML content using PHP one valuable resource is PHP Simple HTML DOM Parser.
PHP Simple HTML DOM Parser underlying the E-media built in crawler and I use it in the EC_1 app. Also, I use it on the latest E-media Newsletter app.

What is PHP Simple HTML DOM Parser

If you take a look into the simple_html_dom.php file you will find about 1000 lines of PHP code. The base code is formed by the simple_html_dom class. So, the PHP Simple HTML DOM Parser is a PHP class that you can use to parse the HTML content. In other words, you can analyze the HTML code of a webpage and take the anchors, divs, spans, tables cells or you can get elements atributs such as id, class, etc.

Initiate PHP Simple HTML DOM Parser

First of all, you must include the PHP Simple HTML DOM Parser into your page. You can do this by using a simple PHP include construct:
include '<path_to_file>/simple_html_dom.php';
In the above code you must replace the <path_to_file> with the path to the simple_html_dom.php relative to you PHP script. Then you must specify the HTML source code either by a string or a file path. So, you must choose between:
$html = str_get_html('<html><body>Hello!</body></html>');
or
$html = file_get_html('html_source.html');
or
$html = file_get_html('http://www.liviubalan.com/html_source.html');
As you can see you can specify both relative or absolute file paths.

Selecting HTML elements and attributes

After you initiate the PHP Simple HTML DOM Parser it's time to actually search for specific elements using nothing else than CSS selectors:
// find all <div> elements which attribute id="foo"
$ret = $html->find('div[id="foo"]');

// find all anchors, returns an array of element objects
$ret = $html->find('a');

// find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// find all <div> elements with the id attribute
$ret = $html->find('div[id]');

// find all elements that has attribute id
$ret = $html->find('[id]');

Tips and tricks

If you take a look into the simplehtmldom.php file you can see the usage of the PHP Simple HTML DOM.  See the print_v and echo_pre functions written by me. The PHP find method usually returns an array. So, the print_v function is very useful to display this array content. If you will use a PHP var_dump or print_r function to display this array content you will see a lot of properties, leading to big loading time. So, use print_v instead and if you need to you can modify it. Also, use echo_pre function to see the content of a string, including HTML special chars and spaces.
If you use this class inside a loop especially on big HTML files be very careful on the freeing the memory. For this, the PHP Simple HTML DOM provides the clear method:
$html->clear();
This is an empirical observation from when I worked to the EC_1 project where even I increase the memory usage on the server or I unset the $html variable the lack of memory appeared.
Another technique that I use I call it safe mode chaining: every time you use Simple HTML DOM chaining be very careful if the previous selected element exists:
$ret = $html->find('span',0);
if ( isset($ret) ) {
    $ret = $html->find('b',0);
    echo_pre($ret);
}
More about PHP Simple HTML DOM you can find on http://simplehtmldom.sourceforge.net/. You can also try the PHP Simple HTML DOM examples at http://dev.liviubalan.com/_res/php/simplehtmldom/example/.
http://www.liviubalan.com/content/php-simple-html-dom-parser

0 comments:

Post a Comment

LinkWithin

Related Posts with Thumbnails