Parsing an RSS with DOM and PHP

I have a little script that I thought I’d share for anyone out there looking for a simple way to parse an RSS feed (well, any XML doc, but I use it for RSS).

First off, we define the function to do the parsing:
[php]
function dom_to_simple_array($domnode, &$array) {
$array_ptr = &$array;
$domnode = $domnode->firstChild;
while(!is_null($domnode)){
if(!(trim($domnode->nodeValue)==””)){
switch($domnode->nodeType){
case XML_TEXT_NODE: {
$array_ptr[‘cdata’] = $domnode->nodeValue;
break;
}
case XML_ELEMENT_NODE: {
$array_ptr = &$array[$domnode->nodeName][];
if($domnode->hasAttributes()){
foreach ($domnode->attributes as $d_attribute){
$array_ptr[$d_attribute->name] = $d_attribute->value;
}
}
break;
}
}
if($domnode->hasChildNodes()){
$this->dom_to_simple_array($domnode, $array_ptr);
}
}else{
//echo $domnode->nodeName.” -> “.$domnode->nodeType.”
“;
$array_ptr = &$array[$domnode->nodeName][];
if($domnode->hasAttributes()){
foreach ($domnode->attributes as $d_attribute){
$array_ptr[$d_attribute->name] = $d_attribute->value;
}
}
}
$domnode = $domnode->nextSibling;
}
}
[/php]
This function (I use it in a class) takes in a DOM XML file object (we’ll get to that later) and an array that’s actually a reference to the one we call it with. See this page in the manual for more information on references.

Using the DOM functionality, it goes through each node in the XML and creates the needed items in the $array passed in. When it finds a child node, it calls itself and the process starts all over again – until the script hits the end of the document.

Now, we get to the other half of the equation – calling this function.

To call the dom_to_simple_array function, we first need to pull in the XML data we need to parse, then feed it in correctly to get a nice, happy array back out. Here’s the code:
[php]
function getData(){
$url=”http://webdevradio.com/podcast.php”;
$xml=array();
$contents=file_get_contents($url);
$dom=DOMDocument::loadXML($contents);
$this->xml->dom_to_simple_array($dom,$xml);
echo “

"; print_r($xml); echo "

“;
}
[/php]
The script uses the value in $url to grab the remote XML file (in this case, an RSS) and pulls it into the variable $contents. From there, there’s a bit of DOM magic in the loadXML function to create the DOM object from it. From there, it’s just a simple matter of feeding it to our function before.

The end result? You have a nice array structure with the contents of the XML file making it easier to just use the normal array functionality (gotta love it) on the results…

For the complete setup, you can grab the source here.

6 comments

  1. My 2 cents: I know that this post is about DOM, but I’d like to recommend SimpleXML instead – for simple XMLs like RSS it’s probably the best lightweight XML parsing php extension [and the code is much more easier to write & read].

    Like

  2. The problem with both SimpleXML and DOM parsing is that they require valid XML. Not that steep of a requirement, but if you’re in an environment where you need to accept feeds that aren’t valid for some reason (such as curly quote characters), you’ve got to fall back to a SAX-style parser or regex (::shudder::).

    Like

  3. SimpleXML is nice, especially if you know the structure of the document you’re working with.

    If you’re pulling in random XML out there, though, it might make things harder. To do anything more complex that just accessing the data in the XML (like getting the name of a node), you’re almost forced to dom_import_simplexml to work with it at all. Don’t get me wrong, SimpleXML is a very happy thing, but it is what it is – simple…

    Like

Leave a comment