How To Build a Universal Feed Reader
We will detail the steps in building a feed reader recognizing all formats, by using the possibilities of XML PHP 5. The knowledge of the structure of an RSS file is essential for this study.
Structure of an RSS file
Any syndication file contains a list of items, articles, notes or other documents, and a description of the site which is the source that is known as the channel. For the channel as well as the elements, we shall provide a title and description, as well as a URL.
Articles or documents
In all formats, basic data are included: the link on the article, its title, and a summary.
<item>
<title>RSS Tutorials</title>
<link>https://www.scriptol.com/universal-reader.php</link>
<description>Tutorials for building and using RSS feeds</description>
</item>
The name of tags are different depending on the format used. Other data can be provided as the author, a logo, etc.
The channel, or website providing contents
The feed includes a description of the source, thus the site where the documents were published. Its URL, the title of the home page, a description of the site.
<channel>
<title></title>
<link>https://www.scriptol.com/</link>
<description></description>
<channel>
Here again, the name of tags depends on the format used.
The items of articles are placed after the description of the channel, as
seen in the various formats below.
Differences between formats
An overall difference between RSS 2.0 and Atom is that the uses the rss
container, and Atom, and only the channel. Other differences are
the names of tags.
Regarding RSS 1.0, which is based on RDF, the syntax is far from those of
the two other formats.
Format RSS 2.0
The example is based on that of the specification of the RSS 2.0 standard from Harvard.
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Xul News</title>
<link>https://www.scriptol.com/</link>
<description>Réaliser un lecteur de flux.</description>
<language>fr-FR</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<item>
<title>Tutoriel</title>
<link>https://www.scriptol.com/rss/</link>
<description></description>
<pubDate>Jeu, 28 Sep 2007 09:39:21 GMT</pubDate>
</item>
</channel>
</rss>
Format RSS 1.0 based upon RDF
The format 1.0 uses the same tag names that the 2.0 which will facilitate the construction of a universal reader. However, there are differences in structures. Firstly, the container rdf belongs to a namespace of the same name. The structure is defined in the channel tag, but the descriptive elements are added after it.
The example below is based on the specification of the standard RSS 1.0.
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
>
<channel rdf:about="http://www.xml.com/xml/news.rss">
<title>scriptol.com</title>
<link>https://www.scriptol.com</link>
<description> </description>
<image rdf:resource="https://www.scriptol.com/images/logo.gif" />
<items>
<rdf:Seq>
<rdf:li resource="https://www.scriptol.com/rss/" />
...autres articles...
</rdf:Seq>
</items>
</channel>
<image rdf:about="https://www.scriptol.com/images/logo.gif">
<title>scriptol.com</title>
<link>https://www.scriptol.com</link>
<url>https://www.scriptol.com/universal/images/logo.gif</url>
</image>
<item rdf:about="https://www.scriptol.com/rss/l">
<title>RSS</title>
<link>https://www.scriptol.com/rss/</link>
<description> </description>
</item>...autres items...
</rdf:RDF>
Even though the format is more complex, using it remains simple with the XML and DOM functions of PHP.
Structure of the Atom format
The Atom format uses directly the channel as root container. The tag of the channel is feed and elements are entry.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="https://www.scriptol.com">
<title>Feed sample</title>
<link href="https://www.scriptol.com/rss/"/>
<updated></updated>
<author>
<name>Denis Sureau</name>
</author>
<entry>
<title>Building a feed reader</title>
<link href="https://www.scriptol.com/rss/rss-reader.php"/>
<updated></updated>
<summary>Une description.</summary>
</entry>
</feed>
As one sees Atom uses its own tag names while the two RSS format share same ones. What we harness to identify the format of a feed file.
Using DOM with PHP 5
The Document Object Model can extract tags in an XML document or HTML. We will use the getElementsByTagName function for a list of tags whose name is given as a parameter. This function returns a list in DOMNodeList format, which contains elements format DOMNode. It applies to the whole document, or a DOMNode element and thus extract parts of the file, the channel or an item, and in this part a list of tags.
Extracting the RSS channel
DOMDocument $doc = new DOMDocument("1.0");
DOMNodeList $channel = $doc->getElementsByTagName("channel");
We will use the parameter "feed" for the channel. Note that the class names are for informational purposes, the PHP code does not use them.
Extracting the first element
DOMElement $element = $channel.item(0);
You can assign a DOMElement rather that a DOMNode directly at the call of the item() method which returns an DOMNode. The advantage is that DOMElement has attributes and methods to access the contents of the element.
Extracting all elements
for($i = 0; $i < $channel->length; i++)
{
$element = $channel->item(i);
}
Using data element
For each item, as the canal, components are extracted with the same method and with the firstChild attribute. For example, the title:
$title = $element.getElementsByTagName("title"); // getting the list of title tags
$title = $title->item(0); // getting one tag
$title = $title->firstChild->textContent; // getting its content
Wihtout a method for extracting a single element, getElementsByTagName
is used to extract a list that actually contain one element, and by using
item, we get this element.
In XML, the content of a tag is treated as a child node, so we use the property
firstChild to get the content of an XML element, and data for
the text content.
It remains to apply these methods on the channel and on each element of the feed to retrieve its contents.
For a more general use, the function returns the contents implemented in a two-dimensional table. It will then be the choice of the programmer to display directly it in a Web page, or perform some treatment on the table.
How to identify the format
Identifying the format is very simple if we know that RSS 1.0 and 2.0 use
the same tags, and therefore that the same functions could apply to both formats.
We recognize Atom by the feed container, while RSS 2.0 uses channel
and 1.0 uses rdf.
Because both RSS versions use the channel tag, the feed
tag is enough to recognize Atom.
DOMDocument $doc = new DOMDocument("1.0");
DOMNodeList $channel = $doc->getElementsByTagName("feed");
$isAtom = ($channel != false);
We do try to extract the feed tag. If the interpreter finds this tag, the DOMNodeList will contain an element. The isAtom flag is set to true, otherwise we will treat the feed as RSS format without distinction.
Reading data channel
We know how to extract the channel. The same function can be used with the string "feed" or "channel" as parameter. It is assumed that the pointer to the document is the global variable $doc.
function extractChannel($chan)
{
DOMNodeList $channel = $doc->getElementsByTagName($chan);
return $channel->item(0);
}
We can then with the following function, called with the name of each tag in parameter, read the title and the description of the channel.
function getTag($tag)
{
$content = $channel->getElementsByTagName($tag);
$content = $content->item(0);
return($content->firstChild->textContent);
}
We then call the function with successively as parameter "title", "link",
"description"...
The names depend
on the format, it will be "summary" for Atom and "description" for the others.
Reading data elements
The principle will be the same, but we will have to loop in a list of items while there is only one channel.
We must also take into account the fact that RSS 1.0 put descriptions of the elements out of the channel tag while they are contained inside in other formats. The items are contained in feed in Atom in channel in RSS 2.0, but in rdf: RDF in RSS 1.0.
The function extractItems extract the list of elements, it has the parameter "item" in RSS and "entry" in Atom:
function extractItems($tag)
{
DOMNodeList $dnl = $doc->getElementsByTagName($tag);
return $dnl;
}
The returned list is used to access each item. He is pushed into the array $a. Example with the RSS format.
$a = array();
$items = extractItems("item");
for($i = 0; $i < $items->length; i++)
{
array_push($a, $items->item($i));
}
One can also directly create an array of tag of an item: title,
link, description for each item and place it in a two-dimensional table.
To do this, we do use a generic version of the getTag function defined
earlier:
function getTag($item, $tag)
{
$content = $item->getElementsByTagName($tag);
$content = $content->item(0);
return($content->firstChild->textContent);
}
for($i = 0; $i < $items->length; i++)
{
$a = array();
$item = $items->item($i);
array_push($a, getTag($item, "title"));
... and so on for each tag of the item...
array_push($FeedArray, $a);
}
We placed each article in a two-dimensional table that can be simply displayed or used as we want. The loop will be put in the getTags function.
Functions of the full reader
We now have a list of all functions useful for the universal reader.
ExtractChannel extracts the tag of the channel into an object.
ExtractItems extracts items of the document as an object.
GetTag Reads data from a tag.
GetTags Place the contents of an element (article or channel) in
an array.
With the appropriate parameters, these functions are used for all formats.
Universal_Reader Englobes the entire process for a given feed, the
format being unspecified.
Universal_Display Customizable functin to display a feed into an
HTML page.
Loading the feed
In the most basic case, the feed is intended to be integrated into a Web page, either before its loading or further at user request.
Whatever the format is, especially for feeds in languages with accents, care must be taken to the compatibility of the encoding format, which is most often UTF-8 for the feed and sometimes ISO-8159 or windows-1252 for the page where it will appear. It is better to give the UTF-8 format to the page to avoid a bad display of accented characters.
The encoding is given by the content-type meta with a line in the following format:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
Loading with the page
To see a page that includes a feed, insert the following code in the HTML code:
<?php
include("universal-reader.php");
Universal_Reader("https://www.scriptol.com/rss.xml");
echo Universal_Display();
?>
See the demonstration given below.
Loading at request
This case arises when the visitor chooses a feed in a list or enters the
name of the feed.
The loading can be done with Ajax
for an asynchronous display or only in PHP by displaying the whole page again.
We will use a form with an input text field to give the URL of the feed or
a single link (or a choice of links) on which one click to see a feed.
Demonstration
- Universal RSS reader demos. Include the reader and two demos.
More
- Which feed format to choose?
- Common Reader. API to build a universal reader.
Forum
Problem with atom links in the universal feed reader.
Atom links do not work
sportbilly
scriptol
sportbilly
scriptol
<?php $url="http://feeds.feedburner.com/eAnagnostis?format=xml"; $hnd=curl_init(); curl_setopt($hnd,CURLOPT_CONNECTTIMEOUT,5); curl_setopt($hnd,CURLOPT_URL,$url); $page=curl_exec($hnd); curl_close($hnd); $doc=new DOMDocument(); $doc->loadXML($page); echo $doc->saveXML(); ?>The feed is loaded. I have to look further in the code, maybe replace this code:
$doc = new DOMDocument(); $doc->load($url);by the curl code above to make it working.
sportbilly
scriptol
sportbilly
scriptol
$Universal_FeedArray = array(); $hnd=curl_init(); curl_setopt($hnd,CURLOPT_CONNECTTIMEOUT,5); curl_setopt($hnd,CURLOPT_URL,$url); $page=curl_exec($hnd); curl_close($hnd); $Universal_Doc=new DOMDocument(); $Universal_Doc->loadXML($page);The feed is loaded, but not properly formatted. I do not know if PHP is able to parse a such file. If you can display other feed and not this one, the answer is no.
sportbilly
scriptol
sportbilly