Using Yahoo!’s Term Extractor API

Yahoo’s Term Extractor is a application that extracts keywords from a piece of text that you can then use as tags or meta data keywords or anything you please. You can even have it return keywords based on a context, but we’ll stick to the simple implementation today.

To start with, you’ll need a Yahoo account and an App Id. You can grab one of those here; https://developer.yahoo.com/wsregapp/

Now, what we are going to build initially has been built a million times before and can be used at millions of sites across the ‘net. But we’ll tack something onto it which will make it a much more powerful application.
So, lets look at what we need to do for a simple term extraction from Yahoo;

$url = 'http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction';
$appid = 'blahblahblah-getyourown';
$output = 'php';

$context = urlencode('This article is intended to give a basic introduction and a guide to push you in the right direction with your REST server implementation. As such, the examples used in this article are somewhat abstracted from real world uses. If there is demand for it I will write a much more in-depth “ultimate” REST server implementation article at a later date.');

$url = $url . '?appid=' . $appid . '&output=' . $output . '&context=' . $context;

$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// grab URL and pass it to the browser
$response = curl_exec($ch);

// close cURL resource, and free up system resources
curl_close($ch);

$response = unserialize($response);

var_dump($response);

So, lets go through the script so far;
The first few variables are simple enough, we have the $url, the $appid and the $output (which in our case is php, so it will return a serialized string).
Next we have the context, this is the string for which to find terms. At the moment it is just a static string, but we will eventually make a form to use this with. We need to make sure we urlencode() the context or we will get errors from Yahoo when it contains spaces.
Now we construct the entire url using the variables we set above.

We will use cURL to make the request, so we need to set that up now. curl_init() will initialise our cURL session and assign it to the variable $ch.
Then we need to set some options within our cURL session using the curl_setopt() function. The options we set are;

  1. CURLOPT_URL – the url of the REST api we are accessing
  2. CURLOPT_HEADER – we set this to 0 so we do not get the HTTP headers included in our results
  3. CURLOPT_POST – So we send the url as POST data, we could also use CURLOPT_HTTPGET to use GET instead.
  4. CURLOPT_RETURNTRANSFER – If we don’t use this, then when we run curl_exec() it would output the results to the browser, this way we can assign the results to a variable for further manipulation

Now we execute the cURL session and assign the results to a variable instead of outputting them to the screen (see #4 CURLOPT_POST above).
Lastly, we can close the curl session and output our unserialized data to the user. If you run this script, you’ll see output similar to below;

array(1) { ["ResultSet"]=> array(1) { ["Result"]=> array(3) { [0]=> string(21) “server implementation” [1]=> string(15) “right direction” [2]=> string(10) “real world” } } }

Great, thats the very basic introductory way of grabbing a bunch of key terms from a static string. Now we’ll make some changes to our script so we can use a URL to parse.

At the top of the script where we have this;
$context = urlencode(‘This article is intended to give a basic introduction and a guide to push you in the right direction with your REST server implementation. As such, the examples used in this article are somewhat abstracted from real world uses. If there is demand for it I will write a much more in-depth “ultimate” REST server implementation article at a later date.’);
We are going to change it to this;

$context = urlencode(strip_tags(file_get_contents('http://www.fliquidstudios.com/2009/01/13/introduction-to-writing-a-rest-server-in-php/')));
$context = substr($context, 0, 7000);

What this does is;

  1. uses file_get_contents() to grab all the contents on that page. Text, images, HTML, the lot.
  2. uses strip_tags() to get rid of any HTML from the string.
  3. uses urlencode() to properly format the string for use in the url.
  4. substr() makes sure the string is less than 7000 chars, to keep the size down (I’m not sure what the max is, but 10,000 throws an error).

Then, if you load your page, you’ll see a lot more results in the array returned.

So, now we have parsed a url, grabbed its contents and passed them to Yahoo for term extraction. But its not overly user friendly, who wants to edit code everytime they want to extract terms?

So, now we’ll build a form for it all to give it a bit of a ui and we’ll also display the results in a nicer way;

We’ll wrap our script in an if statement and add our form;

if (array_key_exists('submit', $_REQUEST)) {

    // SCRIPT GOES IN HERE

} else {

    print '
'; print 'url: '; print 'string: '; print '
'; }

We’ll also change the $context variable to use whichever method we have submitted, we want to make sure that even if the variable exists, it also has content, hence the strlen() function use;

    if (array_key_exists('string', $_REQUEST) && strlen($_REQUEST['string']) > 0) {
        $context = urlencode($_POST['string']);
    } else if (array_key_exists('url', $_REQUEST) && strlen($_REQUEST['url']) > 0) {
        $context = urlencode(strip_tags(file_get_contents($_REQUEST['url'])));
    }

Lastly, we’ll rewrite the results in a nice ordered list;

    print '
    '; foreach ($response['ResultSet']['Result'] as $key => $term) { print '
  1. ' . $term . '
  2. '; } print '
';

Beautiful. Now, we have a fully functioning form that we can use Yahoo’s term extraction on for any string or url we want.

As usual, the code for this post can be found here and the demo can be found here.



  • Memoday
    Thanks a million for sharing this. This is really useful.
  • You're welcome. Thanks for the comment!
blog comments powered by Disqus