14 septembre 2010

Wordle-like PHP script

Introduction

Tag clouds (or word clouds) are very trendy in the fashionable web 2.0.

But very often those clouds are rendered via HTML / CSS and are quite ugly.

I discovered recently the website http://www.wordle.net and was fascinated by the beautiful images it can generate. However they are rendered through a Java applet and the only way to use them is to take a screenshot and extract the image. I wanted clickable tag clouds, and, unless you manually create an HTML image map, it's not possible with Wordle. So I decided to create my own tag cloud generator using PHP.

I started crawling the web to get some information, and I found out an interesting post on StackOverflow from a guy asking how to implement "something like Wordle". Suprisingly, Jonathan Feinberg, the creator of Wordle, replied to the post explaining the basic idea:

Each word "wants" to be somewhere, such as "at some random x position in the vertical center". In decreasing order of frequency, do this for each word:

place the word where it wants to be
while it intersects any of the previously placed words
move it one step along an ever-increasing spiral
That was all I needed to start coding a proof of concept, but a lot of problems still needed to be solved...


Bounding Boxes

In his reply Jonathan Feinberg says: "The hard part is in doing the intersection-testing efficiently, for which I use last-hit caching, hierarchical bounding boxes, and a quadtree spatial index".

Well that was quite too much for me, I needed to find a less efficient but simplier way to test for intersection.

So I came up with this idea:
  1. Each time a word is drawn, store its bounding box in an array
  2. To test if a new box intersects with the already drawn boxes do this:
    • For each bounding box in the array:
      • If the new box intersect the bounding box there is an intersection
This leads to another problem to solve: how to test if two rectangles intersect.


Rectangle Collision Detection

The scipt I wrote only allows to draw words either horizontaly or verticaly. Thus we need to test the collision of axis-aligned boxes. This is quite simple to do.

Two axis-aligned boxes do not intersect when their projection on one of the axis are disjoint. This is not the case for rotated boxes!

if ($box1->bottom > $box2->top) return false;
if ($box1->top < $box2->bottom) return false;
if ($box1->right < $box2->left) return false;
if ($box1->left > $box2->right) return false;

return true; 

 

For arbitrarily rotated boxes you will need some more 2D geometry to test the collision, but for now let's keep it simple.


Searching a place for the new word

We have now all the pieces to write down the routine searching for a free space to draw the new words. We start in the center of the image and move the word along a spiral until it does not intersect with the words already drawn.

$i = 0;  
$x = <image_center_x>; 
$y = <image_center_y>;
while (! $place_found) {
  $x = $x + ($i / 2 * cos($i));
  $y = $y + ($i / 2 * sin($i));
  $new_box = <place the word at x,y >;
  $place_found = <the new word does not overlap with existing words>;
  $i += 1;
}

return array($x, $y);

 

Changing the center of the spiral or its equation will lead to another distribution of the words in the image.

Since the PHP functions to draw text and to get its drawn dimension work with the top left corner as reference point, the above algorithm will tend to place all the vertical words on the left of the image. To prevent this I added a little bit of noize (random numbers) when selecting the center of the spiral for the vertical words.


Clickable images?

As stated at the top of this post, I wanted the generated tag clouds to be clickable. In other words I needed a mechanism to detect which word was clicked.

Since we store the bounding boxes of all the words we draw to detect the collisions, we can use this data to generate an HTML image map.

The problem is that each generation of a tag cloud will generate a different image. This is caused by the noize added when searching for the position of the words, but also to some randomness I added in the calculation of the font sizes.

That means that the tag cloud image and the image map must be rendered and sent to the client in a single call. Unfortunately it is not possible to send back to the client browser an encoded image at the same time as some HTML.

To solve this issue I used an advanced feature of the HTML img tag that allow to embed a base-64 encoded image in the URL.

First render the image in a temporary file and encode its content in base 64:

$file = tempnam(getcwd(), 'img');
imagepng($cloud->get_image(), $file);
$img64 = base64_encode(file_get_contents($file));
unlink($file);
 

Then set the data as the image URL

<img usemap="#mymap" src="data:image/png;base64,<?php echo $img64 ?>"
  border="0" alt="" />
 
Unfortunately this does not work in ... well, as usual ... Internet Explorer... This is out of the scope of this article but you can find more information on how to fix this problem here: Embedding Base64 Image Data into a Webpage

Since we now return HTML instead of a PNG image, we can as well send back the HTML image map.


See it in action

Sorry IE users this wil not work... ;-(



Get the code

The source code of the complete script can be found on GITHub.

11 commentaires:

  1. Hi,
    Awesome work !

    I have to figure out how to limitate the font size and the number of words to display on the screen. I can have very big texts... and I dont want the fonts to reach font size 120 :)
    Any advice ?


    PS: BTW, it works well in IE9 beta.

    RépondreSupprimer
  2. Hello user :-)

    Thank you for your interest in this code. It's a long time I didn't get into it but from a quick look I think you should try to modify how the frequency table is processed.

    Look into frequency_table.php, in the function process_frequency_table there is a TODO for something about the same :-)

    The easiest in your case would be to add a line such as this one at the end of the function:

    if ($this->table[$key]->size > MY_MAX_FONT_SIZE) {
    $this->table[$key] = MY_MAX_FONT_SIZE;
    }

    Let me know if you use this in some website and don't forget you can contribute via github to make the code better ;-)

    RépondreSupprimer
  3. I've pushed on github a quick fix to limit the font size, enjoy :-)

    RépondreSupprimer
  4. Awesome work! there is not a lot of pretty tag cloud solution based on PHP out there. I tried using PyTagCloud recently but it turns that installing pygame library on a redhat server is a real pain.

    RépondreSupprimer
  5. It looks really nice but when I played with your script a little bit I noticed that the generated image is always larger then specified.

    When there are fewer words to render it gets closer to the desired size but it's always at least 10% larger.

    RépondreSupprimer
  6. Excellent pieces. Keep posting such kind of information on your blog. I really impressed by your blog.
    SEO tools

    RépondreSupprimer
  7. Any chance you can make it so it places a black background color, even better set it to a variable so it can be changed on the fly?

    RépondreSupprimer
  8. Hello, I have 2 questions,

    1. Is there any way to make random words repetitive if there is not enough words?
    2. How can I get results only in given resolution? like if I give 640x480 I want exactly size, not less or more.

    Thanks.

    RépondreSupprimer
  9. Hi Sir

    How is your implementation different from the quadtree implementation.

    Regards
    Venkata Vineel

    RépondreSupprimer