This post has a following goals: announcing the graph gallery has gained a tag cloud, and showing how it is done.
The cloud is a simple tag cloud of the words in titles of graphics that are included in the gallery. For this purpose, I am using an
XML dump of the main table of the gallery database, here is for example the information for graph 12.
226 <graph>
227 <id>12</id>
228 <titre>Conditionning plots</titre>
229 <titre_fr>graphique conditionnel</titre_fr>
230 <comments>Conditioning plots</comments>
231 <comments_fr>graphique conditionnel</comments_fr>
232 <demo>graphics</demo>
233 <notemoy>0.56769596199524</notemoy>
234 <nbNote>421</nbNote>
235 <nbKeywords>0</nbKeywords>
236 <boolForum>0</boolForum>
237 <px_w>500</px_w>
238 <px_h>400</px_h>
239 </graph>
240 <graph>
We are interested in the tag
titre
of each tag
graph
. That is something straightforward to get with the R4X package (I will do a post specifically on R4X soon).
1 x <- xmlTreeParse( "/tmp/rgraphgallery.xml" )$doc$children[[1]]
2 titles <- x["graph/titre/#"]
Next, we want to extract words of the titles, we need to be careful about removing
&br;
tags that appear in some of the titles and also remove any character that is not a letter or a space, and then seperate by spaces. For that, we will use the
operators package like this :
4 words <- gsub( "<br>", " ", titles )
5 words <- words %-~% "[^[:alpha:][:space:]]" %/~% "[[:space:]]"
Next, we convert eveything to lower case, and extract the 100 most used words:
7 words <- casefold( words )
8 w100 <- tail( sort( table( words ) ), 100 )
9
and finally generate the (fairly simple) html code:
10 w100 <- w100[ order( names( w100 ) ) ]
11 html <- sprintf( '
12 <a href="search.php?engine=RGG&q=%s">
13 <span style="font-size:%dpt">%s</span>
14 </a>
15 ',
16 names(w100),
17 round( 20*log(w100, base = 5) ),
18 names(w100) )
19 cat( html, file = "cloud.html" )
20
and that's it. You can see it on the gallery
frontpage
Here is the full script:
1 ### read the xml dump
2 x <- xmlTreeParse( "rgraphgallery.xml" )$doc$children[[1]]
3
4 ### extract the titles
5 titles <- x["graph/titre/#"]
6
7 ### clean them up
8 words <- gsub( "<br>", " ", titles )
9 words <- words %-~% "[^[:alpha:][:space:]]" %/~% "[[:space:]]"
10
11 ### get the 100 most used words
12 words <- casefold( words )
13 w100 <- tail( sort( table( words ) ), 100 )
14 w100 <- w100[ order( names( w100 ) ) ]
15
16 ### generate the html using sprintf
17 html <- sprintf( '
18 <a href="search.php?engine=RGG&q=%s">
19 <span style="font-size:%dpt">%s</span>
20 </a>
21 ',
22 names(w100),
23 round( 20*log(w100, base = 5) ),
24 names(w100) )
25 cat( html, file = "cloud.html" )
26
27 ### or using R4X again
28 # - we need an enclosing tag for that
29 # - note the & instead of & to make the XML parser happy
30 w <- names(w100)
31 sizes <- round( 20*log(w100, base = 5) )
32 xhtml <- '##((xml
33 <div id="cloud">
34 <@i|100>
35 <a href="search.php?q={ w[i] }&engine=RGG">
36 <span style="font-size:{sizes[i]}pt" >{ w[i] }</span>
37 </a>
38 </@>
39 </div>'##xml))
40 html <- xml( xhtml )
41