Friday, June 22, 2012

R and the web (for beginners), Part II: XML in R


This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to interchange data over the Internet. If you want to access some online data over a webpage's API you are likely to get it in XML format. So here is a very simple example of how to deal with XML in R.
Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you...

How to parse/read this XML-document into R?
 
# install and load the necessary package

install.packages("XML")
library(XML)


# Save the URL of the xml file in a variable

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

# Use the xmlTreePares-function to parse xml file directly from the web
 
xmlfile <- xmlTreeParse(xml.url)


# the xml file is now saved as an object you can easily work with in R:

class(xmlfile)



# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

# have a look at the XML-code of the first subnodes:

print(xmltop)[1:2]

This should look more or less like:


$PLANT
<PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
</PLANT>

$PLANT
<PLANT>
 <COMMON>Columbine</COMMON>
 <BOTANICAL>Aquilegia canadensis</BOTANICAL>
 <ZONE>3</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.37</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
</PLANT>

attr(,"class")
[1] "XMLNodeList"

One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?


# To extract the XML-values from the document, use xmlSApply:

plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))


# Finally, get the data in a data-frame and have a look at the first rows and columns

plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]

The first rows and columns of that data frame should look like this:
 
               COMMON              BOTANICAL ZONE        LIGHT
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady
2           Columbine   Aquilegia canadensis    3 Mostly Shady
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny
4             Cowslip       Caltha palustris    4 Mostly Shady
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady
Which is exactly what we need to analyze this data in R.



31 comments:

  1. Hi
    How to pass the parameters for the year='2012" and month="August" to this url?.
    It gives me no tables.why?.
    thanks
    veepsirtt

    options(RCurlOptions = list(useragent = "R"))
    library(RCurl)
    url <- "http://www.bseindia.com/histdata/categorywise_turnover.asp"
    wp = getURLContent(url)

    library(RHTMLForms)
    library(XML)
    doc = htmlParse(wp, asText = TRUE)
    form = getHTMLFormDescription(doc)[[1]]
    fun = createFunction(form)
    o = fun(mmm = "9", yyy = "2012",url="http://www.bseindia.com/histdata/categorywise_turnover.asp")

    table = readHTMLTable(htmlParse(o, asText = TRUE),
    header = TRUE,
    stringsAsFactors = FALSE)
    table

    ReplyDelete
  2. Hi veepsirtt,

    I'm not very familiar with the RHTMLForms-package, thus I might be the wrong guy to answer this question. Nevertheless, I guess the problem occurs already in your application of createFunction(), with your code I get from that line:

    Error in if (action != "") formDescription$url = toString.URI(mergeURI(URI(action), :
    missing value where TRUE/FALSE needed

    something seems to be wrong with the formDescription-argument you are using in createFunction().

    I'd recommend you to carefully check the documentation of this function and in the worst case to contact the Author of the function if that problem doesn't pop up in any forum or mailing list.

    ReplyDelete
  3. Hi, any thoughts on how to extract data from an embedded spreadsheet, as is in the following example: http://pakistanbodycount.org/drone_attack

    Thanks in advance!

    ReplyDelete
  4. Hi Andrew,

    A good general starting point is to use Firebug (a Firefox extension) to inspect the website with the data you are interested in.

    What you refer to in your example as "embedded spreadsheet" seems to be in the end a HTML-table (for which the same techniques as described in my post on web scraping should work: http://giventhedata.blogspot.com/2012/08/r-and-web-for-beginners-part-iii.html)

    Mind though that scraping data from a web site, such as in your example, is often a lot more tricky than querying/extracting data from a XML-document.

    best regards

    ReplyDelete
  5. Thanks! Seem to be getting an error at the second step (mps.doc <- htmlParse(mps)) but am new to this and will keep playing. Appreciate the feedback!

    D

    ReplyDelete
  6. Hi

    You can do this and get same result :D

    plantcat_df <-xmlToDataFrame(xml.url)

    ReplyDelete
    Replies
    1. Hi Claudio

      you've correctly pointed out that the XML package also comes with a convenient function (xmlToDataFrame) to "extract data from a simple XML document". There are mainly two reasens why I didn't want to point to that function in this post:

      1) if you are a novice in xml/R you don't learn anything by just using xmlToDataFrame in the above example. The explicit aim of the post is to give some insights into how one can work with XML documents in R.

      2) as the documentation of xmlToDataFrame mentions, this function is made for "simple" XML documents. You will notice what this means as soon as your trying to use xmlToDataFrame in a more complex xml structure as the very simple example above.

      a third, rather minor point is that even if xmlToDataFrame works in your setting it is likely to be less efficient than a self-made function written with the functions pointed out in the example.

      anyway, thanks for pointing this out! mentioning the convenient function as concluding remarks in my post would not have been a bad idea.

      Best,

      Delete
  7. Hi there. Thoughts on good ways to access a very long and complicated XML documents? For example this: http://pastebin.com/tFVwyJgt

    ReplyDelete
  8. HTML tutorial for beginners with examples

    Free online HTML tutorial for beginners with examples - HTML tutorial will help you in creating website, after study the tutorial you will just one step ahead of creating your own website. HTML is easy to understand and you will enjoy it to learn. HTML tutorial contains hundreds of examples to better understand.

    http://www.willvick.com/
    http://www.willvick.com/HTML-tutorial-for-beginners-with-examples/HTML-tutorial-for-beginners-with-examples.aspx

    ReplyDelete
  9. It was really a wonderful article and I was really impressed by reading this blog. We are giving all software Course Online Training. The HTML Training in Chennai is one of the reputed Training institute in Chennai. They give professional and real time training for all students.

    ReplyDelete
  10. Best HTML5 Training in Chennai

    Hi, Thanks for sharing this valuable blog.I was really impressed by reading this blog. I did HTML5 Training in Chennai at reputed HTML5 Training Institutes in Chennai. This is really useful for me to make a bright future in designing field.

    HTML Training in Chennai

    ReplyDelete
  11. PHP Training Chennai

    I get a lot of great information from this blog. Thank you for your sharing this informative blog. Recently I did PHP course at a leading academy. If you are looking for best PHP Training Center in Chennai visit FITA IT training academy which offer real time and Best PHP Training in Chennai.

    PHP Course in Chennai

    ReplyDelete
  12. Android Training in Velachery

    Your blog is really useful for me. Thanks for sharing this useful blog..Suppose if anyone interested to learn Android Course in Chennai please visit fita academy which offers best Android Training in Chennai at reasonable cost.

    Android Training Institutes in Chennai

    ReplyDelete
  13. Dot Net Training Chennai

    Thanks for your wonderful post.It is really very helpful for us and I have gathered some important information from this blog.If anyone wants to get Dot Net Training in Chennai reach FITA, rated as No.1 Dot Net Training Institute in Chennai.

    Dot Net Course in Chennai


    ReplyDelete
  14. SEO Training in Chennai

    Thanks for sharing this information. SEO is one of the digital marketing techniques which is used to increase website traffic and organic search results. If anyone wants to get SEO Course in Chennai visit FITA Academy located at Chennai. Rated as No.1 SEO Training institute in Chennai.

    SEO Training in Chennai | SEO Training Institute in Chennai


    ReplyDelete
  15. Digital Marketing Training in Chennai

    Thanks for sharing this informative blog. Recently I did Digital Marketing courses in Chennai at a leading digital marketing company. It's really useful for me to make a bright career. If anyone wants to get Digital Marketing Training in Chennai visit infiniX.

    Regards...

    Digital Marketing Course in Chennai

    ReplyDelete
  16. The information you posted here is useful to make my career better keep updates...If anyone want to get Cloud Computing Training Chennai, Please visit FITA academy located at Chennai. Rated as No.1 Cloud Computing Training Centers in Chennai

    ReplyDelete
  17. I have read your blog and i got a very useful and knowledgeable information from your blog.You have done a great job . If anyone want to get Salesforce Course in Chennai, Please visit FITA academy located at Chennai Velachery. Rated as No.1 Salesforce Training Institutes in Chennai.

    ReplyDelete
  18. Thanks for your informative post. Your info graphic helped me to create my first blog on blogger platform. Please assist me whether my blog on Web designing course in Chennai is eligible for AdSense.

    ReplyDelete
  19. Nice piece of information on HTML5. With the expansion of smartphones and other portable gadgets, the demand for responsive website design that go comfy on all devices keeps on increasing. This leads to invention and expansion of HTM5 web technology. PHP Training Institute in Chennai

    ReplyDelete
  20. Thanks for sharing these niche piece of coding to our knowledge. Here, I had a solution for my inconclusive problems & it’s really helps me a lot keep updates… DOT NET Training Institute in Chennai | DOT NET Course in Chennai

    ReplyDelete
  21. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    ccna training institute in Chennai | ccna courses in Chennai

    ReplyDelete
  22. This comment has been removed by the author.

    ReplyDelete
  23. Thanks for your informative article. Your blog is loaded with awesome information. Please include RSS field shat that we can receive your latest post direct to my inbox. Wordpress Course in Chennai

    ReplyDelete
  24. Thanks for your informative article. Your blog is loaded with awesome information. Please include RSS field shat that we can receive your latest post direct to my inbox. Wordpress Course in Chennai

    ReplyDelete
  25. It was really a wonderful article and I was really impressed by reading this blog. Your technical information is very useful for me. Thanks for sharing your ideas.

    Regards...
    Hacking Course in Chennai


    ReplyDelete
  26. Thanks for sharing these niche piece of coding to our knowledge. Here, I had a solution for my inconclusive problems & it’s really helps me a lot keep updates…
    Regards,
    PHP Training Institute in Chennai

    ReplyDelete
  27. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    Regards,
    Web design courses in Chennai

    ReplyDelete
  28. Thanks for sharing informative article on web design and development. As every business is moving towards online marketing, there is huge demand for trained and skilled web designers and developers. Web designing course in Chennai

    ReplyDelete