Parsing HTML pages like a boss with Groovy

Remember I told you that Groovy is really good for scripting? Time for some proof.

Lets say you want to download and parse some html page and get the content of the third list on page. How do you do that? Using URL and Regular expressions?

Here I`ll tell you how it can be done in groovy. I will use gumtree search result as an example:

// Grap HTTPBuilder component from maven repository
@Grab(group='org.codehaus.groovy.modules.http-builder',
    module='http-builder', version='0.5.2')
// import of HttpBuilder related stuff
import groovyx.net.http.*

def http = new HTTPBuilder("http://www.gumtree.com/cgi-bin/" +
 "list_postings.pl?search_terms=car&search_location=London&ubercat=1")

def html = http.get([:])

You might think that html var is just a string with page content? No, it is actually an Xml document read by Groovy-s XmlSlurper, and HTTPBuilder is using NekoHTML parser which allows malformed HTML content. Which is very common in web pages (for example tags not being closed).

Now, when we have an xml tree of page, we can do really neat stuff with it, for example, we can find all xml elements with some class and do something with them:

html."**".findAll { it.@class.toString().contains("hlisting")}.each {
 // doSomething with each entry
}

This magic string traverses through all tags and uses closure to filter them. The result is the collection of xml nodes which were matched by closure, so you can iterate through it using standard each

And the iteration element will be xml node, so you still will be able to traverse it deeper and extract information. Here is some code to get all ads from that result page, with link to full add, title of ad and image url:

def ads = html."**".findAll { it.@class.toString().contains("hlisting")}
    .collect {
    [
    link : it.A[0].@href.text(),
    title : it.A[0].@title.text(),
    imgUrl : it.A[0].IMG[0].@src.text()
    ]
}

After that, ads will contain the array of Hashes with “link”, “title” and “imgUrl” keys (check out html returned by server to get the idea of what is being parsed).

As you can see, html parsing of pages in Groovy using HTTPBuilder is really easy and fun. But thats not everything HTTPBuilder can do! It also provides you tools to handle JSON/XML responses easily. I will discuss that later.

←

Home

Binary Buffer

Parsing HTML pages like a boss with Groovy