Dec 15 2009
Head Smacking In Scala: XML Parsing
I program in a lot of different languages, everything from C and C++ to Awk and Sed, Visual Basic and ASP to PHP and Javascript. I’m a bit of a jack-of-all-trades when it comes to languages, but the main one for the past 10 or so years has been Python. Python is the language that I automatically turn to when I say “I need to do ${X},” where X is any given task that does not require a UML diagram and user case studies. It’s fast, it’s powerful, and it’s about as comfortable as an old shoe.
Lately, many of my projects– including my really really big one– have been in Java. Since I haven’t programmed in Java since about 1998 (about when I picked up Python, notably) it’s been a hard road. Java has become a harsh mistress. That sweet young thing that was so easy going and flexible so many years ago has grown up to be a cynical, hard-edged woman with a riding crop in her hand.
At least, that’s been my recent experience.
Still, it’s not all been bad. One thing that’s great about it is that I’ve discovered the Scala programming language.
If Python and Java got together and had a son, and then if Haskell and Ruby got together and had a daughter, and then that son and daughter got together and had a baby, that baby would be named Scala.
Scala is a scripting language for Java that is powerful, yet fast. It’s super OOP-centric, yet still has a foot planted firmly in functional programming land. Simple in its syntax, yet able to use any Java classes natively, and it byte compiles to native Java code.
Scala is everything I remember Java promising back in the mid-90s. It’s basically a language that a Java-whipped Python programmer could only dream of, yet it’s real.
Still, as when learning any language, there are times when I smack my head. Today’s head smacker illustrates why it’s important to not make assumptions when programming in an unfamiliar language.
My XML Parser
Today, I was building an XML parser that would grab water quality station information from the websites of organizations like the Army Corps and USGS. This is normally something that I’d do in Python– I’d whip it out real fast and then forget about it– but I thought this would be another good chance to play with Scala.
It’s a very simple language,1 my HTTP request client consists of only the following:2
import java.io.InputStream; import java.net.URL; object Http { def request(urlString:String): (Boolean, InputStream) = try { val url = new URL(urlString) val body = url.openStream (true, body) } catch { case ex:Exception => (false, null) } }
That’s all that’s needed to make the HTTP request to a server. As you can see, we can pull in the Java classes for InputStream and URL and use them natively. That’s quite nice.
Scala’s design is a bit strange to me. The fact that you can’t have static methods in a class (they go in objects) is a little head scratching sometimes. Also, if you have both static and instance methods, and you do so by naming both a class and an object with the same name. Furthermore, you can have traits, which are somewhat like Java interfaces. Thus, you could have three completely separate types with the same name, all which function both independently and together. Talk about head scratching.
While trying to get used to all this, I defined a base parser as a trait, because I’ll likely be creating parsers for a lot of different types of data sites.
trait BaseParser { def baseUrl:String = "http://" def fetchAndParseURL(URL:String) { val (true, body) = Http request(URL) val xml = XML.load(body) xml } def fetchAndParseQuery(query:String) = fetchAndParseURL(baseUrl + query) }
Here is the skeleton of basic functionality for actually grabbing and parsing the XML of a website.3 fetchAnParseURL() is the basic method, which grabs a URL and parses the XML (error checking and unit tests stripped here). fetchAndParseQuery() is a way to generically modify the base url, with specific modifications to be made in the class.
There are some interesting things in how Scala defines methods. The biggest one is the lack of return statements. Scala assumes that the last object in a method returns. That’s a bit like magic, sometimes, and thus somewhat scary. Another one is the tendency to use equal signs and not require parentheses (as in baseUrl, which is a method). Another nice thing to note here is that function definitions can be declared on one line. It makes some class definitions quite small:
class USGSStation(siteID:String) extends Application with BaseParser { override def baseUrl = "http://qwwebservices.usgs.gov/Station/search?siteid=USGS-" def GetMetaData() { this.fetchAndParseURL(baseUrl + siteID) } }
So, here’s my first parser class. It takes a USGS site ID as a string, and overrides with a baseUrl that coincides with the “station identification” REST query. It’s as simple and fast as Python, which blows my mind.
But it doesn’t work.
Smacking Your Head With A Unit
So, I spent a long time trying to figure out why this wouldn’t work. I mean, it compiled, and it ran, and it returned something. It just didn’t return what I wanted. I kept getting a null value of the type “Unit.”
Well, it turns out that this is the “don’t make assumptions” part of learning a new language. You see, in every other language I know except JavaScript which was designed explicitly to torture terrorist suspect detainees, methods are defined with a signature followed by the definition. That’s it. Signature, definition, done.
Scala, is different. I thought that the equal sign was a clever way of making one line functions. No, it turns out that it’s a necessary part of defining a method– at least if you want it to return anything.
Strangely, Scala doesn’t break when you mis-define a method, it just returns Unit– which, as much as I can tell, is the number 42. Once I figured out that my methods needed equal signs, everything worked as expected.
Thus, here is the final XML grabber for the site metadata:
import xml.{Elem, XML} import java.io.InputStream; import java.net.URL; object Http { def request(urlString:String): (Boolean, InputStream) = try { val url = new URL(urlString) val body = url.openStream (true, body) } catch { case ex:Exception => (false, null) } } trait BaseParser { def baseUrl:String = "http://" def fetchAndParseURL(URL:String) = { val (true, body) = Http request(URL) val xml = XML.load(body) xml } def fetchAndParseQuery(query:String) = fetchAndParseURL(baseUrl + query) } class USGSStation(siteID:String) extends Application with BaseParser { override def baseUrl = "http://qwwebservices.usgs.gov/Station/search?siteid=USGS-" def GetMetaData() = this.fetchAndParseURL(baseUrl + siteID) }
…which works correctly. It grabs the site’s metadata as the full XML object, which I can then parse for elements such as county code, name, latitude/longitude and sensor type. I’m using this, in conjunction with a bunch of other as-yet-poorly-written-code to pull both sites and site data from their pseudo-REST interface.
Coda
So far, I’m really impressed. Scala is a great language. The total time I spent on this (without including my stupid lack of equal sign problem) was not much more than what I’d spend on a Python version, and that’s without me being familiar with the language. More than that, we can byte compile it and use it within our larger infrastructure without resorting to something like Jython or another solution.
- and this example was made simpler by the fact that Raphael Ferreira already built up the code to parse Amazon’s website [↩]
- Code here, except where otherwise licensed, is licensed under the MIT license. [↩]
- yes, fellow geeks, I know that there are things I could do better here. This is a quick skeleton to get used to the language, not a final product that needs to be judged. Save that for when I can actually make something work [↩]