Archive

Archive for December, 2011

Loading XML into MongoDB

December 29, 2011 Leave a comment

I’m starting a new app today and building out the data layer with MongoDB as my database. The app uses a collection from the USDA, that I thought makes a good sample for getting started with the “Load” portion of ETL into MongoDB.

The data is available from the USDA here – the raw XML for MyPyramid: http://explore.data.gov/download/b978-7txq/XML

Step 1 – Define a Class for the data

Although not absolutely necessary as you could build a raw BSON document directly from XML, you kind of miss out on some of the C# driver’s niceties if you do. Looking at the raw data, I came up with this class, along with a constructor that takes an XElement to handle the parsing. Strict DTO people might move that parsing to a function within the ETL process…up to you. The only MongoDB specific code here is the BsonId attribute, which I’ll put on the FoodCode property – a unique ID from the source system.

public class Food {
	[BsonId]
	public int FoodCode {get;set;}
	public string DisplayName {get;set;}
	public float PortionDefault {get;set;}
	public float PortionAmount {get;set;}
	public string PortionDisplayName {get;set;}
	public float Factor {get;set;}
	public float Increment {get;set;}
	public float Multiplier {get;set;}
	public float Grains {get;set;}
	public float WholeGrains {get;set;}
	public float Vegetables {get;set;}
	public float OrangeVegetables {get;set;}
	public float DarkGreenVegetables {get;set;}
	public float StarchyVegetables {get;set;}
	public float OtherVegetables {get;set;}
	public float Fruits {get;set;}
	public float Milk {get;set;}
	public float Meats {get;set;}
	public float Soy {get;set;}
	public float DryBeansPeas {get;set;}
	public float Oils {get;set;}
	public float SolidFats {get;set;}
	public float AddedSugars {get;set;}
	public float Alcohol {get;set;}
	public float Calories {get;set;}
	public float SaturatedFats {get;set;}
	
	public Food(XElement elem) {
		this.FoodCode = Int32.Parse(elem.Element("Food_Code").Value);
		this.DisplayName = elem.Element("Display_Name").Value;
		this.PortionDefault = float.Parse (elem.Element("Portion_Default").Value);
		this.PortionAmount = float.Parse (elem.Element("Portion_Amount").Value);
		this.PortionDisplayName = elem.Element("Portion_Display_Name").Value;
		if(elem.Element ("Factor") != null)
			this.Factor = float.Parse (elem.Element("Factor").Value);
		this.Increment = float.Parse (elem.Element("Increment").Value);
		this.Multiplier = float.Parse (elem.Element("Multiplier").Value);
		this.Grains = float.Parse (elem.Element("Grains").Value);
		this.WholeGrains = float.Parse (elem.Element("Whole_Grains").Value);
		this.Vegetables = float.Parse (elem.Element("Vegetables").Value);
		this.OrangeVegetables = float.Parse (elem.Element("Orange_Vegetables").Value);
		this.DarkGreenVegetables = float.Parse (elem.Element("Drkgreen_Vegetables").Value);
		this.StarchyVegetables = float.Parse (elem.Element("Starchy_vegetables").Value);
		this.OtherVegetables = float.Parse (elem.Element("Other_Vegetables").Value);
		this.Fruits = float.Parse (elem.Element("Fruits").Value);
		this.Milk = float.Parse (elem.Element("Milk").Value);
		this.Meats = float.Parse (elem.Element("Meats").Value);
		this.Soy = float.Parse (elem.Element("Soy").Value);
		this.DryBeansPeas = float.Parse (elem.Element("Drybeans_Peas").Value);
		this.Oils = float.Parse (elem.Element("Oils").Value);
		this.SolidFats = float.Parse (elem.Element("Solid_Fats").Value);
		this.AddedSugars = float.Parse (elem.Element("Added_Sugars").Value);
		this.Alcohol = float.Parse (elem.Element("Alcohol").Value);
		this.Calories = float.Parse (elem.Element("Calories").Value);
		this.SaturatedFats = float.Parse (elem.Element("Saturated_Fats").Value);
	}
}

You might notice I’m using float for my decimal values. That’s all the accuracy I need, but it does lose some precision. I’m rounding the data when I use it so it won’t really matter, but if your needs differ, choose a different numeric type.

Step 2 – Function for reading the XML file

This is a pretty small data file, only about 750 records, but loading it all into memory at once is a waste. I want to load the “Food_Display_Row” XML elements one at a time, convert to a Food object, store in MongoDB, and move on to the next. It’s a job for a streaming API and an iterator, powered by “yield return” to get one XElement at a time:

static IEnumerable<XElement> readElementStream(string fileName, string elementName) {
	using(var reader = XmlReader.Create(fileName)) {
		reader.MoveToContent();
		while(reader.Read()) {
			if(reader.NodeType == XmlNodeType.Element && reader.Name == elementName) {
				var e = XElement.ReadFrom (reader) as XElement;
				yield return e;
			}
		}
		reader.Close ();
	}
}

Step 3 – Pull it all together and load the data

With the pieces in place, the load process is pretty simple. Connect to the server, get the database (MongoDB creates it on first use), get the collection (MongoDB also creates the collection), and use the iterator to read the XML file, load each element into a Food object and insert into the MongoDB collection. At the end, we have a MongoDB database with a collection of data from the food guide pyramid.

var server = MongoServer.Create ("mongodb://localhost");
var db = server.GetDatabase("gov");
var foods = db.GetCollection<Food>("food");

foreach(var elem in readElementStream("~/Downloads/MyFoodapediaData/Food_Display_Table.xml", "Food_Display_Row")) {
	var food = new Food(elem);
	foods.Insert (food);
}

My favorite thing about this is that I never had to leave C# to create the database, parse the source XML, or load the data. I don’t have to run a separate ETL process or use management tools to configure my database schema. It’s a simple, self-contained solution.

My second favorite part is that I ran all of this under Ubuntu and Mono. It should work just as well under Windows and .NET, but life is better running under an open source software stack.

I hope you find this helpful if you’re getting started with MongoDB and want a little data to play with.

Categories: C#, MongoDB Tags: ,

If C# is so awesome, why use anything else?

December 29, 2011 20 comments

Anyone who knows me professionally knows I work in C# most of the time. I think it’s a great language that’s been well designed and made very portable by way of being an open language specification. A lot of people look at C# and say, that’s just Java with some Microsoft-extensions. Sort of, since it’s framework (.NET) ships with quite a few libraries that interoperate well with Windows, although the C# language itself doesn’t have anything to do with Windows, and runs on Linux, OS X, Android, iOS, and so on. In my opinion, Java has stagnated over the years, while C# has been evolving with generics (which Java followed), lambdas (Java finally gets them years later), anonymous types, partial method and class declarations, language integrated query (LINQ), dynamic runtime integration, and soon a simplified asynchronous programming model with await and async that will allow the runtime to deal with the gory details of async programming rather than forcing the programmer to understand and properly implement callbacks and cleanup. Java isn’t catching up fast, so Scala is filling the gaps, but C# remains years ahead.

Every year, my family looks at me funny when I they give me new books. That’s right, I’m a geek that reads computer books. Most of these books are not on C#, but on JavaScript, Python, Haskell, and I even keep an old PERL book on my shelf. What is all this other stuff?

JavaScript – it’s pretty rare these days that other developers would say, “why would you ever want to write JavaScript?” It’s a ubiquitous language amongst web browsers, and it’s pretty rare that anyone can write much of a browser-based application at all without it. Besides, the latest trend is to write a “language X to JavaScript converter” and what good would that be if I didn’t know JavaScript and wasn’t willing to learn language X? There have always been some nice server-side implementations, like Spidermonkey and newer V8, powering trendy applications like MongoDB and Node.js. Until I started down the Python path, whenever I needed extensibility, I would embed Spidermonkey for some JavaScript fun.

Python – in the realm of C#, a lot of people are uncomfortable mixing in Python. They don’t like the idea of losing compile-time checks and worry about needing a myriad of Python frameworks to solve any sizable development tasks. However, Python is an excellent tool for large and small projects alike, and IronPython take the Python language and gives it access to the full .NET framework. In the last few years, I’ve felt constrained if I didn’t have a layer of extensibility that IronPython can add to CLR applications. Python scripts let you treat code as data, meaning you can store it, transmit it, and change it at runtime. Python gives you a new way to move the problem around, solve it at a different time in your overall solution. It’s a great piece of the toolbox.

I remember spending weeks building business rules engines so non-programmers could add some logic to enterprise applications. These engines would use reflection and Lightweight Code Generation (LCG) and a clumsy UI where end users would select data objects and operators and build expression statements. IronPython uses LCG, is highly optimized, and gives you a general purpose scripting language with access to CLR objects. Most end users prefer the ability to write an expression in script rather than fumble with the type of UI needed to build an expression tree. This is just scratching the surface of the Python language, but at the very least, it’s a great tool anywhere you want to offer runtime extensibility.

Haskell is pure functional programming – no state, just functions. I used F# a bit for professional work just to learn it, but it allows you to fall back into the OOP line of thinking. Haskell makes you take a fully functional approach. I recommend every developer that’s looking to expand their approach to problem solving to spend some time with it.

What about that PERL 5 book? Well, I don’t use that, to be honest. I did once upon a time, but I really do avoid PERL at all costs. Maybe one day, I’ll pick it back up.

Categories: C#, IronPython Tags: , ,