Home > C#, MongoDB > Loading XML into MongoDB

Loading XML into MongoDB

I’m starting a new app today and building out the data layer with MongoDB as my database. The app uses a collection from the USDA, that I thought makes a good sample for getting started with the “Load” portion of ETL into MongoDB.

The data is available from the USDA here – the raw XML for MyPyramid: http://explore.data.gov/download/b978-7txq/XML

Step 1 – Define a Class for the data

Although not absolutely necessary as you could build a raw BSON document directly from XML, you kind of miss out on some of the C# driver’s niceties if you do. Looking at the raw data, I came up with this class, along with a constructor that takes an XElement to handle the parsing. Strict DTO people might move that parsing to a function within the ETL process…up to you. The only MongoDB specific code here is the BsonId attribute, which I’ll put on the FoodCode property – a unique ID from the source system.

public class Food {
	[BsonId]
	public int FoodCode {get;set;}
	public string DisplayName {get;set;}
	public float PortionDefault {get;set;}
	public float PortionAmount {get;set;}
	public string PortionDisplayName {get;set;}
	public float Factor {get;set;}
	public float Increment {get;set;}
	public float Multiplier {get;set;}
	public float Grains {get;set;}
	public float WholeGrains {get;set;}
	public float Vegetables {get;set;}
	public float OrangeVegetables {get;set;}
	public float DarkGreenVegetables {get;set;}
	public float StarchyVegetables {get;set;}
	public float OtherVegetables {get;set;}
	public float Fruits {get;set;}
	public float Milk {get;set;}
	public float Meats {get;set;}
	public float Soy {get;set;}
	public float DryBeansPeas {get;set;}
	public float Oils {get;set;}
	public float SolidFats {get;set;}
	public float AddedSugars {get;set;}
	public float Alcohol {get;set;}
	public float Calories {get;set;}
	public float SaturatedFats {get;set;}
	
	public Food(XElement elem) {
		this.FoodCode = Int32.Parse(elem.Element("Food_Code").Value);
		this.DisplayName = elem.Element("Display_Name").Value;
		this.PortionDefault = float.Parse (elem.Element("Portion_Default").Value);
		this.PortionAmount = float.Parse (elem.Element("Portion_Amount").Value);
		this.PortionDisplayName = elem.Element("Portion_Display_Name").Value;
		if(elem.Element ("Factor") != null)
			this.Factor = float.Parse (elem.Element("Factor").Value);
		this.Increment = float.Parse (elem.Element("Increment").Value);
		this.Multiplier = float.Parse (elem.Element("Multiplier").Value);
		this.Grains = float.Parse (elem.Element("Grains").Value);
		this.WholeGrains = float.Parse (elem.Element("Whole_Grains").Value);
		this.Vegetables = float.Parse (elem.Element("Vegetables").Value);
		this.OrangeVegetables = float.Parse (elem.Element("Orange_Vegetables").Value);
		this.DarkGreenVegetables = float.Parse (elem.Element("Drkgreen_Vegetables").Value);
		this.StarchyVegetables = float.Parse (elem.Element("Starchy_vegetables").Value);
		this.OtherVegetables = float.Parse (elem.Element("Other_Vegetables").Value);
		this.Fruits = float.Parse (elem.Element("Fruits").Value);
		this.Milk = float.Parse (elem.Element("Milk").Value);
		this.Meats = float.Parse (elem.Element("Meats").Value);
		this.Soy = float.Parse (elem.Element("Soy").Value);
		this.DryBeansPeas = float.Parse (elem.Element("Drybeans_Peas").Value);
		this.Oils = float.Parse (elem.Element("Oils").Value);
		this.SolidFats = float.Parse (elem.Element("Solid_Fats").Value);
		this.AddedSugars = float.Parse (elem.Element("Added_Sugars").Value);
		this.Alcohol = float.Parse (elem.Element("Alcohol").Value);
		this.Calories = float.Parse (elem.Element("Calories").Value);
		this.SaturatedFats = float.Parse (elem.Element("Saturated_Fats").Value);
	}
}

You might notice I’m using float for my decimal values. That’s all the accuracy I need, but it does lose some precision. I’m rounding the data when I use it so it won’t really matter, but if your needs differ, choose a different numeric type.

Step 2 – Function for reading the XML file

This is a pretty small data file, only about 750 records, but loading it all into memory at once is a waste. I want to load the “Food_Display_Row” XML elements one at a time, convert to a Food object, store in MongoDB, and move on to the next. It’s a job for a streaming API and an iterator, powered by “yield return” to get one XElement at a time:

static IEnumerable<XElement> readElementStream(string fileName, string elementName) {
	using(var reader = XmlReader.Create(fileName)) {
		reader.MoveToContent();
		while(reader.Read()) {
			if(reader.NodeType == XmlNodeType.Element && reader.Name == elementName) {
				var e = XElement.ReadFrom (reader) as XElement;
				yield return e;
			}
		}
		reader.Close ();
	}
}

Step 3 – Pull it all together and load the data

With the pieces in place, the load process is pretty simple. Connect to the server, get the database (MongoDB creates it on first use), get the collection (MongoDB also creates the collection), and use the iterator to read the XML file, load each element into a Food object and insert into the MongoDB collection. At the end, we have a MongoDB database with a collection of data from the food guide pyramid.

var server = MongoServer.Create ("mongodb://localhost");
var db = server.GetDatabase("gov");
var foods = db.GetCollection<Food>("food");

foreach(var elem in readElementStream("~/Downloads/MyFoodapediaData/Food_Display_Table.xml", "Food_Display_Row")) {
	var food = new Food(elem);
	foods.Insert (food);
}

My favorite thing about this is that I never had to leave C# to create the database, parse the source XML, or load the data. I don’t have to run a separate ETL process or use management tools to configure my database schema. It’s a simple, self-contained solution.

My second favorite part is that I ran all of this under Ubuntu and Mono. It should work just as well under Windows and .NET, but life is better running under an open source software stack.

I hope you find this helpful if you’re getting started with MongoDB and want a little data to play with.

Advertisements
Categories: C#, MongoDB Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: