Loading XML into MongoDB
I’m starting a new app today and building out the data layer with MongoDB as my database. The app uses a collection from the USDA, that I thought makes a good sample for getting started with the “Load” portion of ETL into MongoDB.
The data is available from the USDA here – the raw XML for MyPyramid: http://explore.data.gov/download/b978-7txq/XML
Step 1 – Define a Class for the data
Although not absolutely necessary as you could build a raw BSON document directly from XML, you kind of miss out on some of the C# driver’s niceties if you do. Looking at the raw data, I came up with this class, along with a constructor that takes an XElement to handle the parsing. Strict DTO people might move that parsing to a function within the ETL process…up to you. The only MongoDB specific code here is the BsonId attribute, which I’ll put on the FoodCode property – a unique ID from the source system.
public class Food {
[BsonId]
public int FoodCode {get;set;}
public string DisplayName {get;set;}
public float PortionDefault {get;set;}
public float PortionAmount {get;set;}
public string PortionDisplayName {get;set;}
public float Factor {get;set;}
public float Increment {get;set;}
public float Multiplier {get;set;}
public float Grains {get;set;}
public float WholeGrains {get;set;}
public float Vegetables {get;set;}
public float OrangeVegetables {get;set;}
public float DarkGreenVegetables {get;set;}
public float StarchyVegetables {get;set;}
public float OtherVegetables {get;set;}
public float Fruits {get;set;}
public float Milk {get;set;}
public float Meats {get;set;}
public float Soy {get;set;}
public float DryBeansPeas {get;set;}
public float Oils {get;set;}
public float SolidFats {get;set;}
public float AddedSugars {get;set;}
public float Alcohol {get;set;}
public float Calories {get;set;}
public float SaturatedFats {get;set;}
public Food(XElement elem) {
this.FoodCode = Int32.Parse(elem.Element("Food_Code").Value);
this.DisplayName = elem.Element("Display_Name").Value;
this.PortionDefault = float.Parse (elem.Element("Portion_Default").Value);
this.PortionAmount = float.Parse (elem.Element("Portion_Amount").Value);
this.PortionDisplayName = elem.Element("Portion_Display_Name").Value;
if(elem.Element ("Factor") != null)
this.Factor = float.Parse (elem.Element("Factor").Value);
this.Increment = float.Parse (elem.Element("Increment").Value);
this.Multiplier = float.Parse (elem.Element("Multiplier").Value);
this.Grains = float.Parse (elem.Element("Grains").Value);
this.WholeGrains = float.Parse (elem.Element("Whole_Grains").Value);
this.Vegetables = float.Parse (elem.Element("Vegetables").Value);
this.OrangeVegetables = float.Parse (elem.Element("Orange_Vegetables").Value);
this.DarkGreenVegetables = float.Parse (elem.Element("Drkgreen_Vegetables").Value);
this.StarchyVegetables = float.Parse (elem.Element("Starchy_vegetables").Value);
this.OtherVegetables = float.Parse (elem.Element("Other_Vegetables").Value);
this.Fruits = float.Parse (elem.Element("Fruits").Value);
this.Milk = float.Parse (elem.Element("Milk").Value);
this.Meats = float.Parse (elem.Element("Meats").Value);
this.Soy = float.Parse (elem.Element("Soy").Value);
this.DryBeansPeas = float.Parse (elem.Element("Drybeans_Peas").Value);
this.Oils = float.Parse (elem.Element("Oils").Value);
this.SolidFats = float.Parse (elem.Element("Solid_Fats").Value);
this.AddedSugars = float.Parse (elem.Element("Added_Sugars").Value);
this.Alcohol = float.Parse (elem.Element("Alcohol").Value);
this.Calories = float.Parse (elem.Element("Calories").Value);
this.SaturatedFats = float.Parse (elem.Element("Saturated_Fats").Value);
}
}
You might notice I’m using float for my decimal values. That’s all the accuracy I need, but it does lose some precision. I’m rounding the data when I use it so it won’t really matter, but if your needs differ, choose a different numeric type.
Step 2 – Function for reading the XML file
This is a pretty small data file, only about 750 records, but loading it all into memory at once is a waste. I want to load the “Food_Display_Row” XML elements one at a time, convert to a Food object, store in MongoDB, and move on to the next. It’s a job for a streaming API and an iterator, powered by “yield return” to get one XElement at a time:
static IEnumerable<XElement> readElementStream(string fileName, string elementName) {
using(var reader = XmlReader.Create(fileName)) {
reader.MoveToContent();
while(reader.Read()) {
if(reader.NodeType == XmlNodeType.Element && reader.Name == elementName) {
var e = XElement.ReadFrom (reader) as XElement;
yield return e;
}
}
reader.Close ();
}
}
Step 3 – Pull it all together and load the data
With the pieces in place, the load process is pretty simple. Connect to the server, get the database (MongoDB creates it on first use), get the collection (MongoDB also creates the collection), and use the iterator to read the XML file, load each element into a Food object and insert into the MongoDB collection. At the end, we have a MongoDB database with a collection of data from the food guide pyramid.
var server = MongoServer.Create ("mongodb://localhost");
var db = server.GetDatabase("gov");
var foods = db.GetCollection<Food>("food");
foreach(var elem in readElementStream("~/Downloads/MyFoodapediaData/Food_Display_Table.xml", "Food_Display_Row")) {
var food = new Food(elem);
foods.Insert (food);
}
My favorite thing about this is that I never had to leave C# to create the database, parse the source XML, or load the data. I don’t have to run a separate ETL process or use management tools to configure my database schema. It’s a simple, self-contained solution.
My second favorite part is that I ran all of this under Ubuntu and Mono. It should work just as well under Windows and .NET, but life is better running under an open source software stack.
I hope you find this helpful if you’re getting started with MongoDB and want a little data to play with.