Adventures in .NET-dom: February 2009

Thursday, February 19, 2009

LinqPad, my new favorite tool

As I was looking into LINQ, I found a tool called LINQPad, which allows you to run LINQ queries on a typical SQL database. "Cool," I thought, "That might help me become familiar with LINQ syntax." I had no idea of the power of this tool.

When it first connects to your chosen database, LINQPad automatically generates a data-object mapping based on the foreign and primary keys in your database. So rather than having to worry about doing an INNER JOIN ON table1.some_column = table2.some_column, you can just use standard LINQ syntax to join the objects based on their relationships. Another advantage is that it provides links from one table to another, so you can easily jump around the table structure, seeing how things are interconnected.

Visualization of the returned data is much more intuitive and powerful than the simple grid that is returned by normal applications. For example, using the "select new" statement I discussed in my last posting, you can see lists of objects grouped within other objects, fully collapsible and everything! For this reason, I prefer it to SQL Server Management Studio for viewing information in the database.

LINQPad also lets you see what code is generated based on the LINQ queries you put in. You can see the Lambda code, the SQL code, and even the Instruction Language code that comes from the statements you write. That's very handy for understanding the performance of your code.

LINQPad's uses aren't limited to Linq to SQL, however! You can reference the .dll file from your Entity Framework data layer and run queries as if you were writing code right within Visual Studio. It holds true to its claim to be a "code snippet IDE." In fact, you don't even have to use it for LINQ! Say you just want to run a method and see what kind of data comes out: rather than creating a temporary unit-test in your Visual Studio directory, you can simply write the C# code snippet in LINQPad and run it!

If you're new to LINQ, LINQPad actually comes with a hearty set of code samples to help you get started.

Oh, yeah, and you can also use it to run SQL queries and commands if you want.

Overall, I've been extremely impressed by all that I can do with LINQPad. Now I'm trying to get my bosses to invest in an Autocomplete license for it, and looking forward to the increased productivity I can expect from that feature.

Joining entities using the "new" operator

I took some time in my last post to explain that two database queries are better than many database queries. Namely, it's better to do one database query and create a mapping with the returned data, than it is to perform a new database query for every chunk of information that we want. (I should note that this is obviously only true if the results of your two database queries can be limited to a size that roughly fits the amount of data you'd otherwise be requesting. If you're clever, this can usually be the case.)

As I played around a bit more with Linq, I found another syntax that can be used to join objects together. For example, the following Linq (to SQL) code:

1 2	from ec in EntityCategory select new {ec, te = from te in ec.TrackedEntity select te}

... will produce objects of a new Anonymous type with the following elements:

EntityCategory ec
IEnumerable te

This approach can use a single SQL round-trip, which you'd think would be ideal. After all, if two is better than many, shouldn't one be better than two? But there are a few things to keep in mind:

Because it creates an Anonymous data type, the returned results don't really lend themselves to being passed around your program as arguments.
There is actually quite a bit of overhead associated with this method. The generated SQL query is a lot more expensive for the database, and the returned data set has a lot of unnecessarily-duplicated data. Once the data returns from the database, it has to be parsed and organized by the framework (Entity or Linq to SQL) in much the same way as the JoinedEntityMap class that I mentioned in my last post. So this syntax can run much more slowly, despite reducing the number of database round-trips.

That said, this syntax can be useful in certain circumstances, so it's good to keep this alternative in the back of your mind.

Tuesday, February 17, 2009

Joining entities to avoid database roundtrips

In my last post, I hinted that I had an alternative in mind for the Eager Loading problem that has vexed me over the past few days.

First of all, it's important to understand what we intend to gain by eager loading--fewer database round-trips. You see, every time your program asks the database for information, there is considerable overhead with establishing the connection, waiting for your request to go over the wire, and waiting for the response to come back. If your queries return large data sets, this overhead will have a minimal impact. But if you're doing a whole bunch of tiny queries, this overhead can make a huge difference. In my case, where I often work from home and connect to the database at work, the overhead from these so-called "round-trips" can actually take more time than everything else in a typical page load.

So rather than saying:

TrackedEntityBO tebo = new TrackedEntityBO(context);

foreach (RepeaterItem item in Repeater1.Items)
{
CheckBoxList list = (CheckBoxList)item.FindControl("CheckBoxList1");
foreach (ListItem li in list.Items)
{
  TrackedEntity te = tebo.getById(Decimal.Parse(li.Value)); // Run 1 query per TrackedEntity
  ...
}
}

... , I could save a lot of time by saying:

TrackedEntityBO tebo = new TrackedEntityBO(context);
// Runs a single linq query for all TrackedEntities
Dictionary<decimal, TrackedEntity> ueMap = tebo.getAll().ToDictionary(c => c.TrackedEntityId);

foreach (RepeaterItem item in Repeater1.Items)
{
CheckBoxList list = (CheckBoxList)item.FindControl("CheckBoxList1");
foreach (ListItem li in list.Items)
{
   TrackedEntity te = ueMap[Decimal.Parse(li.Value)];
   ...
}
}

The folks who designed the Entity Framework were aware of this, and they took action to prevent round-trips as much as possible. For example:

Each context instance can cache a lot of the data that it gathers. If a later LINQ query on the same context can be determined to contain only data that has previously been cached, the Framework will use the cached data rather than executing another query.
Unlike Linq to SQL, the Entity Framework will not make a database query without being asked. For example, the following code will produce an exception because the EntityCategory object never got loaded into the context:
1
2
3
TrackedEntity e = (from te in context.TrackedEntity
select te).First();
Console.WriteLine (e.EntityCategory.Name);

As I pointed out in my last post, you simply can't make assumptions regarding which items have been loaded into the context. That means if you want to get certain objects associated with other objects, you'll either have to do a whole bunch of queries, or put both groups of objects into a data structure that maps them to one another. Thanks to the power of the .NET framework, however, this isn't all that difficult. For example, with relatively few lines of actual code, I was able to make the following statements possible:

IQueryable<TrackedEntity> teQuery = from te in context.TrackedEntity
                                 orderby te.Name
                                 select te;
IQueryable<EntityCategory> ecQuery = from te in teQuery
                                  let ec = te.EntityCategory
                                  orderby ec.Name
                                  select ec;
JoinedEntityMap<EntityCategory, TrackedEntity> j =
 ecQuery.ToMap(c => c.EntityCategoryId, teQuery, te => te.EntityCategory);

foreach (EntityCategory c in j.UniqueKeys)
{
 foreach (TrackedEntity e in j[c])
 {
     Console.WriteLine(c.Name + ": " + e.Name);
 }
}

With negligible memory overhead, the above code causes a grand total of two SQL queries and takes 110 milliseconds to run with 4 EntityCategories and 47 TrackedEntities, as opposed to the code below, which takes 251 milliseconds:

IQueryable<EntityCategory> ecQuery = from ec in context.EntityCategory
                                    orderby ec.Name
                                    select ec;

foreach (EntityCategory c in ecQuery)
{
   IQueryable<TrackedEntity> teQuery = from te in context.TrackedEntity
                                       where te.EntityCategory.EntityCategoryId == c.EntityCategoryId
                                       orderby te.Name
                                       select te;
   foreach (TrackedEntity e in teQuery)
   {
       Console.WriteLine(c.Name + ": " + e.Name);
   }
}

With that kind of difference on such a small scale, you can imagine how much of a difference it could make if you had hundreds, or even thousands of records in each table!

Another potential benefit to this approach is that the main page's code can create the data it knows will be needed, and then hand strongly-typed data to each user control in turn. It not only makes it possible for the user controls to be more specific in what sort of data they need as parameters, but it also increases the transparency of your data accesses. With most of your calls to the data layer occurring in one code file, it becomes obvious what's taking so much time to load. And if you're not using an HttpContext-based Entity context (which we are), it can also simplify the problems you'll run into with conflicting contexts.

Monday, February 16, 2009

A better understanding of Eager Loading

It turns out to be a very bad idea to rely on the behavior that I mentioned in my last posting. Even though the first Linq query I mentioned tends to cause those requested objects to be loaded into the context, there is no guarantee that those are the only objects that will be brought in. For example, if I run the following code:


   List<EntityCategory> l2 = (from ec in context.EntityCategory.Include(c => c.TrackedEntity)
                                    select ec).ToList();

   List<EntityCategory> l1 = (from ec in context.EntityCategory
                                    from te in ec.TrackedEntity
                                    from ue in te.User
                                    where ue.UserId == user.UserId
                                    orderby ec.Name
                                    select ec).DistinctBy(c => c.EntityCategoryId).ToList();

... l1.First().TrackedEntity will contain ALL of the tracked entities related to the given entity category. This is because the first statement loaded all of these entities into the context, and the second statement reused the objects that were created there.

So, what have we learned? We can only trust a LINQ statement to give us the data we ask for in the select statement, and we should never trust associated entity objects to even exist, much less contain what we're expecting.

I have a plan for dealing with this, which I'll detail once I get the kinks worked out.

Friday, February 13, 2009

Eager loading limitations (driving me crazy!)

Yesterday I talked about some strategies for Eager Loading in the Entity Framework. Well, I'm running into a really perplexing situation now.

If I run a LINQ query like this:

List l1 =
from u in context.UserSet
where u.UserId == user.UserId
from t in u.UserEntities
orderby t.EntityCategory.Name
select t.EntityCategory

I find that all of the entities accessed by the query are eager loaded... sort of. I can access l1.First().TrackedEntity.First().Name, for example. But l1.First().TrackedEntity.IsLoaded is false! So the data exists in the context, but the context doesn't think it's loaded. Furthermore, adding an Include("TrackedEntity") doesn't do a thing when you do joins. So even though all the information gets pulled into memory correctly, if I put any "if(!isloaded) load" code in my front-end (which would normally be a robust way of doing things), it will reload all the same data from the database for each EntityCategory.

Now, I could do a fancy query like this:

from ec in context.EntityCategory.Include("TrackedEntity")
where ec.TrackedEntity.Any(te => te.User.Any(u => u.UserId == user.UserId))
select ec

But the eager-loaded results would give me ALL of the tracked entities for each entity category, and not just the ones selected by the given user.

Is there any way make a statement that selectively joins tables AND eager-loads ONLY the entities included in the join, marking the appropriate EntityCollection as Loaded?

Thursday, February 12, 2009

Lazy and Eager Loading Strategies in the Entity Framework

We're working on a Timesheets project to help us learn .NET. Our architecture uses the fairly new .NET Entity Framework for its data model. We use three layers:

The data access layer consists of an EDMX file, which defines CLR objects based on our database schema. (e.g. TrackedTime)
The business layer consists of "business objects" with methods that do things with the objects in the data access layer. (e.g. public List getByUserAndDate(User user, DateTime dateTime))
The front-end is an ASP.NET Website.

So, to get a list of all of a user's tracked times for a particular day, our front-end can call the following method:

public List getByUserAndDate(User user, DateTime dateTime, params string[] includes)
{
  ObjectQuery queryObject = context.TrackedTime;
  return queryByUserAndDate(queryObject, user, dateTime).ToList();
}

The method "queryByUserAndDate" generates a LINQ query that asks for all TrackedTime objects in the given ObjectQuery that have the given User and Date. The list returned by "getByUserAndDate" can then be bound to a control in the front-end, such as a DataGrid, using an ObjectDataSource.

This works fine, as long as our DataGrid only displays the data tied directly to the TrackedTime object. If we want to have a field in the DataGrid that shows all the TrackedEntities that are related to our TrackedTime object, we will need to make sure they get loaded from the database. There are two ways to do this.

One way is Lazy Loading: for each row in our datagrid, we have to insert the following code before the TrackedEntities get accessed:

if (!trackedTime.TrackedEntities.IsLoaded)
{
  trackedTime.TrackedEntities.Load();
}

There are a few problems with this approach:

It messes up our three-tiered architecture, since data is being loaded almost directly from the front-end layer.
If the context object used in getByUserAndDate has been disposed before this code gets reached, it will throw an exception. This isn't such a problem in our architecture, for reasons I will probably discuss in another blog posting, but for many applications it simply wouldn't work.
getByUserAndDate only made one database query to load in all "N" TrackedTime objects. Now, we're manually forcing "N" database queries to load in the TrackedEntity objects. With all the overhead involved in a database query, this can slow the overall process significantly.

The other approach is to use so-called "Eager" loading. In this approach, we make a method getByUserAndDateWithTrackedEntities to load the TrackedEntities in the initial LINQ statement:

public List getByUserAndDateWithTrackedEntities(User user, DateTime dateTime)
{
  ObjectQuery queryObject = context.TrackedTime.Include("TrackedEntities");
  return queryByUserAndDate(queryObject, user, dateTime).ToList();
}

This has the advantage of using a single database call to load all TrackedEntities for all the TrackedTime objects that are going to be returned. However, it creates a couple of new problems:

Experienced programmers are loath to use Strings to define data that's tied to the object model. If, for example, you mistakenly say "TrackedEntity" instead, you'll have no idea that there's something wrong with your code until it runs. If this method only gets run in a corner case, you might not catch the bug for a long, long time. Studies show that it takes enormously more time to fix a bug when it gets caught later in the production cycle. We want the compiler to tell us if we're making a dumb mistake as soon as we make it. It's even better if we can use Intellisense to help us avoid making the error in the first place!
If we have to create a new method for every possible combination of includes, this could get really cumbersome. Are we going to make a getByDateWithUserAndTrackedEntitiesAndTheirCategories method, too? Ideally, we would like to have one method that could take the includes as arguments.

To solve problem #1, a gentleman named Matthieu MEZIL (est-il Français?) created Extension Methods that basically snap into the ObjectQuery class to give you new ways of calling the Include method. After creating a class based on the one he created, we can change getByUserAndDateWithTrackedEntities thusly:

ObjectQuery queryObject = context.TrackedTime.Include(c => c.TrackedEntities);

This gives us slightly more overhead because it basically just uses reflection behind the scenes to create the "TrackedEntities" string at run-time. But it's a trifle compared to all the other overhead inherent in ASP.NET, and it is now fully strongly typed.

Is there a way to solve problem #2 in this context as well? To be honest, M. MEZIL's extension methods use deep magic that is beyond my current understanding of the .NET framework, so I haven't figured out how one might go about altering the getByUserAndDate method to allow the same kind of "lambda expression" to be passed into it, although there is certainly a way to do so. But the limitations of the front-end leave us with certain practical concerns as well, which might make it impossible (or at least really annoying) to call the method this way. You see, ASP.NET relies heavily on reflection to make it easier to get the "it just works" behavior that Microsoft is striving for. For example, there appears to be no way in my .aspx code to tell the ObjectDataSource to use actual objects as arguments to my business layer. If I had a method that looked like this:

public List getByUserAndDateWithTrackedEntities(String userId, String dateTimeStr)

... I could just use the wizard in Visual Web Developer to create simple XML-ish code that tells the framework where to get the userId and dateTime to use when calling the getByUserAndDateWithTrackedEntities method. But if I want to pass actual objects around (which not only helps us prevent the situation mentioned in #1, but also saves us from re-loading the same User object from the entity framework for every control on the page that wants information from it), I have to leave the parameter values blank in the .aspx file and put something like this in my code-behind:

protected void TodayTimes_Selecting(object sender, ObjectDataSourceSelectingEventArgs e)
{
  e.InputParameters["user"] = currentUser;
  e.InputParameters["dateTime"] = DateTime.Parse(DateList.SelectedValue);
}

It's ugly, but that's the price we pay for those nifty wizards and that nice design view. If I want to bind the GridView programmatically I can, but I would have to sacrifice some of what makes ASP.NET attractive in the first place. Let's assume for a moment that I won't always want to make that sacrifice. How would I add a lambda expression to the InputParameters in the above code? If it's possible, it's undoubtedly messy, and since the function is only getting invoked through reflection anyway, we've already lost the benefits of Intellisense and compiler checks, so why bother?

With that in mind, let's examine a way to solve problem #2 without solving problem #1. I begin by creating the following method extension:

public static ObjectQuery Include(this ObjectQuery mainQuery, params string[] paths)
{
  ObjectQuery q = mainQuery;
  foreach (string path in paths)
  {
      q = q.Include(path);
  }
  return q;
}

As you can see, all this does is call Include iteratively on every string that gets passed in to it. Now we can change our getByUserAndDate function like so:

public List getByUserAndDate(User user, DateTime dateTime, params string[] includes)
{
  ObjectQuery queryObject = context.TrackedTime.Include(includes);
  return queryByUserAndDate(queryObject, user, dateTime).ToList();
}

Then, depending on how much data we want to load eagerly, we can call this programmatically like so:

myGridView.DataSource = trackedTimeBo.getByUserAndDate(user, dateTime);
myGridView.DataSource = trackedTimeBo.getByUserAndDate(user, dateTime, "TrackedEntities");
myGridView.DataSource = trackedTimeBo.getByUserAndDate(user, dateTime, "TrackedEntities", "TrackedEntities.EntityCategory");
...

Or, in the case mentioned earlier, we can add the input parameter like so:

e.InputParameters["includes"] = new string[] {};
e.InputParameters["includes"] = new string[] {"TrackedEntities"};
e.InputParameters["includes"] = new string[] {"TrackedEntities", "TrackedEntities.EntityCategory"};
...

So we end up with a couple of good ways to do Eager Loading in the Entity Framework, each with their advantages and disadvantages. We'll probably end up using a combination of the two. Or maybe we'll find another way that makes even more sense. Who knows?

Tuesday, February 10, 2009

Watch the ViewState While Debugging

In my previous post, I briefly showed how to use the ViewState to persist data. This post will overcome the difficulty we have in seeing what data is currently in the ViewState while debugging. The problem? ViewState is a StateBag, which, unlike Hashtable, does not have a default visualizer that makes its values easy for the debugger to look at. In fact, the actual keys and values don't appear to be contained within the StateBag object itself, which makes drilling through its data pointless. (I'm guessing that since the ViewState has per-user-control-per-instance values, it probably stores the real data in the ASP.NET "magic zone" where all the context loading and such takes place. The StateBag probably just provides access methods to wherever the data is really stored.)

So how can you get a quick-and-easy view of the current ViewState? Just add the following line to your Watch panel:

new System.Collections.Hashtable(ViewState)

The Hashtable constructor iterates through all of the key/value pairs of the StateBag and creates a Hashtable object out of it. Since Hashtable has a key/value pair visualizer in Visual Studio, you get a nice, clean table of keys and values.

Persisting Data in User Controls with ViewState

In ASP.NET, the ViewState is used to persist data across requests. Let's say, for example, I want to use a User Control that will show information based on a particular date. I could create a property in that control that looks like this:

private DateTime _Date;
public DateTime Date
{
set
{
_Date = value;
ViewState["Date"] = value;
}
get
{
if (_Date.Ticks == 0 && ViewState["Date"] != null)
{
_Date = (DateTime)ViewState["Date"];
}
return _Date;
}
}

... which would make it possible for me to "set and forget" the date from the page containing that user control. For example:

protected void Page_Load(object sender, EventArgs e)
{
if (!Page.IsPostBack)
{
AddTrackedTime1.Date = DateTime.Today;
}
}

So the first time the page loads, it sets the user control's date. You could have events set up to change the date on that user control if you want. Each time the date gets set, it will add a binary value to the ViewState, which will persist in the form of a hidden field on the webpage that is being viewed. When the time comes to do something with that date value, it will automatically be loaded out of the ViewState. Because the viewstate is persisted in a hidden form field, it will continue to work properly even if the user clicks "back" several times in the browser and then clicks on a control from their browsing history. Also, unlike the Session, the ViewState is specific to the control instance, so you could have several of these controls on the same page and they wouldn't step on each others' toes.

The only potential downside is that you increase the amount of data sent to and from the browser, which could slow transactions somewhat.

Adventures in .NET-dom