Adventures in .NET-dom: 2009

Friday, July 10, 2009

Combining Lambda Expressions

I'm working on custom fields for a GridView which can include custom filters. The idea is to persist these filters to the data layer, so that when a filter is applied, the query sent to the database includes all the appropriate "where" clauses.

I wanted to have an abstract parent class that deals with most of the tedium involved in extending the DataControlField class, and then only require the child classes to handle the cases specific to their custom filters. The parent class would know how to create the lambda expression required to get a given property off of the object that is being represented in the GridView, while the child class would know how to filter by that property. I was willing to make the parent class a little more messy, as long as it could easily be extended to make whatever kind of filterable columns we wanted. For example, if the child class was filtering based on string data types, I wanted to make the following syntax possible:

protected override IEnumerable<Expression<Func<string, bool>>> GetFilters()
{
   if (filterTextBox.Text != null && filterTextBox.Text.Length > 0)
   {
       yield return s => s.StartsWith(filterTextBox.Text);
   }
}

But I ran into a wall trying to get the parent class to build the full filter expression based on this partial filter expression. I needed a method that looked like this:

public Expression<Func<TrackedTime, bool>> mergeFilterExpressions(
   Expression<Func<TrackedTime, T>> propGetter,
   Expression<Func<T, bool>> filter)

So given one expression that could get the property from the TrackedTime, and another expression that could get a boolean value from the property, I needed to be able to create an Expression that gets a boolean value from the TrackedTime. In LINQ itself, there is of course the ability to Invoke the functions represented by my expressions, like this:

return tt => filter.Invoke(propGetter.Invoke(tt));

However, LINQ to Entities doesn't support the Invoke command. So although this would compile just fine, it would throw a runtime exception when the time came to build the SQL query to send to the database. It was driving me crazy.

After much searching, it was Joe Albahari, creator of the amazing LinqPad (which I use almost daily by the way), who had the solution. On his website, he provides a collection of extension methods called LinqKit, which allowed me to do the following:

public Expression<Func<TrackedTime, bool>> mergeFilterExpressions(
Expression<Func<TrackedTime, T>> propGetter,
Expression<Func<T, bool>> filter)
{
   ParameterExpression param = propGetter.Parameters[0];
   InvocationExpression getProp = Expression.Invoke(propGetter, param);
   InvocationExpression invoke = Expression.Invoke(filter, getProp.Expand());
   return Expression.Lambda<Func<TrackedTime, bool>>(invoke.Expand(), param);
}

How 'bout that! By Invoking and then Expanding the Expressions, I can now combine Lambda Expressions any way I want!

Thursday, June 4, 2009

Would a Wave-based IDE be feasible

Google recently announced "Google Wave," a project that they've been working on for the past two years, and which is set to revolutionize online communication as we know it. When it was announced, they mentioned that the whole thing was built using Google's Web Toolkit. I started looking into the toolkit and noticed that they also have something called the Google App Engine, which is a framework and service that lets you host your applications and data "in the cloud" on Google's servers.

Pondering on these various Internet technologies, I got to wondering: Would it be possible to host the entire development process online? If someone could create a Wave-based code infrastructure, we could have an online IDE:

Each Java code file could be a Wave, which could be edited collaboratively--by multiple users at the same time, if necessary (hello Extreme Programming!).
A spell-check-like plugin could be created to provide real-time compiler feedback and intellisense.
A bot could be granted access to the code tree in order to compile and deploy changes to a cloud-based service in real-time, provide debugging services, and even run unit tests.
Developers could "check out" the waves into their own framework instance, and once a set of changes is ready, they could be merged into the "stable" set of Waves.
Waves have a built-in, extremely powerful version control system built in already; you can visually and immediately step back to each point in a file's revision history to watch its evolution.
The Google folks already showed how useful Waves can be in bug management; bugs and tasks could be handled and passed around within the same Wave framework. They could probably even be linked to the code changes that were made to fix them (and vice versa), for future reference.
A plugin similar to the bug management one could be used to tag a spot in code for colleague review. A Wave thus tagged would appear in the colleague's inbox, where they could see the changes made in the context of the entire Java file. They could start a thread inline in the code to ask questions and make suggestions, which would all be immediately visible to the original programmer. They could even have an entire chat session right there, inline with the code! Both the original programmer and the colleague could make and see the changes in real-time. Once the colleague is satisfied, they could use the plugin to sign off on the changes.
Documentation (both internal and external) could also be managed by the same system. Code for a particular feature could be linked to that feature's documentation.

And perhaps the best part about the whole thing is that developers don't even need to install anything on their computers. They can log in from any web-enabled computer, anywhere, and all the same capabilities are at their fingertips. With Wave, Google has laid the foundation for a new generation of Internet technologies. I'm excited to see the many ways that this sort of technology will be leveraged in the years to come.

Monday, May 4, 2009

Fastest way to check for existence in a table

In order to improve performance on one of the pages in our Java code, I was making a SQL query which, along with the typical section grade information, also pulls in a field to tell whether that particular grade type is in use. Between my own brains and some quick Google searching, this was the best query I could come up with:

1
2
3

select sg.*,
(select top 1 1 from section_roster sr where sr.section_grade_id = sg.section_grade_id) as isUsed
from section_grade sg

I assumed that in the absence of any ordering, "select top 1 1" would be converted by SQL Server into a sort of "exists" statement, and would therefore be the fastest query for the job. But just out of curiosity, I ran a similar LINQ query in LINQPad to see what SQL would be generated. Based on those results, I created the following query:

select sg.*,
(case when exists(select null from section_roster sr where sr.section_grade_id = sg.section_grade_id) then 1
      else null
      end) as isUsed
from section_grade sg

Although it's not as simple a query, I was able to drop the execution time from about 27 milliseconds to about 9 milliseconds.

Thursday, April 16, 2009

Enumerate across a date range in Linq using "yield"

I'm making a generic data table displayer which should be able to display any set of data that is formatted properly. It can either deduce the table headers based on the data it receives, or it can use ones that I specify via a list of key/value pairs. For example, by setting the value of this property:

1	IEnumerable<DataPair<object, String>> RowKeysWithHeaders

... I can make the row headers display all of the Strings on the right side of the given DataPairs in the IEnumerable's order, and the data for each row of the table will be aligned to match the keys on the left side of the DataPairs.

Now, let's say I'm using the data returned by the query I mentioned yesterday, which will only include dates for which there are TrackedTime entries, but I want to display all dates within a given date range, and simply leave cells empty if there are no entries for those dates. Furthermore, I want to specify how the dates are formatted. I could create a list of date/String pairs to use as row keys with headers, like this:

List<DataPair<object, String>> dateHeaders = new List<DataPair<object, String>>();
for(DateTime date = startDate; date < endDate; date.AddDays(1))
{
   DataPair<object, String> header = new DataPair<object, String>
   {
       LeftObject = date,
       RightObject = date.ToShortDateString()
   };
   dateHeaders.Add(header);
}
this.ReportTableDisplay1.RowKeysWithHeaders = dateHeaders;

But that would be kind of wasteful, wouldn't it? It would mean creating an entire List of dates, when all I need is to print a bunch of consecutive dates--something I should be able to do mathematically on the fly.

A more efficient way would be to create an IEnumerable class (and an accompanying IEnumerator class) that know how to iterate across dates. But that's a lot more work and a lot more code for something that should be relatively simple.

Thanks to the yield operator, there is a better way. This simple method:

public static IEnumerable<DateTime> DaysInRange(DateTime startDate, DateTime endDate)
{
   DateTime current = startDate;
    while (current <= endDate)
   {
       yield return current;
       current = current.AddDays(1);
   }
}

... will create an enumerable object that does exactly what I need it to, without the need for any extra classes or anything. Here's how I use it:

this.ReportTableDisplay1.RowKeysWithHeaders = from d in DaysInRange(startDate, endDate)
                                             select new DataPair<object, String>
                                             {
                                                 LeftObject = d,
                                                 RightObject = d.ToShortDateString()
                                             };

This simple LINQ statement then gives me an IEnumerable that iteratively creates new DatePairs as the program traverses it. You have to admit, that's pretty smooth. That means that if I decide to paginate the table results, I can use the Take() method, and the system won't even produce DataPairs for dates that I don't iterate over. I can also move my DaysInRange method into a common utility class where it can be accessed any time I need to traverse a range of dates. It's a great example of how we can use LINQ in conjunction with the yield operator to create simple, efficient code.

It can also be used to highlight one of the dangers of accepting IEnumerable arguments. Since I literally have no idea how expensive it might be to iterate over a given IEnumerable, I need to make sure my ReportTableDisplay class only iterates over the RowKeysWithHeaders once. Otherwise I could end up creating who-knows-how-many copies of exactly the same DataPair without even realizing it. Just think how that would turn out if I was using a truly expensive IEnumerable--one whose IEnumerator begins by accessing data from over the Internet, for example!

Wednesday, April 15, 2009

Improved Dynamic Query

Yesterday I figured out how we could use C# Expressions to filter query results by any number of criteria. Today I refined my approach somewhat. Instead of the long, complex LINQ query I used there, I now have:

var userTimes = (from t in times
  group t by new {t.User.UserName, t.TargetDate} into ut
  select new
  {
   UserName = ut.Key.UserName,
  TargetDate = ut.Key.TargetDate,
  Minutes = ut.Sum(t => t.Minutes)
  }
  );

This produces a faster, more reasonable set of SQL code which returns almost identical data:

-- Region Parameters
DECLARE @p__linq__2 DateTime SET @p__linq__2 = '2009-02-14 16:01:50.069'
-- EndRegion
SELECT
1 AS [C1],
[GroupBy1].[K2] AS [UserName],
[GroupBy1].[K1] AS [TargetDate],
[GroupBy1].[A1] AS [C2]
FROM ( SELECT
 [Extent1].[TargetDate] AS [K1],
 [Extent2].[UserName] AS [K2],
 SUM( CAST( [Extent1].[Minutes] AS int)) AS [A1]
 FROM  [dbo].[TrackedTime] AS [Extent1]
 LEFT OUTER JOIN [dbo].[aspnet_Users] AS [Extent2] ON [Extent1].[UserId] = [Extent2].[UserId]
 WHERE ([Extent1].[TargetDate] > @p__linq__2) AND ([Extent1].[TargetDate] < (GetDate()))
 GROUP BY [Extent1].[TargetDate], [Extent2].[UserName]
)  AS [GroupBy1]

Note that I avoided the problem I mentioned here by returning a simple set of native types rather than Entity Framework objects. This approach wouldn't be best under normal circumstances. If I'm querying the database for usernames, for example, I will often want to have more user data as well (like their real names) available to my front-end code. In that case, I could benefit from using the Entity Framework to get all the object data in a single roundtrip. However, if I'm pulling data for reporting purposes, I generally know exactly what data I want to be displaying. In that case it's more important to keep the data set small and fast. I've seen poorly-designed reporting engines pull so much data from the database that the VM runs out of memory, which causes all sorts of problems.

Finally, rather than relying on the LINQ framework to package my objects into custom classes for me, I created a generic TableData class, along with an extender that allows me to do this:

public TableData<String, DateTime, int> generateData(List<Filter<TrackedTime>> dateFilters)
{
 ...
 return userTimes.ToTable(d => d.UserName, d => d.TargetDate, d => d.Minutes);
}

The next step would be to create a generic web control that creates a front-end table when given the returned report data. That way we can use a single control to display all kinds of report data, rather than creating a new method to display each new set of report data.

Tuesday, April 14, 2009

Dynamic Queries in Entity Framework

I've been working on a proof-of-concept for building dynamic Linq queries in the Entity Framework. The idea is to have any number of criteria that a user can set to limit or expand the scope of (for example) a report.

I still have a long way to go before it would be ready to use for real, but here's what I've been able to do:

    public class ReportPerUserPerDay2
  {
      public List<UserNamesWithDatesWithMinuteTotals> generateData(List<Filter<TrackedTime>> dateFilters)
      {
          TimesheetEntities context = new TimesheetEntities();
          IQueryable<TrackedTime> times = context.TrackedTime;
          foreach (Filter<TrackedTime> dateFilter in dateFilters)
          {
              times = times.Where(dateFilter.getPredicate());
          }
          var userTimes = (from t in times
                           group t by t.User.UserName into ut
                           from ut2 in
                               (from t in ut
                                group t by t.TargetDate into ut3
                                select new DatesWithMinuteTotals()
                                {
                                    TargetDate = ut3.Key,
                                    Minutes = ut3.Sum(t => t.Minutes)
                                })
                           group ut2 by ut.Key into ut4
                           select new UserNamesWithDatesWithMinuteTotals()
                           {
                               UserName = ut4.Key,
                               Dates = from t in ut4 select t
                           }
               );
        
          return userTimes.ToList();
      }
  }

  public class DatesWithMinuteTotals
  {
      public DateTime TargetDate;
      public long Minutes;
  }

  public class UserNamesWithDatesWithMinuteTotals
  {
      public String UserName;
      public IEnumerable<DatesWithMinuteTotals> Dates;
  }

So I can pass the generateData method filters like this one:

public partial class Controls_ReportFilters_FilterDateRange : System.Web.UI.UserControl, TimesheetReporting.Filters.Filter<TrackedTime>
{
   ...

   public Expression<Func<TrackedTime, bool>> getPredicate()
   {
       if (StartDate.Ticks == 0 && EndDate.Ticks == 0) return t => true;
       if (StartDate.Ticks == 0) return t => t.TargetDate <= EndDate;
       if (EndDate.Ticks == 0) return t => t.TargetDate >= StartDate;
       return t => t.TargetDate <= EndDate && t.TargetDate >= StartDate;
   }

}

and the Linq statement will auto-generate this crazy SQL code:

-- Region Parameters
DECLARE p__linq__2 DateTime SET p__linq__2 = '2009-02-13 18:13:53.470'
-- EndRegion
SELECT
[Project7].[UserName] AS [UserName],
[Project7].[C1] AS [C1],
[Project7].[C3] AS [C2],
[Project7].[TargetDate] AS [TargetDate],
[Project7].[C2] AS [C3]
FROM ( SELECT
 [Distinct2].[UserName] AS [UserName],
 1 AS [C1],
 [Project6].[TargetDate] AS [TargetDate],
 CASE WHEN ([Project6].[C2] IS NULL) THEN CAST(NULL AS bigint) ELSE  CAST( [Project6].[C1] AS bigint) END AS [C2],
 [Project6].[C2] AS [C3]
 FROM   (SELECT DISTINCT
  [Project2].[UserName] AS [UserName]
  FROM   (SELECT
   @p__linq__2 AS [p__linq__2],
   [Distinct1].[UserName] AS [UserName]
   FROM ( SELECT DISTINCT
    [Extent2].[UserName] AS [UserName]
    FROM  [dbo].[TrackedTime] AS [Extent1]
    LEFT OUTER JOIN [dbo].[aspnet_Users] AS [Extent2] ON [Extent1].[UserId] = [Extent2].[UserId]
    WHERE [Extent1].[TargetDate] > @p__linq__2
   )  AS [Distinct1] ) AS [Project2]
  CROSS APPLY  (SELECT
   [Extent3].[TargetDate] AS [K1]
   FROM  [dbo].[TrackedTime] AS [Extent3]
   LEFT OUTER JOIN [dbo].[aspnet_Users] AS [Extent4] ON [Extent3].[UserId] = [Extent4].[UserId]
   WHERE ([Extent3].[TargetDate] > @p__linq__2) AND (([Extent4].[UserName] = [Project2].[UserName]) OR (([Extent4].[UserName] IS NULL) AND ([Project2].[UserName] IS NULL)))
   GROUP BY [Extent3].[TargetDate] ) AS [GroupBy1] ) AS [Distinct2]
 LEFT OUTER JOIN  (SELECT
  [Project5].[UserName] AS [UserName],
  [GroupBy2].[K1] AS [TargetDate],
  [GroupBy2].[A1] AS [C1],
  1 AS [C2]
  FROM   (SELECT
   @p__linq__2 AS [p__linq__2],
   [Distinct3].[UserName] AS [UserName]
   FROM ( SELECT DISTINCT
    [Extent6].[UserName] AS [UserName]
    FROM  [dbo].[TrackedTime] AS [Extent5]
    LEFT OUTER JOIN [dbo].[aspnet_Users] AS [Extent6] ON [Extent5].[UserId] = [Extent6].[UserId]
    WHERE [Extent5].[TargetDate] > @p__linq__2
   )  AS [Distinct3] ) AS [Project5]
  CROSS APPLY  (SELECT
   [Extent7].[TargetDate] AS [K1],
   SUM( CAST( [Extent7].[Minutes] AS int)) AS [A1]
   FROM  [dbo].[TrackedTime] AS [Extent7]
   LEFT OUTER JOIN [dbo].[aspnet_Users] AS [Extent8] ON [Extent7].[UserId] = [Extent8].[UserId]
   WHERE ([Extent7].[TargetDate] > @p__linq__2) AND (([Extent8].[UserName] = [Project5].[UserName]) OR (([Extent8].[UserName] IS NULL) AND ([Project5].[UserName] IS NULL)))
   GROUP BY [Extent7].[TargetDate] ) AS [GroupBy2] ) AS [Project6] ON ([Project6].[UserName] = [Distinct2].[UserName]) OR (([Project6].[UserName] IS NULL) AND ([Distinct2].[UserName] IS NULL))
)  AS [Project7]
ORDER BY [Project7].[UserName] ASC, [Project7].[C3] ASC

which, amazingly enough, rather efficiently produces exactly the information I wanted, and no more. Pretty slick, eh?

Thursday, February 19, 2009

LinqPad, my new favorite tool

As I was looking into LINQ, I found a tool called LINQPad, which allows you to run LINQ queries on a typical SQL database. "Cool," I thought, "That might help me become familiar with LINQ syntax." I had no idea of the power of this tool.

When it first connects to your chosen database, LINQPad automatically generates a data-object mapping based on the foreign and primary keys in your database. So rather than having to worry about doing an INNER JOIN ON table1.some_column = table2.some_column, you can just use standard LINQ syntax to join the objects based on their relationships. Another advantage is that it provides links from one table to another, so you can easily jump around the table structure, seeing how things are interconnected.

Visualization of the returned data is much more intuitive and powerful than the simple grid that is returned by normal applications. For example, using the "select new" statement I discussed in my last posting, you can see lists of objects grouped within other objects, fully collapsible and everything! For this reason, I prefer it to SQL Server Management Studio for viewing information in the database.

LINQPad also lets you see what code is generated based on the LINQ queries you put in. You can see the Lambda code, the SQL code, and even the Instruction Language code that comes from the statements you write. That's very handy for understanding the performance of your code.

LINQPad's uses aren't limited to Linq to SQL, however! You can reference the .dll file from your Entity Framework data layer and run queries as if you were writing code right within Visual Studio. It holds true to its claim to be a "code snippet IDE." In fact, you don't even have to use it for LINQ! Say you just want to run a method and see what kind of data comes out: rather than creating a temporary unit-test in your Visual Studio directory, you can simply write the C# code snippet in LINQPad and run it!

If you're new to LINQ, LINQPad actually comes with a hearty set of code samples to help you get started.

Oh, yeah, and you can also use it to run SQL queries and commands if you want.

Overall, I've been extremely impressed by all that I can do with LINQPad. Now I'm trying to get my bosses to invest in an Autocomplete license for it, and looking forward to the increased productivity I can expect from that feature.

Joining entities using the "new" operator

I took some time in my last post to explain that two database queries are better than many database queries. Namely, it's better to do one database query and create a mapping with the returned data, than it is to perform a new database query for every chunk of information that we want. (I should note that this is obviously only true if the results of your two database queries can be limited to a size that roughly fits the amount of data you'd otherwise be requesting. If you're clever, this can usually be the case.)

As I played around a bit more with Linq, I found another syntax that can be used to join objects together. For example, the following Linq (to SQL) code:

1 2	from ec in EntityCategory select new {ec, te = from te in ec.TrackedEntity select te}

... will produce objects of a new Anonymous type with the following elements:

EntityCategory ec
IEnumerable te

This approach can use a single SQL round-trip, which you'd think would be ideal. After all, if two is better than many, shouldn't one be better than two? But there are a few things to keep in mind:

Because it creates an Anonymous data type, the returned results don't really lend themselves to being passed around your program as arguments.
There is actually quite a bit of overhead associated with this method. The generated SQL query is a lot more expensive for the database, and the returned data set has a lot of unnecessarily-duplicated data. Once the data returns from the database, it has to be parsed and organized by the framework (Entity or Linq to SQL) in much the same way as the JoinedEntityMap class that I mentioned in my last post. So this syntax can run much more slowly, despite reducing the number of database round-trips.

That said, this syntax can be useful in certain circumstances, so it's good to keep this alternative in the back of your mind.

Tuesday, February 17, 2009

Joining entities to avoid database roundtrips

In my last post, I hinted that I had an alternative in mind for the Eager Loading problem that has vexed me over the past few days.

First of all, it's important to understand what we intend to gain by eager loading--fewer database round-trips. You see, every time your program asks the database for information, there is considerable overhead with establishing the connection, waiting for your request to go over the wire, and waiting for the response to come back. If your queries return large data sets, this overhead will have a minimal impact. But if you're doing a whole bunch of tiny queries, this overhead can make a huge difference. In my case, where I often work from home and connect to the database at work, the overhead from these so-called "round-trips" can actually take more time than everything else in a typical page load.

So rather than saying:

TrackedEntityBO tebo = new TrackedEntityBO(context);

foreach (RepeaterItem item in Repeater1.Items)
{
CheckBoxList list = (CheckBoxList)item.FindControl("CheckBoxList1");
foreach (ListItem li in list.Items)
{
  TrackedEntity te = tebo.getById(Decimal.Parse(li.Value)); // Run 1 query per TrackedEntity
  ...
}
}

... , I could save a lot of time by saying:

TrackedEntityBO tebo = new TrackedEntityBO(context);
// Runs a single linq query for all TrackedEntities
Dictionary<decimal, TrackedEntity> ueMap = tebo.getAll().ToDictionary(c => c.TrackedEntityId);

foreach (RepeaterItem item in Repeater1.Items)
{
CheckBoxList list = (CheckBoxList)item.FindControl("CheckBoxList1");
foreach (ListItem li in list.Items)
{
   TrackedEntity te = ueMap[Decimal.Parse(li.Value)];
   ...
}
}

The folks who designed the Entity Framework were aware of this, and they took action to prevent round-trips as much as possible. For example:

Each context instance can cache a lot of the data that it gathers. If a later LINQ query on the same context can be determined to contain only data that has previously been cached, the Framework will use the cached data rather than executing another query.
Unlike Linq to SQL, the Entity Framework will not make a database query without being asked. For example, the following code will produce an exception because the EntityCategory object never got loaded into the context:
1
2
3
TrackedEntity e = (from te in context.TrackedEntity
select te).First();
Console.WriteLine (e.EntityCategory.Name);

As I pointed out in my last post, you simply can't make assumptions regarding which items have been loaded into the context. That means if you want to get certain objects associated with other objects, you'll either have to do a whole bunch of queries, or put both groups of objects into a data structure that maps them to one another. Thanks to the power of the .NET framework, however, this isn't all that difficult. For example, with relatively few lines of actual code, I was able to make the following statements possible:

IQueryable<TrackedEntity> teQuery = from te in context.TrackedEntity
                                 orderby te.Name
                                 select te;
IQueryable<EntityCategory> ecQuery = from te in teQuery
                                  let ec = te.EntityCategory
                                  orderby ec.Name
                                  select ec;
JoinedEntityMap<EntityCategory, TrackedEntity> j =
 ecQuery.ToMap(c => c.EntityCategoryId, teQuery, te => te.EntityCategory);

foreach (EntityCategory c in j.UniqueKeys)
{
 foreach (TrackedEntity e in j[c])
 {
     Console.WriteLine(c.Name + ": " + e.Name);
 }
}

With negligible memory overhead, the above code causes a grand total of two SQL queries and takes 110 milliseconds to run with 4 EntityCategories and 47 TrackedEntities, as opposed to the code below, which takes 251 milliseconds:

IQueryable<EntityCategory> ecQuery = from ec in context.EntityCategory
                                    orderby ec.Name
                                    select ec;

foreach (EntityCategory c in ecQuery)
{
   IQueryable<TrackedEntity> teQuery = from te in context.TrackedEntity
                                       where te.EntityCategory.EntityCategoryId == c.EntityCategoryId
                                       orderby te.Name
                                       select te;
   foreach (TrackedEntity e in teQuery)
   {
       Console.WriteLine(c.Name + ": " + e.Name);
   }
}

With that kind of difference on such a small scale, you can imagine how much of a difference it could make if you had hundreds, or even thousands of records in each table!

Another potential benefit to this approach is that the main page's code can create the data it knows will be needed, and then hand strongly-typed data to each user control in turn. It not only makes it possible for the user controls to be more specific in what sort of data they need as parameters, but it also increases the transparency of your data accesses. With most of your calls to the data layer occurring in one code file, it becomes obvious what's taking so much time to load. And if you're not using an HttpContext-based Entity context (which we are), it can also simplify the problems you'll run into with conflicting contexts.

Monday, February 16, 2009

A better understanding of Eager Loading

It turns out to be a very bad idea to rely on the behavior that I mentioned in my last posting. Even though the first Linq query I mentioned tends to cause those requested objects to be loaded into the context, there is no guarantee that those are the only objects that will be brought in. For example, if I run the following code:


   List<EntityCategory> l2 = (from ec in context.EntityCategory.Include(c => c.TrackedEntity)
                                    select ec).ToList();

   List<EntityCategory> l1 = (from ec in context.EntityCategory
                                    from te in ec.TrackedEntity
                                    from ue in te.User
                                    where ue.UserId == user.UserId
                                    orderby ec.Name
                                    select ec).DistinctBy(c => c.EntityCategoryId).ToList();

... l1.First().TrackedEntity will contain ALL of the tracked entities related to the given entity category. This is because the first statement loaded all of these entities into the context, and the second statement reused the objects that were created there.

So, what have we learned? We can only trust a LINQ statement to give us the data we ask for in the select statement, and we should never trust associated entity objects to even exist, much less contain what we're expecting.

I have a plan for dealing with this, which I'll detail once I get the kinks worked out.

Friday, February 13, 2009

Eager loading limitations (driving me crazy!)

Yesterday I talked about some strategies for Eager Loading in the Entity Framework. Well, I'm running into a really perplexing situation now.

If I run a LINQ query like this:

List l1 =
from u in context.UserSet
where u.UserId == user.UserId
from t in u.UserEntities
orderby t.EntityCategory.Name
select t.EntityCategory

I find that all of the entities accessed by the query are eager loaded... sort of. I can access l1.First().TrackedEntity.First().Name, for example. But l1.First().TrackedEntity.IsLoaded is false! So the data exists in the context, but the context doesn't think it's loaded. Furthermore, adding an Include("TrackedEntity") doesn't do a thing when you do joins. So even though all the information gets pulled into memory correctly, if I put any "if(!isloaded) load" code in my front-end (which would normally be a robust way of doing things), it will reload all the same data from the database for each EntityCategory.

Now, I could do a fancy query like this:

from ec in context.EntityCategory.Include("TrackedEntity")
where ec.TrackedEntity.Any(te => te.User.Any(u => u.UserId == user.UserId))
select ec

But the eager-loaded results would give me ALL of the tracked entities for each entity category, and not just the ones selected by the given user.

Is there any way make a statement that selectively joins tables AND eager-loads ONLY the entities included in the join, marking the appropriate EntityCollection as Loaded?

Thursday, February 12, 2009

Lazy and Eager Loading Strategies in the Entity Framework

We're working on a Timesheets project to help us learn .NET. Our architecture uses the fairly new .NET Entity Framework for its data model. We use three layers:

The data access layer consists of an EDMX file, which defines CLR objects based on our database schema. (e.g. TrackedTime)
The business layer consists of "business objects" with methods that do things with the objects in the data access layer. (e.g. public List getByUserAndDate(User user, DateTime dateTime))
The front-end is an ASP.NET Website.

So, to get a list of all of a user's tracked times for a particular day, our front-end can call the following method:

public List getByUserAndDate(User user, DateTime dateTime, params string[] includes)
{
  ObjectQuery queryObject = context.TrackedTime;
  return queryByUserAndDate(queryObject, user, dateTime).ToList();
}

The method "queryByUserAndDate" generates a LINQ query that asks for all TrackedTime objects in the given ObjectQuery that have the given User and Date. The list returned by "getByUserAndDate" can then be bound to a control in the front-end, such as a DataGrid, using an ObjectDataSource.

This works fine, as long as our DataGrid only displays the data tied directly to the TrackedTime object. If we want to have a field in the DataGrid that shows all the TrackedEntities that are related to our TrackedTime object, we will need to make sure they get loaded from the database. There are two ways to do this.

One way is Lazy Loading: for each row in our datagrid, we have to insert the following code before the TrackedEntities get accessed:

if (!trackedTime.TrackedEntities.IsLoaded)
{
  trackedTime.TrackedEntities.Load();
}

There are a few problems with this approach:

It messes up our three-tiered architecture, since data is being loaded almost directly from the front-end layer.
If the context object used in getByUserAndDate has been disposed before this code gets reached, it will throw an exception. This isn't such a problem in our architecture, for reasons I will probably discuss in another blog posting, but for many applications it simply wouldn't work.
getByUserAndDate only made one database query to load in all "N" TrackedTime objects. Now, we're manually forcing "N" database queries to load in the TrackedEntity objects. With all the overhead involved in a database query, this can slow the overall process significantly.

The other approach is to use so-called "Eager" loading. In this approach, we make a method getByUserAndDateWithTrackedEntities to load the TrackedEntities in the initial LINQ statement:

public List getByUserAndDateWithTrackedEntities(User user, DateTime dateTime)
{
  ObjectQuery queryObject = context.TrackedTime.Include("TrackedEntities");
  return queryByUserAndDate(queryObject, user, dateTime).ToList();
}

This has the advantage of using a single database call to load all TrackedEntities for all the TrackedTime objects that are going to be returned. However, it creates a couple of new problems:

Experienced programmers are loath to use Strings to define data that's tied to the object model. If, for example, you mistakenly say "TrackedEntity" instead, you'll have no idea that there's something wrong with your code until it runs. If this method only gets run in a corner case, you might not catch the bug for a long, long time. Studies show that it takes enormously more time to fix a bug when it gets caught later in the production cycle. We want the compiler to tell us if we're making a dumb mistake as soon as we make it. It's even better if we can use Intellisense to help us avoid making the error in the first place!
If we have to create a new method for every possible combination of includes, this could get really cumbersome. Are we going to make a getByDateWithUserAndTrackedEntitiesAndTheirCategories method, too? Ideally, we would like to have one method that could take the includes as arguments.

To solve problem #1, a gentleman named Matthieu MEZIL (est-il Français?) created Extension Methods that basically snap into the ObjectQuery class to give you new ways of calling the Include method. After creating a class based on the one he created, we can change getByUserAndDateWithTrackedEntities thusly:

ObjectQuery queryObject = context.TrackedTime.Include(c => c.TrackedEntities);

This gives us slightly more overhead because it basically just uses reflection behind the scenes to create the "TrackedEntities" string at run-time. But it's a trifle compared to all the other overhead inherent in ASP.NET, and it is now fully strongly typed.

Is there a way to solve problem #2 in this context as well? To be honest, M. MEZIL's extension methods use deep magic that is beyond my current understanding of the .NET framework, so I haven't figured out how one might go about altering the getByUserAndDate method to allow the same kind of "lambda expression" to be passed into it, although there is certainly a way to do so. But the limitations of the front-end leave us with certain practical concerns as well, which might make it impossible (or at least really annoying) to call the method this way. You see, ASP.NET relies heavily on reflection to make it easier to get the "it just works" behavior that Microsoft is striving for. For example, there appears to be no way in my .aspx code to tell the ObjectDataSource to use actual objects as arguments to my business layer. If I had a method that looked like this:

public List getByUserAndDateWithTrackedEntities(String userId, String dateTimeStr)

... I could just use the wizard in Visual Web Developer to create simple XML-ish code that tells the framework where to get the userId and dateTime to use when calling the getByUserAndDateWithTrackedEntities method. But if I want to pass actual objects around (which not only helps us prevent the situation mentioned in #1, but also saves us from re-loading the same User object from the entity framework for every control on the page that wants information from it), I have to leave the parameter values blank in the .aspx file and put something like this in my code-behind:

protected void TodayTimes_Selecting(object sender, ObjectDataSourceSelectingEventArgs e)
{
  e.InputParameters["user"] = currentUser;
  e.InputParameters["dateTime"] = DateTime.Parse(DateList.SelectedValue);
}

It's ugly, but that's the price we pay for those nifty wizards and that nice design view. If I want to bind the GridView programmatically I can, but I would have to sacrifice some of what makes ASP.NET attractive in the first place. Let's assume for a moment that I won't always want to make that sacrifice. How would I add a lambda expression to the InputParameters in the above code? If it's possible, it's undoubtedly messy, and since the function is only getting invoked through reflection anyway, we've already lost the benefits of Intellisense and compiler checks, so why bother?

With that in mind, let's examine a way to solve problem #2 without solving problem #1. I begin by creating the following method extension:

public static ObjectQuery Include(this ObjectQuery mainQuery, params string[] paths)
{
  ObjectQuery q = mainQuery;
  foreach (string path in paths)
  {
      q = q.Include(path);
  }
  return q;
}

As you can see, all this does is call Include iteratively on every string that gets passed in to it. Now we can change our getByUserAndDate function like so:

public List getByUserAndDate(User user, DateTime dateTime, params string[] includes)
{
  ObjectQuery queryObject = context.TrackedTime.Include(includes);
  return queryByUserAndDate(queryObject, user, dateTime).ToList();
}

Then, depending on how much data we want to load eagerly, we can call this programmatically like so:

myGridView.DataSource = trackedTimeBo.getByUserAndDate(user, dateTime);
myGridView.DataSource = trackedTimeBo.getByUserAndDate(user, dateTime, "TrackedEntities");
myGridView.DataSource = trackedTimeBo.getByUserAndDate(user, dateTime, "TrackedEntities", "TrackedEntities.EntityCategory");
...

Or, in the case mentioned earlier, we can add the input parameter like so:

e.InputParameters["includes"] = new string[] {};
e.InputParameters["includes"] = new string[] {"TrackedEntities"};
e.InputParameters["includes"] = new string[] {"TrackedEntities", "TrackedEntities.EntityCategory"};
...

So we end up with a couple of good ways to do Eager Loading in the Entity Framework, each with their advantages and disadvantages. We'll probably end up using a combination of the two. Or maybe we'll find another way that makes even more sense. Who knows?

Tuesday, February 10, 2009

Watch the ViewState While Debugging

In my previous post, I briefly showed how to use the ViewState to persist data. This post will overcome the difficulty we have in seeing what data is currently in the ViewState while debugging. The problem? ViewState is a StateBag, which, unlike Hashtable, does not have a default visualizer that makes its values easy for the debugger to look at. In fact, the actual keys and values don't appear to be contained within the StateBag object itself, which makes drilling through its data pointless. (I'm guessing that since the ViewState has per-user-control-per-instance values, it probably stores the real data in the ASP.NET "magic zone" where all the context loading and such takes place. The StateBag probably just provides access methods to wherever the data is really stored.)

So how can you get a quick-and-easy view of the current ViewState? Just add the following line to your Watch panel:

new System.Collections.Hashtable(ViewState)

The Hashtable constructor iterates through all of the key/value pairs of the StateBag and creates a Hashtable object out of it. Since Hashtable has a key/value pair visualizer in Visual Studio, you get a nice, clean table of keys and values.

Adventures in .NET-dom