Wednesday, April 01, 2009

Working Around DataTable.Merge() Not Changing DataRowState

Before I forget, while the documentation on the DataTable.Merge() methods would seem to indicate that the method is smart about using FK relationships and RowState. The reality is that Microsoft has acknowledged in the Connect forum that the Merge is much more like a batch load in that constraints are dropped, data is correlated, and constraints are re-enabled (possibly throwing exceptions) but *without* changing row states. There is a workaround suggested using LinQ which I can't work with right now as I'm restricted to the 2.0 framework.

I was hoping to use the merge functionality to perform some small ETL tasks. Instead, after researching for awhile, I came up with the following, relatively simple workaround. I'm using Strong-Typed DataSets which affects the nature of the implementation, but the approach should still work for the more generic form.
If it's not obvious, this sample is in C#.

public static void Upsert
(DataTable sourceTable, TableType targetTable)
where TableType : DataTable, new()
where RowType : DataRow
{
TableType inputTable = sourceTable as TableType;

//if we can't cast to the appropriate type we can run into
//data problems down below in LoadDataRow.
if (inputTable == null)
{
inputTable = new TableType();
inputTable.Merge(sourceTable, false, MissingSchemaAction.Add);
}

//pull in schema from candidate table
targetTable.Merge(sourceTable.Clone(), true, MissingSchemaAction.Add);

//apply resuling schema to candidate table
sourceTable.Merge(targetTable.Clone(), true, MissingSchemaAction.Add);

//now use LoadDataRow to "merge" the tables, together
foreach (RowType row in inputTable.Rows)
{
try
{
targetTable.LoadDataRow(row.ItemArray, LoadOption.Upsert);
}
catch (Exception ex)
{
//TODO: Tighten this exception down to what is actually thrown (not in MSDN Documentation?)
//exception will be thrown on FK validation failure
//this is ok and actually expected in the design of the load.
System.Diagnostics.Debug.WriteLine("Verify Expected Error: " + ex.Message);
}
}//next row
}


As you can see, this could use a little more work, but I've managed an Upsert (Insert or Update combined method) implementation. This method was placed in the DataSet's partial class and each DataTable partial class contains a typed version (non-generic) of the Upsert method that delegates to the above generic form for it's implementation.

public partial class MyDataTable
{
public void Upsert(DataTable table)
{
MyParentDS.Upsert(table, this);
}
}

The table Upsert does accept the base DataTable reference to allow the merge to come from a query result that was not filled into a MyParentDS instance. The generic method on the parent either casts or merges into the strong-typed form.

The gotcha in all this is that the item array from the row must match or you can end up with some seemingly unexpected type-mismatch errors from LoadDataRow. The safest way, I felt, was to make sure I "cast" to the MyParentDS before going into the upsert section itself.

Recorded here so I don't have to reinvent the wheel again.

2 comments:

Chad said...

You don't need to do this workaround.

To change the DataRowState, set the DataRowStateChanged handler, and set the row state explicitly in the handler.

PapaSwoof said...

Sorry for the really late follow-up. I haven't been back to that particular set of code in awhile. And thank you, Chad, for taking a moment to post up an alternative solution.

I have 2 things to add:
First, I couldn't locate that event directly in the MSDN documentation or on the objects themselves. So, here's the reference to the DataRowChangeEventArgs for the 2.0 framework which was the target for this conversation.

Second, if the state change can be exposed through an event, why should I be required to go look up the change and apply the state to the row myself? Why not take an argument so that I can tell the Merge to apply the row state change?

Finally, for now at least, why would I want to use an event handler against millions of rows? To me that seems like a non-trivial performance impact. Let me stress this again... MILLIONS of rows for an action in what is essentially an ETL activity.

I do not understand what drove Microsoft's decision to decouple the state change from the merge activity. I wish I had someone on the team I could ask for a little peek at the logic behind it just to settle the question that continues to rattle around my head whenever I seen this code.