Friday, July 10, 2009

Select Non-Distinct with Linq!

So earlier today I had a large collection and I wanted to select items from it which were essentially duplicates in some way. Getting distinct items from a collection in C# is easy:

var result = MyCollection.Distinct();

Getting the opposite, in sense, is a little trickier. At first I wrote the following:

var result = from Item a in MyCollection
where MyCollection.Count(b=> b.Id = a.Id) > 1
select a;

Which works! Unfortunately, it was way too slow. The code above will enumerate through the collection for every item in the collection, resulting in O(n^2) processing time. Since my collection had 13k items in it, that meant 169,000,000 enumerations. Yikes!

The solution I came up with was to utilize the grouping ability of linq. So I first grouped my collection by the Id, checked which groups have a count greater than 1 and select those, then utilize selectmany to flatten the enumerables into one!

private IEnumerable<Item> GetDuplicates()
{
return (from Item a in MyCollection
group a by a.Id into g
where g.Count() > 1
select g).SelectMany(g => g.AsEnumerable());
}

Great! However, this is still not optimal. The SelectMany unfortunately results in another layer of enumeration. Iterating through the groups manually, as in the following code, results in a roughly 4 fold performance improvement over the linq query above:

private IEnumerable<Item> GetDuplicates()
{
var results = from Item a in MyCollection
group a by a.Id into g
where g.Count() > 1
select g;

foreach (var group in results)
foreach (var item in group)
yield return item;
}

The result is we now only iterate through the collection once, and then the grouping once. The problem goes from being O(n^2) to < O(2n)!

Now you know how to get duplicate items from you collections. Enjoy!

2 comments:

  1. You can shorten this code as follows:

    var byId = collection.GroupBy(a => a.id).Where(g => g.Count()>1);
    return byId.SelectMany(a => a);

    Hope this helps.

    ReplyDelete
  2. http://protiguous.blogspot.com/2011/05/ienumerable-duplicates-select-non.html

    ReplyDelete