Raucous Random Ramblings: Select Non-Distinct with Linq!

So earlier today I had a large collection and I wanted to select items from it which were essentially duplicates in some way. Getting distinct items from a collection in C# is easy:

var result = MyCollection.Distinct();

Getting the opposite, in sense, is a little trickier. At first I wrote the following:

var result = from Item a in MyCollection

where MyCollection.Count(b=> b.Id = a.Id) > 1

select a;

Which works! Unfortunately, it was way too slow. The code above will enumerate through the collection for every item in the collection, resulting in O(n^2) processing time. Since my collection had 13k items in it, that meant 169,000,000 enumerations. Yikes!

The solution I came up with was to utilize the grouping ability of linq. So I first grouped my collection by the Id, checked which groups have a count greater than 1 and select those, then utilize selectmany to flatten the enumerables into one!

private IEnumerable<Item> GetDuplicates()

{

return (from Item a in MyCollection

group a by a.Id into g

where g.Count() > 1

select g).SelectMany(g => g.AsEnumerable());

}

Great! However, this is still not optimal. The SelectMany unfortunately results in another layer of enumeration. Iterating through the groups manually, as in the following code, results in a roughly 4 fold performance improvement over the linq query above:

private IEnumerable<Item> GetDuplicates()

{

var results = from Item a in MyCollection

group a by a.Id into g

where g.Count() > 1

select g;

foreach (var group in results)

foreach (var item in group)

yield return item;

}

The result is we now only iterate through the collection once, and then the grouping once. The problem goes from being O(n^2) to < O(2n)!

Now you know how to get duplicate items from you collections. Enjoy!

2 comments:

DananosMarch 14, 2011 at 7:32 AM
You can shorten this code as follows:

var byId = collection.GroupBy(a => a.id).Where(g => g.Count()>1);
return byId.SelectMany(a => a);

Hope this helps.
ProtiguousMay 29, 2011 at 12:43 PM
http://protiguous.blogspot.com/2011/05/ienumerable-duplicates-select-non.html

Raucous Random Ramblings

Friday, July 10, 2009

Select Non-Distinct with Linq!

2 comments:

About Me

Followers

Blog Archive