var result = MyCollection.Distinct();
Getting the opposite, in sense, is a little trickier. At first I wrote the following:
var result = from Item a in MyCollection
where MyCollection.Count(b=> b.Id = a.Id) > 1
select a;
Which works! Unfortunately, it was way too slow. The code above will enumerate through the collection for every item in the collection, resulting in O(n^2) processing time. Since my collection had 13k items in it, that meant 169,000,000 enumerations. Yikes!
The solution I came up with was to utilize the grouping ability of linq. So I first grouped my collection by the Id, checked which groups have a count greater than 1 and select those, then utilize selectmany to flatten the enumerables into one!
private IEnumerable<Item> GetDuplicates()
{
return (from Item a in MyCollection
group a by a.Id into g
where g.Count() > 1
select g).SelectMany(g => g.AsEnumerable());
}
Great! However, this is still not optimal. The SelectMany unfortunately results in another layer of enumeration. Iterating through the groups manually, as in the following code, results in a roughly 4 fold performance improvement over the linq query above:
private IEnumerable<Item> GetDuplicates()
{
var results = from Item a in MyCollection
group a by a.Id into g
where g.Count() > 1
select g;
foreach (var group in results)
foreach (var item in group)
yield return item;
}
The result is we now only iterate through the collection once, and then the grouping once. The problem goes from being O(n^2) to < O(2n)!
Now you know how to get duplicate items from you collections. Enjoy!
You can shorten this code as follows:
ReplyDeletevar byId = collection.GroupBy(a => a.id).Where(g => g.Count()>1);
return byId.SelectMany(a => a);
Hope this helps.
http://protiguous.blogspot.com/2011/05/ienumerable-duplicates-select-non.html
ReplyDelete