When working with data in Scala, duplicates can often be a problem that needs to be addressed. Dropping duplicates is a common operation in data processing and analysis. Here are 5 ways to drop duplicates in Scala, each with its own unique approach and application.
Method 1: Using distinct() Method

The most straightforward way to drop duplicates in Scala is by using the distinct()
method. This method returns a new collection that contains no duplicate elements. Here is an example:
val list = List(1, 2, 2, 3, 4, 4, 5)
val distinctList = list.distinct
println(distinctList) // prints: List(1, 2, 3, 4, 5)
Subtopic: distinct() on Custom Objects
When working with custom objects, you need to define how equality should be determined. This can be done by overriding the equals()
and hashCode()
methods in your class:
case class Person(name: String, age: Int) {
override def equals(obj: Any): Boolean = obj match {
case Person(n, a) => n == name && a == age
case _ => false
}
override def hashCode(): Int = (name, age).##
}
val people = List(Person("John", 30), Person("Alice", 25), Person("John", 30))
val distinctPeople = people.distinct
println(distinctPeople) // prints: List(Person(John,30), Person(Alice,25))
Method 2: Using groupBy() Method

Another approach is to use the groupBy()
method, which groups elements based on a key function and then takes the first element of each group to eliminate duplicates:
val list = List(1, 2, 2, 3, 4, 4, 5)
val distinctList = list.groupBy(identity).map(_._2.head).toList
println(distinctList) // prints: List(1, 2, 3, 4, 5)
Subtopic: groupBy() on Custom Objects
For custom objects, you can use groupBy()
with a function that returns a key based on which you want to eliminate duplicates:
case class Person(name: String, age: Int)
val people = List(Person("John", 30), Person("Alice", 25), Person("John", 30))
val distinctPeople = people.groupBy(_.name).map(_._2.head).toList
println(distinctPeople) // prints: List(Person(John,30), Person(Alice,25))
Method 3: Using toSet() Method
Converting a collection to a set automatically removes duplicates because sets in Scala cannot contain duplicate elements:
val list = List(1, 2, 2, 3, 4, 4, 5)
val distinctSet = list.toSet
println(distinctSet) // prints: Set(1, 2, 3, 4, 5)
Subtopic: toSet() on Custom Objects
Similar to using distinct()
, when working with custom objects, you need to ensure proper implementation of equals()
and hashCode()
methods:
case class Person(name: String, age: Int) {
override def equals(obj: Any): Boolean = obj match {
case Person(n, a) => n == name && a == age
case _ => false
}
override def hashCode(): Int = (name, age).##
}
val people = List(Person("John", 30), Person("Alice", 25), Person("John", 30))
val distinctPeopleSet = people.toSet
println(distinctPeopleSet) // prints: Set(Person(John,30), Person(Alice,25))
Method 4: Using filter() Method
You can also use filter()
in combination with indexOf()
to remove duplicates, though this method is less efficient for large datasets:
val list = List(1, 2, 2, 3, 4, 4, 5)
val distinctList = list.filter(i => list.indexOf(i) == list.lastIndexOf(i))
println(distinctList) // Note: This approach does not actually remove duplicates as intended, it's here for educational purposes.
Correct Approach with filter()
A correct and more efficient approach using filter()
would involve keeping track of elements seen so far, typically using a set:
val list = List(1, 2, 2, 3, 4, 4, 5)
var seen = Set[Int]()
val distinctList = list.filter { x =>
if (seen.contains(x)) false
else {
seen += x
true
}
}
println(distinctList) // prints: List(1, 2, 3, 4, 5)
Method 5: Using foldLeft() Method

Another functional programming approach is to use foldLeft()
to accumulate unique elements into a new collection:
val list = List(1, 2, 2, 3, 4, 4, 5)
val distinctList = list.foldLeft(List[Int]())((acc, x) => if (acc.contains(x)) acc else x :: acc).reverse
println(distinctList) // prints: List(1, 2, 3, 4, 5)
Method | Description | Efficiency |
---|---|---|
distinct() | Returns a new collection with no duplicates. | High |
groupBy() | Groups elements and takes the first of each group. | High |
toSet() | Converts to a set, automatically removing duplicates. | High |
filter() | Filters based on a condition, can be used to remove duplicates. | Low to Medium |
foldLeft() | Accumulates unique elements using a fold operation. | Low to Medium |

Key Points
- distinct() is a straightforward method for removing duplicates.
- groupBy() and toSet() are efficient methods for eliminating duplicates.
- filter() can be used but is generally less efficient for large datasets.
- foldLeft() provides a functional programming approach to removing duplicates.
- Choosing the right method depends on the dataset size, the nature of the data, and performance requirements.
What is the most efficient way to drop duplicates in Scala?
+The most efficient methods for dropping duplicates in Scala are using distinct(), groupBy(), and toSet() as they are optimized for performance.
Can I use these methods on custom objects?
+Yes, you can use these methods on custom objects. However, you need to ensure that your custom objects have properly overridden equals() and hashCode() methods for methods like distinct() and toSet() to work correctly.
What if my dataset is too large to fit into memory?
+For datasets too large to fit into memory, consider using distributed computing frameworks like Apache Spark, which provides efficient methods for handling large-scale data processing, including removing duplicates.