## Finding duplicates in vector

Learning APL or new to Dyalog? Ask "silly" questions here, without fear...

### Finding duplicates in vector

Hi,

I've decided to write a function to find duplicates (all elements which appears more than once) in a vector. It is a purely educational task.
I tried to make it generic, to support both vector of strings and vector of numbers. Here is my implementation:
dups←{(1<+/u∘.≡⍵)/u←∪⍵}
dups 1 2 3 2 2 3 4 3 2 5
┌→──┐
│2 3│
└~──┘
dups 'aa' 'bbb' 'cc' 'aa'
┌→─────┐
│ ┌→─┐ │
│ │aa│ │
│ └──┘ │
└∊─────┘

The idea was to first get a list of unique elements, then build a table comparing these elements to source array and filter out those who have more than 1 element in the row of the table. Here I tried to follow the APL approach to add a new dimension to the source data and try to reduce it back.

Please provide any suggestions or alternative more elegant implementations!
alexeyv

Posts: 56
Joined: Tue Nov 17, 2015 4:18 pm

### Re: Finding duplicates in vector

A more usual solution would be:
{∪((⍵⍳⍵)≠⍳⍴⍵)/⍳⍴⍵}
Here's a tacit equivalent:
∪⊢(/⍨)⍳⍨≠⍳∘⍴

Phil Last

Posts: 557
Joined: Thu Jun 18, 2009 6:29 pm

### Re: Finding duplicates in vector

Phil, I think you meant
{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}

Now for something completely different:
x←?20⍴10
x
3 7 3 3 4 7 8 3 2 8 7 1 2 4 9 7 4 1 4 4
{⍺,≢⍵}⌸x
3 4
7 4
4 5
8 2
2 2
1 2
9 1
(1≠t[;1])/t[;0]⊣t←{⍺,≢⍵}⌸x
3 7 4 8 2 1
Roger|Dyalog

Posts: 215
Joined: Thu Jul 28, 2011 10:53 am

### Re: Finding duplicates in vector

Phil, I didn't quite get your last expression, all I get is
∪⊢(/⍨)⍳⍨≠⍳∘⍴x
SYNTAX ERROR: The function requires a left argument

..but perhaps tacits are not meant for us mere mortals :)

I'd like also rewrite Roger's idea in a little bit more aplitically correct form:
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x

For my amusement there were quite remarkable performance differences between some keyed approaches, such as
↑,/{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x
and
∊(//)⌽{⍺,1<≢⍵}⌸x

-Veli-Matti
Veli-Matti

Posts: 56
Joined: Sat Nov 28, 2009 3:12 pm

### Re: Finding duplicates in vector

Quite coincidentally this week's #onelinerwednesday Tweet (which also appears on http://www.dyalog.com) deals with this question.

The one-liner that has been tweeted is
(∪(/⍨){1≠≢⍵}⌸) 4 3 2 3 5 2 3
It has the merits of being ⎕io-indepedent, and also, like Roger's expression above, returns the non-unique elements in the order in which they first appear in the original vector.

AndyS|Dyalog

Posts: 235
Joined: Tue May 12, 2009 6:06 pm

### Re: Finding duplicates in vector

Roger. Yes, I changed the dfn at the last minute from
{∪⍵/⍨(⍵⍳⍵)≠⍳⍴⍵}
(always a mistake.)
Veli-Matti.
∪⊢(/⍨)⍳⍨≠⍳∘⍴x
is a function. It requires either assignment
dups←∪⊢(/⍨)⍳⍨≠⍳∘⍴
or parentheses and an argument
(∪⊢(/⍨)⍳⍨≠⍳∘⍴)arg
to run it in-line.

Phil Last

Posts: 557
Joined: Thu Jun 18, 2009 6:29 pm

### Re: Finding duplicates in vector

I couldn't but run a little test:
x←?2000⍴10000
cmpx'(∪(/⍨){1≠≢⍵}⌸)x' '∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x' '(1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x' '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
(∪(/⍨){1≠≢⍵}⌸)x → 1.3E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
∊{1<≢⍵:⍺ ⋄ ⊂⍬}⌸x → 1.6E¯3 | +19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(1≠t[;2])/t[;1]⊣t←{⍺,≢⍵}⌸x → 3.8E¯5 | -98% ⎕
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x → 3.8E¯5 | -98% ⎕
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x → 1.5E¯5 | -99%
10↑(∪(/⍨){1≠≢⍵}⌸)x
5874 6973 8115 8713 1386 8735 5212 5708 8672 2474
10↑{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x
8672 2944 2650 2369 5874 6592 8413 3907 8568 8050

The shortest is not always the quickest, and quite interestingly the results are in a different order.

-wm
Veli-Matti

Posts: 56
Joined: Sat Nov 28, 2009 3:12 pm

### Re: Finding duplicates in vector

Interesting benchmarks. I picked the "key" approach because it uses one "expensive" operation ({⍺,≢⍵}⌸) whereas {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵} uses two such operations (∪ and ⍵⍳⍵), but it looks like the "unique" approach is better overall.

There are arguments for which the key approach is faster, illustrating my claim about "expensive operations":

u←?100⍴2e9
x←u[?4e6⍴≢u]
cmpx '{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x' '{∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x'
{(1<⊢/⍵)/⊣/⍵}{⍺,≢⍵}⌸x → 2.80E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* {∪((⍵⍳⍵)≠⍳⍴⍵)/⍵}x → 3.12E¯2 | +11% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

A couple of side notes:

I'd probably rewrite the unique approach as {∪⍵⌿⍨(⍳≢⍵)=⍳⍨⍵}. (a) It'd work for higher-ranked arguments if/when ∪ is extended to higher-ranked arguments; (b) It puts ⍳⍨ on the right where it could theoretically run faster, in case the implementation chooses a slower algorithm due to lack of space; (c) It's shorter.

Secondly, f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases. A complete list of such cases can be found in the TP4 workshop materials from Dyalog '15 (it's in special.htm in the "special" folder).

Also, APL generally is faster if you give it larger arguments to work with. For example, if you say 1<≢⍵ in the key operand it's operating on a little piece, whereas if you say 1< on the result of f⌸ is operating on a larger piece. This is a point I made in the TP4 workshop. I believe it is one of the goals of the compiler (see Foad, Jay) to address this effect.
Roger|Dyalog

Posts: 215
Joined: Thu Jul 28, 2011 10:53 am

### Re: Finding duplicates in vector

f⌸ has special code for a few cases, and when those kick in the performance difference can be significant. {⍺,≢⍵}⌸ is one of the special cases.

For example:

x←?1e6⍴4e5
cmpx '{⍺,≢⍵}⌸x' '{⊢⍺,≢⍵}⌸x'
{⍺,≢⍵}⌸x → 2.29E¯2 | 0% ⎕⎕
{⊢⍺,≢⍵}⌸x → 3.40E¯1 | +1385% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
Roger|Dyalog

Posts: 215
Joined: Thu Jul 28, 2011 10:53 am

### Re: Finding duplicates in vector

Wow thanks for these amazing replies!
Phil, thanks for the solution with ⍵⍳⍵! I completely forgot this idiom and its use-cases. It really looks like idiomatic way to do it!
However I'm having a hard time trying to understand tacit version. Maybe it is just me but tacit version doesn't look readable even if I draw it like a tree
∪⊢(/⍨)⍳⍨≠⍳∘⍴
┌─┴─┐
∪ ┌─┼───┐
⊢ ⍨ ┌─┼─┐
┌─┘ ⍨ ≠ ∘
/ ┌─┘ ┌┴┐
⍳ ⍳ ⍴

Can you explain what is happening here? For example why right tack is used and why compose is used as well.
Probably it could be helpful to have an option in interpreter to rewrite expression in tacit notation in the typical lambda-function form?

Roger, the idea of using keys (or hashtables) was something I thought about (and actually implemented in another language - just a hashtable holding number of elements every input element is encountered), but forgot what it is available in Dyalog APL. It is really interesting and educational to see how it is used here.
alexeyv

Posts: 56
Joined: Tue Nov 17, 2015 4:18 pm

Next