Wohoo 14
14 posts
• Page 2 of 2 • 1, 2
Re: Wohoo 14
The indexof family of functions (⍳ ∊ ∪ ∩ ~) is a complex topic. For an indepth discussion please see IndexOf, A 30Year Quest (http://www.jsoftware.com/papers/indexof/indexofscript.htm), presented at the J Conference in Toronto, 20140725. An alternative presentation was the workshop SP4: Exploring IndexOf (http://www.dyalog.com/uploads/conference/dyalog14/workshops/SP4_Exploring_IndexOf.zip) at the Dyalog Conference, 20140921. (The link is from http://www.dyalog.com/usermeetings/dyalog14.htm.) The workshop consists of 21 questions, but they are "leading questions" which, in answering them, will (I claim) provide useful insights on indexof.
The ideas discussed in the paper and the workshop are not yet fully implemented in Dyalog APL, but we are working on them. So Paul is indeed correct that currently (v14.0) if x and y are char matrices then inverted table indexof (8⌶) is the fastest way to do indexof on them:
x⍳y and x{(↓⍺)⍳↓⍵}y invoke the same code and so run at the same speed. (Timing differences of less than 5% are artifacts of the timing process and are not significant, or repeatable.) 8⌶ implements the latest thinking, including the use of the CRC instruction available in SSE4.1 (any Intel CPU after 2008). It is expected that in v14.1 the faster code would be used in ⍳, so employing the circumlocutory (,⊂x)(8⌶)(,⊂y) in place of x⍳y is not recommended.
As mentioned above, we are working on implementing the latest ideas. The following benchmarks show the speedups that can be expected once the implementation is complete:
The benchmarks also show that it is tricky to do benchmarks. Your faithful implementers are sneaky guys and will exploit properties of the data to effect faster computation. In this case, smallrange is exploited with the range being computed efficiently using vector instructions available in any nonancient CPU. To give you an idea of the details involved: ∊ on 2e9 uses hashing; ∊ on smallrange uses a table, a faster computation. The factor for 2e6 is smaller than the factor for 4e6 because the v14.0 implementation also uses a table for that range. The factor for 4e6 is bigger because the v14.0 implementation uses hashing where v14.1 would use a table, but without using more temporary space. I conjecture that in practice all integers arguments to the ⍳ family are small range, so that smallrangeness is a special case but not an unusual case. And on and on.
Stig's example is interesting and I am glad he brought it up. I have a suggestion regarding the handling of floats: unless you have reason to believe that ⎕ct is significant, try setting a local ⎕ct←0. As well, there is a possibility that ⍳ could incorporate the same efficient code that is already available in 8⌶. That is, there is the possibility that x xFindn y can be done using plain old x⍳y, with x⍳y being the faster. The following benchmark shows the upper bound on the amount of speedup:
Of course, until the time when ⍳ handles it internally, you have to account for the time needed for the conversion:
I should point out that compared to x1 (the "inverted table" format), x is extravagantly profligate in its use of space:
Inefficient use of space leads directly to unavoidable inefficiencies in time. I will be in touch with Stig regarding additional details on the matrices in his application (not to twist his arm to change his data, but to get details on how best to enhance ⍳ on his data :).
Finally, the extension to ⍳ in v14.0 ("looking for rows") greatly improved it as a tool of thought, and the possibilities take time to sink in. Using ⍳ (and improving its implementation) for Stig's application is one example; the blog post A SpeedUp Story (http://www.dyalog.com/blog/2014/11/aspeedupstory2/) is another.
The ideas discussed in the paper and the workshop are not yet fully implemented in Dyalog APL, but we are working on them. So Paul is indeed correct that currently (v14.0) if x and y are char matrices then inverted table indexof (8⌶) is the fastest way to do indexof on them:
a←(' ',⎕a,⎕d)[?1000 22⍴37]
x←a[?1e6⍴≢a;]
y←a[?1.1e6⍴≢a;]
cmpx 'x⍳y' 'x{(↓⍺)⍳↓⍵}y' '(,⊂x)(8⌶)(,⊂y)'
x⍳y → 3.42E¯1  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x{(↓⍺)⍳↓⍵}y → 3.41E¯1  1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(,⊂x)(8⌶)(,⊂y) → 8.70E¯2  75% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x⍳y and x{(↓⍺)⍳↓⍵}y invoke the same code and so run at the same speed. (Timing differences of less than 5% are artifacts of the timing process and are not significant, or repeatable.) 8⌶ implements the latest thinking, including the use of the CRC instruction available in SSE4.1 (any Intel CPU after 2008). It is expected that in v14.1 the faster code would be used in ⍳, so employing the circumlocutory (,⊂x)(8⌶)(,⊂y) in place of x⍳y is not recommended.
As mentioned above, we are working on implementing the latest ideas. The following benchmarks show the speedups that can be expected once the implementation is complete:
x←?1e6⍴2e9
y←?1.1e6⍴2e9
cmpx 'x f y' 'x⍳y'
x f y → 5.78E¯2  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x⍳y → 1.83E¯1  +216% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
cmpx 'y g x' 'y∊x'
y g x → 4.61E¯2  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
y∊x → 1.74E¯1  +276% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x←?1e6⍴2e6
y←?1.1e6⍴2e6
cmpx 'x f y' 'x⍳y'
x f y → 1.64E¯2  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕
x⍳y → 7.62E¯2  +365% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
cmpx 'y g x' 'y∊x'
y g x → 1.06E¯2  0% ⎕⎕⎕⎕⎕⎕
y∊x → 7.50E¯2  +608% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x←?1e6⍴4e6
y←?1.1e6⍴4e6
cmpx 'x f y' 'x⍳y'
x f y → 4.92E¯2  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x⍳y → 1.58E¯1  +221% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
cmpx 'y g x' 'y∊x'
y g x → 9.81E¯3  0% ⎕⎕⎕
y∊x → 1.50E¯1  +1426% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
The benchmarks also show that it is tricky to do benchmarks. Your faithful implementers are sneaky guys and will exploit properties of the data to effect faster computation. In this case, smallrange is exploited with the range being computed efficiently using vector instructions available in any nonancient CPU. To give you an idea of the details involved: ∊ on 2e9 uses hashing; ∊ on smallrange uses a table, a faster computation. The factor for 2e6 is smaller than the factor for 4e6 because the v14.0 implementation also uses a table for that range. The factor for 4e6 is bigger because the v14.0 implementation uses hashing where v14.1 would use a table, but without using more temporary space. I conjecture that in practice all integers arguments to the ⍳ family are small range, so that smallrangeness is a special case but not an unusual case. And on and on.
Stig's example is interesting and I am glad he brought it up. I have a suggestion regarding the handling of floats: unless you have reason to believe that ⎕ct is significant, try setting a local ⎕ct←0. As well, there is a possibility that ⍳ could incorporate the same efficient code that is already available in 8⌶. That is, there is the possibility that x xFindn y can be done using plain old x⍳y, with x⍳y being the faster. The following benchmark shows the upper bound on the amount of speedup:
xData←{(?⍵⍴100),a[?⍵⍴≢a←'Stig' 'Paul' 'Morten' 'John' 'Jay' 'Nick' 'Fi' 'Roger'],⍪0.1×?⍵⍴200}
xConvert←{t←↓[⎕IO]⍵ ⋄ t[1+⎕IO]←↑¨t[1+⎕IO] ⋄ t}
x←xData 1e6
y←xData 1.1e6
x1←xConvert x
y1←xConvert y
cmpx 'x1(8⌶)y1' 'x xFindn y'
x1(8⌶)y1 → 2.34E¯1  0% ⎕⎕⎕⎕⎕⎕⎕⎕
x xFindn y → 1.14E0  +388% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
Of course, until the time when ⍳ handles it internally, you have to account for the time needed for the conversion:
cmpx 'x1←xConvert x⊣y1←xConvert y'
5.03E¯1
I should point out that compared to x1 (the "inverted table" format), x is extravagantly profligate in its use of space:
⎕size 'x' 'x1'
128000040 15000160
Inefficient use of space leads directly to unavoidable inefficiencies in time. I will be in touch with Stig regarding additional details on the matrices in his application (not to twist his arm to change his data, but to get details on how best to enhance ⍳ on his data :).
Finally, the extension to ⍳ in v14.0 ("looking for rows") greatly improved it as a tool of thought, and the possibilities take time to sink in. Using ⍳ (and improving its implementation) for Stig's application is one example; the blog post A SpeedUp Story (http://www.dyalog.com/blog/2014/11/aspeedupstory2/) is another.
 RogerDyalog
 Posts: 238
 Joined: Thu Jul 28, 2011 10:53 am
Re: Wohoo 14
Thomas, sorry for the comment alignment ;!
I forgot to mention that xFindn is a matrix lookup that is used to find occurences of rows from matrix into another, both of same shape and datatypes column wise.
It was introduced when we had an example where we had to find unique rows in a matrix (i.e. mat {(↓⍺)⍳↓⍵} mat). This operation took 6+ hours(!) with {(↓⍺)⍳↓⍵} and was reduced to 5 sec. with the xFindn implementation.
I forgot to mention that xFindn is a matrix lookup that is used to find occurences of rows from matrix into another, both of same shape and datatypes column wise.
It was introduced when we had an example where we had to find unique rows in a matrix (i.e. mat {(↓⍺)⍳↓⍵} mat). This operation took 6+ hours(!) with {(↓⍺)⍳↓⍵} and was reduced to 5 sec. with the xFindn implementation.

stignielsen  Posts: 4
 Joined: Mon Sep 20, 2010 12:58 pm
Re: Wohoo 14
I note that xFindn makes assumptions that the interpreter can not make either with ⍳ or with {(↓⍺)⍳↓⍵}. Specifically, the problem is that xFindn assumes that the result is the same if you replace a floating point column c by c⍳c. The following example demonstrates that the results are not necessarily the same.
If ⎕ct≠0, ⍳ on multiple floating point columns is a difficult problem if it has to be fast. The difficulties are unavoidable from the way tolerance and ⍳ are defined  basically each value is not a single point but an interval of equality. If you know (and the interpreter can not know) that ⎕ct is not a factor, you should set ⎕ct←0.
8⌶ ignores the current value of ⎕ct and proceeds as if ⎕ct=0.
⎕ct
1E¯14
x←1+(0.5×⎕ct)×⍉4 4⊤⍳16
y←⍉↑⍳⍨¨↓⍉x
t←x ⋄ t[;0]←t[;0]⍳t[;0] ⋄ t[;1]←t[;1]⍳t[;1]
y ≡ t
1
x⍳x
0 0 0 1 0 0 0 1 0 0 0 1 4 4 4 5
y⍳y
0 0 0 3 0 0 0 3 0 0 0 3 12 12 12 15
If ⎕ct≠0, ⍳ on multiple floating point columns is a difficult problem if it has to be fast. The difficulties are unavoidable from the way tolerance and ⍳ are defined  basically each value is not a single point but an interval of equality. If you know (and the interpreter can not know) that ⎕ct is not a factor, you should set ⎕ct←0.
8⌶ ignores the current value of ⎕ct and proceeds as if ⎕ct=0.
 RogerDyalog
 Posts: 238
 Joined: Thu Jul 28, 2011 10:53 am
Re: Wohoo 14
In the 14.1 beta ⍳ has special code for relational tables, the arguments accepted by Stig's xFindn function. Relational tables are matrices in which items of a column have the same type.
First, some "random" data.
Since the data in this case contains floating point numbers, the special code requires nontolerant comparisons. (xFindn also requires nontolerant comparisons to work correctly. See the previous message in this thread.)
Now the benchmarks. The second benchmark shows that ⍳ exploits "selfies", the case where the left and right arguments are "the same". x is the same as x (of course); x is not "the same" as (⍴x)⍴x. Checking whether the arguments are "the same" is instantaneous for the interpreter.
The Fine Print:
• As stated above, if the data contains floating point numbers, ⎕ct←0 is required for the special code to be invoked. (If ⎕ct≠0, x⍳y would still be correct, but slower.)
• The items of a column need not have the same length, but x⍳y would be a bit faster if they do have the same length (as above).
• The special code uses the CRC32 instruction (http://en.wikipedia.org/wiki/SSE4#SSE4.2) if available; it would be available on a CPU newer than 2008. x⍳y would be a bit slower otherwise.
Finally, if you have a choice, storing the data as inverted tables is more efficient both in space and in time. 8⌶ computes indexof on inverted tables and has been available since 14.0.
First, some "random" data.
 Code: Select all
u←(?m⍴5e5),(↓⎕A[?(m,10)⍴26]),(↓⎕AV[?(m,5)⍴256]),⍪0.01×?m⍴20000
x←u[?1e6⍴≢u;]
y←u[?1e6⍴≢u;]
Since the data in this case contains floating point numbers, the special code requires nontolerant comparisons. (xFindn also requires nontolerant comparisons to work correctly. See the previous message in this thread.)
 Code: Select all
⎕ct←0
Now the benchmarks. The second benchmark shows that ⍳ exploits "selfies", the case where the left and right arguments are "the same". x is the same as x (of course); x is not "the same" as (⍴x)⍴x. Checking whether the arguments are "the same" is instantaneous for the interpreter.
 Code: Select all
cmpx 'x⍳y' 'x xFindn y'
x⍳y → 9.69E¯1  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x xFindn y → 2.39E0  +146% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
cmpx 'x⍳x' 'x xFindn x'
x⍳x → 4.77E¯1  0% ⎕⎕⎕⎕⎕⎕⎕⎕
x xFindn x → 2.32E0  +387% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
The Fine Print:
• As stated above, if the data contains floating point numbers, ⎕ct←0 is required for the special code to be invoked. (If ⎕ct≠0, x⍳y would still be correct, but slower.)
• The items of a column need not have the same length, but x⍳y would be a bit faster if they do have the same length (as above).
• The special code uses the CRC32 instruction (http://en.wikipedia.org/wiki/SSE4#SSE4.2) if available; it would be available on a CPU newer than 2008. x⍳y would be a bit slower otherwise.
Finally, if you have a choice, storing the data as inverted tables is more efficient both in space and in time. 8⌶ computes indexof on inverted tables and has been available since 14.0.
 Code: Select all
xi←↑¨↓[1] x ⍝ the equivalent inverted tables
yi←↑¨↓[1] y
⎕size 'x' 'xi' 'y' 'yi'
190378352 32000208 190378512 32000208
cmpx 'x⍳y' 'xi(8⌶)yi'
x⍳y → 9.53E¯1  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
xi(8⌶)yi → 3.25E¯1  66% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
cmpx 'x⍳x' 'xi(8⌶)xi'
x⍳x → 4.71E¯1  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
xi(8⌶)xi → 1.59E¯1  67% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
 RogerDyalog
 Posts: 238
 Joined: Thu Jul 28, 2011 10:53 am
14 posts
• Page 2 of 2 • 1, 2
Who is online
Users browsing this forum: No registered users and 1 guest
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group