The Different Flavors of Parallelism

Chris Rackauckas
September 22nd, 2019

Now that you are aware of the basics of parallel computing, let's give a high level overview of the differences between different modes of parallelism.

Lowest Level: SIMD

Recall SIMD, the idea that processors can run multiple commands simulataniously on specially structured data. "Single Instruction Multiple Data". SIMD is parallelism within a single core.

How to do SIMD

The simplest way to do SIMD is simply to make sure that your values are aligned. If they are, then great, LLVM's autovectorizer pass has a good chance of automatic vectorization (in the world of computing, "SIMD" is synonymous with vectorization since it is taking specific values and instead computing on small vectors. That is not to be confused with "vectorization" in the sense of Python/R/MATLAB, which is a programming style which prefers using C-defined primitive functions, like broadcast or matrix multiplication).

You can check for auto-vectorization inside of the LLVM IR by looking for statements like:

%wide.load24 = load <4 x double>, <4 x double> addrspac(13)* %46, align 8
; └
; ┌ @ float.jl:395 within `+'
%47 = fadd <4 x double> %wide.load, %wide.load24

which means that 4 additions are happening simultaniously. The amount of vectorization is heavily dependent on your architecture. The ancient form of SIMD, the SSE(2) instructions, required that your data was aligned. Now there's a bit more leeway, but generally it holds that making your the data you're trying to SIMD over is aligned. Thus there can be major differences in computing using a struct of array format instead of an arrays of structs format. For example:

struct MyComplex
  real::Float64
  imag::Float64
end
arr = [MyComplex(rand(),rand()) for i in 1:100]
100-element Array{Main.WeaveSandBox1.MyComplex,1}:
 Main.WeaveSandBox1.MyComplex(0.48266109238745125, 0.4641701720870326) 
 Main.WeaveSandBox1.MyComplex(0.8745325193759172, 0.70955531369546)    
 Main.WeaveSandBox1.MyComplex(0.8216949483533402, 0.501030229896035)   
 Main.WeaveSandBox1.MyComplex(0.3004513165491971, 0.6149428769185861)  
 Main.WeaveSandBox1.MyComplex(0.7757706134223445, 0.8903444717887914)  
 Main.WeaveSandBox1.MyComplex(0.22166752052387517, 0.37583602256259563)
 Main.WeaveSandBox1.MyComplex(0.5639603327796399, 0.1358179576798808)  
 Main.WeaveSandBox1.MyComplex(0.3081067883083548, 0.954313795827268)   
 Main.WeaveSandBox1.MyComplex(0.9846207657411097, 0.04822752674160746) 
 Main.WeaveSandBox1.MyComplex(0.7597281208423328, 0.4513281855208291)  
 ⋮                                                                     
 Main.WeaveSandBox1.MyComplex(0.25452918938701696, 0.9947643801570574) 
 Main.WeaveSandBox1.MyComplex(0.677046799254889, 0.33480360843450874)  
 Main.WeaveSandBox1.MyComplex(0.7476673689452558, 0.9329171870836386)  
 Main.WeaveSandBox1.MyComplex(0.2889574979354079, 0.6338728028483844)  
 Main.WeaveSandBox1.MyComplex(0.03434226397997442, 0.7936725650719132) 
 Main.WeaveSandBox1.MyComplex(0.07815446633676415, 0.3836054453624207) 
 Main.WeaveSandBox1.MyComplex(0.8645465183809189, 0.323115769622738)   
 Main.WeaveSandBox1.MyComplex(0.04150804376850292, 0.8501391565472747) 
 Main.WeaveSandBox1.MyComplex(0.42762296024914925, 0.5067909267064767)

is represented in memory as

[real1,imag1,real2,imag2,...]

while the struct of array formats are

struct MyComplexes
  real::Vector{Float64}
  imag::Vector{Float64}
end
arr2 = MyComplexes(rand(100),rand(100))
Main.WeaveSandBox1.MyComplexes([0.737997, 0.637227, 0.80643, 0.226203, 0.36
3984, 0.821431, 0.509467, 0.308366, 0.138407, 0.0460029  …  0.313719, 0.910
71, 0.16531, 0.899056, 0.925093, 0.93055, 0.5059, 0.892234, 0.998142, 0.390
089], [0.248891, 0.659875, 0.300714, 0.736982, 0.608866, 0.989186, 0.459049
, 0.12227, 0.690812, 0.733272  …  0.888037, 0.592105, 0.132248, 0.992864, 0
.343688, 0.857534, 0.174047, 0.91854, 0.849392, 0.953488])

Now let's check what happens when we perform a reduction:

using InteractiveUtils
Base.:+(x::MyComplex,y::MyComplex) = MyComplex(x.real+y.real,x.imag+y.imag)
Base.:/(x::MyComplex,y::Int) = MyComplex(x.real/y,x.imag/y)
average(x::Vector{MyComplex}) = sum(x)/length(x)
@code_llvm average(arr)
;  @ none:1 within `average'
; Function Attrs: uwtable
define void @julia_average_14440({ double, double }* noalias nocapture sret
, %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) #0 {
top:
  %2 = alloca %jl_value_t addrspace(10)*, i32 4
  %3 = alloca <2 x double>, align 16
  %tmpcast = bitcast <2 x double>* %3 to { double, double }*
; ┌ @ reducedim.jl:648 within `sum'
; │┌ @ reducedim.jl:648 within `#sum#550'
; ││┌ @ reducedim.jl:652 within `_sum' @ reducedim.jl:653
; │││┌ @ reducedim.jl:304 within `mapreduce'
; ││││┌ @ reducedim.jl:304 within `#mapreduce#548'
; │││││┌ @ reducedim.jl:308 within `_mapreduce_dim'
; ││││││┌ @ reduce.jl:302 within `_mapreduce'
; │││││││┌ @ indices.jl:426 within `Type'
; ││││││││┌ @ abstractarray.jl:75 within `axes'
; │││││││││┌ @ array.jl:155 within `size'
            %4 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t
 addrspace(11)*
            %5 = bitcast %jl_value_t addrspace(11)* %4 to %jl_value_t addrs
pace(10)* addrspace(11)*
            %6 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_val
ue_t addrspace(10)* addrspace(11)* %5, i64 3
            %7 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %6 to i6
4 addrspace(11)*
            %8 = load i64, i64 addrspace(11)* %7, align 8
; │││││││││└
; │││││││││┌ @ tuple.jl:165 within `map'
; ││││││││││┌ @ range.jl:317 within `Type' @ range.jl:308
; │││││││││││┌ @ promotion.jl:414 within `max'
              %9 = icmp sgt i64 %8, 0
; │││││││└└└└└
; │││││││ @ reduce.jl:304 within `_mapreduce'
         br i1 %9, label %L11, label %L9

L9:                                               ; preds = %top
; │││││││ @ reduce.jl:305 within `_mapreduce'
         %10 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrsp
ace(10)** %2, i32 0
         store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* intto
ptr (i64 128224816 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_val
ue_t addrspace(10)** %10
         %11 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrsp
ace(10)** %2, i32 1
         store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* intto
ptr (i64 111391024 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_val
ue_t addrspace(10)** %11
         %12 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrsp
ace(10)** %2, i32 2
         store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* intto
ptr (i64 129609744 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_val
ue_t addrspace(10)** %12
         %13 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrsp
ace(10)** %2, i32 3
         store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* intto
ptr (i64 268525232 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_val
ue_t addrspace(10)** %13
         %14 = call nonnull %jl_value_t addrspace(10)* @jl_invoke(%jl_value
_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 128225872 to %j
l_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %2,
 i32 4)
         call void @llvm.trap()
         unreachable

L11:                                              ; preds = %top
; │││││││ @ reduce.jl:306 within `_mapreduce'
; │││││││┌ @ promotion.jl:403 within `=='
          %15 = icmp eq i64 %8, 1
; │││││││└
         br i1 %15, label %L13, label %L15

L13:                                              ; preds = %L11
; │││││││ @ reduce.jl:307 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %16 = bitcast %jl_value_t addrspace(11)* %4 to <2 x double> addrs
pace(13)* addrspace(11)*
          %17 = load <2 x double> addrspace(13)*, <2 x double> addrspace(13
)* addrspace(11)* %16, align 8
          %18 = load <2 x double>, <2 x double> addrspace(13)* %17, align 8
; │││││││└
; │││││││ @ reduce.jl:308 within `_mapreduce'
         br label %L47

L15:                                              ; preds = %L11
; │││││││ @ reduce.jl:309 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %19 = icmp sgt i64 %8, 15
; │││││││└
         br i1 %19, label %L43, label %L17

L17:                                              ; preds = %L15
; │││││││ @ reduce.jl:311 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %20 = bitcast %jl_value_t addrspace(11)* %4 to { double, double }
 addrspace(13)* addrspace(11)*
          %21 = load { double, double } addrspace(13)*, { double, double } 
addrspace(13)* addrspace(11)* %20, align 8
          %22 = bitcast { double, double } addrspace(13)* %21 to <2 x doubl
e> addrspace(13)*
          %23 = load <2 x double>, <2 x double> addrspace(13)* %22, align 8
; │││││││└
; │││││││ @ reduce.jl:312 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %.elt63 = getelementptr inbounds { double, double }, { double, do
uble } addrspace(13)* %21, i64 1, i32 0
          %24 = bitcast double addrspace(13)* %.elt63 to <2 x double> addrs
pace(13)*
          %25 = load <2 x double>, <2 x double> addrspace(13)* %24, align 8
; │││││││└
; │││││││ @ reduce.jl:313 within `_mapreduce'
; │││││││┌ @ reduce.jl:21 within `add_sum'
; ││││││││┌ @ none:1 within `+' @ float.jl:395
           %26 = fadd <2 x double> %23, %25
; │││││││└└
; │││││││ @ reduce.jl:314 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %27 = icmp sgt i64 %8, 2
; │││││││└
         br i1 %27, label %L34.lr.ph, label %L47

L34.lr.ph:                                        ; preds = %L17
         br label %L34

L34:                                              ; preds = %L34.lr.ph, %L3
4
         %value_phi73 = phi i64 [ 2, %L34.lr.ph ], [ %29, %L34 ]
         %28 = phi <2 x double> [ %26, %L34.lr.ph ], [ %32, %L34 ]
; │││││││ @ reduce.jl:315 within `_mapreduce'
; │││││││┌ @ int.jl:53 within `+'
          %29 = add nuw nsw i64 %value_phi73, 1
; │││││││└
; │││││││┌ @ array.jl:729 within `getindex'
          %.elt67 = getelementptr inbounds { double, double }, { double, do
uble } addrspace(13)* %21, i64 %value_phi73, i32 0
          %30 = bitcast double addrspace(13)* %.elt67 to <2 x double> addrs
pace(13)*
          %31 = load <2 x double>, <2 x double> addrspace(13)* %30, align 8
; │││││││└
; │││││││ @ reduce.jl:316 within `_mapreduce'
; │││││││┌ @ reduce.jl:21 within `add_sum'
; ││││││││┌ @ none:1 within `+' @ float.jl:395
           %32 = fadd <2 x double> %28, %31
; │││││││└└
; │││││││ @ reduce.jl:314 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %33 = icmp ult i64 %29, %8
; │││││││└
         br i1 %33, label %L34, label %L47

L43:                                              ; preds = %L15
; │││││││ @ reduce.jl:320 within `_mapreduce'
; │││││││┌ @ reduce.jl:178 within `mapreduce_impl'
          call void @julia_mapreduce_impl_14441({ double, double }* noalias
 nocapture nonnull sret %tmpcast, %jl_value_t addrspace(10)* nonnull %1, i6
4 1, i64 %8, i64 1024)
; │││││││└
         %34 = load <2 x double>, <2 x double>* %3, align 16
         br label %L47

L47:                                              ; preds = %L34, %L17, %L4
3, %L13
         %35 = phi <2 x double> [ %34, %L43 ], [ %18, %L13 ], [ %26, %L17 ]
, [ %32, %L34 ]
; └└└└└└└
; ┌ @ array.jl:199 within `length'
   %36 = bitcast %jl_value_t addrspace(11)* %4 to %jl_array_t addrspace(11)
*
   %37 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %36
, i64 0, i32 1
   %38 = load i64, i64 addrspace(11)* %37, align 8
; └
; ┌ @ none:1 within `/' @ promotion.jl:316
; │┌ @ promotion.jl:284 within `promote'
; ││┌ @ promotion.jl:261 within `_promote'
; │││┌ @ number.jl:7 within `convert'
; ││││┌ @ float.jl:60 within `Type'
       %39 = sitofp i64 %38 to double
; │└└└└
; │ @ none:1 within `/' @ promotion.jl:316 @ float.jl:401
   %40 = insertelement <2 x double> undef, double %39, i32 0
   %41 = shufflevector <2 x double> %40, <2 x double> undef, <2 x i32> zero
initializer
   %42 = fdiv <2 x double> %35, %41
; └
  %43 = bitcast { double, double }* %0 to <2 x double>*
  store <2 x double> %42, <2 x double>* %43, align 8
  ret void
}

vs

average(x::MyComplexes) = MyComplex(sum(x.real),sum(x.imag))/length(x.real)
@code_llvm average(arr2)
;  @ none:1 within `average'
; Function Attrs: uwtable
define void @julia_average_14443({ double, double }* noalias nocapture sret
, %jl_value_t addrspace(10)* nonnull align 8 dereferenceable(16)) #0 {
top:
; ┌ @ sysimg.jl:18 within `getproperty'
   %2 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspac
e(11)*
   %3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_value_t addrspace(10)*
 addrspace(11)*
   %4 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspa
ce(11)* %3, align 8
; └
; ┌ @ reducedim.jl:648 within `sum'
; │┌ @ reducedim.jl:648 within `#sum#550'
; ││┌ @ reducedim.jl:652 within `_sum' @ reducedim.jl:653
; │││┌ @ reducedim.jl:304 within `mapreduce'
; ││││┌ @ reducedim.jl:304 within `#mapreduce#548'
; │││││┌ @ reducedim.jl:308 within `_mapreduce_dim'
; ││││││┌ @ reduce.jl:302 within `_mapreduce'
; │││││││┌ @ indices.jl:426 within `Type'
; ││││││││┌ @ abstractarray.jl:75 within `axes'
; │││││││││┌ @ array.jl:155 within `size'
            %5 = addrspacecast %jl_value_t addrspace(10)* %4 to %jl_value_t
 addrspace(11)*
            %6 = bitcast %jl_value_t addrspace(11)* %5 to %jl_value_t addrs
pace(10)* addrspace(11)*
            %7 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_val
ue_t addrspace(10)* addrspace(11)* %6, i64 3
            %8 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %7 to i6
4 addrspace(11)*
            %9 = load i64, i64 addrspace(11)* %8, align 8
; │││││││││└
; │││││││││┌ @ tuple.jl:165 within `map'
; ││││││││││┌ @ range.jl:317 within `Type' @ range.jl:308
; │││││││││││┌ @ promotion.jl:414 within `max'
              %10 = icmp sgt i64 %9, 0
; │││││││└└└└└
; │││││││ @ reduce.jl:304 within `_mapreduce'
         br i1 %10, label %L11, label %L43

L11:                                              ; preds = %top
; │││││││ @ reduce.jl:306 within `_mapreduce'
; │││││││┌ @ promotion.jl:403 within `=='
          %11 = icmp eq i64 %9, 1
; │││││││└
         br i1 %11, label %L13, label %L15

L13:                                              ; preds = %L11
; │││││││ @ reduce.jl:307 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %12 = bitcast %jl_value_t addrspace(11)* %5 to double addrspace(1
3)* addrspace(11)*
          %13 = load double addrspace(13)*, double addrspace(13)* addrspace
(11)* %12, align 8
          %14 = load double, double addrspace(13)* %13, align 8
; │││││││└
; │││││││ @ reduce.jl:308 within `_mapreduce'
         br label %L43

L15:                                              ; preds = %L11
; │││││││ @ reduce.jl:309 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %15 = icmp sgt i64 %9, 15
; │││││││└
         br i1 %15, label %L31, label %L17

L17:                                              ; preds = %L15
; │││││││ @ reduce.jl:311 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %16 = bitcast %jl_value_t addrspace(11)* %5 to double addrspace(1
3)* addrspace(11)*
          %17 = load double addrspace(13)*, double addrspace(13)* addrspace
(11)* %16, align 8
          %18 = load double, double addrspace(13)* %17, align 8
; │││││││└
; │││││││ @ reduce.jl:312 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %19 = getelementptr inbounds double, double addrspace(13)* %17, i
64 1
          %20 = load double, double addrspace(13)* %19, align 8
; │││││││└
; │││││││ @ reduce.jl:313 within `_mapreduce'
; │││││││┌ @ reduce.jl:24 within `add_sum'
; ││││││││┌ @ float.jl:395 within `+'
           %21 = fadd double %18, %20
; │││││││└└
; │││││││ @ reduce.jl:314 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %22 = icmp sgt i64 %9, 2
; │││││││└
         br i1 %22, label %L26.lr.ph, label %L43

L26.lr.ph:                                        ; preds = %L17
         br label %L26

L26:                                              ; preds = %L26.lr.ph, %L2
6
         %value_phi512 = phi double [ %21, %L26.lr.ph ], [ %26, %L26 ]
         %value_phi411 = phi i64 [ 2, %L26.lr.ph ], [ %23, %L26 ]
; │││││││ @ reduce.jl:315 within `_mapreduce'
; │││││││┌ @ int.jl:53 within `+'
          %23 = add nuw nsw i64 %value_phi411, 1
; │││││││└
; │││││││┌ @ array.jl:729 within `getindex'
          %24 = getelementptr inbounds double, double addrspace(13)* %17, i
64 %value_phi411
          %25 = load double, double addrspace(13)* %24, align 8
; │││││││└
; │││││││ @ reduce.jl:316 within `_mapreduce'
; │││││││┌ @ reduce.jl:24 within `add_sum'
; ││││││││┌ @ float.jl:395 within `+'
           %26 = fadd double %value_phi512, %25
; │││││││└└
; │││││││ @ reduce.jl:314 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %27 = icmp ult i64 %23, %9
; │││││││└
         br i1 %27, label %L26, label %L43

L31:                                              ; preds = %L15
; │││││││ @ reduce.jl:320 within `_mapreduce'
; │││││││┌ @ reduce.jl:178 within `mapreduce_impl'
          %28 = call double @julia_mapreduce_impl_13158(%jl_value_t addrspa
ce(10)* nonnull %4, i64 1, i64 %9, i64 1024)
; │││││││└
         br label %L43

L43:                                              ; preds = %L26, %L17, %L1
3, %L31, %top
         %value_phi = phi double [ %14, %L13 ], [ %28, %L31 ], [ 0.000000e+
00, %top ], [ %21, %L17 ], [ %26, %L26 ]
; └└└└└└└
; ┌ @ sysimg.jl:18 within `getproperty'
   %29 = bitcast %jl_value_t addrspace(11)* %2 to i8 addrspace(11)*
   %30 = getelementptr inbounds i8, i8 addrspace(11)* %29, i64 8
   %31 = bitcast i8 addrspace(11)* %30 to %jl_value_t addrspace(10)* addrsp
ace(11)*
   %32 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrsp
ace(11)* %31, align 8
; └
; ┌ @ reducedim.jl:648 within `sum'
; │┌ @ reducedim.jl:648 within `#sum#550'
; ││┌ @ reducedim.jl:652 within `_sum' @ reducedim.jl:653
; │││┌ @ reducedim.jl:304 within `mapreduce'
; ││││┌ @ reducedim.jl:304 within `#mapreduce#548'
; │││││┌ @ reducedim.jl:308 within `_mapreduce_dim'
; ││││││┌ @ reduce.jl:302 within `_mapreduce'
; │││││││┌ @ indices.jl:426 within `Type'
; ││││││││┌ @ abstractarray.jl:75 within `axes'
; │││││││││┌ @ array.jl:155 within `size'
            %33 = addrspacecast %jl_value_t addrspace(10)* %32 to %jl_value
_t addrspace(11)*
            %34 = bitcast %jl_value_t addrspace(11)* %33 to %jl_value_t add
rspace(10)* addrspace(11)*
            %35 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_va
lue_t addrspace(10)* addrspace(11)* %34, i64 3
            %36 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %35 to 
i64 addrspace(11)*
            %37 = load i64, i64 addrspace(11)* %36, align 8
; │││││││││└
; │││││││││┌ @ tuple.jl:165 within `map'
; ││││││││││┌ @ range.jl:317 within `Type' @ range.jl:308
; │││││││││││┌ @ promotion.jl:414 within `max'
              %38 = icmp sgt i64 %37, 0
; │││││││└└└└└
; │││││││ @ reduce.jl:304 within `_mapreduce'
         br i1 %38, label %L53, label %L85

L53:                                              ; preds = %L43
; │││││││ @ reduce.jl:306 within `_mapreduce'
; │││││││┌ @ promotion.jl:403 within `=='
          %39 = icmp eq i64 %37, 1
; │││││││└
         br i1 %39, label %L55, label %L57

L55:                                              ; preds = %L53
; │││││││ @ reduce.jl:307 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %40 = bitcast %jl_value_t addrspace(11)* %33 to double addrspace(
13)* addrspace(11)*
          %41 = load double addrspace(13)*, double addrspace(13)* addrspace
(11)* %40, align 8
          %42 = load double, double addrspace(13)* %41, align 8
; │││││││└
; │││││││ @ reduce.jl:308 within `_mapreduce'
         br label %L85

L57:                                              ; preds = %L53
; │││││││ @ reduce.jl:309 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %43 = icmp sgt i64 %37, 15
; │││││││└
         br i1 %43, label %L73, label %L59

L59:                                              ; preds = %L57
; │││││││ @ reduce.jl:311 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %44 = bitcast %jl_value_t addrspace(11)* %33 to double addrspace(
13)* addrspace(11)*
          %45 = load double addrspace(13)*, double addrspace(13)* addrspace
(11)* %44, align 8
          %46 = load double, double addrspace(13)* %45, align 8
; │││││││└
; │││││││ @ reduce.jl:312 within `_mapreduce'
; │││││││┌ @ array.jl:729 within `getindex'
          %47 = getelementptr inbounds double, double addrspace(13)* %45, i
64 1
          %48 = load double, double addrspace(13)* %47, align 8
; │││││││└
; │││││││ @ reduce.jl:313 within `_mapreduce'
; │││││││┌ @ reduce.jl:24 within `add_sum'
; ││││││││┌ @ float.jl:395 within `+'
           %49 = fadd double %46, %48
; │││││││└└
; │││││││ @ reduce.jl:314 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %50 = icmp sgt i64 %37, 2
; │││││││└
         br i1 %50, label %L68.lr.ph, label %L85

L68.lr.ph:                                        ; preds = %L59
         br label %L68

L68:                                              ; preds = %L68.lr.ph, %L6
8
         %value_phi310 = phi double [ %49, %L68.lr.ph ], [ %54, %L68 ]
         %value_phi29 = phi i64 [ 2, %L68.lr.ph ], [ %51, %L68 ]
; │││││││ @ reduce.jl:315 within `_mapreduce'
; │││││││┌ @ int.jl:53 within `+'
          %51 = add nuw nsw i64 %value_phi29, 1
; │││││││└
; │││││││┌ @ array.jl:729 within `getindex'
          %52 = getelementptr inbounds double, double addrspace(13)* %45, i
64 %value_phi29
          %53 = load double, double addrspace(13)* %52, align 8
; │││││││└
; │││││││ @ reduce.jl:316 within `_mapreduce'
; │││││││┌ @ reduce.jl:24 within `add_sum'
; ││││││││┌ @ float.jl:395 within `+'
           %54 = fadd double %value_phi310, %53
; │││││││└└
; │││││││ @ reduce.jl:314 within `_mapreduce'
; │││││││┌ @ int.jl:49 within `<'
          %55 = icmp ult i64 %51, %37
; │││││││└
         br i1 %55, label %L68, label %L85

L73:                                              ; preds = %L57
; │││││││ @ reduce.jl:320 within `_mapreduce'
; │││││││┌ @ reduce.jl:178 within `mapreduce_impl'
          %56 = call double @julia_mapreduce_impl_13158(%jl_value_t addrspa
ce(10)* nonnull %32, i64 1, i64 %37, i64 1024)
; │││││││└
         br label %L85

L85:                                              ; preds = %L68, %L59, %L5
5, %L73, %L43
         %value_phi1 = phi double [ %42, %L55 ], [ %56, %L73 ], [ 0.000000e
+00, %L43 ], [ %49, %L59 ], [ %54, %L68 ]
; └└└└└└└
; ┌ @ sysimg.jl:18 within `getproperty'
   %57 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrsp
ace(11)* %3, align 8
; └
; ┌ @ array.jl:199 within `length'
   %58 = addrspacecast %jl_value_t addrspace(10)* %57 to %jl_value_t addrsp
ace(11)*
   %59 = bitcast %jl_value_t addrspace(11)* %58 to %jl_array_t addrspace(11
)*
   %60 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %59
, i64 0, i32 1
   %61 = load i64, i64 addrspace(11)* %60, align 8
; └
; ┌ @ none:1 within `/' @ promotion.jl:316
; │┌ @ promotion.jl:284 within `promote'
; ││┌ @ promotion.jl:261 within `_promote'
; │││┌ @ number.jl:7 within `convert'
; ││││┌ @ float.jl:60 within `Type'
       %62 = sitofp i64 %61 to double
; │└└└└
; │ @ none:1 within `/' @ promotion.jl:316 @ float.jl:401
   %63 = insertelement <2 x double> undef, double %value_phi, i32 0
   %64 = insertelement <2 x double> %63, double %value_phi1, i32 1
   %65 = insertelement <2 x double> undef, double %62, i32 0
   %66 = shufflevector <2 x double> %65, <2 x double> undef, <2 x i32> zero
initializer
   %67 = fdiv <2 x double> %64, %66
; └
  %68 = bitcast { double, double }* %0 to <2 x double>*
  store <2 x double> %67, <2 x double>* %68, align 8
  ret void
}
using BenchmarkTools
@btime average(arr)
102.406 ns (0 allocations: 0 bytes)
Main.WeaveSandBox1.MyComplex(0.4942370179088904, 0.497469289886528)
@btime average(arr2)
37.009 ns (0 allocations: 0 bytes)
Main.WeaveSandBox1.MyComplex(0.5295326789280854, 0.5268654152835516)

While this case is able to auto-vectorize (not all discontiguous calcuations can), we can see that the in-memory structure matters for the efficiency and the struct of array formulation is better in this case. Note it's because we are doing a column-wise operation, i.e. averaging the reals and the imaginary parts separately. If we were using each complex in isolation, then the struct of array formulation would be better aligned to the calculation.

Note that we can explicitly force SIMD with the @simd macro. For example:

function not_forced_simd(x)
  out = 0.0
  for i in 1:100
    out += sqrt(x[i].real^2 + x[i].imag^2)
  end
  out
end
@code_llvm not_forced_simd(arr)
;  @ none:2 within `not_forced_simd'
; Function Attrs: uwtable
define double @julia_not_forced_simd_14479(%jl_value_t addrspace(10)* nonnu
ll align 16 dereferenceable(40)) #0 {
top:
;  @ none:4 within `not_forced_simd'
; ┌ @ array.jl:729 within `getindex'
   %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspac
e(11)*
   %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_array_t addrspace(11)*
   %3 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %2, 
i64 0, i32 1
   %4 = load i64, i64 addrspace(11)* %3, align 8
   %5 = icmp eq i64 %4, 0
   br i1 %5, label %oob, label %idxend.lr.ph

idxend.lr.ph:                                     ; preds = %top
   %6 = bitcast %jl_value_t addrspace(11)* %1 to { double, double } addrspa
ce(13)* addrspace(11)*
   %7 = load { double, double } addrspace(13)*, { double, double } addrspac
e(13)* addrspace(11)* %6, align 8
   br label %idxend4

L29:                                              ; preds = %idxend4
; └
; ┌ @ range.jl:595 within `iterate'
; │┌ @ int.jl:53 within `+'
    %8 = add nuw nsw i64 %value_phi124, 1
; └└
; ┌ @ array.jl:729 within `getindex'
   %9 = icmp ult i64 %value_phi124, %4
   br i1 %9, label %idxend4, label %oob

L30:                                              ; preds = %idxend4
; └
;  @ none:6 within `not_forced_simd'
  ret double %20

oob:                                              ; preds = %L29, %top
  %value_phi1.lcssa = phi i64 [ 1, %top ], [ %8, %L29 ]
;  @ none:4 within `not_forced_simd'
; ┌ @ array.jl:729 within `getindex'
   %10 = alloca i64, align 8
   store i64 %value_phi1.lcssa, i64* %10, align 8
   %11 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspa
ce(12)*
   call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %11, i64* non
null %10, i64 1)
   unreachable

idxend4:                                          ; preds = %L29, %idxend.l
r.ph
   %12 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi124, %L29 ]
   %value_phi124 = phi i64 [ 1, %idxend.lr.ph ], [ %8, %L29 ]
   %value_phi23 = phi double [ 0.000000e+00, %idxend.lr.ph ], [ %20, %L29 ]
   %.elt = getelementptr inbounds { double, double }, { double, double } ad
drspace(13)* %7, i64 %12, i32 0
   %13 = bitcast double addrspace(13)* %.elt to <2 x double> addrspace(13)*
   %14 = load <2 x double>, <2 x double> addrspace(13)* %13, align 8
; └
; ┌ @ intfuncs.jl:243 within `literal_pow'
; │┌ @ float.jl:399 within `*'
    %15 = fmul <2 x double> %14, %14
; └└
; ┌ @ float.jl:395 within `+'
   %16 = extractelement <2 x double> %15, i32 0
   %17 = extractelement <2 x double> %15, i32 1
   %18 = fadd double %16, %17
; └
; ┌ @ math.jl:493 within `sqrt'
   %19 = call double @llvm.sqrt.f64(double %18)
; └
; ┌ @ float.jl:395 within `+'
   %20 = fadd double %value_phi23, %19
; └
; ┌ @ range.jl:594 within `iterate'
; │┌ @ promotion.jl:403 within `=='
    %21 = icmp eq i64 %value_phi124, 100
; └└
  br i1 %21, label %L30, label %L29
}
function forced_simd(x)
  out = 0.0
  @simd for i in 1:100
    out += sqrt(x[i].real^2 + x[i].imag^2)
  end
  out
end
@code_llvm forced_simd(arr)
;  @ none:2 within `forced_simd'
; Function Attrs: uwtable
define double @julia_forced_simd_14480(%jl_value_t addrspace(10)* nonnull a
lign 16 dereferenceable(40)) #0 {
top:
  %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace
(11)*
  %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_array_t addrspace(11)*
  %3 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %2, i
64 0, i32 1
  %4 = load i64, i64 addrspace(11)* %3, align 8
  %5 = bitcast %jl_value_t addrspace(11)* %1 to { double, double } addrspac
e(13)* addrspace(11)*
  %6 = load { double, double } addrspace(13)*, { double, double } addrspace
(13)* addrspace(11)* %5, align 8
;  @ none:3 within `forced_simd'
; ┌ @ simdloop.jl:71 within `macro expansion'
   br label %L8

L8:                                               ; preds = %top, %idxend4
   %value_phi124 = phi i64 [ 0, %top ], [ %7, %idxend4 ]
   %value_phi23 = phi double [ 0.000000e+00, %top ], [ %15, %idxend4 ]
; │ @ simdloop.jl:72 within `macro expansion'
; │┌ @ simdloop.jl:50 within `simd_index'
; ││┌ @ range.jl:615 within `getindex'
; │││┌ @ int.jl:53 within `+'
      %7 = add nuw nsw i64 %value_phi124, 1
; │└└└
; │ @ simdloop.jl:73 within `macro expansion' @ none:4
; │┌ @ array.jl:729 within `getindex'
    %8 = icmp ult i64 %value_phi124, %4
    br i1 %8, label %idxend4, label %oob

L45:                                              ; preds = %idxend4
; └└
;  @ none:6 within `forced_simd'
  ret double %15

oob:                                              ; preds = %L8
;  @ none:3 within `forced_simd'
; ┌ @ simdloop.jl:73 within `macro expansion' @ none:4
; │┌ @ array.jl:729 within `getindex'
    %9 = alloca i64, align 8
    store i64 %7, i64* %9, align 8
    %10 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrsp
ace(12)*
    call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %10, i64* no
nnull %9, i64 1)
    unreachable

idxend4:                                          ; preds = %L8
    %.elt = getelementptr inbounds { double, double }, { double, double } a
ddrspace(13)* %6, i64 %value_phi124, i32 0
    %.unpack = load double, double addrspace(13)* %.elt, align 8
    %.elt18 = getelementptr inbounds { double, double }, { double, double }
 addrspace(13)* %6, i64 %value_phi124, i32 1
    %.unpack19 = load double, double addrspace(13)* %.elt18, align 8
; │└
; │┌ @ intfuncs.jl:243 within `literal_pow'
; ││┌ @ float.jl:399 within `*'
     %11 = fmul double %.unpack, %.unpack
     %12 = fmul double %.unpack19, %.unpack19
; │└└
; │┌ @ float.jl:395 within `+'
    %13 = fadd double %11, %12
; │└
; │┌ @ math.jl:493 within `sqrt'
    %14 = call double @llvm.sqrt.f64(double %13)
; │└
; │┌ @ float.jl:395 within `+'
    %15 = fadd fast double %value_phi23, %14
; │└
; │ @ simdloop.jl:71 within `macro expansion'
; │┌ @ int.jl:49 within `<'
    %16 = icmp ugt i64 %value_phi124, 98
; │└
   br i1 %16, label %L45, label %L8
; └
}
@btime not_forced_simd(arr)
214.012 ns (1 allocation: 16 bytes)
76.18745105933863
@btime forced_simd(arr)
213.624 ns (1 allocation: 16 bytes)
76.18745105933863

It's different code but not realistically any faster. This is partly because using "more" SIMD is actually quite hard. This amount of parallelism tends to overheat the cores, and thus when the newest and largest SIMD is used (AVX2, AVX512, etc.), these instructions actually require that the processor downclocks for these interactions, somewhat offsetting the performance gain of doing multiple calculations simulataniously. If enough are aligned in a row it still will give a big speed up, but this means that there's a cost to doing too little of highly compressed SIMD operations.

Note that in this case we are using the real and imaginary parts together, so let's test our theory that the array of structs formulation will be faster:

function SoA_forced_simd(x)
  out = 0.0
  @simd for i in 1:100
    out += sqrt(x.real[i]^2 + x.imag[i]^2)
  end
  out
end
@code_llvm SoA_forced_simd(arr2)
;  @ none:2 within `SoA_forced_simd'
; Function Attrs: uwtable
define double @julia_SoA_forced_simd_14512(%jl_value_t addrspace(10)* nonnu
ll align 8 dereferenceable(16)) #0 {
top:
  %gcframe = alloca %jl_value_t addrspace(10)*, i32 3
  %1 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* %1, i8 0, i32 24, i32 0, i1 false)
  %2 = call %jl_value_t*** inttoptr (i64 1801331392 to %jl_value_t*** ()*)(
) #7
  %3 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)*
* %gcframe, i32 0
  %4 = bitcast %jl_value_t addrspace(10)** %3 to i64*
  store i64 2, i64* %4
  %5 = getelementptr %jl_value_t**, %jl_value_t*** %2, i32 0
  %6 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)*
* %gcframe, i32 1
  %7 = bitcast %jl_value_t addrspace(10)** %6 to %jl_value_t***
  %8 = load %jl_value_t**, %jl_value_t*** %5
  store %jl_value_t** %8, %jl_value_t*** %7
  %9 = bitcast %jl_value_t*** %5 to %jl_value_t addrspace(10)***
  store %jl_value_t addrspace(10)** %gcframe, %jl_value_t addrspace(10)*** 
%9
  %10 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspac
e(11)*
  %11 = bitcast %jl_value_t addrspace(11)* %10 to %jl_value_t addrspace(10)
* addrspace(11)*
  %12 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspa
ce(11)* %11, align 8
  %13 = addrspacecast %jl_value_t addrspace(10)* %12 to %jl_value_t addrspa
ce(11)*
  %14 = bitcast %jl_value_t addrspace(11)* %13 to %jl_array_t addrspace(11)
*
  %15 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %14,
 i64 0, i32 1
  %16 = load i64, i64 addrspace(11)* %15, align 8
  %17 = bitcast %jl_value_t addrspace(11)* %13 to double addrspace(13)* add
rspace(11)*
  %18 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %1
7, align 8
  %19 = bitcast %jl_value_t addrspace(11)* %10 to i8 addrspace(11)*
  %20 = getelementptr inbounds i8, i8 addrspace(11)* %19, i64 8
  %21 = bitcast i8 addrspace(11)* %20 to %jl_value_t addrspace(10)* addrspa
ce(11)*
  %22 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspa
ce(11)* %21, align 8
  %23 = addrspacecast %jl_value_t addrspace(10)* %22 to %jl_value_t addrspa
ce(11)*
  %24 = bitcast %jl_value_t addrspace(11)* %23 to %jl_array_t addrspace(11)
*
  %25 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %24,
 i64 0, i32 1
  %26 = bitcast %jl_value_t addrspace(11)* %23 to double addrspace(13)* add
rspace(11)*
;  @ none:3 within `SoA_forced_simd'
; ┌ @ simdloop.jl:71 within `macro expansion'
   br label %L8

L8:                                               ; preds = %top, %idxend4
   %value_phi116 = phi i64 [ 0, %top ], [ %27, %idxend4 ]
   %value_phi15 = phi double [ 0.000000e+00, %top ], [ %49, %idxend4 ]
; │ @ simdloop.jl:72 within `macro expansion'
; │┌ @ simdloop.jl:50 within `simd_index'
; ││┌ @ range.jl:615 within `getindex'
; │││┌ @ int.jl:53 within `+'
      %27 = add nuw nsw i64 %value_phi116, 1
; │└└└
; │ @ simdloop.jl:73 within `macro expansion' @ none:4
; │┌ @ array.jl:729 within `getindex'
    %28 = icmp ult i64 %value_phi116, %16
    br i1 %28, label %idxend, label %oob

L45:                                              ; preds = %idxend4
; └└
;  @ none:6 within `SoA_forced_simd'
  %29 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)
** %gcframe, i32 1
  %30 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %29
  %31 = getelementptr %jl_value_t**, %jl_value_t*** %2, i32 0
  %32 = bitcast %jl_value_t*** %31 to %jl_value_t addrspace(10)**
  store %jl_value_t addrspace(10)* %30, %jl_value_t addrspace(10)** %32
  ret double %49

oob:                                              ; preds = %L8
;  @ none:3 within `SoA_forced_simd'
; ┌ @ simdloop.jl:73 within `macro expansion' @ none:4
; │┌ @ array.jl:729 within `getindex'
    %33 = alloca i64, align 8
    store i64 %27, i64* %33, align 8
    %34 = addrspacecast %jl_value_t addrspace(10)* %12 to %jl_value_t addrs
pace(12)*
    call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %34, i64* no
nnull %33, i64 1)
    unreachable

idxend:                                           ; preds = %L8
    %35 = load i64, i64 addrspace(11)* %25, align 8
    %36 = icmp ult i64 %value_phi116, %35
    br i1 %36, label %idxend4, label %oob3

oob3:                                             ; preds = %idxend
    %37 = alloca i64, align 8
    store i64 %27, i64* %37, align 8
    %38 = addrspacecast %jl_value_t addrspace(10)* %22 to %jl_value_t addrs
pace(12)*
    %39 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(1
0)** %gcframe, i32 2
    store %jl_value_t addrspace(10)* %22, %jl_value_t addrspace(10)** %39
    call void @jl_bounds_error_ints(%jl_value_t addrspace(12)* %38, i64* no
nnull %37, i64 1)
    unreachable

idxend4:                                          ; preds = %idxend
    %40 = getelementptr inbounds double, double addrspace(13)* %18, i64 %va
lue_phi116
    %41 = load double, double addrspace(13)* %40, align 8
    %42 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* 
%26, align 8
    %43 = getelementptr inbounds double, double addrspace(13)* %42, i64 %va
lue_phi116
    %44 = load double, double addrspace(13)* %43, align 8
; │└
; │┌ @ intfuncs.jl:243 within `literal_pow'
; ││┌ @ float.jl:399 within `*'
     %45 = fmul double %41, %41
     %46 = fmul double %44, %44
; │└└
; │┌ @ float.jl:395 within `+'
    %47 = fadd double %45, %46
; │└
; │┌ @ math.jl:493 within `sqrt'
    %48 = call double @llvm.sqrt.f64(double %47)
; │└
; │┌ @ float.jl:395 within `+'
    %49 = fadd fast double %value_phi15, %48
; │└
; │ @ simdloop.jl:71 within `macro expansion'
; │┌ @ int.jl:49 within `<'
    %50 = icmp ugt i64 %value_phi116, 98
; │└
   br i1 %50, label %L45, label %L8
; └
}
@btime SoA_forced_simd(arr2)
217.695 ns (1 allocation: 16 bytes)
79.9629568819235

Explicit SIMD

If you want to pack the vectors yourself, then primitives for doing so from within Julia are available in SIMD.jl. This is for "real" performance warriors.

Summary of SIMD

  • Communication in SIMD is due to locality: if things are local the processor can automatically setup the operations.

  • There's no real worry about "getting it wrong": you cannot overwrite pieces from different parts of the arithmetic unit, and if SIMD is unsafe then it just won't auto-vectorize.

  • Suitable for operations measured in ns.

Next Level Up: Multithreading

Last time we briefly went over multithreading and described how every process has multiple threads which share a single heap, and when multiple threads are executed simultaniously we have multithreaded parallelism. Note that you can have multiple threads which aren't executed simultaniously, like in the case of I/O operations, and this is an example of concurrency without parallelism and is commonly referred to as green threads.

Last time we described a simple multithreaded program and noticed that multithreading has an overhead cost of around 50ns-100ns. This is due to the construction of the new stack (amont other things) each time a new computational thread is spun up. This means that, unlike SIMD, some thought needs to be put in as to when to perform multithreading: it's not always a good idea. It needs to be high enough on the cost for this to be counter-balanced.

One abstraction that was glossed over was the memory access style. Before, we were considering a single heap, or an UMA style:

However, this is the case for all shared memory devices. For example, compute nodes on the HPC tend to be "dual Xeon" or "quad Xeon", where each Xeon processor is itself a multicore processor. But each processor on its own accesses its own local caches, and thus one has to be aware that this is setup in a NUMA (non-uniform memory access) manner:

where there is a cache that is closer to the prcessor and a cache that is further away. Care should be taken in this to localize the computation per thread, otherwise a cost associated with the memory sharing will be hit (but all sharing will still be automatic).

In this sense, interthread communication is naturally done through the heap: if you want other threads to be able to touch a value, then you can simply place it on the heap and then it'll be available. We saw this last time by how overlapping computations can re-use the same heap-based caches, meaning that care needs to be taken with how one writes into a dynamically-allocated array.

A simple example that demonstrates this is:

using Base.Threads
acc = 0
@threads for i in 1:10_000
    global acc
    acc += 1
end
acc
3273

The reason for this behavior is that there is a difference between the reading and the writing step to an array. Here, values are being read while other threads are writing, meaning that they see a lower value than when they are attempting to write into it. The result is that the total summation is lower than the true value because of this clashing. We can prevent this by only allowing one thread to utilize the heap-allocated variable at a time. One abstraction for doing this is atomics:

acc = Atomic{Int64}(0)
@threads for i in 1:10_000
    atomic_add!(acc, 1)
end
acc
Base.Threads.Atomic{Int64}(10000)

When an atomic add is being done, all other threads wishing to do the same computation are blocked. This of course can have a massive effect on performance since atomic computations are not parallel.

Julia also exposes a lower level of heap control in threading using locks

const acc_lock = Ref{Int64}(0)
const mlock = Mutex()
function f1()
    @threads for i in 1:10_000
        lock(mlock)
        acc_lock[] += 1
        unlock(mlock)
    end
end
const splock = SpinLock()
function f2()
    @threads for i in 1:10_000
        lock(splock)
        acc_lock[] += 1
        unlock(splock)
    end
end
const rsplock = RecursiveSpinLock()
function f3()
    @threads for i in 1:10_000
        lock(rsplock)
        acc_lock[] += 1
        unlock(rsplock)
    end
end
acc2 = Atomic{Int64}(0)
function g()
  @threads for i in 1:10_000
      atomic_add!(acc2, 1)
  end
end
const acc_s = Ref{Int64}(0)
function h()
  global acc_s
  for i in 1:10_000
      acc_s[] += 1
  end
end
@btime f1()
720.300 μs (0 allocations: 0 bytes)

Mutex is the most capable: just put locks and unlocks anywhere and it will work, but it is costly. SpinLock is non-reentrent, i.e. it will block itself if a thread that calls a lock does another lock. Therefore it has to be used with caution (every lock goes with one unlock), but it's fast:

@btime f2()
318.200 μs (0 allocations: 0 bytes)

Additionally there is the undocumented RecursiveSpinLock which allows a single thread to recursively lock and then unlock, but if multiple threads try to recursively lock it can run into issues. But it only has a small overhead:

@btime f3()
459.799 μs (0 allocations: 0 bytes)

But if you can use atomics, they will be faster:

@btime g()
133.500 μs (0 allocations: 0 bytes)

and if your computation is actually serial, then use serial code:

@btime h()
2.099 ns (0 allocations: 0 bytes)

Why is this so fast? Check the code:

@code_llvm h()
;  @ none:2 within `h'
; Function Attrs: uwtable
define nonnull %jl_value_t addrspace(10)* @japi1_h_14627(%jl_value_t addrsp
ace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*
** %3, align 8
;  @ none:4 within `h'
; ┌ @ refvalue.jl:33 within `setindex!'
; │┌ @ sysimg.jl:19 within `setproperty!'
    %.promoted = load i64, i64* inttoptr (i64 286484128 to i64*), align 32
; └└
;  @ none:3 within `h'
  %4 = add i64 %.promoted, 10000
;  @ none:4 within `h'
; ┌ @ refvalue.jl:33 within `setindex!'
; │┌ @ sysimg.jl:19 within `setproperty!'
    store i64 %4, i64* inttoptr (i64 286484128 to i64*), align 32
; └└
  ret %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 
251723784 to %jl_value_t*) to %jl_value_t addrspace(10)*)
}

It just knows to add 10000. So to get a proper timing let's make the size mutable:

const len = Ref{Int}(10_000)
function h2()
  global acc_s
  global len
  for i in 1:len[]
      acc_s[] += 1
  end
end
@btime h2()
2.099 ns (0 allocations: 0 bytes)
@code_llvm h2()
;  @ none:2 within `h2'
; Function Attrs: uwtable
define nonnull %jl_value_t addrspace(10)* @japi1_h2_14642(%jl_value_t addrs
pace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*
** %3, align 8
;  @ none:4 within `h2'
; ┌ @ refvalue.jl:32 within `getindex'
; │┌ @ sysimg.jl:18 within `getproperty'
    %4 = load i64, i64* inttoptr (i64 296067184 to i64*), align 16
; └└
; ┌ @ range.jl:5 within `Colon'
; │┌ @ range.jl:274 within `Type'
; ││┌ @ range.jl:279 within `unitrange_last'
; │││┌ @ operators.jl:333 within `>='
; ││││┌ @ int.jl:428 within `<='
       %5 = icmp sgt i64 %4, 0
; └└└└└
  br i1 %5, label %L9.L13_crit_edge, label %L29

L9.L13_crit_edge:                                 ; preds = %top
;  @ none:5 within `h2'
; ┌ @ refvalue.jl:33 within `setindex!'
; │┌ @ sysimg.jl:19 within `setproperty!'
    %.promoted = load i64, i64* inttoptr (i64 286484128 to i64*), align 32
; └└
;  @ none:4 within `h2'
  %6 = add i64 %.promoted, %4
;  @ none:5 within `h2'
; ┌ @ refvalue.jl:33 within `setindex!'
; │┌ @ sysimg.jl:19 within `setproperty!'
    store i64 %6, i64* inttoptr (i64 286484128 to i64*), align 32
; └└
  br label %L29

L29:                                              ; preds = %L9.L13_crit_ed
ge, %top
  ret %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 
251723784 to %jl_value_t*) to %jl_value_t addrspace(10)*)
}

It's still optimizing it!

non_const_len = 10000
function h3()
  global acc_s
  global non_const_len
  len2::Int = non_const_len
  for i in 1:len2
      acc_s[] += 1
  end
end
@btime h3()
65.820 ns (1 allocation: 16 bytes)

Note that what is shown here is a type-declaration. a::T = ... forces a to be of type T throughout the whole function. By giving the compiler this information, I am able to use the non-constant global in a type-stable manner.

One last thing to note about multithreaded computations, and parallel computations, is that one cannot assume that the parallelized computation is computed in any given order. For example, the following will has a quasi-random ordering:

const a2 = zeros(nthreads()*10)
const acc_lock2 = Ref{Int64}(0)
const splock2 = SpinLock()
function f_order()
    @threads for i in 1:length(a2)
        lock(splock2)
        acc_lock2[] += 1
        a2[i] = acc_lock2[]
        unlock(splock2)
    end
end
f_order()
a2
40-element Array{Float64,1}:
  1.0
  2.0
  3.0
  4.0
  5.0
  6.0
  7.0
  8.0
  9.0
 10.0
  ⋮  
 22.0
 23.0
 24.0
 25.0
 26.0
 27.0
 28.0
 29.0
 30.0

Note that here we can see that Julia 1.1 is dividing up the work into groups of 10 for each thread, and then one thread dominates the computation at a time, but which thread dominates is random.

The Dining Philosophers Problem

A classic tale in parallel computing is the dining philosophers problem. In this case, there are N philosophers at a table who all want to eat at the same time, following all of the same rules. Each philosopher must alternatively think and then eat. They need both their left and right fork to start eating, but cannot start eating until they have both forks. The problem is how to setup a concurrent algorithm that will not cause any philosophers to starve.

The difficulty is a situation known as deadlock. For example, if each philosopher was told to grab the right fork when it's avaialble, and then the left fork, and put down the fork after eating, then they will all grab the right fork and none will ever eat because they will all be waiting on the left fork. This is analygous to two blocked computations which are waiting on the other to finish. Thus, when using blocking structures, one needs to be careful about deadlock!

Summary of Multithreading

  • Communication in multithreading is done on the heap. Locks and atomics allow for a form of safe message passing.

  • 50ns-100ns of overhead. Suitable for 1μs calculations.

  • Be careful of ordering and heap-allocated values.

GPU Computing

GPUs are not fast. In fact, the problem with GPUs is that each processor is slow. However, GPUs have a lot of cores... like thousands.

An RTX2080, a standard "gaming" GPU (not even the ones in the cluster), has 2944 cores. However, not only are GPUs slow, but they also need to be programmed in a style that is SPMD, which standard for Single Program Multiple Data. This means that every single thread must be running the same program but on different pieces of data. Exactly the same program. If you have

if a > 1
  # Do something
else
  # Do something else
end

where some of the data goes on one branch and other data goes on the other branch, every single thread will run both branches (performing "fake" computations while on the other branch). This means that GPU tasks should be "very parallel" with as few conditionals as possible.

This can be very difficult to actually use effectively, so for the most part GPUs are used as array-operation accelerators. CuArrays.jl has a CUDA-accelerated BLAS for high-performance linear algebra. If one wants to generate GPU code from Julia, one can use CudaNative.jl or GPUifyLoops.jl, which will be a topic for later in the course.

GPU Memory

GPUs themselves are shared memory devices, meaning they have a heap that is shared amongst all threads. However, GPUs are heavily in the NUMA camp, where different blocks of the GPU have much faster access to certain parts of the memory. Additionally, this heap is disconnected from the standard processor, so data must be passed to the GPU and data must be returned.

GPU memory size is relatively small compared to CPUs. Example: the RTX2080Ti has 8GB of RAM. Thus one needs to be doing computations that are memory compact (such as matrix multiplications, which are O(n^3) making the computation time scale quicker than the memory cost).

Note on GPU Hardware

Standard GPU hardware "for gaming", like RTX2070, is just as fast as higher end GPU hardware for Float32. Higher end hardware, like the Tesla, add more memory, memory safety, and Float64 support. However, these require being in a server since they have alternative cooling strategies, making them a higher end product.

GPU Computing Example with CuArrays

A = rand(100,100); B = rand(100,100)
using CuArrays
# Pass to the GPU
cuA = cu(A); cuB = cu(B)
cuC = cuA*cuB
# Pass to the CPU
C = Array(cuC)
100×100 Array{Float32,2}:
 25.3107  22.7298  25.5692  24.3045  …  24.2161  21.6363  24.8899  26.2532
 26.6793  24.9427  27.8811  26.3918     27.9027  24.1422  27.4993  26.7786
 26.5908  24.0974  25.7496  25.9126     27.7348  23.4801  25.7065  26.4204
 28.7188  23.8214  27.9859  25.211      26.4259  24.363   26.2981  26.9884
 27.4881  25.1147  26.7679  26.3187     26.8501  25.4026  29.6349  28.3685
 21.5718  20.3469  22.8488  22.6915  …  23.022   20.9601  24.3381  23.0586
 26.3455  23.5796  27.653   26.2658     26.5764  24.2969  27.4871  26.5515
 26.9288  25.0956  28.7806  29.758      29.3532  27.068   30.7919  28.6132
 25.0739  22.1758  26.9618  24.7945     24.6738  22.7741  24.013   25.1917
 24.9346  25.1186  27.074   25.589      27.3505  23.7031  28.206   26.3026
  ⋮                                  ⋱                                    
 21.8821  21.6143  24.6156  22.287      23.1155  22.3583  23.7767  22.4511
 24.0125  22.6128  23.5028  24.4777     23.8922  20.7687  24.281   25.8209
 22.8917  21.6158  23.719   23.2208     21.579   21.4846  24.077   26.0685
 25.8146  22.9446  26.1122  26.0201     25.5328  22.512   25.4883  25.252 
 26.8133  23.2904  26.8035  26.0081  …  24.999   23.7144  26.8481  26.9805
 22.9258  22.3818  24.1611  23.3293     23.5227  20.5523  23.4254  24.4378
 24.0151  22.5584  27.2785  21.295      24.0352  22.2044  23.0835  24.6288
 25.6003  23.6021  24.6391  25.5412     26.4169  22.6339  25.2036  26.8669
 27.4531  23.8239  26.3994  26.8437     28.7813  25.2764  28.0294  28.5478

Let's see the transfer times:

@btime cu(A)
58.100 μs (21 allocations: 78.64 KiB)
100×100 CuArrays.CuArray{Float32,2}:
 0.0957158  0.0822288   0.832487   …  0.87214    0.0760202  0.590729 
 0.0144815  0.622855    0.881261      0.33415    0.0712413  0.846761 
 0.378166   0.986902    0.0464911     0.81872    0.122787   0.141584 
 0.571106   0.721337    0.643075      0.0771098  0.881677   0.749174 
 0.151795   0.583606    0.160998      0.829109   0.604337   0.852383 
 0.937345   0.94373     0.137428   …  0.987506   0.1646     0.912784 
 0.704714   0.543219    0.542691      0.748799   0.719914   0.342393 
 0.772732   0.256546    0.956927      0.035975   0.678482   0.871433 
 0.999848   0.941949    0.76449       0.261359   0.617673   0.0275737
 0.307927   0.451561    0.334097      0.602481   0.55241    0.0251871
 ⋮                                 ⋱                                 
 0.511491   0.108915    0.987897      0.220209   0.714782   0.491357 
 0.949688   0.144914    0.25579       0.594408   0.349565   0.610267 
 0.107148   0.00617826  0.713126      0.739051   0.480235   0.684719 
 0.731386   0.884892    0.621268      0.515833   0.171045   0.615208 
 0.0583768  0.578275    0.734511   …  0.582012   0.137375   0.784945 
 0.408481   0.0287324   0.292208      0.688842   0.636986   0.279396 
 0.206525   0.388333    0.613621      0.0891181  0.270656   0.263189 
 0.434286   0.122891    0.256494      0.580923   0.811442   0.754561 
 0.873519   0.0980525   0.468673      0.0951486  0.974195   0.304249
@btime Array(cuC)
72.299 μs (6 allocations: 39.22 KiB)
100×100 Array{Float32,2}:
 25.3107  22.7298  25.5692  24.3045  …  24.2161  21.6363  24.8899  26.2532
 26.6793  24.9427  27.8811  26.3918     27.9027  24.1422  27.4993  26.7786
 26.5908  24.0974  25.7496  25.9126     27.7348  23.4801  25.7065  26.4204
 28.7188  23.8214  27.9859  25.211      26.4259  24.363   26.2981  26.9884
 27.4881  25.1147  26.7679  26.3187     26.8501  25.4026  29.6349  28.3685
 21.5718  20.3469  22.8488  22.6915  …  23.022   20.9601  24.3381  23.0586
 26.3455  23.5796  27.653   26.2658     26.5764  24.2969  27.4871  26.5515
 26.9288  25.0956  28.7806  29.758      29.3532  27.068   30.7919  28.6132
 25.0739  22.1758  26.9618  24.7945     24.6738  22.7741  24.013   25.1917
 24.9346  25.1186  27.074   25.589      27.3505  23.7031  28.206   26.3026
  ⋮                                  ⋱                                    
 21.8821  21.6143  24.6156  22.287      23.1155  22.3583  23.7767  22.4511
 24.0125  22.6128  23.5028  24.4777     23.8922  20.7687  24.281   25.8209
 22.8917  21.6158  23.719   23.2208     21.579   21.4846  24.077   26.0685
 25.8146  22.9446  26.1122  26.0201     25.5328  22.512   25.4883  25.252 
 26.8133  23.2904  26.8035  26.0081  …  24.999   23.7144  26.8481  26.9805
 22.9258  22.3818  24.1611  23.3293     23.5227  20.5523  23.4254  24.4378
 24.0151  22.5584  27.2785  21.295      24.0352  22.2044  23.0835  24.6288
 25.6003  23.6021  24.6391  25.5412     26.4169  22.6339  25.2036  26.8669
 27.4531  23.8239  26.3994  26.8437     28.7813  25.2764  28.0294  28.5478

The cost transferring is about 50μs-100μs in each direction, meaning that one needs to be doing operations that cost at least 200μs for GPUs to break even. A good rule of thumb is that GPU computations should take at least a milisecond, or GPU memory should be re-used.

Summary of GPUs

  • GPUs cores are slow

  • GPUs are SPMD

  • GPUs are generally used for linear algebra

  • Suitable for SPMD 1ms computations

Xeon Phi Accelerators and OpenCL

Other architectures exist to keep in mind. Xeon Phis are a now-defunct accelerator that used X86 (standard processors) as the base, using hundreds of them. For example, the Knights Landing series had 256 core accelerator cards. These were all clocked down, meaning they were still slower than a standard CPU, but there were less restrictions on SPMD (though SPMD-like computations were still preferred in order to heavily make use of SIMD). However, because machine learning essentially only needs linear algebra, and linear algebra is faster when restricting to SPMD-architectures, this failed. These devices can still be found on many high end clusters.

One alternative to CUDA is OpenCL which supports alternative architectures such as the Xeon Phi at the same time that it supports GPUs. However, one of the issues with OpenCL is that its BLAS implementation currently does not match the speed of CuBLAS, which makes NVIDIA-specific libraries still the king of machine learning and most scientific computing.

TPU Computing

TPUs are tensor processing units, which is Google's newest accelerator technology. They are essentially just "tensor operation compilers", which in computer science speak is simply higher dimensional linear algebra. To do this, they internally utilize a BFloat16 type, which is a 16-bit floating point number with the same exponent size as a Float32 with an 8-bit significand. This means that computations are highly prone to catastrophic cancellation. This computational device only works because BFloat16 has primitive operations for FMA which allows 32-bit-like accuracy of multiply-add operations, and thus computations which are only dot products (linear algebra) end up okay. Thus this is simply a GPU-like device which has gone further to completely specialize in linear algebra.

Multiprocessing (Distributed Computing)

While multithreading computes with multiple threads, multiprocessing computes with multiple independent processes. Note that processes do not share any memory, not heap or data, and thus this mode of computing also allows for distributed computations, which is the case where processes may be on separate computing hardware. However, even if they are on the same hardware, the lack of a shared address space means that multiprocessing has to do message passing, i.e. send data from one process to the other.

The Master-Worker Model

Given the amount of control over data handling, there are many different models for distributed computing. The simplest, the one that Julia's Distributed Standard Library defaults to, is the master-worker model. The master-worker model has one process, deemed the master, which controls the worker processes.

Here we can start by adding some new worker processes:

using Distributed
addprocs(4)

This adds 4 worker processes for the master to control. The simplest computations are those where the master process gives the worker process a job which returns the value afterwards. For example, a pmap operation or @distributed loop gives the worker a function to execute, along with the data, and the worker then computes and returns the result.

At a lower level, this is done by Distributed.@spawning jobs, or using a remotecall and fetching the result. ParallelDataTransfer.jl gives an extended set of primitive message passing operations.

SharedArrays, Elemental, and DArrays

Because array operations are a standard way to compute in scientific computing, there are higher level primitives to help with message passing. A SharedArray is an array which acts like a shared memory device. This means that every change to a SharedArray causes message passing to keep them in sync, and thus this should be used with a performance caution. DistributedArrays.jl is a parallel array type which has local blocks and can be used for writing higher level abstractions with explicit message passing. Because it is currently missing high-level parallel linear algebra, currently the recommended tool for distributed linear algebra is Elemental.jl.

MapReduce, Hadoop, and Spark: The Map-Reduce Model

Many data-parallel operations work by mapping a function f onto each piece of data and then reducing it. For example, the sum of squares maps the function x -> x^2 onto each value, and then these values are reduced by performing a summation. MapReduce was a Google framework in the 2000's built around this as the parallel computing concept, and current data-handling frameworks, like Hadoop and Spark, continue this as the core distributed programming model.

In Julia, there exists the mapreduce function for performing serial mapreduce operations. It also work on GPUs. However, it does not auto-distribute. For distributed map-reduce programming, the @distributed for-loop macro can be used. For example, sum of squares of random numbers is:

@distributed (+) for i in 1:1000
  rand()^2
end

One can see that computing summary statistics is easily done in this framework which is why it was majorly adopted among "big data" communities.

MPI

The main way to do high-performace multiprocessing is MPI, which is an old distributed computing interface from the C/Fortran days. Julia has access to the MPI programming model through MPI.jl which is installed on the Supercloud. The programming model for MPI is that every computer is running the same program, and synchronization is performed by blocking communication. For example, the following prints on every process:

using MPI
MPI.Init()

comm = MPI.COMM_WORLD
print("Hello world, I am rank $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")
MPI.Barrier(comm)
> mpiexec -n 3 julia examples/01-hello.jl
Hello world, I am rank 0 of 3
Hello world, I am rank 1 of 3
Hello world, I am rank 2 of 3

We will go into detail on using MPI later in the course. This model can be faster than the master-worker model because less communication to a single master is required, but care has to be taken to make sure the computation doesn't deadlock.

Summary of Multiprocessing

  • Cost is hardware dependent: only suitable for 1ms or higher depending on the connections through which the messages are being passed and the topology of the network.

  • Master-worker is Julia's model

  • Map-reduce is a common data-handling model

  • Array-based distributed computations are another abstraction

  • MPI is the lowest abstract, where each process is completely independent and one just controls the memory handling.