How can I find the number of threads for the best performance?

I have tested performance of PFC3D 7.00.153 using simple model (gravity falling of 1000 balls) with changing the number of threads. (This is very similar with the performance test suggested for using YADE Introduction — Yade 3rd ed. documentation)

The model was solved to reach equilibrium ratio of 1E-5 and the ‘average number of cycles per second’ is calculated.

This is performance result of my machine (AMD Ryzen thread ripper PRO 5975WX)

The result show that using 7 threads may yield the best performance for this model.

Question

  1. Do I have to conduct performance test for every model that I made? or The ‘one’ test result for the specific machine could be regarded as general performance to be expected?

  2. What is the reason of poor performance when increasing the number of threads? Is it a problem of overhead in parallel computing?

Main script file “C_T[1].py”


import itasca as it
it.command("python-reset-state false") #the model new command will not affect python environment
import ct2 as ct2
import importlib
importlib.reload(ct2)

for i in range(1,65):
     it.set_threads(i)
     ct2.falling_test()

External module “ct2.py”

import itasca as it
from datetime import datetime

def falling_test():
    # MODEL CONFIGURATION
    it.command("""model new
        model configure dynamic
        model deterministic on
        model large-strain on
        model energy mechanical off
        model orientation-tracking off
        model title 'core performance test'
    """)

    it.set_domain_min((-0.05, -0.05, -0.05))
    it.set_domain_max((0.05, 0.05, 0.05))

    it.command("""
        wall generate box -0.05 0.05
        ball generate box -0.05 0.05 -0.05 0.05 0.03 0.05 radius 0.002 number 1000
    """)

    it.command("""
        contact cmat default type ball-ball model hertz property hz_shear 1e6 hz_poiss 0.2 dp_nratio 0.2 dp_sratio 0.2
        contact cmat default type ball-facet model linear property kn 1e6 ks 1e6 dp_nratio 0.2 dp_sratio 0.2
        ball attribute density 2000
    """)

    it.set_gravity((0.0,0.0,-9.81))

    start_time = datetime.now()
    it.command("""
        model solve ratio 1e-5
    """)
    running_time =  datetime.now() - start_time

    print('Thread = ', it.threads())
    print('Total cycle = ', it.cycle(), 'Cycle per second = ', int(it.cycle()/running_time.total_seconds()))
    print('Elapsed time = ', running_time)

I’m answering with the help of another expert.
I think the model used in this question does not really reflect the proper scaling that can be achieved with the processor, because the number of balls used in the system is too small.
To respond to the specific questions:

  1. Do I have to conduct performance test for every model that I made? or The ‘one’ test result for the specific machine could be regarded as general performance to be expected?
    As discussed above, the number of balls in the system may be too small to actually provide sensible results. In general the most time is spent in: 1) contact resolution and contact force updates; and 2) contact detection. A “quasi-static” model with a stable contact network will spend most of the time in 1, whereas a “dynamic” model with lots of contact creation/deletion will spend time in both 1 and 2. Therefore the scaling may depend on the state of the system.
  2. What is the reason of poor performance when increasing the number of threads? Is it a problem of overhead in parallel computing?
    In general, there is a threshold after which increasing the number of threads will not yield to better performances. This is due to memory cache misses occurring during the computation. This threshold depends on the memory cache size and is processor-dependent.