Hi,
I have some performance issue with COMSOL Multiphysics on a new system.
I apologize in advance but it will take long.
Recently, we have built a new rig in our institute:
Supermicro H8QGI-FN4L motherboard
4x Opteron 6328 (8 integer cores and 4 FPUs per CPU, i.e. 32 int. cores and 16 FPUs in total, 3.2GHz but with TurboCore 3.5GHz if max. 4 integer cores/CPU is used and 3.8GHz if more than 4 integer cores/CPU is used)
4x Noctua NH-U9DO A3 cooler
Kingston ValueRam ECC Registered 1600MHz CL11 DDR3 (4x KVR16R11S4K4/32i kit, one for each CPU, i.e. 4x8GB modules per CPU; 128GB in total)
FirePro W4100
Samsung 850 Pro 128GB SSD, WD Caviar Green 2TB (64MB cache), Asus Xonar DGX sound card, Asus DVD writer (SATA)
EVGA SuperNOVA 1200 P2 PSU
Memory and CPU were tested, they seem good. RAM modules were placed in accordance with the manual of motherboard, i.e. one module per each memory channel, so every channel are used (16 memory channel in total).
numactl --hardware shows that ram is distributed among all (8) numa nodes almost evenly. For the first numa node, there is a size difference about 70MB, but i think it is reserved maybe by kernels or by IPMI (haven't found out yet).
SSD has 26GB root partition and remaining is for /home.
HDD has 128GB for swap (just to be sure), remaining is for data.
OS is Debian 8 (Jessie) 64bit.
Comsol Multiphysics 5.0 has been installed (NSL licence), but seems very SLOW.
Wrench.mph model from Model Library was modified (max. element size 0.0005 and min. element size 0.00005) to set DoF higher (to 4,133,121), in order to have multithreaded calculations take longer.
On the new rig, it takes 4-5 minutes to solve this model (in BIOS, NUMA mode and memory bank interleaving and memory channel interleaving are enabled, memory node interleaving is disabled). If I explicitly set the number of NUMA nodes to 8 with the flag -numasets 8, then it takes about 3.5 minutes to solve. In that case, it seems from htop that Comsol uses just the first two CPUs (all cores of them). Without this flag, Comsol uses all CPUs, but takes longer to solve. Used physical memory is about 12GB, virtual memory is about 20GB (from COMSOL log), no swapping (as it can be seen in htop). TurboCore is working, monitored by cpufreq-aperf.
The old Intel rig we have:
P6X58D-E motherboard
1x Intel Core i7 950 CPU (4 physical core, 3. 07GHz) with stock fan
24GB DDR3 as 6x4GB modules, non-ECC, unregistered, 1066MHz, CL8
NVIDIA Quadro FX1700
no SSD just HDDs
Debian 7 Wheezy 64bit
Without any COMSOL flags, it takes 3.5 minutes to solve the previously mentioned model, too !!!
On the new machine, Comsol sees 16 cores due to the numbers of FPUs. If I force it to use 32 cores (with -np 32 flag), it complaints that only 16 physical CPUs are present, and simulation takes longer a bit than with -np 16.
Besides of these facts, I think that simulation SHOULD BE AT LEAST 4 TIMES FASTER on the new rig, than on the old one (4x more memory channels, 4x more FPUs, higher frequency, newer architecture).
Is that possible that Comsol uses non-optimized code/BLAS for solving models on AMD CPUs?
By default, it uses MKL (as I can see in Comsol 5.0 Release notes), and if I set selected BLAS to acml (instead of mkl) with the -blas flag (i.e. using ACML shipped with Comsol), it is slower a bit.
I think maybe Comsol's acml library do not uses FMA4/FMA3 and other new instruction set on AMD Opteron.
I have downloaded the newest ACML (6.1) from AMDs web site, but don't know how to set up properly for Comsol.
In have played with settings in BIOS, e.g. NUMA enabled/disabled; memory node interleaving enabled/disabled; CPU specific options like HPC, CPB, etc; IOMMU (if that cares at all).
My question is: what do you suggest to boost performace? If anybody has some system like us, how did she/he configure her/his own system?
Can you suggest me some benchmark to test if this system is properly configured?
One last note: our active subscription has ended at the end of 2014, so last COMSOL version we can use is version 5.0.
Thank you for your help in advance.
I have some performance issue with COMSOL Multiphysics on a new system.
I apologize in advance but it will take long.
Recently, we have built a new rig in our institute:
Supermicro H8QGI-FN4L motherboard
4x Opteron 6328 (8 integer cores and 4 FPUs per CPU, i.e. 32 int. cores and 16 FPUs in total, 3.2GHz but with TurboCore 3.5GHz if max. 4 integer cores/CPU is used and 3.8GHz if more than 4 integer cores/CPU is used)
4x Noctua NH-U9DO A3 cooler
Kingston ValueRam ECC Registered 1600MHz CL11 DDR3 (4x KVR16R11S4K4/32i kit, one for each CPU, i.e. 4x8GB modules per CPU; 128GB in total)
FirePro W4100
Samsung 850 Pro 128GB SSD, WD Caviar Green 2TB (64MB cache), Asus Xonar DGX sound card, Asus DVD writer (SATA)
EVGA SuperNOVA 1200 P2 PSU
Memory and CPU were tested, they seem good. RAM modules were placed in accordance with the manual of motherboard, i.e. one module per each memory channel, so every channel are used (16 memory channel in total).
numactl --hardware shows that ram is distributed among all (8) numa nodes almost evenly. For the first numa node, there is a size difference about 70MB, but i think it is reserved maybe by kernels or by IPMI (haven't found out yet).
SSD has 26GB root partition and remaining is for /home.
HDD has 128GB for swap (just to be sure), remaining is for data.
OS is Debian 8 (Jessie) 64bit.
Comsol Multiphysics 5.0 has been installed (NSL licence), but seems very SLOW.
Wrench.mph model from Model Library was modified (max. element size 0.0005 and min. element size 0.00005) to set DoF higher (to 4,133,121), in order to have multithreaded calculations take longer.
On the new rig, it takes 4-5 minutes to solve this model (in BIOS, NUMA mode and memory bank interleaving and memory channel interleaving are enabled, memory node interleaving is disabled). If I explicitly set the number of NUMA nodes to 8 with the flag -numasets 8, then it takes about 3.5 minutes to solve. In that case, it seems from htop that Comsol uses just the first two CPUs (all cores of them). Without this flag, Comsol uses all CPUs, but takes longer to solve. Used physical memory is about 12GB, virtual memory is about 20GB (from COMSOL log), no swapping (as it can be seen in htop). TurboCore is working, monitored by cpufreq-aperf.
The old Intel rig we have:
P6X58D-E motherboard
1x Intel Core i7 950 CPU (4 physical core, 3. 07GHz) with stock fan
24GB DDR3 as 6x4GB modules, non-ECC, unregistered, 1066MHz, CL8
NVIDIA Quadro FX1700
no SSD just HDDs
Debian 7 Wheezy 64bit
Without any COMSOL flags, it takes 3.5 minutes to solve the previously mentioned model, too !!!
On the new machine, Comsol sees 16 cores due to the numbers of FPUs. If I force it to use 32 cores (with -np 32 flag), it complaints that only 16 physical CPUs are present, and simulation takes longer a bit than with -np 16.
Besides of these facts, I think that simulation SHOULD BE AT LEAST 4 TIMES FASTER on the new rig, than on the old one (4x more memory channels, 4x more FPUs, higher frequency, newer architecture).
Is that possible that Comsol uses non-optimized code/BLAS for solving models on AMD CPUs?
By default, it uses MKL (as I can see in Comsol 5.0 Release notes), and if I set selected BLAS to acml (instead of mkl) with the -blas flag (i.e. using ACML shipped with Comsol), it is slower a bit.
I think maybe Comsol's acml library do not uses FMA4/FMA3 and other new instruction set on AMD Opteron.
I have downloaded the newest ACML (6.1) from AMDs web site, but don't know how to set up properly for Comsol.
In have played with settings in BIOS, e.g. NUMA enabled/disabled; memory node interleaving enabled/disabled; CPU specific options like HPC, CPB, etc; IOMMU (if that cares at all).
My question is: what do you suggest to boost performace? If anybody has some system like us, how did she/he configure her/his own system?
Can you suggest me some benchmark to test if this system is properly configured?
One last note: our active subscription has ended at the end of 2014, so last COMSOL version we can use is version 5.0.
Thank you for your help in advance.