COMSOL Multiphysics performance on 4p Opteron system

Hi,

I have some performance issue with COMSOL Multiphysics on a new system.

I apologize in advance but it will take long.

Recently, we have built a new rig in our institute:

Supermicro H8QGI-FN4L motherboard
4x Opteron 6328 (8 integer cores and 4 FPUs per CPU, i.e. 32 int. cores and 16 FPUs in total, 3.2GHz but with TurboCore 3.5GHz if max. 4 integer cores/CPU is used and 3.8GHz if more than 4 integer cores/CPU is used)
4x Noctua NH-U9DO A3 cooler
Kingston ValueRam ECC Registered 1600MHz CL11 DDR3 (4x KVR16R11S4K4/32i kit, one for each CPU, i.e. 4x8GB modules per CPU; 128GB in total)
FirePro W4100
Samsung 850 Pro 128GB SSD, WD Caviar Green 2TB (64MB cache), Asus Xonar DGX sound card, Asus DVD writer (SATA)
EVGA SuperNOVA 1200 P2 PSU

Memory and CPU were tested, they seem good. RAM modules were placed in accordance with the manual of motherboard, i.e. one module per each memory channel, so every channel are used (16 memory channel in total).
numactl --hardware shows that ram is distributed among all (8) numa nodes almost evenly. For the first numa node, there is a size difference about 70MB, but i think it is reserved maybe by kernels or by IPMI (haven't found out yet).

SSD has 26GB root partition and remaining is for /home.
HDD has 128GB for swap (just to be sure), remaining is for data.

OS is Debian 8 (Jessie) 64bit.

Comsol Multiphysics 5.0 has been installed (NSL licence), but seems very SLOW.

Wrench.mph model from Model Library was modified (max. element size 0.0005 and min. element size 0.00005) to set DoF higher (to 4,133,121), in order to have multithreaded calculations take longer.

On the new rig, it takes 4-5 minutes to solve this model (in BIOS, NUMA mode and memory bank interleaving and memory channel interleaving are enabled, memory node interleaving is disabled). If I explicitly set the number of NUMA nodes to 8 with the flag -numasets 8, then it takes about 3.5 minutes to solve. In that case, it seems from htop that Comsol uses just the first two CPUs (all cores of them). Without this flag, Comsol uses all CPUs, but takes longer to solve. Used physical memory is about 12GB, virtual memory is about 20GB (from COMSOL log), no swapping (as it can be seen in htop). TurboCore is working, monitored by cpufreq-aperf.

The old Intel rig we have:
P6X58D-E motherboard
1x Intel Core i7 950 CPU (4 physical core, 3. 07GHz) with stock fan
24GB DDR3 as 6x4GB modules, non-ECC, unregistered, 1066MHz, CL8
NVIDIA Quadro FX1700
no SSD just HDDs
Debian 7 Wheezy 64bit
Without any COMSOL flags, it takes 3.5 minutes to solve the previously mentioned model, too !!!

On the new machine, Comsol sees 16 cores due to the numbers of FPUs. If I force it to use 32 cores (with -np 32 flag), it complaints that only 16 physical CPUs are present, and simulation takes longer a bit than with -np 16.
Besides of these facts, I think that simulation SHOULD BE AT LEAST 4 TIMES FASTER on the new rig, than on the old one (4x more memory channels, 4x more FPUs, higher frequency, newer architecture).

Is that possible that Comsol uses non-optimized code/BLAS for solving models on AMD CPUs?
By default, it uses MKL (as I can see in Comsol 5.0 Release notes), and if I set selected BLAS to acml (instead of mkl) with the -blas flag (i.e. using ACML shipped with Comsol), it is slower a bit.

I think maybe Comsol's acml library do not uses FMA4/FMA3 and other new instruction set on AMD Opteron.

I have downloaded the newest ACML (6.1) from AMDs web site, but don't know how to set up properly for Comsol.

In have played with settings in BIOS, e.g. NUMA enabled/disabled; memory node interleaving enabled/disabled; CPU specific options like HPC, CPB, etc; IOMMU (if that cares at all).

My question is: what do you suggest to boost performace? If anybody has some system like us, how did she/he configure her/his own system?
Can you suggest me some benchmark to test if this system is properly configured?

One last note: our active subscription has ended at the end of 2014, so last COMSOL version we can use is version 5.0.

Thank you for your help in advance.

COMSOL Multiphysics performance on 4p Opteron system

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112