19 April 2012

118. Solution to nwchem: SHMMAX too small

Update: also see this post: http://verahill.blogspot.com.au/2012/10/shmmax-revisited-and-shmall-shmmni.html

When running nwchem using mpirun I've occasionally encountered this error.

Error:
******************* ARMCI INFO ************************
The application attempted to allocate a shared memory segment of 44498944 bytes in size. This might be in addition to segments that were allocated succesfully previously. The current system configuration does not allow enough shared memory to be allocated to the application.

This is most often caused by:
1) system parameter SHMMAX (largest shared memory segment) being too small or
2) insufficient swap space.
Please ask your system administrator to verify if SHMMAX matches the amount of memory needed by your application and the system has sufficient amount of swap space. Most UNIX systems can be easily reconfigured to allow larger shared memory segments,
see http://www.emsl.pnl.gov/docs/global/support.html
In some cases, the problem might be caused by insufficient swap space.
*******************************************************
0:allocate: failed to create shared region : -1
(rank:0 hostname:boron pid:17222):ARMCI DASSERT fail. shmem.c:armci_allocate():1082 cond:0

Diagnosis:
Check the currently defined shmmax:
cat /proc/sys/kernel/shmmax
33554432
Well, 33554432<44498944, so it seems that it's caused by reason 1 above.

Solution:

Edit /etc/sysctl.conf
Add a line saying
kernel.shmmax=44498944
Save and reboot. The exact value is up to you -- I've set my shmmax to 128*1024*1024=134217728, while our production cluster has 6269961216.

Update: to change it on the fly do
sudo sysctl -w kernel.shmmax=6269961216

4 comments:

  1. Hi,
    I am encountering this problem whenever I run NWChem over OpenMPI, and I have already set shmmax to a greater value than the error message shows to have allocated for NWChem. This only happens when I try to run NWChem on more than one node. My nodes have 2 cores each with 2GB physical ram. I have already checked the shmmax through /proc/sys/kernel/shmmax, which shows that the my kernel shmmax value is sufficient for the purposes of NWChem. Could you help me out? Thank you.

    ReplyDelete
    Replies
    1. I have to admit that while I don't usually have issues with shmmax, I still can't claim to fully understand it --

      I take it that you've set shmmax on both nodes? Are you running the same binary on both nodes (via nfs) or are they local binaries (what I normally do)? I don't honestly know what difference it would make though.

      Finally, have you run getmem.nwchem in the contrib folder to make sure that nwchem can see all your RAM?

      As an example of shmmax headaches, I tried running a job on a borrowed 4 core 32 GB node, and it failed with shmmax/swap warnings, while an 8 core 32 GB node with the exact same shmmax, shmall values ran just fine.

      In neither case did the ACTUAL memory usage go above 4 gb.

      You might want to post your question on at nwchem-sw.org if you can't find the source of the error.

      Delete
    2. thanks, i shall try running getmem.nwchem

      Delete
    3. Note that this will only work out ok if you have locally compiled versions of the binary, i.e. not sharing the binary via nfs.

      Delete