The discussion forums in the XSEDE User Portal are for users to share experiences, questions, and comments with other users and XSEDE staff. Visitors are welcome to browse and search, but you must login to contribute to the forums. While XSEDE staff monitor the lists, XSEDE does not guarantee that questions will be answered. Please note that the forums are not a replacement for formal support or bug reporting procedures through the XSEDE Help Desk. You must be logged in to post to the user forums.

« Back to Stampede Forum

Can't use 256 nodes for symmetric computing

Combination View Flat View Tree View
Threads [ Previous | Next ]
toggle
I am having a problem submitting symmetric computing (CPU and MIC) jobs to Stampede's normal-mic queue. The normal-mic queue currently allows a job to use a maximum of 256 nodes.

When my job script requests 256 CPU-MIC nodes, my job fails. The output includes messages that the argument list is too long. My job script requests 16 CPU MPI tasks and 15 MIC MPI tasks per node.

I changed the application so that the CPU and MIC versions don't use command line arguments. However, the same error messages appear. What does "argument list is too long" mean when the application does not have any command line arguments? How can I get past this error and use all 256 CPU-MIC nodes?

The program runs correctly when requesting from 1 to 128 CPU-MIC nodes. Also, the program runs correctly when requesting from 1 to 128 CPU-MIC-MIC nodes on the normal-2mic queue (which allows a job to use a maximum of 128 nodes).

I've attached my job script and the output file with the error messages for reference.

Thank you.
Attachments: job.symm (1.1k), symm.327680.256x16x1_256x15x4.6429778.txt (144.8k)

RE: Can't use 256 nodes for symmetric computing
Answer
1/27/16 10:46 PM as a reply to David Apostal.
Hi David,

For timely answers to technical questions like this, you should submit a ticket to help@xsede.org. But, I should be able to help you here.

With the upgrade to Intel 15 and Intel MPI 5, the behavior of mpiexec.hydra (which starts the mpi job) changed. There is an internal buffer that processes the arguments for each executable, and the handling of this buffer has changed. This error doesn't occur when using Intel MPI 4.

To work around this buffer overflow, I've created a version of ibrun.symm that does not set up the MIC environment using the mpiexec.hydra task starter. This reduces the number of characters for each excutable command and avoids the buffer overflow issue.

However, using this version of ibrun.symm starts the MIC executable with the incorrect environment. So, a wrapper script must be used also. I've attached the modified version of ibrun.symm and the wrapper script to this response.

To use this version of ibrun.symm, replace your current run line with the following:

./ibrun.symm.no_env -m "./mic_env_run ./arc_mpi.exe.mic" -c "./arc_mpi.exe"

If you continue to have problems with this, please submit a ticket.

Thanks,

John
Attachments: