{2023.06}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + eb_hooks.py#35
{2023.06}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 + eb_hooks.py#35TopRichard wants to merge 6 commits intoEESSI:mainfrom
Conversation
|
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
|
New job on instance
|
|
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
|
New job on instance
|
|
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
|
New job on instance
|
e877c9e to
5befb75
Compare
|
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/generic |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/generic accel:nvidia/cc90 |
|
New job on instance
|
|
The failure is: |
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/generic accel:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:x86_64/amd/zen3 accel:nvidia/cc80 |
|
New job on instance
|
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
|
Instance
|
|
Instance
|
|
Instance
|
|
bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software arch:cascaselake accel:nvidia/cc70
I think it might be better to always open a secondary pr where you do the actual testing to make sure that non of the builds get deployed from the pr and only the changed scripts like I did with #49 and #22. But I know that I said that I was gonna write out the policy but I have not gotten to it. |
|
@laraPPr marking the PR as draft as long as the easystack file is in there probably helps, but indeed, we may need to come up with a better approach |
|
It is gonna fail the test step but lets see for the build step. |
|
The gent bot crashed because of a local problem. I'll update the reframe_config and try again later. |
|
Ah no it does seem still alive but I made a mistake in the comment so lets see if this works: |
|
Thirds the charm I hope |
|
No job is being created for some reason and I can't tell why |
|
Debug building with new bot release... |
|
No job was submitted, possibly because the bot: build repo:eessi.io-2023.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace,accel=nvidia/cc90 |
|
New job on instance
|
|
Try cross-compiling for cc80... |
|
Supplying several values for the |
|
New job on instance
|
|
bot: show_config |
|
Instance
|
|
Instance
|
|
Instance
|
|
Instance
|
|
Instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-vsc-ugent for:arch=x86_64/intel/cascadelake,accel=nvidia/cc70 |
|
New job on instance
|
|
@TopRichard can you sync this pr with the main branch because I think it is not picking up these changes #59 |
…into TensorFlow-CUDA
|
@TopRichard apparently its a bigger issue so I'm moving my experementing to EESSI/software-layer#1147 |
This PR uses a CUDA-ARM patch to workaround the previously seen error:
On x86_64 with cc80:
CPU tests:
Executed 847 out of 847 tests: 847 tests pass.
GPU tests
Executed 189 out of 189 tests: 189 tests pass.
On aarch64 with cc90 :
CPU tests:
Executed 847 out of 847 tests: 847 tests pass.
GPU tests
Executed 189 out of 189 tests: 188 tests pass and 1 fails locally