Performance Optimization Suggestions
Optimization Priority:
High
Medium
Low
overall
overall summary
Description
details
| Performance Index | Duration(ms) | Duration Ratio |
| Computing Time | 40144.912 | 69.09% |
| -- Flash Attention | 3167.986 | 5.45% |
| -- Conv | 0.000 | 0.00% |
| -- Matmul | 32236.639 | 55.48% |
| -- Vector | 4732.744 | 8.14% |
| -- SDMA(Tensor Move) | 7.544 | 0.01% |
| -- Other Cube | 0.000 | 0.00% |
| Uncovered Communication Time | 16564.843 | 28.51% |
| -- Wait | 5573.454 | 9.59% |
| -- Transmit | 10991.389 | 18.92% |
| Free Time | 1399.231 | 2.41% |
| -- SDMA | 0.000 | 0.00% |
| -- Free | 1399.231 | 2.41% |
| E2E Time | 58108.987 | 100.00% |
performance problem analysis
communication
Bandwidth Contention Analysis
The SDMA bandwidth is lower than 14.4 when computing and communication tasks are performed concurrently. Generally, parallel computing and communication improves the running efficiency of the model. Concurrent computing and communication tasks may affect the communication bandwidth.
The following table lists the 10 operators with the most severe performance deterioration.
| Suggestions |
|---|
| 1. compare the performance data before and after the convergence function is enabled to evaluate whether the impact of bandwidth preemption exceeds the convergence benefits. |
The following table lists the 10 operators with the most severe performance deterioration.
| op name | duration(ms) | bandwidth(GB/s) |
|---|---|---|
| hcom_reduceScatter__276_14698_1 | 1.688 | 11.14 |
| hcom_reduceScatter__276_8918_1 | 1.6212 | 11.2 |
| hcom_reduceScatter__276_14080_1 | 1.6358 | 11.21 |
| hcom_reduceScatter__276_1740_1 | 1.7091 | 11.28 |
| hcom_reduceScatter__276_14110_1 | 1.6054 | 11.29 |
| hcom_reduceScatter__276_15835_1 | 1.5857 | 11.31 |
| hcom_reduceScatter__276_932_1 | 1.6462 | 11.33 |
| hcom_reduceScatter__276_8840_1 | 1.599 | 11.33 |
| hcom_reduceScatter__276_8906_1 | 1.6338 | 11.35 |
| hcom_reduceScatter__276_14098_1 | 1.5753 | 11.37 |
computation
Operator Dynamic Shape Issues
| Description | Suggestion |
|---|---|
| Found all operators are dynamic shape | Optimize by enabling compiled operator, such as: `torch_npu.npu.set_compile_mode(jit_compile=False)` for details please refer to link : LINK |
schedule
Operator Dispatch Issues
| Description | Suggestion |
|---|---|
| Found 142 operator compile issues. | Please place the following code at the entrance of the python script to disable jit compile. Code: `torch_npu.npu.set_compile_mode(jit_compile=False); torch_npu.npu.config.allow_internal_format = False` |
| Issue | Counts | Elapsed Time(us) |
|---|---|---|
| aclopCompileAndExecute | 142 | 6496.6575 |