Performance Optimization Suggestions

Optimization Priority:
High
Medium
Low

overall

overall summary

Description
details
Performance Index Duration(ms) Duration Ratio
Computing Time 40144.912 69.09%
-- Flash Attention 3167.986 5.45%
-- Conv 0.000 0.00%
-- Matmul 32236.639 55.48%
-- Vector 4732.744 8.14%
-- SDMA(Tensor Move) 7.544 0.01%
-- Other Cube 0.000 0.00%
Uncovered Communication Time 16564.843 28.51%
-- Wait 5573.454 9.59%
-- Transmit 10991.389 18.92%
Free Time 1399.231 2.41%
-- SDMA 0.000 0.00%
-- Free 1399.231 2.41%
E2E Time 58108.987 100.00%

performance problem analysis

communication

Bandwidth Contention Analysis

The SDMA bandwidth is lower than 14.4 when computing and communication tasks are performed concurrently. Generally, parallel computing and communication improves the running efficiency of the model. Concurrent computing and communication tasks may affect the communication bandwidth.
Suggestions
1. compare the performance data before and after the convergence function is enabled to evaluate whether the impact of bandwidth preemption exceeds the convergence benefits.


The following table lists the 10 operators with the most severe performance deterioration.
op name duration(ms) bandwidth(GB/s)
hcom_reduceScatter__276_14698_1 1.688 11.14
hcom_reduceScatter__276_8918_1 1.6212 11.2
hcom_reduceScatter__276_14080_1 1.6358 11.21
hcom_reduceScatter__276_1740_1 1.7091 11.28
hcom_reduceScatter__276_14110_1 1.6054 11.29
hcom_reduceScatter__276_15835_1 1.5857 11.31
hcom_reduceScatter__276_932_1 1.6462 11.33
hcom_reduceScatter__276_8840_1 1.599 11.33
hcom_reduceScatter__276_8906_1 1.6338 11.35
hcom_reduceScatter__276_14098_1 1.5753 11.37

computation

Operator Dynamic Shape Issues

Description Suggestion
Found all operators are dynamic shape Optimize by enabling compiled operator, such as:
`torch_npu.npu.set_compile_mode(jit_compile=False)`
for details please refer to link : LINK

schedule

Operator Dispatch Issues

Description Suggestion
Found 142 operator compile issues. Please place the following code at the entrance of the python script to disable jit compile. Code: `torch_npu.npu.set_compile_mode(jit_compile=False); torch_npu.npu.config.allow_internal_format = False`
Issue Counts Elapsed Time(us)
aclopCompileAndExecute 142 6496.6575