-
Shaojie WANG authored
* add fp32 multi-k instruciton for 16x16 wave * revert file mode of igemm/algo/mfma_main_loop.py * fix bug in xdlops mappiing for multi-k instruction * comment out opt cpu conv * add fp16 instruction set * add some configs and script for multi k instruction * add some fp16 code branch * 1.successfully produce fp16 mfma instruction;2.compilation right;3.bug in load data from lds * add fp16 for mfma main loop and coleascing; need to check correctness * delete data_type in config file * add fp16 debug convention code for validation * fix bug in fp16 random generation * add fp16 shared_store mem inst * fix a typo * fix bug in host function; fix bug in A sst offset; fix bug in sld offset * finally get a right result for fp16 on one config * add another config; and it can not run successfully now * debug version * fix bug in gemm_in and gemm_im computation * fix bug in mfma_main_loop if have steps; fix bug for fp16 ds_write2_b64 * comment out some unused print * fix bug in b matrix offset * fix bug for unroll_k_sub==0 in step2x2_interleave mfma main loop branch * add gpu data type check for driver; add more configs in config files * add some debug code * fix bugs in shared mem offset calculation * fix bug when tac1e>4 * check thread copy lenght 1 cases * fix bug when (src order==1 and length d1==vector d1) is true * add some 4x64 configs * fix bug in likely_write2_b64 and likely_write2st64_b64: bound is vec_count // 2 * stride * Fix to support a case where tensor_a_thread_lengths[3] > 1 * fix bug when 1 step dimension is 2 * More accurate xdlop_mapping matching in get_ctrl_xdlops_mapping * Add restriction to vector_d1 of the wei tensor to solve issue brought by some specific configs * Add validated configurations * fix bug when tbc1e is bigger than 16; add glb_b pack instruction to avoid lds bank conflict * add gemm k padding; add more configs for gemm_k_per_block==32 * fix bug for gemm_k padding * fix bug when ta k0 is greater than 16;add some high efficiency configs * add high efficiency configs * 1.add buffer load oob instead of using exec to check padding;2.use 2 ds_write_b64 instead of ds_write2_b64/ds_write2st64_b64, can work, but still under development;3.going to add gemm_k_pack 8; 4. R.I.P DIEGO MARATHONA, KING of SOCCER * fix bufferload oob for input * Enable using double LDS buffers * Fix to mfma_loop_repeat_2x2() * Re-implement mfma_loop_repeat_2x2() * fix bug for 2 errors:1.when fwd's step and repeat are all 2x2, ds_read use wrong tmp gpr;2.ds_read2_likely use gpr_count-1 * Fix coalescing_store_groups initialization * Support vector size 8 in name() of global 2d load macros * add lds_double_buffer with interleave kernel * Adjust to the lds_buffer_num initialization * keep some original function in mfma_main_loop, to make it easier to be compared and merged * Add tools to tailor/reorder/generate configurations * add lds double buffer lp2 interleave main loop * put last 1x1 repeat to main loop * fix bug in double buffer lp2 interleave main loop * Add environment variable for easy testing of orderred configurations * Update to the tailor/reorder/generate configurations tools * fix lds double buffer disable logic * Adjust the sequence of macro-tiles and the number of checked nxb sizes * Add Readme for configuration tool * fix lds double buffer use case * Add checking for selecting better tensor_b n1b cluster size in tunable_is_valid() * add group conv and magic div for fp16 * fix bug for 1x1 lp2 interleave mfma_main_loop * Reuse the fixed xdlops_mapping * fix build error when out/ does not exist * Tiny fix in igemm_fwd_gtc_driver.h * Tiny fix in reorder_configs.cpp * fix wrw bug * update nrms computation * Adapt get_ctrl_xdlops_mapping_from_wave_tile calls to new interface in igemm_bwd_gtc.py * set lds_gemm_k_pack to 1 in mfma main loop * fix compile error * remove useless namespace * add input pack var * fix git ignore * remove test configs and test code * remove useless config files * update gitignore * add interleave variable to control code * add format buffer load instruction * update config files * add oob feature * update gitignore * fix some mistakes * make valid_vector be element wise * remove template for igemm driver code * 1. remove redundant space; 2. fix bug when computing reusable vgpr; 3. milestone for generation * chmod 644 for some files * chmod 644 for fma file * merge conv model script * add fp16 in smoke test * do not change unrelated files * fix some bug * chmod to 644 * use size_t instead of int; omit useless branch * use macro to enable fp16 in host * update README.md * update README.md * update README.md: delete toc * remove useless code * fix bug in param check for two script * remove line-.gitignore in .gitignore file Co-authored-by:Qianfeng Zhang <Qianfeng.Zhang@amd.com>
fbf6de72