• Shaojie WANG's avatar
    Mfma fp16 (#62) · fbf6de72
    Shaojie WANG authored
    
    
    * add fp32 multi-k instruciton for 16x16 wave
    
    * revert file mode of igemm/algo/mfma_main_loop.py
    
    * fix bug in xdlops mappiing for multi-k instruction
    
    * comment out opt cpu conv
    
    * add fp16 instruction set
    
    * add some configs and script for multi k instruction
    
    * add some fp16 code branch
    
    * 1.successfully produce fp16 mfma instruction;2.compilation right;3.bug in load data from lds
    
    * add fp16 for mfma main loop and coleascing; need to check correctness
    
    * delete data_type in config file
    
    * add fp16 debug convention code for validation
    
    * fix bug in fp16 random generation
    
    * add fp16 shared_store mem inst
    
    * fix a typo
    
    * fix bug in host function; fix bug in A sst offset; fix bug in sld offset
    
    * finally get a right result for fp16 on one config
    
    * add another config; and it can not run successfully now
    
    * debug version
    
    * fix bug in gemm_in and gemm_im computation
    
    * fix bug in mfma_main_loop if have steps; fix bug for fp16 ds_write2_b64
    
    * comment out some unused print
    
    * fix bug in b matrix offset
    
    * fix bug for unroll_k_sub==0 in step2x2_interleave mfma main loop branch
    
    * add gpu data type check for driver; add more configs in config files
    
    * add some debug code
    
    * fix bugs in shared mem offset calculation
    
    * fix bug when tac1e>4
    
    * check thread copy lenght 1 cases
    
    * fix bug when (src order==1 and length d1==vector d1) is true
    
    * add some 4x64 configs
    
    * fix bug in likely_write2_b64 and likely_write2st64_b64: bound is vec_count // 2 * stride
    
    * Fix to support a case where tensor_a_thread_lengths[3] > 1
    
    * fix bug when 1 step dimension is 2
    
    * More accurate xdlop_mapping matching in get_ctrl_xdlops_mapping
    
    * Add restriction to vector_d1 of the wei tensor to solve issue brought by some specific configs
    
    * Add validated configurations
    
    * fix bug when tbc1e is bigger than 16; add glb_b pack instruction to avoid lds bank conflict
    
    * add gemm k padding; add more configs for gemm_k_per_block==32
    
    * fix bug for gemm_k padding
    
    * fix bug when ta k0 is greater than 16;add some high efficiency configs
    
    * add high efficiency configs
    
    * 1.add buffer load oob instead of using exec to check padding;2.use 2 ds_write_b64 instead of ds_write2_b64/ds_write2st64_b64, can work, but still under development;3.going to add gemm_k_pack 8; 4. R.I.P DIEGO MARATHONA, KING of SOCCER
    
    * fix bufferload oob for input
    
    * Enable using double LDS buffers
    
    * Fix to mfma_loop_repeat_2x2()
    
    * Re-implement mfma_loop_repeat_2x2()
    
    * fix bug for 2 errors:1.when fwd's step and repeat are all 2x2, ds_read use wrong tmp gpr;2.ds_read2_likely use gpr_count-1
    
    * Fix coalescing_store_groups initialization
    
    * Support vector size 8 in name() of global 2d load macros
    
    * add lds_double_buffer with interleave kernel
    
    * Adjust to the lds_buffer_num initialization
    
    * keep some original function in mfma_main_loop, to make it easier to be compared and merged
    
    * Add tools to tailor/reorder/generate configurations
    
    * add lds double buffer lp2 interleave main loop
    
    * put last 1x1 repeat to main loop
    
    * fix bug in double buffer lp2 interleave main loop
    
    * Add environment variable for easy testing of orderred configurations
    
    * Update to the tailor/reorder/generate configurations tools
    
    * fix lds double buffer disable logic
    
    * Adjust the sequence of macro-tiles and the number of checked nxb sizes
    
    * Add Readme for configuration tool
    
    * fix lds double buffer use case
    
    * Add checking for selecting better tensor_b n1b cluster size in tunable_is_valid()
    
    * add group conv and magic div for fp16
    
    * fix bug for 1x1 lp2 interleave mfma_main_loop
    
    * Reuse the fixed xdlops_mapping
    
    * fix build error when out/ does not exist
    
    * Tiny fix in igemm_fwd_gtc_driver.h
    
    * Tiny fix in reorder_configs.cpp
    
    * fix wrw bug
    
    * update nrms computation
    
    * Adapt get_ctrl_xdlops_mapping_from_wave_tile calls to new interface in igemm_bwd_gtc.py
    
    * set lds_gemm_k_pack to 1 in mfma main loop
    
    * fix compile error
    
    * remove useless namespace
    
    * add input pack var
    
    * fix git ignore
    
    * remove test configs and test code
    
    * remove useless config files
    
    * update gitignore
    
    * add interleave variable to control code
    
    * add format buffer load instruction
    
    * update config files
    
    * add oob feature
    
    * update gitignore
    
    * fix some mistakes
    
    * make valid_vector be element wise
    
    * remove template for igemm driver code
    
    * 1. remove redundant space; 2. fix bug when computing reusable vgpr; 3. milestone for generation
    
    * chmod 644 for some files
    
    * chmod 644 for fma file
    
    * merge conv model script
    
    * add fp16 in smoke test
    
    * do not change unrelated files
    
    * fix some bug
    
    * chmod to 644
    
    * use size_t instead of int; omit useless branch
    
    * use macro to enable fp16 in host
    
    * update README.md
    
    * update README.md
    
    * update README.md: delete toc
    
    * remove useless code
    
    * fix bug in param check for two script
    
    * remove line-.gitignore in .gitignore file
    Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
    fbf6de72
README.md 2 KB