Diffusion-based text-to-speech with zero-shot voice cloning. Based on meituan-longcat/LongCat-AudioDiT.