I wonder if this is even possible. The diffusion based image generation models we have today require way way more processing power than the large language models we have today. I heard it takes almost an hour to generate a single image on CPU only on a regular laptop computer, and to use GPU for fast generation, you need at least 4 GB VRAM. I don't know if phones have that much yet. It is pretty recently desktop gaming GPUs got that much. Even the better large language models on the other hand can type out an answer at a rate of one word per second using CPU only on a regular laptop, and supports stepwise offloading to GPU, so doesn't require any specific amount of VRAM. So, I see them being accelerated enough to work on phones too (despite phones being very underpowered compared to a laptop) as far more likely, and indeed we have seen some offline large language model apps that works fine on Pixel phones. I never tried that myself though.