MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks


W3Schools
MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks
by mfiguiere on Hacker News.


W3Schools

Leave a comment