Trained on smaller dataset, like 595 images (including repeats), captioned with short sentences provided by Florence2 and manually edited by me. Alpha because it is inconsistent and has some problems with distinguishing some characters (especially skeletons). But it is the first more or less working version, even though characters may not have 100% likeness.