> With your massive parallel infra you're still processing 27 millions frames, r...

> With your massive parallel infra you're still processing 27 millions frames, right?

No, it's not. We first run a cheap filter like a motion detector on all of the video, which is inexpensive. We then stack other, more expensive filters on top of this depending on the use-case and eventually run the most expensive metadata-generating models at the end. We also don't do this on every single frame and can interpolate information using surrounding frames. Our parallel infra speeds this up further.

> That's exactly transfer learning is for. I am saying to do it with because you automatically get a pretrained model which can do much more than that. Imagine doing a search "a person wearing red cap and yellow hand bag walking toward the exit" or "A person wearing a shirt with mark written on it". Can your system do it right now?

The issue is that there are very few text-image pair datasets out there, and building a good one is difficult. We constantly use transfer learning in-house when working with different customer data and typical classifier / detector models but haven't yet had success doing so with CLIP. Our system can't semantically search through video just yet but we're exploring the most feasible ways for doing this still. There's some interesting work on this which we've been reading recently:

https://ddkang.github.io/papers/2022/tasti-paper.pdf

https://vcg.ece.ucr.edu/sites/g/files/rcwecm2661/files/2021-...