Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Processing 24 hours of video in ten minutes (sievedata.com)
77 points by mvoodarla on Jan 11, 2022 | hide | past | favorite | 37 comments


Hey HN! I’m one of the creators of Sieve (https://sievedata.com/), and we’re happy to be sharing this with all of you!

Sieve is an API that helps you store, process, and automatically search your video data–instantly and efficiently. Just think 10 cameras recording footage at 30 FPS, 24/7. That would be 27 million frames generated in a single day. The videos might be searchable by timestamp, but finding moments of interest is like searching for a needle in a haystack.

We built this visual demo (https://sievedata.com/app/query?api_key=AIzaSyAfKwf0tuuNOHbY...) a little while back which we’d love to get feedback on. It’s ~24 hours of security footage that our API processed in <10 mins and has simple querying and export functionality enabled. The demo is a security use-case but our platform supports a wider variety of use-cases and metadata. We’re working with a few early customers to explore which ones to dive deeper into (happy to hear ideas on this front from the HN community as well!).

To try it on your videos: https://github.com/Sieve-Data/automatic-video-processing

Visual dashboard walkthrough: https://youtu.be/_uyjp_HGZl4

General FAQ: https://sievedata.com/faq


My current interest in video filtering is not business oriented or even that much personal. It's a bit SponsorBlock-like in that I need certain fragments cut out from a final video, but I don't want to manually encode and cut them.

I occasionally download really huge VODs from Twitch or YouTube (mostly gaming tournaments or game playthroughs). I'm quite annoyed of long blocks of breaks, too much blabbering before a match starts, or in the case of bot tournaments -- one of the bots hanging and the system timing out after one whole hour of a mostly static image, etc.

But let me give a more digestible example. The ESL_SC2 channel on Twitch streams StarCraft 2 matches 24/7 (often rebroadcasts). They have these 3 minute-ish breaks with a slightly moving image and an annoying (IMO) music. I believe these segments from the videos can be easily filtered out with applying a particular perception hash on the frames and filtering them out, but never had the time and energy to try it myself.

I'm not asking to be catered to; I'm giving Sieve's authors an idea for a creative endeavor. If they're up for it, download one video from there with youtube-dl / yt-dlp and try it out.

(I suppose this could be useful in the home surveillance scenario as well, e.g. if the camera has been covered with snow for 12 hours, then you don't need the footage and want all frames adhering to that perception hash deleted.)


Will definitely try this out on personal time. Sounds like a fun weekend project :)


You should release an app for Synology or at least Synology should hire you to improve 'Photos'

It works fine at facial recognition for photos but it doesn't even try with videos, any meta data would be useful for hundreds of hours of family videos.

A terabyte of audio and video is only as useful as the time and date we've organised for, and imagine actual usable face ,object recognition for both media types.

Would this work offline for lowend hardware? E.g. Synology 920+


Sounds like another use-case that fits! We're currently only offering this via hosted API so it won't work offline. I'd be interested in understanding the specific use-case further.

If you have any connects there, feel free to reach out to the email in my bio :)


I’d like to use something like this on my private network, processing files from tools like Shinobi or other private video surveillance system. I’m not about to upload my private videos to your server.


This is totally understandable. We're thinking of the best way to satisfy personal users given we've heard this security use case many times.


Weirdish use cases?

1) Allow homeowners to take a timelapse of their backyard. For each frame (or some video interval) determine what areas are in sun vs shade. Then greyscale the image and overlay with a heatmap showing hours of sun in each area. Or have 1/2 hour increment bandings.

2) Road monitoring of various types. Traffic count and speed. Common visitors / time of visit (garbage, street cleaning etc, postal, newspaper delivery).


1) Wow I've never heard that one before but that's so interesting. I'm just thinking of my parents who love gardening and the insight they might get from that. If you want to leverage our platform for a personal project like that, let us know! Might need to add a few new features to make that seamless.

2) This one we currently support + work with. Thanks for bringing it up again :)


OK - I actually did this once already. My simple version. I just took a picture every minute in grayscale, then stacked them, then banded that by brightness, then laid that color on a shadowed picture so you could kind of see what was what.

Key improvements though would be some type of classification of sun vs not sun for a pixel (based on the entire stream of pixels for the day for that spot to make it easier) because dark stuff in the yard / walls etc messed up my approach. And banding the color (most gardening says things like needs 6 hours of sun a day) so you can clearly see outlines of areas by hours of sun.

Yes, a lot of people don't know how much sun a spot gets and don't want to sit out all day watching (front yard / back yard etc).


1) Would be really helpful when moving into a new house. It takes a bit of time to figure out how big the shadows from trees are, how it varies by season, what time a garden area gets shaded by a fence, etc.

I've looked into if there are any sunlight simulators available, but my Google-fu has been lacking. Even if I had a tool that told me "on this date and time, at your location, a 1 meter tall stick casts a shadow _x_ meters in length in _y_ direction", that would be workable.


This is relatively simple to create yourself.

There are closed form equations that can be used for the angular position of the sun, which will be more than accurate for this purpose.

Then it's mostly some simple trig to determine areas that have a line of sight to the sun.

I have this on my to-do list because in my bedroom on a full moon the light will come through one specific window onto our pillows. i wanted to automate either curtains or something to predict when it'll happen.

Maybe eventually I'll get around to something like this


This is really neat!

I'm curious about what projects people in the community are working on that would benefit from this sort of video-frame searching. My computer vision work is in a niche market and I use images, not video. But I figure there must be people building self-driving delivery robots or the like that would like to be able to search their massive video datasets for corner cases where their models need more training data.


Thanks for sharing your use-case Tom!

Love that you bring up delivery bots. They're the moving example, but you can also think of a parallel application where you have stationary cameras monitoring something of interest (worker safety, defective parts, security, etc).

Interested in hearing where else people are struggling with videos.


Interesting!

I'm currently working with video in the context of gameplay feedback on video games (https://www.volt.school/ is my example app, https://www.vodon.gg/ if you want an instance of your own)

I'd love to be able to detect certain events that happen in the game stream. Things like the player killing someone, picking up a a particular item etc. Adjacent to this, I'd also be interested in your infrastructure around hosting and processing the videos. I'm currently pigging backing off YouTube and in the V2 of my app, I've moved to Cloudflare Stream but I'd love to know how you're working with these massive video streams in a cost effective way.


Most of the time, we work on processing video data after it's been hosted somewhere for us to seek by frame or download. We don't yet support video streams but have heard multiple people bring it up.

Our infrastructure is really good at filtering for parts of video where things are happening ("interesting") and being able to parallelize video processing by cutting it up into multiple pieces, running more granular models on those parts, and smoothly interpolating metadata using surrounding frames. You should check out the processing section of our FAQ page (linked in original comment) for a more detailed explanation of how our processing works.

If you're worried about just streaming seamlessly, Mux has a great solution: https://mux.com/live.

Also, for detecting things in e-sports feeds, feel free to reach out to the email in my bio!


Ah, I'm not working off streams (yet!). Users will use a screen recorder to capture gameplay and upload it later.

Thanks for pointing out the FAQ.


I could see this technology being deployed to automatically have people arrested for crimes that occur on camera or any remote sensing medium for that matter. Gape recognition and face recognition are more than enough to identify an individual even without voice or cell data. Luckily deepfakes are here to save us all by completely invalidating any sort of media authenticity.


This seems like a great tool for my area of interest: counting different types of vehicles (including bicycles, scooters, mopeds, skateboards, light electric vehicles, passenger cars, commercial cars, and different sizes & types of trucks and buses) that move through an observed intersection!


Interesting application! We're working with a few people that monitor public streets in different ways so I can see this being useful to you as well! Feel free to reach out to me at mokshith@sievedata.com and we can chat more :)


Just pass this through openai clip and you would get a semantic search without much effort. For example this is for youtube videos - https://github.com/haltakov/natural-language-youtube-search


Well, this depends.

If you're looking at some generic things that are similar to what CLIP was trained on, this would work. Say you're interested in specific physical security metrics, or monitoring defective parts, or specific things about traffic, etc. CLIP might just say "people walking" or "car in intersection" or "part on conveyer belt" which isn't meaningful enough if all your images are exactly that, but with other small differences.

Another important aspect of this is the amount of frames you need to process. Running CLIP even on 27 million frames (1 day of footage) is super expensive. We've built some infra that makes processing video efficient (forms of parallelization + filtering), without you having to think about it.


> Running CLIP even on 27 million frames (1 day of footage) is super expensive. We've built some infra that makes processing video efficient (forms of parallelization + filtering), without you having to think about it.

With your massive parallel infra you're still processing 27 millions frames, right?

> If you're looking at some generic things that are similar to what CLIP was trained on, this would work. Say you're interested in specific physical security metrics, or monitoring defective parts, or specific things about traffic, etc. CLIP might just say "people walking" or "car in intersection" or "part on conveyer belt" which isn't meaningful enough if all your images are exactly that, but with other small differences.

That's exactly transfer learning is for. I am saying to do it with because you automatically get a pretrained model which can do much more than that. Imagine doing a search "a person wearing red cap and yellow hand bag walking toward the exit" or "A person wearing a shirt with mark written on it". Can your system do it right now?


> With your massive parallel infra you're still processing 27 millions frames, right?

No, it's not. We first run a cheap filter like a motion detector on all of the video, which is inexpensive. We then stack other, more expensive filters on top of this depending on the use-case and eventually run the most expensive metadata-generating models at the end. We also don't do this on every single frame and can interpolate information using surrounding frames. Our parallel infra speeds this up further.

> That's exactly transfer learning is for. I am saying to do it with because you automatically get a pretrained model which can do much more than that. Imagine doing a search "a person wearing red cap and yellow hand bag walking toward the exit" or "A person wearing a shirt with mark written on it". Can your system do it right now?

The issue is that there are very few text-image pair datasets out there, and building a good one is difficult. We constantly use transfer learning in-house when working with different customer data and typical classifier / detector models but haven't yet had success doing so with CLIP. Our system can't semantically search through video just yet but we're exploring the most feasible ways for doing this still. There's some interesting work on this which we've been reading recently:

https://ddkang.github.io/papers/2022/tasti-paper.pdf

https://vcg.ece.ucr.edu/sites/g/files/rcwecm2661/files/2021-...


This is basically the exact same comment as the infamous 'why use dropbox when it's trivial to set up an FTP server?' comment.


The visual demo linked in the top comment gives "No samples to show".


Hey try something like "person_count: 2"? Some things like "constrast: low" might not have any samples but try a few queries yourself! I just tried it again and it works


it would probably be helpful to have some kind of query prefilled in the demo or something, because somebody coming in fresh has no clue what to try.


You're right. thanks for the feedback, we'll add this soon!


This would be so useful if it could connect to a Nest video stream.


Nest already provides some smart people / motion detection. I could definitely see personal home security use-cases if you're building something DIY. Curious as to why you might prefer a system like this over what Nest provides. Is it the more granular metadata?


Nest's person / motion detection is a joke. Every day it thinks I have a new package in front of the door - it literally gets it wrong 100% of the time. It sees a person wearing brown... it must be a package... sees a person sitting on the stoop next door... must be a package. When it actually sees the mailman carrying a box towards the door... it tells me nothing.

I am sure the motion sensor is good for a backyard where there's little movement, but not for the front of the house where there's constant movement. I set up motion zones to let me know when there's someone in it, and it triggered every time when someone was caught in the camera, disregarding the zone set. I got so many false alerts each day that I had to turn it off.

But honestly, I want more than what it offers that I cannot tap into... for instance... if someone is trying to break-into my house in the middle of the night I need a phone call or a way connect the camera to my alarm system. A simple phone notification won't wake me up.

It would also be useful to review every time the camera saw a person in the middle of the night when it matters most... not during the day when dozens of people walk in front of it.

I also need to be able to distinguish when it detects a person that's far away, like across the street, and someone who's right in-front of the camera, filling most of the frame.


Wyze is not quite that bad for me, but it could use some work. It's very twitchy about motion (so lots of notifications on a windy day for waving branches and leaves), and lackluster on person/package detection. In both cases fine-tuning by setting only a portion of the field of view for monitoring did improve the hit/miss ratio quite a bit.


Privacy. I don’t want anyone outside of my network seeing any of my videos. If I’m using a cloud provider, then I’ve already given that up. If I’m using a tool like Shinobi on personal streaming video cameras, then I don’t want to have to upload my video to your servers.


so cool, great work! Any plans on making a "real-time" API? Something like "notified if people are climbing fence"


Thanks! We're don't currently have plans to make it a real-time but if there are enough use-cases here, we'll definitely think more about it. Other than security, where do you see this being useful?

I'm thinking possibly a live trend analysis / alert sort of thing might be useful if say we detect too much motion, too many people, or some other variable that we determine to be "out of wack".


might be useful for any sort of live stream service (twitch, reddit public access network, etc). For example, alert when nudity




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: