⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Functions to get all darwin cut notes based on image dimensions - in python and spark for efficient parallel processing

Notifications You must be signed in to change notification settings

HackTheStacks/darwin-image-preprocessing

Repository files navigation

darwin-image-preprocessing

Functions to get all darwin cut notes based on image dimensions and throw away full-page notes (non cut notes). Works by comparing image dimensions to mean image dimensions within folder. Written in PySpark for efficient parallel processing due to dataset size of ~350GB and ~60k images.

About

Functions to get all darwin cut notes based on image dimensions - in python and spark for efficient parallel processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages