doi:https://doi.org/

ESSI – Earth & Space Science Informatics

ESSI1.2 – Addressing Training Data Challenges to Accelerate Earth Science Machine Learning

EGU21-12065 | vPICO presentations | ESSI1.2

Introducing AIDE: a Software Suite for Annotating Images with Deep and Active Learning Assistance

Benjamin Kellenberger, Devis Tuia, and Dan Morris

Ecological research like wildlife censuses increasingly relies on data on the scale of Terabytes. For example, modern camera trap datasets contain millions of images that require prohibitive amounts of manual labour to be annotated with species, bounding boxes, and the like. Machine learning, especially deep learning [3], could greatly accelerate this task through automated predictions, but involves expansive coding and expert knowledge.

In this abstract we present AIDE, the Annotation Interface for Data-driven Ecology [2]. In a first instance, AIDE is a web-based annotation suite for image labelling with support for concurrent access and scalability, up to the cloud. In a second instance, it tightly integrates deep learning models into the annotation process through active learning [7], where models learn from user-provided labels and in turn select the most relevant images for review from the large pool of unlabelled ones (Fig. 1). The result is a system where users only need to label what is required, which saves time and decreases errors due to fatigue.

Fig. 1: AIDE offers concurrent web image labelling support and uses annotations and deep learning models in an active learning loop.

AIDE includes a comprehensive set of built-in models, such as ResNet [1] for image classification, Faster R-CNN [5] and RetinaNet [4] for object detection, and U-Net [6] for semantic segmentation. All models can be customised and used without having to write a single line of code. Furthermore, AIDE accepts any third-party model with minimal implementation requirements. To complete the package, AIDE offers both user annotation and model prediction evaluation, access control, customisable model training, and more, all through the web browser.

AIDE is fully open source and available under https://github.com/microsoft/aerial_wildlife_detection.

References

How to cite: Kellenberger, B., Tuia, D., and Morris, D.: Introducing AIDE: a Software Suite for Annotating Images with Deep and Active Learning Assistance, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-12065, https://doi.org/10.5194/egusphere-egu21-12065, 2021.

Ecological research like wildlife censuses increasingly relies on data on the scale of Terabytes. For example, modern camera trap datasets contain millions of images that require prohibitive amounts of manual labour to be annotated with species, bounding boxes, and the like. Machine learning, especially deep learning [3], could greatly accelerate this task through automated predictions, but involves expansive coding and expert knowledge.

In this abstract we present AIDE, the Annotation Interface for Data-driven Ecology [2]. In a first instance, AIDE is a web-based annotation suite for image labelling with support for concurrent access and scalability, up to the cloud. In a second instance, it tightly integrates deep learning models into the annotation process through active learning [7], where models learn from user-provided labels and in turn select the most relevant images for review from the large pool of unlabelled ones (Fig. 1). The result is a system where users only need to label what is required, which saves time and decreases errors due to fatigue.

Fig. 1: AIDE offers concurrent web image labelling support and uses annotations and deep learning models in an active learning loop.

AIDE includes a comprehensive set of built-in models, such as ResNet [1] for image classification, Faster R-CNN [5] and RetinaNet [4] for object detection, and U-Net [6] for semantic segmentation. All models can be customised and used without having to write a single line of code. Furthermore, AIDE accepts any third-party model with minimal implementation requirements. To complete the package, AIDE offers both user annotation and model prediction evaluation, access control, customisable model training, and more, all through the web browser.

AIDE is fully open source and available under https://github.com/microsoft/aerial_wildlife_detection.

References

How to cite: Kellenberger, B., Tuia, D., and Morris, D.: Introducing AIDE: a Software Suite for Annotating Images with Deep and Active Learning Assistance, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-12065, https://doi.org/10.5194/egusphere-egu21-12065, 2021.

Discussion

EGU21-6853 | vPICO presentations | ESSI1.2

Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery

Rudy Venguswamy, Mike Levy, Anirudh Koul, Satyarth Praveen, Tarun Narayanan, Ajay Krishnan, Jenessa Peterson, Siddha Ganju, and Meher Kasam

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack.

We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them.

In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers’ specified “training budget.”). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image’s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p ≈ 0.5).

Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.

By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers’ data curation efforts.

How to cite: Venguswamy, R., Levy, M., Koul, A., Praveen, S., Narayanan, T., Krishnan, A., Peterson, J., Ganju, S., and Kasam, M.: Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-6853, https://doi.org/10.5194/egusphere-egu21-6853, 2021.

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack.

We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them.

In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers’ specified “training budget.”). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image’s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p ≈ 0.5).

Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.

By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers’ data curation efforts.

How to cite: Venguswamy, R., Levy, M., Koul, A., Praveen, S., Narayanan, T., Krishnan, A., Peterson, J., Ganju, S., and Kasam, M.: Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-6853, https://doi.org/10.5194/egusphere-egu21-6853, 2021.

Discussion

EGU21-16326 | vPICO presentations | ESSI1.2

Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics

Jason Meil

Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1] Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.

Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers. By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time. Data can be ingested via batch or stream.

The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification.

How to cite: Meil, J.: Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-16326, https://doi.org/10.5194/egusphere-egu21-16326, 2021.

Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1] Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.

Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers. By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time. Data can be ingested via batch or stream.

The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification.

How to cite: Meil, J.: Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-16326, https://doi.org/10.5194/egusphere-egu21-16326, 2021.

Discussion

EGU21-15297 | vPICO presentations | ESSI1.2

Collecting training data to map forest management at global scale

Myroslava Lesiv, Dmitry Schepaschenko, Martina Dürauer, Marcel Buchhorn, Ivelina Georgieva, and Steffen Fritz

Spatially explicit information on forest management at a global scale is critical for understanding the current status of forests for sustainable forest management and restoration. Whereas remotely sensed based datasets, developed by applying ML and AI algorithms, can successfully depict tree cover and other land cover types, it has not yet been used to depict untouched forest and different degrees of forest management. We show for the first time that with sufficient training data derived from very high-resolution imagery a differentiation within the tree cover class of various levels of forest management is possible.

In this session, we would like to present our approach for labeling forest related training data by using Geo-Wiki application (https://www.geo-wiki.org/). Moreover, we would like to share a new open global training data set on forest management we collected from a series of Geo-Wiki campaigns. In February 2019, we organized an expert workshop to (1) discuss the variety of forest management practices that take place in different parts of the world; (2) generalize the definitions for the application at global scale; (3) finalize the Geo-Wiki interface for the crowdsourcing campaigns; and (4) build a data set of control points (or the expert data set), which we used later to monitor the quality of the crowdsourced contributions by the volunteers. We involved forest experts from different regions around the world to explore what types of forest management information could be collected from visual interpretation of very high-resolution images from Google Maps and Microsoft Bing, in combination with Sentinel time series and Normalized Difference Vegetation Index (NDVI) profiles derived from Google Earth Engine (GEE). Based on the results of this analysis, we expanded these campaigns by involving a broader group of participants, mainly people recruited from remote sensing, geography and forest research institutes and universities.

In total, we collected forest data for approximately 230 000 locations globally. These data are of sufficient density and quality and therefore could be used in many ML and AI applications for forests at regional and local scale. We also provide an example of ML application, a remotely sensed based global forest management map at a 100 m resolution (PROBA-V) for the year 2015. It includes such classes as intact forests, forests with signs of human impact, including clear cuts and logging, replanted forest, woody plantations with a rotation period up to 15 years, oil palms and agroforestry. The results of independent statistical validation show that the map’s overall accuracy is 81%.

How to cite: Lesiv, M., Schepaschenko, D., Dürauer, M., Buchhorn, M., Georgieva, I., and Fritz, S.: Collecting training data to map forest management at global scale, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-15297, https://doi.org/10.5194/egusphere-egu21-15297, 2021.

Spatially explicit information on forest management at a global scale is critical for understanding the current status of forests for sustainable forest management and restoration. Whereas remotely sensed based datasets, developed by applying ML and AI algorithms, can successfully depict tree cover and other land cover types, it has not yet been used to depict untouched forest and different degrees of forest management. We show for the first time that with sufficient training data derived from very high-resolution imagery a differentiation within the tree cover class of various levels of forest management is possible.

In this session, we would like to present our approach for labeling forest related training data by using Geo-Wiki application (https://www.geo-wiki.org/). Moreover, we would like to share a new open global training data set on forest management we collected from a series of Geo-Wiki campaigns. In February 2019, we organized an expert workshop to (1) discuss the variety of forest management practices that take place in different parts of the world; (2) generalize the definitions for the application at global scale; (3) finalize the Geo-Wiki interface for the crowdsourcing campaigns; and (4) build a data set of control points (or the expert data set), which we used later to monitor the quality of the crowdsourced contributions by the volunteers. We involved forest experts from different regions around the world to explore what types of forest management information could be collected from visual interpretation of very high-resolution images from Google Maps and Microsoft Bing, in combination with Sentinel time series and Normalized Difference Vegetation Index (NDVI) profiles derived from Google Earth Engine (GEE). Based on the results of this analysis, we expanded these campaigns by involving a broader group of participants, mainly people recruited from remote sensing, geography and forest research institutes and universities.

In total, we collected forest data for approximately 230 000 locations globally. These data are of sufficient density and quality and therefore could be used in many ML and AI applications for forests at regional and local scale. We also provide an example of ML application, a remotely sensed based global forest management map at a 100 m resolution (PROBA-V) for the year 2015. It includes such classes as intact forests, forests with signs of human impact, including clear cuts and logging, replanted forest, woody plantations with a rotation period up to 15 years, oil palms and agroforestry. The results of independent statistical validation show that the map’s overall accuracy is 81%.

How to cite: Lesiv, M., Schepaschenko, D., Dürauer, M., Buchhorn, M., Georgieva, I., and Fritz, S.: Collecting training data to map forest management at global scale, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-15297, https://doi.org/10.5194/egusphere-egu21-15297, 2021.

Discussion

EGU21-10347 | vPICO presentations | ESSI1.2

gprMax: An Open Source Electromagnetic Simulator for Generating Big Data for Ground Penetrating Radar Applications

Craig Warren, Iraklis Giannakis, and Antonios Giannopoulos

Lack of well-labelled and coherent training data is the main reason why machine learning (ML) and data-driven interpretations are not established in the field of Ground-Penetrating Radar (GPR). Non-representative and limited datasets lead to non-reliable ML-schemes that overfit, and are unable to compete with traditional deterministic approaches. To that extent, numerical data can potentially complement insufficient measured datasets and overcome this lack of data, even in the presence of large feature spaces.

Using synthetic data in ML is not new and it has been extensively applied to computer vision. Applying numerical data in ML requires a numerical framework capable of generating synthetic but nonetheless realistic datasets. Regarding GPR, such a framework is possible using gprMax, an open source electromagnetic solver, fine-tuned for GPR applications [1], [2], [3]. gprMax is fully parallelised and can be run using multiple CPU’s and GPU’s. In addition, it has a flexible scriptable format that makes it easy to generate big data in a trivial manner. Stochastic geometries, realistic soils, vegetation, targets [3] and models of commercial antennas [4], [5] are some of the features that can be easily incorporated in the training data.

The capability of gprMax to generate realistic numerical datasets is demonstrated in [6], [7]. The investigated problem is assessing the depth and the diameter of rebars in reinforced concrete. Estimating the diameter of rebars using GPR is particularly challenging with no conclusive solution. Using a synthetic training set, generated using gprMax, we managed to effectively train ML-schemes capable of estimating the diameter of rebar in an accurate and efficient manner [6], [7]. The aforementioned case studies support the premise that gprMax has the potential to provide realistic training data to applications where well-labelled data are not available, such as landmine detection, non-destructive testing and planetary sciences.

References

[1] Warren, C., Giannopoulos, A. & Giannakis, I., (2016). gprMax: Open Source software to simulate electromagnetic wave propagation for Ground Penetrating Radar, Computer Physics Communications, 209, 163-170.

[2] Warren, C., Giannopoulos, A., Gray, A., Giannakis, I., Patterson, A., Wetter, L. & Hamrah, A., (2018). A CUDA-based GPU engine for gprMax: Open source FDTD, electromagnetic simulation software. Computer Physics Communications, 237, 208-218.

[3] Giannakis, I., Giannopoulos, A. & Warren, C. (2016). A realistic FDTD numerical modeling framework of Ground Penetrating Radar for landmine detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 9(1), 37-51.

[4] Giannakis, I., Giannopoulos, A. & Warren, C., (2018). Realistic FDTD GPR antenna models optimized using a novel linear/non-linear full waveform inversion. IEEE Transactions on Geoscience and Remote Sensing, 207(3), 1768-1778.

[5] Warren, C., Giannopoulos, A. (2011). Creating finite-difference time-domain models of commercial ground-penetrating radar antennas using Taguchi’s optimization method. Geophysics, 76(2), G37-G47

[6] Giannakis, I., Giannopoulos, A. & Warren, C. (2021). A Machine Learning Scheme for Estimating the Diameter of Reinforcing Bars Using Ground Penetrating Radar. IEEE Geoscience and Remote Sensing Letters.

[7] Giannakis, I., Giannopoulos, A., & Warren, C. (2019). A machine learning-based fast-forward solver for ground penetrating radar with application to full-waveform inversion. IEEE Transactions on Geoscience and Remote Sensing. 57(7), 4417-4426.

How to cite: Warren, C., Giannakis, I., and Giannopoulos, A.: gprMax: An Open Source Electromagnetic Simulator for Generating Big Data for Ground Penetrating Radar Applications, EGU General Assembly 2021, online, 19–30 Apr 2021, EGU21-10347, https://doi.org/10.5194/egusphere-egu21-10347, 2021.