Friday, August 7, 2020

COCO dataset from scratch : try and fail ...

Detectron2 provides several algorithms for instance segmentation, so it was tempting to submit the overlapping datasets to one of those. However, to use one of these algorithms, the dataset format seem to follow the MS-COCO format.

One available dataset consists in 2164 pairs of grayscaled+groundtruth images.To give a try, a minimalist dataset with one image and two instances could be converted to COCO format:

 

The two instances (right) are obtained from the groundtruth image showing the overlapping chromosomes. The instances are numpy arrays which can be saved as png images. To generate a COCO dataset associated to the gray scaled image (left), the following steps were followed:

  • generate a python dictionary according to the COCO format specification found in the detectron2 documentation and convert the binary masks to their bounding boxes and compressed rle using pycocotools.
  • Save the dictionary as a json file
  • Load the json file with pycocotools (or detectron2) in order to visualize if possible the instances overlaying the gray scaled image.

 The whole process is available in a jupyter notebook on Kaggle.

Unfortunately, the dataset is not a legit COCO dataset as the dataset registration fails. Hope to get some help on Pytorch forum or from stackoverflow.