Speaker
Description
The Algorithms Research and Development Group (ARDG) at the National Radio Astronomy Observatory (NRAO) has been using HTCondor and compute resources at the Open Science Grid (OSG) to significantly improve throughput in radio astronomy imaging by up to 2 orders of magnitude in single imaging workflows, and we are currently putting efforts towards extending these imaging capabilities to multiple imaging workflows. Besides developing and maintaining software to efficiently process data and manage the imaging workflows, we found that monitoring and diagnosing workflow executions is critical to achieve and maintain high throughput. We will present and discuss some of the tools and methodology that we use to assess workflow health and efficiency, which enable us to visualize and solve problems, eventually also contributing to optimizing and advancing capabilities of HTCondor and the OSG.