A major challenge of data-driven biomedical research lies in the collection and representation of data provenance information to ensure reproducibility of findings. In order to communicate and reproduce multi-step analysis workflows executed on datasets that contain data for dozens or hundreds of samples, it is crucial to be able to visualize the provenance graph at different levels of aggregation. Most existing approaches are based on node-link diagrams, which do not scale to the complexity of typical data provenance graphs. In our proposed approach we reduce the complexity of the graph using hierarchical and motif-based aggregation. Based on user action and graph attributes a modular degree-of-interest (DoI) function is applied to expand parts of the graph that are relevant to the user. This interest-driven adaptive provenance visualization approach allows users to review and communicate complex multi-step analyses, which can be based on hundreds of files that are processed by numerous workflows. We integrate our approach into an analysis platform that captures extensive data provenance information and demonstrate its effectiveness by means of a biomedical usage scenario.
AVOCADO: Visualization of Workflow-Derived Data Provenance for Reproducible Biomedical Research
Computer Graphics Forum (EuroVis '16), vol. 35, no. 3, pp. 481-490 , doi:10.1111/cgf.12924, 2016.
We are grateful to Samuel Gratzl for input on the early design and the implementation of AVOCADO, and to the Refinery Platform team (Peter J Park, Shannan Ho Sui, Win Hide, Ilya Sytchev, Jennifer Marx, Scott Ouellette, Fritz Lekschas) for their help with the task definitions and the integration of AVOCADO. This work was funded by the Austrian Research Promotion Agency (FFG 840232), the Austrian Science Fund (FWF P27975-NBL), the US National Institutes of Health (R00 HG007583), and the Harvard Stem Cell Institute.