PTN-08 Analyzing Billion-objects Catalog Interactively: Apache Spark For Physicists

Enter pincode for exact delivery dates and charge
Safe and Secure payments.100% Authentic products
BrandMelody's Ideal ForUnisex Age Group8+ Years

Apache Spark is a big-data framework for working on large distributed datasets.
Although widely used in the industry, it remains confidential in the scientific community or
often restricted to software engineers. The goal of this paper is to introduce the framework to
newcomers and show that the technology is mature enough to be used without excessive
programming skills also by physicists as astronomers or cosmologists to perform analyses
over large datasets as those originating from future galactic surveys. To demonstrate it, we
start from a realistic simulation corresponding to 10 years of LSST data-taking (6 billions of
galaxies). Then we design, optimize and benchmark a set of Spark python algorithms in order
to perform standard operations as adding photometric redshift errors, measuring the selection
function or computing power spectra over tomographic bins. Most of the commands executes
on the full 110 GB dataset within tens of seconds and can therefore be performed
interactively in order to design full-scale cosmological analyses.