Post HERE

# Data-Driven Astronomy

The following problems appeared as assignments in the **coursera course** **Data-Driven Astronomy**.

One of the most widely used formats for astronomical images is the **Flexible Image Transport System**. In a **FITS** file, the image is stored in a numerical array. The **FITS files** shown below are some of the publicly available ones downloaded from the following sites:

- https://www.spacetelescope.org/projects/fits_liberator/datasets_arc…
- https://fits.gsfc.nasa.gov/fits_samples.html
- ftp://legacy.gsfc.nasa.gov/fits_info/sample_files/images
- http://chandra.harvard.edu/photo/openFITS/casa.html
- http://www.sergepetiot.com/
- http://deep.ucolick.org/archive.html.

## 1. Computig the Mean and Median Stacks from a set of noisy FITS files

In this assignment, we shall focuss on calculating the *mean* of a stack of FITS files. Each individual file may or may not have a detected a **pulsar**, but in the final stack we should be able to see a **clear detection**.

### Computig the Mean FITS

The following figure shows ** 5 noisy FITS** files, which will be used to compute the

**mean FITS**file.

The following figure shows the **mean FITS** file computed from thes above FITS files. Mean being an **algebraic** measure, it’s possible to compute **running mean** by loadig each file at a time in the memory.

Computing the Median FITS

Now let’s look at a different statistical measure — the median, which in many cases is considered to be a better measure than the mean due to its robustness to outliers. The median can be a more robust measure of the average trend of datasets than the mean, as the latter is easily skewed by outliers.

However, a naïve implementation of the median algorithm can be *very inefficient* when dealing with large datasets. *Median*, being a **holistic** measure, required all the datasets to be loaded in memory for exact computation, when implemeted i a naive manner.

Calculating the median requires all the data to be in memory at once. This is an issue in typical astrophysics calculations, which may use hundreds of thousands of FITS files. Even with a machine with lots of RAM, it’s not going to be possible to find the median of more than a few tens of thousands of images.This isn’t an issue for calculating the mean, since the sum only requires one image to be added at a time.

### Computing the approximate runing median: the BinApprox Algorithm

If there were a way to calculate a “running median” we could save space by only having one image loaded at a time. Unfortunately, there’s no way to do an exact running median, but there are ways to do it **approximately**.

The **binapprox** algorithm does just this. The idea behind it is to find the median from the data’s histogram.

First of all it ca be proved that media always lies within one standard deviation of the mean, as show below:

The algorithm to find the **running approximate median** is show below:

As soon as the relevant bin is updated the data point being binned can be removed from memory. So if we’re finding the median of a bunch of **FITS** files we only need to have one loaded at any time. (The mean and standard deviation can both be calculated from running sums so that still applies to the first step).

The downside of using **binapprox** is that we only get an answer accurate to **σ/B** by using **B** bins. Scientific data comes with its own uncertainties though, so as long as you keep large enough this isn’t necessarily a problem.

The following figure shows the histogram of **1 million** data points generated randomly. Now, the **binapprox** algorithm will be used to compute the **running median** and the error in approximation will be computed with different number of bins **B**.

As can be seen from above, as the number of bins **B** increases, the **error** in**approximation** of the running **median** decreases.

Now we can use the **binapprox** algorithm to efficiently estimate the **median** of each pixel from a set of astronomy images in **FITS** files. The following figure shows ** 10 noisy FITS** files, which will be used to compute the

**median FITS**file.

The following figure shows the **median FITS** file computed from the above FITS files using the **binapprox** algorithm.

## 2. Cross-matching

When investigating astronomical objects, like **active galactic nuclei (AGN)**, astronomers compare data about those objects from different telescopes at different wavelengths.

This requires positional *cross-matching* to find the closest counterpart within a given radius on the sky.

In this activity you’ll cross-match two catalogues: one from a radio survey, the **AT20G Bright Source Sample (BSS) catalogue** and one from an optical survey, the **SuperCOSMOS all-sky galaxy catalogue**.

The BSS catalogue lists the brightest sources from the AT20G radio survey while the SuperCOSMOS catalogue lists galaxies observed by visible light surveys. If we can find an optical match for our radio source, we are one step closer to working out what kind of object it is, e.g. a galaxy in the local Universe or a distant quasar.

We’ve chosen one small catalogue (BSS has only 320 objects) and one large one (SuperCOSMOS has about 240 million) to demonstrate the issues you can encounter when implementing cross-matching algorithms.

The positions of stars, galaxies and other astronomical objects are usually recorded in either equatorial or Galactic coordinates.

Equatorial coordinates are fixed relative to the celestial sphere, so the positions are independent of when or where the observations took place. They are defined relative to the celestial equator (which is in the same plane as the Earth’s equator) and the ecliptic (the path the sun traces throughout the year).

A point on the celestial sphere is given by two coordinates:

: the angle from the vernal equinox to the point, going east along the celestial equator;*Right ascension*: the angle from the celestial equator to the point, going north (negative values indicate going south).*Declination*

The vernal equinox is the intersection of the celestial equator and the ecliptic where the ecliptic rises above the celestial equator going further east.

- Right ascension is often given in hours-minutes-seconds (HMS) notation, because it was convenient to calculate when a star would appear over the horizon. A full circle in HMS notation is 24 hours, which means 1 hour in HMS notation is equal to 15 degrees. Each hour is split into 60 minutes and each minute into 60 seconds.
- Declination, on the other hand, is traditionally recorded in degrees-minutes-seconds (DMS) notation. A full circle is 360 degrees, each degree has 60 arcminutes and each arcminute has 60 arcseconds.

To **crossmatch** two catalogs we need to compare the angular distance between objects on the celestial sphere.

People loosely call this a “distance”, but technically its an angular distance: the projected angle between objects as seen from Earth.

Angular distances have the same units as angles (degrees). There are other equations for calculating the angular distance but this one, called the **haversine formula**, is good at avoiding floating point errors when the two points are close together.

If we have an object on the celestial sphere with right ascension and declination** (α1,δ1)**, then the angular distance to another object with coordinates**(α2,δ2****)** is given below:

Before we can **crossmatch** our two catalogs we first have to import the raw data. Every astronomy catalog tends to have its own unique format so we’ll need to look at how to do this with each one individually.

We’ll look at the AT20G bright source sample survey first. The raw data we’ll be using is the file table2.dat from this page in the VizieR archives, but we’ll use the filename bss.dat from now on.

Every catalogue in VizieR has a detailed README file that gives you the exact format of each table in the catalogue.

The full catalogue of bright radio sources contains 320 objects. The first few rows look like this:

The catalogue is organised in fixed-width columns, with the format of the columns being:

- 1: Object catalogue ID number (sometimes with an asterisk)
- 2-4: Right ascension in HMS notation
- 5-7: Declination in DMS notation
- 8-: Other information, including spectral intensities

The SuperCOSMOS all-sky catalogue is a catalogue of galaxies generated from several visible light surveys.

The original data is available on this page in a package called SCOS_stepWedges_inWISE.csv.gz. The file is extracted to a csv file named super.csv.

The first few lines of super.csv look like this:

The catalog uses a comma-separated value (CSV) format. Aside from the first row, which contains column labels, the format is:

- 1: Right ascension in decimal degrees
- 2: Declination in decimal degrees
- 3: Other data, including magnitude and apparent shape.

### Naive Cross-matcher

Let’s implement a naive **crossmatch** function that crossmatches two catalogs within a maximum distance and returns a list of matches and non-matches for the first catalog (**BSS**) against the second (**SuperCOSMOS**). The maximum distance is given in decimal degrees (e.g., nearest objects within a distance of **5 degree** will be considered to be matched).

There are ** 320** objects in the

**that are compared with first**

*BSS*catalog**objects from the**

*n***. The values of**

*SuperCOSMOS*catalog**is gradually increased from**

*n***to**

*500***and impact on the running time and the number of matched objected are noted.**

*1,25,000*The below figures shows the time taken for the *naive cross-matching* as the umber of objects in the *second* catalog is increased and also the final matches produced as a *bipartite graph*.

### An efficient Cross-Matcher

Cross-matching is a very common task in astrophysics, so it’s natural that it’s had optimized implementations written of it already. A popular implementation uses objects called k-d trees to perform crossmatching incredibly quickly, by constructing a **k-d tree** out of the *second catalogue*, letting it search through for a match for each object in the *first catalogue* efficiently. Constructing a k-d tree is similar to binary search: the *k-dimensional* space is divided into two parts recursively until each division only contains only a single object. Creating a k-d tree from an astronomy catalogue works like this:

- Find the object with the median right ascension, split the catalogue into objects left and right partitions of this
- Find the objects with the median declination in each partition, split the partitions into smaller partitions of objects down and up of these
- Find the objects with median right ascension in each of the partitions, split the partitions into smaller partitions of objects left and right of these
- Repeat 2-3 until each partition only has one object in it

This creates a binary tree where each object used to split a partition (a node) links to the two objects that then split the partitions it has created (its children), similar to the one show in the image below.

Once a *k-d tree* is costructed out of a catalogue, finding a match to an object then works like this:

- Calculate the distance from the object to highest level node (the root node), then go to the child node closest (in right ascension) to the object
- Calculate the distance from the object to this child, then go to the child node closest (in declination) to the object
- Calculate the distance from the object to this child, then go to the child node closest (in right ascension) to the object
- Repeat 2-3 until you reach a child node with no further children (a leaf node)
- Find the shortest distance of all distances calculated, this corresponds to the closest object

Since each node branches into two children, a catalogue of objects will have, on average, log2(N) nodes from the root to any leaf, as show in the following figure.

As can be seen above, the **kd-tree** based implementation is way more faster than the **naive** counterpart for crossmatching. When the **naive** takes **> 20 mins** to match against **1,25,000** objects in the *second* catalog, the **kd-tree** based implemetation takes just about **1 second** to produce the same set of results, as shown above.