Storing compressed images in SQLite with python

I am working on a project that requires working with a large number of images of galaxies. While a million files is something easily handled by the linux file system it can be problematic to manage. For instance,  in order to transfer the full image catalog (~ 2GB) it has to be packed into a zip or tar file, which takes a long time (but much less than a direct transfer of individual files). Also,  in the case of distributed file systems like those used in Big Data applications a very large set of small (a few Kb) files is a worst case scenario.

One solution that so far works well is storing the images in an SQLite database. I found lots of information on the web about this problem but all where partial solutions. So as a reminder to myself, and hoping that this may be useful to other people, this is my end-to-end solution.

SQLite  lets us store all the images in one single database file (this as far as I know is not the case for other databases) and even in RAM. while allowing us fast operations on the individual images.  According to  this paper data stored in blobs can be accessed %30 faster than a file system (although the study is old).

The key idea here is to store the compressed (encoded) images as raw binary blobs in the database and then decode them as needed. This saves both disk space and reading time.

Below is a code that takes all the jpg files in a folder and stores them in a SQLite database (the actual code I use reads from another database to extract file names and properties of the galaxies).


import os
import glob
import numpy as np
import sqlite3
import cv2

PATH = '/home/miguel/Science/DATA/SDSS/IMAGES/SQLite/Images/'

Get the list of files in the folder

 
#--- Extract files from folder following pattern
files   = glob.glob(PATH+"*.jpg")
n_files = len(files)
print('Number of files in folder: ', n_files)

Create a table in a database with three columns:

  1. ObjId: The unique Id of the galaxy, also used as filename.
  2. img: The raw bytes of the image, this has to be decompressed.
  3. size: The size of the image in pixels (my images are squared).

 

 
#--- Simple database creator
def create_db(filename):
    db = sqlite3.connect(filename)
    cursor = db.cursor()
    cursor.execute("DROP TABLE IF EXISTS Images")
    cursor.execute("CREATE TABLE Images(ObjId INT, img BLOB, size INT)")
    db.commit()
    db.close()

filename_db = 'DR7_IMAGES.db'
create_db(filename_db)

Now we iterate over the files, read theis raw content in binary format and inset them in a table.

 
#--- Open database and loop over files to insert in database
con = sqlite3.connect(filename_db)
cur = con.cursor()
for i, file_i in enumerate(files):

    #--- Read image as a binary blob
    with open(file_i, 'rb') as f:
        image_bytes = f.read()
    f.close()

    #--- Decode raw bytes to get image size
    nparr  = np.fromstring(image_bytes, np.uint8)
    img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    image_size = img_np.shape[1]

    #--- Extract file name without extension
    filename = os.path.relpath(file_i, PATH)
    objid = int(os.path.splitext(filename)[0])

    #--- Insert image and data into table
    cur.execute("insert into Images VALUES(?,?,?)", (objid,sqlite3.Binary(image_bytes),image_size)   )
    con.commit()

    #--- Cheap progress
    if ((i % 100) == 0):
        print(i, n_files)

cur.close()
con.close()

Note the code that reads the image as a binary blob:

with open(file_i, 'rb') as f:
   image_bytes = f.read()
   f.close()

 

I wanted to store the image size so I had to “decode” the blob. This step is not neccesary if you just want to store the images as in a file system. The first line converts the string into a byte (uint8 or unsigned char) numpy array. The second line does the magic of decoding the raw bytes into an image. It actually decompresses the jpg data:


nparr = np.fromstring(image_bytes, np.uint8)

img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

I saw a few websites where the blob was passed using a buffer. I am instead passing the image blob using a SQLite function, I don’t know which way is better. I suppose using sqlite3 own functions leads to more maintainable code (or not…):

sqlite3.Binary(image_bytes)

Finally we read some images from the database.

 
con = sqlite3.connect(filename_db)
cur = con.cursor()
row = cur.execute("SELECT ObjId, img from Images")
for ObjId, item in row:

    #--- Decode blob
    nparr  = np.fromstring(item, np.uint8)
    img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    #--- Display image
    cv2.imshow('image',img_np)
    k = cv2.waitKey(0)
    if k == 27:         # wait for ESC key to exit
        cv2.destroyAllWindows()
        break

cur.close()
con.close()

 

Finally here is the full code

 

 
import os
import glob
import numpy as np
import sqlite3
import cv2

PATH_IMAGES = 'Images/'

#--- Extract files from folder following pattern
files   = glob.glob(PATH_IMAGES+"*.jpg")
n_files = len(files)
print('Number of files in folder: ', n_files)

#--- Simple database creator
def create_db(filename):
    db = sqlite3.connect(filename)
    cursor = db.cursor()
    cursor.execute("DROP TABLE IF EXISTS Images")
    cursor.execute("CREATE TABLE Images(ObjId INT, img BLOB, size INT)")
    db.commit()
    db.close()

filename_db = 'testdb_0.db'
create_db(filename_db)

#--- Open database and loop over files to insert in database
con = sqlite3.connect(filename_db)
cur = con.cursor()
for i, file_i in enumerate(files):

    #--- Read image as a binary blob
    with open(file_i, 'rb') as f:
        image_bytes = f.read()
    f.close()

    #--- Decode raw bytes to get image size
    nparr  = np.fromstring(image_bytes, np.uint8)
    img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    image_size = img_np.shape[1]

    #--- Extract file name without extension
    filename = os.path.relpath(file_i, PATH)
    objid = int(os.path.splitext(filename)[0])

    #--- Insert image and data into table
    cur.execute("insert into Images VALUES(?,?,?)", (objid,sqlite3.Binary(image_bytes),image_size)   )
    con.commit()

    #--- Cheap progress
    if ((i % 100) == 0):
        print(i, n_files)

cur.close()
con.close()

con = sqlite3.connect(filename_db)
cur = con.cursor()
row = cur.execute("SELECT ObjId, img from Images")
for ObjId, item in row:

    #--- Decode blob
    nparr  = np.fromstring(item, np.uint8)
    img_np = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    #--- Display image
    cv2.imshow('image',img_np)
    k = cv2.waitKey(0)
    if k == 27:         # wait for ESC key to exit
        cv2.destroyAllWindows()
        break

cur.close()
con.close()

 

 

 

 

 

 

 

 

 

 

 

Advertisement

SDSS fits cutouts

Since quite some time I have been trying to get image cutouts from the SDSS in fits format. The jpg images in the cutout service are ok for simple tasks but nothing serious. Recently a colleague told me about Montage, a great tool to automate fits cutout, mosaic generation, coverage masks etc. Since Montage doesn’t include a script to retrieve SDSS cutouts in a simple way (there is actually one but I could not make it run) I wrote the following bash script based on Inseok Song’s pleiades script.

You need to install Montage first and add Montage’s bin folder to your PATH


Download and install Montage:

http://montage.ipac.caltech.edu/


Add Montage’s bin folder to your path, this can be done in your .bashrc file. My Montage installation is in /home/miguel/Packages/Montage :


export PATH="/home/miguel/Packages/Montage/bin:$PATH"


 

This is the fits cutout script:

 

#!/bin/bash
#
# Cutout fits images from SDSS
#
# Adapted from Inseok Song, 2007
# http://irsa.ipac.caltech.edu/docs/howto/DSS_pleiades_mosaic.html
# # Arguments:
# $1 ra
# $2 dec
# $3 size
# $4 base_name
#
# Usage: sh get_sdss_atlas.sh 199.54198 -1.2436732 0.05 gal_000
# Miguel Aragon-Calvo, 2017
#
echo ra $1
echo dec $2
echo size $3
echo file $4

#--- Working directory
mkdir $4;
cd $4;

mHdr -p 0.4 "\"$1 $2\"" $3 $4.hdr

for bands in u g r i z; do echo Processing ${bands};
mkdir $bands;
cd $bands;
mkdir raw projected;
cd raw;
mArchiveList sdss ${bands} "\"$1 $2\"" $3 $3 remote.tbl;
mArchiveExec remote.tbl;
cd .. ;
mImgtbl raw rimages.tbl ;
mProjExec -p raw rimages.tbl ../$4.hdr projected stats.tbl ;
mImgtbl projected pimages.tbl ;
mAdd -p projected pimages.tbl ../$4.hdr ${bands}.fits ;
cd .. ;
mv ${bands}/${bands}.fits .
rm -rf ${bands}
done

cd .. ;
mJPEG -blue $4/g.fits -1s "99.999%" gaussian-log -green $4/r.fits -1s "99.999%" gaussian-log -red $4/i.fits -1s "99.999%" gaussian-log -out $4.jpg

For instance, to retrieve and generate fits images for all SDSS filters (u,g,r,i,z) centered at ra=19.54198, dec=-1.2436732 (the galaxy at the top of this post) with 0.05 degrees of size we call the script as follows:

$ sh get_sdss_atlas.sh 199.54198 -1.2436732 0.05 gal_000

The final fits files can be found in the folder gal_000 as gal_000/u.fits, gal_000/g.fits, etc.

The (commented) last line produces a jpg image using logarithmic scaling. The asinh scaling (http://adsabs.harvard.edu/abs/2004PASP..116..133L) is a better option.

Note that Montage will retrieve fits of complete sdss fields, probably more than one file per band. From these files it will cut the desired image. In the case of (small) individual galaxies this can be very  IO wasteful. For the galaxy called above Montage will download ~400MB of sdss field fits files in order to generate 5 fits files with 1.6MB each. The script automatically deletes the intermediate files.

 

 

Delaunay volumes

This post shows some experiments I have been doing with pseudo volume rendering using Delaunay tessellations.While there are lots of techniques that can be used to display large point datasets there are very few techniques for visualizing sparsely sampled points. This is a very common situation when dealing with (astronomical) real data like galaxy redshift catalogues.

This technique does not produce “correct” volume renderings but it is very fast, uses memory efficiently and looks very cool.  This approach is based on the Delaunay tessellation of the point distribution. Densities at each point are estimated assuming that they are proportional to the volume of the contiguous voronoi cell of the point. The density value can be computed at any point inside the convex hull of the point distribution by linearly interpolating the density between the 4 vertices of the tetrahedron containing the sampling point.

This is a pesudo volume rendering. Instead of computing the integrals along the line of sight for each pixel one uses summatory of the interpolated values along the faces of the tetrahedra that intersect with the line of sight. This is just a complicated way of saying that one simply has to display the triangles using the density at each point to color them. While this is technically wrong one can see that for visualization purposes it actually works very well. It was the advantage of not requiring any computation from the CPU since all the drawing is done in the graphics card.

This is a very simple OpenGL application. It still needs a lot of work. In regions with high density one could use GLpoints or point quads. I disable z-ordering so it is not possible to assign opacity to the triangles. I am also working on a .obj file loader to visualize objects like 3D survey masks, fancy axis, field lines, etc.

The advantage of using triangulation becomes clear in the low-density regions. One can resolve filamentary structures with less than 30 particles!

The images are rendering from a low-res N-body simulation. The mass per particle is in the order of 10^11 solar masses.

A fly through the Universe

This post describes the main steps in the making of a SDSS fly-though I have been developing since some time ago. I started with povray but moved to Blender due to its flexibility,  especially  with its python api.

Since its first light the SDSS has been continuously taking both images and spectra across the northern sky. The last data release (DR7) contains almost one million galaxies with spectra and many more galaxy images. From the +- million galaxies with measured redshift only a very small fraction has high quality images. This is mainly due to the intrinsic resolution of the telescope and the atmospheric conditions at the time of observation. In order to produce images with the desired quality each galaxy in the spectroscopic sample was matched with a high-res image from a template of galaxies. The matching was done by choosing the closest galaxy in color — concentration-index space. Elliptical galaxies were displayed as quad sprites with a dense bulge and a diffuse 3D halo.

Galaxy template
Galaxy template

Loading galaxies into Blender
Once all galaxies have been assigned an image from the template the next step is to load the galaxy’s position, size, inclination and image into Blender. The python interface provided by Blender makes this very easy.

Galaxies were binned by their respective templates. The all galaxies were loaded into a single object in Blender. So there are 300 objects each holding several thousand quads for each galaxy. This saves memory since there is no need to store all the possible properties for each galaxy.

Loading galaxies into Blender
Loading galaxies into Blender

In progress…