Luis B. Almeida
INESC-ID, Lisbon, Portugal, 2005
Description of the data sets
Contents of each data set
Reading and using the data
Data set production
In Version 2.0, five additional data sets (numbers 4 to 8) have been included in the database. These data sets were acquired in conditions that were as symmetrical as possible. They include sources that are the same as, or similar to, those of the three initial sets, but also include a mixture of printed text with natural scenes, and a mixture of two printed text images. The "bars" mixture is new, having been produced, printed and acquired under better-controlled conditions.
When we scan or photograph a paper document, the image printed on the back page often shows through, due to partial transparency of the paper. The data sets given here correspond to a severe case of this problem, in which the paper that was used was "onion skin" (for readers whose native language is not English, onion skin paper is the semi-transparent paper that is often used in professional drawing).
The strong transparency of the onion skin paper originates a strong mixture of the images from the front and back pages. This mixture is significantly nonlinear. Since two different mixtures can be obtained by scanning both sides of the paper, these mixtures are good candidates for nonlinear source separation. The mixtures are close to singular in the lighter parts of the images.
The data sets will be divided into three groups:
The source images are formed by bars (horizontal in one of the images and vertical in the other one), each bar having a uniform gray tone. As such, the two source images are exactly independent from each other.
This was the first set to be acquired. We don't have, any more, access to the scanner that was used. It seems to have had some "memory" effect, which is apparent in some of the light to dark, or dark to light transitions. This data set is provided for historical reasons only. For new work we suggest using set #4 instead.
These data sets were acquired with a Canon LIDE 50 desktop scanner. The scanner's "automatic image adjustment" option was turned on. The exact operation of this option is not described in the scanner's documentation, but it seems to automatically adjust the brightness, contrast and gamma value of the image being acquired. Since these automatic adjustments will normally have acted differently on the two pages of a data set, the mixtures probably are slightly non-symmetrical.
In set #2 the source images are two scenery photographs with a large variability and relatively small details. This originates a good mixing of the gray levels of both images, and therefore the images' intensities are close to independent. However, the large variability and the small details tend to make weak image superpositions hard to notice visually.
In set #3 the source images are two scenery photographs with large, quasi-uniform areas. This originates a weaker mixing of gray levels from both images, resulting in images whose intensities are not independent. However, weak image superpositions are easier to notice due to the existence of the quasi-uniform areas.
These data sets were acquired with a Canon LIDE 50 desktop scanner. The scanner's "automatic image adjustment" option was turned off. Therefore the processing of the two sides of the paper was as identical as possible. The only source of asymmetry appears to be in the fact that the second image of each mixture was aligned relative to the first image. Therefore the second image was slightly modified (mainly in position, although that involved interpolating intensities) while the first was not.
Each data set is contained in a zip file. Each zip file contains six image files in bitmap format (256 levels, mapped to grayscale). For data set #N the files are:
The data sets can be downloaded here:
The images are provided in bitmap format, and should be easily usable in a wide variety of programs and programming languages.
To read an image into Matlab use a command such as
>> x1 = imread('set1_x1.bmp');
This will create a 'uint8' array named x1, which contains the image. Note that for processing the data you'll probably need to then convert to 'double' format, using a command like
>> x1 = double(x1);
This Section gives some details of how the data sets were produced.
The images from each pair were printed on opposite faces of a sheet of onion skin paper. Printing was performed with a Hewlett-Packard LaserJet 2200 printer, at a resolution of 1200 dpi, using the printer's default halftoning system. Both faces of the sheet of onion skin paper were then scanned with a desktop scanner, at a resolution of 100 dpi. This low resolution was chosen on purpose, so that the printer's halftoning grid would not be very apparent in the scanned images. Set #1 was acquired with a scanner different from the one used for the remaining sets. I don't have access to this scanner any more, and I don't know its brand. All other sets were were acquired with a Canon LIDE 50 desktop scanner. In all cases the scanners' de-screening option (which reduces the visibility of the halftoning grid) was turned on.
In each pair of acquired images, one of them was horizontally flipped, so
that both images would have the same orientation. Then the two images were
aligned, as described next.
In preliminary tests it was found that even a very careful alignment, using translation, rotation and shear operations on the whole images, could not perform a good simultaneous alignment of all parts of the images. This was probably due to slight geometrical distortions introduced by the scanner. This indicated that an automatic, local alignment was needed. That automatic alignment, in turn, relaxed the demands placed on the initial manual alignment.
In the procedure that was finally adopted, the first step consisted just of a
manual displacement of one of the images by an integer number of pixels in each
direction, so that the two images would be coarsely aligned with each other. In
a second step an automatic, local alignment was performed. For this alignment,
the resolution of both images was first increased by a factor of four in each
direction, using bicubic interpolation. Then one of the images was divided into
100x100 pixel squares (corresponding to 25x25 pixels in the original image), and
for each square the best displacement was found, based on the maximum of the
cross-correlation with the other image. The whole image was then rebuilt, based
on these optimal displacements, and its resolution was finally reduced by a
factor of 4. In this way, a local alignment with a resolution of 1/4 pixel was
The source images originally had a resolution which was very different from that of the acquired ones. In order for useful comparisons to be possible, the source images were reduced to the same resolution as the acquired ones, and were aligned with the corresponding "recovered source" images, extracted from the mixtures using nonlinear independent component analysis (see  for the nonlinear ICA technique). This alignment used the alignment procedure that was described in the previous Section. The sources that are included in these data sets are the result of the reduction in resolution and alignment, and were cropped to correspond, as closely as possible, to the aligned mixture images. Note that, in the sources of data sets #1 and #4, gray levels different from the original ones are visible in the edges of the bars due to the alignment procedure, which involved a change of resolution, performed with bicubic interpolation, followed by displacement and decimation.
These data sets are copyright of Luis B. Almeida. Data set #1 is also copyright of Miguel Faria. Free permission is given for their use for nonprofit research purposes. Any other use is prohibited, unless a license is previously obtained. To obtain a license please contact Luis B. Almeida.
Luis B. Almeida
Instituto de Telecomunicações
Instituto Superior Técnico
Av. Rovisco Pais, 1
E-mail (delete the first two c's, which are there to prevent spamming): firstname.lastname@example.org
Home page: http://www.lx.it.pt/~lbalmeida/
The following preprint gives an example of linear and nonlinear source separation performed on data sets 4 to 8.
 L. B. Almeida, "Separating a real-life nonlinear image mixture", submitted to Journal of Machine Learning Research, 2005. Available in: http://www.lx.it.pt/~lbalmeida/papers/AlmeidaJMLR05.pdf.
The mixture images, as acquired (without preprocessing) were also included in the data sets.
The images are now supplied in bitmap format, since this widens their usability.
Initial version, supplied in the '.mat' format of Matlab 7.