"See-through" data sets for nonlinear source separation

Luis B. Almeida

INESC-ID, Lisbon, Portugal, 2005

Version 2.0

What's new
Introduction
Description of the data sets
Contents of each data set
Download
Reading and using the data
Data set production
License
Contact
Bibliography
Version information

0. What's new

In Version 2.0, five additional data sets (numbers 4 to 8) have been included in the database. These data sets were acquired in conditions that were as symmetrical as possible. They include sources that are the same as, or similar to, those of the three initial sets, but also include a mixture of printed text with natural scenes, and a mixture of two printed text images. The "bars" mixture is new, having been produced, printed and acquired under better-controlled conditions.

1. Introduction

When we scan or photograph a paper document, the image printed on the back page often shows through, due to partial transparency of the paper. The data sets given here correspond to a severe case of this problem, in which the paper that was used was "onion skin" (for readers whose native language is not English, onion skin paper is the semi-transparent paper that is often used in professional drawing).

The strong transparency of the onion skin paper originates a strong mixture of the images from the front and back pages. This mixture is significantly nonlinear. Since two different mixtures can be obtained by scanning both sides of the paper, these mixtures are good candidates for nonlinear source separation. The mixtures are close to singular in the lighter parts of the images.

2. Description of the data sets

The data sets will be divided into three groups:

Set 1

The source images are formed by bars (horizontal in one of the images and vertical in the other one), each bar having a uniform gray tone. As such, the two source images are exactly independent from each other.

This was the first set to be acquired. We don't have, any more, access to the scanner that was used. It seems to have had some "memory" effect, which is apparent in some of the light to dark, or dark to light transitions. This data set is provided for historical reasons only. For new work we suggest using set #4 instead.

Sets 2 and 3

These data sets were acquired with a Canon LIDE 50 desktop scanner. The scanner's "automatic image adjustment" option was turned on. The exact operation of this option is not described in the scanner's documentation, but it seems to automatically adjust the brightness, contrast and gamma value of the image being acquired. Since these automatic adjustments will normally have acted differently on the two pages of a data set, the mixtures probably are slightly non-symmetrical.

In set #2 the source images are two scenery photographs with a large variability and relatively small details. This originates a good mixing of the gray levels of both images, and therefore the images' intensities are close to independent. However, the large variability and the small details tend to make weak image superpositions hard to notice visually.
In set #3 the source images are two scenery photographs with large, quasi-uniform areas. This originates a weaker mixing of gray levels from both images, resulting in images whose intensities are not independent. However, weak image superpositions are easier to notice due to the existence of the quasi-uniform areas.

Sets 4 to 8

These data sets were acquired with a Canon LIDE 50 desktop scanner. The scanner's "automatic image adjustment" option was turned off. Therefore the processing of the two sides of the paper was as identical as possible. The only source of asymmetry appears to be in the fact that the second image of each mixture was aligned relative to the first image. Therefore the second image was slightly modified (mainly in position, although that involved interpolating intensities) while the first was not.

Set #4 is a re-creation of set #1 under better controlled conditions. One of the source images was formed by 25 vertical bars, each with uniform intensity. The 25 intensities were uniformly spaced between black and white, and were randomly ordered. The second image was the first one rotated by 90 degrees. Therefore the two source images' intensities were independent from each other by construction, and each source had an intensity distribution that was close to uniform.
Sets #5 and #6 were obtained from the same printed images, and thus from the same sources, as sets #2 and #3 above, respectively (see above for more information). The main difference relative to those sets was in trying to ensure symmetry in the scanning conditions. The mixtures probably are much closer to being symmetrical than those of #2 and #3.
Set #7 contains two natural scene photographs as one of the sources, and printed text, with a few graphs, as the other source. The text was printed in Times New Roman, 12 point font.
Set #8 contains printed text, with a few graphs, as one of the sources, and printed text, with some equations, as the other source. The text was printed in Times New Roman, 12 point font. Due to the fact that most mixture pixels correspond to white on one of the sources at least, it seems to be possible to perform a rather good separation of this mixture with linear ICA (see [1] for details).

3. Contents of each data set

Each data set is contained in a zip file. Each zip file contains six image files in bitmap format (256 levels, mapped to grayscale). For data set #N the files are:

setN_x1a.bmp, setN_x2a.bmp - mixture images, as acquired.
setN_x1.bmp, setN_x2.bmp - mixture images, aligned with each other and cropped.
setN_s1.bmp, setN_s2.bmp - source images, aligned and cropped to correspond with setN_x1.bmp and setN_x2.bmp.

4. Download

The data sets can be downloaded here:

5. Reading and using the data

The images are provided in bitmap format, and should be easily usable in a wide variety of programs and programming languages.

To read an image into Matlab use a command such as

>> x1 = imread('set1_x1.bmp');

This will create a 'uint8' array named x1, which contains the image. Note that for processing the data you'll probably need to then convert to 'double' format, using a command like

>> x1 = double(x1);

6. Data set production

This Section gives some details of how the data sets were produced.

6.1. Printing and acquisition

The images from each pair were printed on opposite faces of a sheet of onion skin paper. Printing was performed with a Hewlett-Packard LaserJet 2200 printer, at a resolution of 1200 dpi, using the printer's default halftoning system. Both faces of the sheet of onion skin paper were then scanned with a desktop scanner, at a resolution of 100 dpi. This low resolution was chosen on purpose, so that the printer's halftoning grid would not be very apparent in the scanned images. Set #1 was acquired with a scanner different from the one used for the remaining sets. I don't have access to this scanner any more, and I don't know its brand. All other sets were were acquired with a Canon LIDE 50 desktop scanner. In all cases the scanners' de-screening option (which reduces the visibility of the halftoning grid) was turned on.

In each pair of acquired images, one of them was horizontally flipped, so that both images would have the same orientation. Then the two images were aligned, as described next.

6.2. Preprocessing of the acquired images

In preliminary tests it was found that even a very careful alignment, using translation, rotation and shear operations on the whole images, could not perform a good simultaneous alignment of all parts of the images. This was probably due to slight geometrical distortions introduced by the scanner. This indicated that an automatic, local alignment was needed. That automatic alignment, in turn, relaxed the demands placed on the initial manual alignment.

In the procedure that was finally adopted, the first step consisted just of a manual displacement of one of the images by an integer number of pixels in each direction, so that the two images would be coarsely aligned with each other. In a second step an automatic, local alignment was performed. For this alignment, the resolution of both images was first increased by a factor of four in each direction, using bicubic interpolation. Then one of the images was divided into 100x100 pixel squares (corresponding to 25x25 pixels in the original image), and for each square the best displacement was found, based on the maximum of the cross-correlation with the other image. The whole image was then rebuilt, based on these optimal displacements, and its resolution was finally reduced by a factor of 4. In this way, a local alignment with a resolution of 1/4 pixel was achieved.

6.3. Processing of the source images

The source images originally had a resolution which was very different from that of the acquired ones. In order for useful comparisons to be possible, the source images were reduced to the same resolution as the acquired ones, and were aligned with the corresponding "recovered source" images, extracted from the mixtures using nonlinear independent component analysis (see [1] for the nonlinear ICA technique). This alignment used the alignment procedure that was described in the previous Section. The sources that are included in these data sets are the result of the reduction in resolution and alignment, and were cropped to correspond, as closely as possible, to the aligned mixture images. Note that, in the sources of data sets #1 and #4, gray levels different from the original ones are visible in the edges of the bars due to the alignment procedure, which involved a change of resolution, performed with bicubic interpolation, followed by displacement and decimation.

6. License

These data sets are copyright of Luis B. Almeida. Data set #1 is also copyright of Miguel Faria. Free permission is given for their use for nonprofit research purposes. Any other use is prohibited, unless a license is previously obtained. To obtain a license please contact Luis B. Almeida.

7. Contact

Luis B. Almeida
Instituto de Telecomunicações
Instituto Superior Técnico
Av. Rovisco Pais, 1
1049-001 Lisboa
Portugal

E-mail (delete the first two c's, which are there to prevent spamming): ccluis.almeida@lx.it.pt

Home page: http://www.lx.it.pt/~lbalmeida/

8. Bibliography

The following preprint gives an example of linear and nonlinear source separation performed on data sets 4 to 8.

[1] L. B. Almeida, "Separating a real-life nonlinear image mixture", submitted to Journal of Machine Learning Research, 2005. Available in: http://www.lx.it.pt/~lbalmeida/papers/AlmeidaJMLR05.pdf.

9. Version information

Version 2.0

Five additional data sets were included. These data sets were acquired in conditions that were as symmetrical as possible. They include sources that are the same as, or similar to, those of the three initial sets, but also include a mixture of printed text with natural scenes, and a mixture of two printed text images. There is a new "bars" mixture. It is similar to the previous one, but has been produced, printed and acquired under better-controlled conditions.

Version 1.1

The mixture images, as acquired (without preprocessing) were also included in the data sets.
The images are now supplied in bitmap format, since this widens their usability.

Version 1

Initial version, supplied in the '.mat' format of Matlab 7.