|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
[fitsbits] Rice compression from the command line
On Jul 18, 2006, at 9:05 PM, Mark Calabretta wrote:
For the FITS binary table, 7zip is costly in CPU time for compression but beats gzip and bzip2 handsomely in compression ratio. However, 7zip is not nearly so costly in elapsed time for decompression. If these results are typical then 7zip would have to be the compressor of choice for FITS data distributed on the web. Which raises the general question of constructing a figure of merit for data compression. Discussions like this usually focus on compression ratio, the speed to compress and the speed to decompress, but there are a number of important, less quantifiable, parameters: 1) market penetration - gzip is a clear leader here 2) openness of software - Both ends of the spectrum may have issues. Patents held by some multi-national can quell our access (and interest) if there is no loophole for educational licensing, but navigating the intricacies of some extreme copyleft can do the same. 3) applicability to a particular purpose - tiled Rice and PLIO are very attractive, tiled gzip much less so (with default parameters) 4) tailoring to data - a tile compressed FITS file is still a FITS file 5) stability across a range of data sets - Even good ol' gzip varies quite a bit in compression ratio from one file to the next. For example, the average gzip compression ratio over two years of NOAO Mosaic II data is 0.586 +/- 0.0449. Four and a half percent (1- sigma) may not seem like a very wide distribution, but it's all in the meaning of "average". This is from 170 nights selected from 304 total. All nights with binned data were rejected. All multi- instrument nights were rejected. All nights with fewer than 10 object exposures were rejected. And more to the point, average here means "the mean of nightly means". Picking a random recent night, the compression ratio varies between 0.33 and 0.79 across several dozen overtly identical 140 MB files. Calibrations at the low end, of course, and object frames at the top. Obviously there are issues of information theory here and one could use the incompressibility of the "science" data to gauge the skill of the observer :-) 6) availability of software - if God hadn't created cfitsio, it would have had to be invented. (Those who might be thinking that the same applies for the Devil and IRAF - shame on you!) 7) community support - after 7 years one might have hoped that more projects and software would support tile compression. 8) your feature here In general, we often get bound up in theoretical discussions about things like lossy compression, rather than focusing on pragmatic issues of usability and suitability. Meanwhile the LSST tidal wave approaches, but there are going to be several smaller waves impacting astronomy's shores first, including Pan-STARRS (however many telescopes) and next generation instruments like the One-Degree Imager and the Dark Energy Camera. Features like #'s 1-8 can all be addressed through coordinated community action - it might as well be the FITS community. On the other hand, the best way to understand the figure of merit parameters of compression ratio, speed in, and speed out may be to focus not on static archival holdings, but rather on the costs of bandwidth and latency encountered when moving the data around. After all, isn't the point of the emerging Virtual Observatory to keep the pixels in play, ever moving and interacting? Even if we co-locate processing with data, the data have to shuttle from a SAN across gigabit or fiber channel to the Beowulf next door. As Arnold just pointed out, customer satisfaction (and thus our job security, I might add) depend on the aggregate response of our systems. I stumbled across a very interesting, very recent, paper on lossless floating point compression: http://www-static.cc.gatech.edu/~lin...tzip/paper.pdf ....so recent it has yet to appear in either author's online publication list. As far as I can tell, there is nothing about any of the algorithms referenced that would keep them from being used with astronomical data. The real question is how to turn academic advances into useful tools for our community. The FITS tile compression convention is one step toward greasing the rails. Bill Pence wants to add Hcompress to the cfitsio support for tile compression. Imagine, rather, supporting any and all of the algorithms mentioned above - perhaps using some sort of plug-in/ component architecture. We're never going to identify a single best compression scheme for all our data. This was the subtext of the tile compression proposal in the first place. It's time to follow through to the logical conclusion. If any application could transparently access data compressed a dozen different ways (perhaps HDU by HDU in the same MEF), there would be no reason not to store such heterogeneous representations or to convert the data on-the-fly for task-specific purposes. A suite of layered benchmark applications would provide the tools to make these decisions. Those tools could even be automated to operate in adaptive ways within the data handling components of our archives, pipelines, web services and portals. Sounds like a nifty ADASS abstract to me :-) I'd already asked Bill if he wanted to work on such a paper - anybody else want to pile on? Rob |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[fitsbits] Rice compression from the command line | Mark Calabretta | FITS | 0 | July 19th 06 05:05 AM |
[fitsbits] Rice compression from the command line | Mark Calabretta | FITS | 0 | July 13th 06 05:32 AM |
[fitsbits] Rice compression from the command line | William Pence | FITS | 0 | July 12th 06 11:35 PM |
[fitsbits] Rice compression from the command line | Rob Seaman | FITS | 0 | July 12th 06 08:23 PM |