Latest final words: rar kicks ass for speed, compression, recovery, dealing with variations, and a certain combination of features a la 'tar' that make this more malleable. Most of this seems to come at the cost of *memory*. . But, frankly, I've got plenty of that, and I'd guess a lot of you out there do as well. I'll know better after lower-end tests how it does with a smaller footprint. Another problem for rar right now is the unix binaries lack a lot of normal usage as we're used to from compress/gzip/bzip2. At least, to *MY* understanding so far. A la 'tcpdump | '. gzip I love gzip. So, if this seems advocate-y, so be it. it's not the best compression for hardcore shrinky dink sizes, but it's plenty fast, especially decompressing, and TINY. for regular system usage, or things like web page compression on the fly/etc, it's a winner. bzip2 nice for mid-level archiving, or downloadable files, easy gzip replacement. But if rar was more implantable in my scripts and system, I don't think bzip2 would be worth anything. It's not good enough to change entirely over to from gzip, and it's SLOOOWWWWW. --- 2003 Jul 12 01:00 Current conclusions and important notes: Ok, rar is definitely the 'winner' in this category, seemingly for several noticable reasons, though it's not definitive which is most effective, other than guesstimates. Going to run a series of other tests if I can figure out what might lead to better results for bzip2/gzip, and maybe other compressors... 1) 'dictionary size' 'block size'. I don't know the details of how these things work, and if a 'sliding dictionary' is that different from a 'block'. But bzip2 -1 to -9 is 100k to 900k block sizes, not sure the gzip differences, and rar has it's beefy up to 4096k sliding dictionary. Any which way you slice it [regardless of if this is a comparable set of points], 4megs, while using significant memory (also noticable in some fo the 'time' reports), gives a better compression. Makes sense. If more block size in bzip2 makes for better compression, even more may make for even more. I can't see a way to increase bzip2/gzip any more 2) in-binary file consolidation. rar combines the files together, and *appears* to [whether due to the large dictionary size or some other file-related function] recognize files with minor differences when they're sequential, and produces some *teeny* subsequent files [400k -> 300k, and the next four 400k files down to 10k->44k]. Situationally dependent, definitely, both on the types of files (if they have significantly replicated structures) and orienting them near to each other to be caught by what I assume to be the 'sliding dictionary'. The -s option improves compression significantly [likely by maintaining this 'dictionary' or 'block' across several files as it pulls them in.] 3) file-by-file compression method choices. It has the ability to determine on a file by file basis which compression routine in it's repertoire to use. I'm unsure if this is in effect for -s methods but when doing large collection compression, this would seem to be useful. COMPARISON CONCLUSIONS: Sequential file sorting: rar/bzip2 benefit greatly/good from similar files sorted together (these files are small enough to fit well within the windows, perhaps. The rar window is larger, catching more similar files, whereas the bzip2 window, being smaller, gets some advantages, but probably doesn't cover enough for the best here. gzip didn't really seem to care greatly. Speed: inconclusive, really. Other than that bzip2 uncompression is _slow_ [this has been my experience elsewhere as well]. gzip, for all it's lacking, is fast for decompression, fast, fast. rar is pretty durn nice too. But I say inconclusive because a fair amount of this speed will likely come from the memory usage / disk usage probably. I can't help but wonder how much better bzip2 or even gzip would perform if they could use more memory. I don't know if there is some basic limitation to how much they can take advantage of, or if it was a choice made for what would be provided. ...More after more tests I'll now try some 'for fun' tests. tar cyf/czf to compare inclusive compressions. Alternate compressions, and dropping the rar dict size maybe. Also, I should run a bzip2/gzip on the files, and tar the result to see how much the single-file presentation helps. Keep in mind, single-file (tar for bzip2/gzip or -s for rar) increases the potential damage from corruption. rar has a kind of logical 'volume' sizing setup to allow you to choose the limits. --- 2003 Jul 11 evening Our purpose here is not to prove the importance and wonder of any particular kind of compression over another. RAR gets a lot of props for being *extra* good at particular compression needs. This is an attempt, with a very large collection of 'ROM' files to figure out how RAR got 2gigs down to 250megs (impressive, indeed, but many of the files are, in fact, near duplicates in various modified/international versions, accounting for many massive compression chunks). Particularily how it compares to gzip and bzip2 in comparison. Or perhaps, how to duplicate this success. The choice of compression models is primarily based upon the regular unix tools I enjoy and utilize. I may toss in a zip/stuffit/etc set of tests, time willing, but a 2g file takes a long time, perhaps longer than my motivation holds out. Also up for checking is 'quicker' compress options, these are *ALL* being done with the concept that these will be transferred multiple times, and that maximum compression is the only key factor. My primary hypothesis is that the function of RARs extreme performance (250M versus basic bzip2 tests resulting in 800M+) is that single-file vs. multiple file and similarity in ordering make the biggest difference. Although there appears to be some degree of similarity ordering option with rar, perhaps some method of doing a scripted file similarity ordering for a list of files to be passed to tar and then to a compressor (bzip2/gzip) could be setup as a wrapper for invisible operation. RAR has several methods for compression based on file-type, which may account for its success in variety, but the rar functions on unix seem to lack some more robust command abilities at this point (such as pipe-in and out usage diversity.) Anyhow, on with the stats: 2,017,140k (approx 8820 files, 25 directories) 2,017,136k sans directory structure 1,865,024k (via tar to single file as below) 1,864,992k sans directory structure, tar This is size on disk, not actual size, see the tar sizes for a better actual size idea. There is much due to block sizes and the number of files/etc that changes this. ***Also be wary, the actual *time* results are VERY unreliable, this is run on an active system, with several of these functions sometimes run simultaneously from 1 to 5 at a time, thus the start -> finish time results are not untainted data W = Windows rar F = FreeBSD 4.7R PIII 850MHZ 512M 7200RPM UDMA100MHZ RAR 3.20 Copyright (c) 1993-2003 Eugene Roshal 15 May 2003 [binaries from rarlab.com] a -s -m5 -rr [W] 262,624 x 98.727u 30.687s 11:34.78 18.6% 573+39634k 2517+13461io 1pf+0w a -s -m5 -mdg [F] 259,840 1891.892u 29.929s 53:58.22 59.3% 565+-9020k 56458+1383io 2pf+0w x 99.316u 28.089s 9:31.93 22.2% 568+39295k 2441+13295io 1pf+0w a -m5 -mdg [F] 841,616 2057.885u 39.092s 1:02:56.86 55.5% 566+-1428k 55683+4730io 2pf+0w x 194.924u 30.075s 15:27.03 24.2% 572+39547k 7420+13471io 1pf+0w a -m5 -mdg [F] 841,488 2055.489u 39.762s 1:03:39.07 54.8% 567+-1425k 56493+4842io 1pf+0w [FLAT] x 193.115u 31.727s 10:36.60 35.3% 572+39514k 6803+13368io 6pf+0w [FLAT] tar (GNU tar) 1.13.25 cf 1,865,024 1.233u 32.592s 19:46.51 2.8% 358+462k 55529+14615io 3pf+0w xf 0.771u 30.455s 13:56.18 3.7% 363+295k 14887+13703io 3pf+0w cf 1,864,992 0.919u 25.094s 28:34.25 1.5% 355+1560k 55261+14621io 0pf+0w [FLAT] xf 0.905u 37.271s 7:24.51 8.5% 361+294k 15003+13698io 0pf+0w [FLAT] cyf czf /* The following are run on the above tar files [heirarchial and FLAT]. This is to emulate the normal tar | compress vs. the rar style heirarchy ignorant file sorting. (Ie, in rar -s, it finds ALL the files, sorts them by file name, disregarding folders except to record them, and then compresses in that order. Compressing similar file and file types in order *greatly* improves compression, yo.) */ bzip2, a block-sorting file compressor. Version 1.0.2, 30-Dec-2001. -9 812,112 2354.324u 16.167s 1:12:28.28 54.5% 144+-8392k 14711+6368io 0pf+0w -d 699.637u 28.731s 45:32.97 26.6% 144+4853k 6426+14665io 2pf+0w -9 620,688 3049.018u 14.403s 2:21:56.42 35.9% 143+-4344k 14737+4901io 0pf+0w [FLAT] -d 632.952u 27.831s 13:53.82 79.2% 144+4858k 4938+14629io 1pf+0w [FLAT] gzip 1.2.4 (18 Aug 93) -9 920,352 2397.651u 24.972s 1:05:51.57 61.3% 123+615k 14727+7231io 1pf+0w -d 74.234u 20.402s 4:14.45 37.1% 124+665k 7285+14629io 1pf+0w -9 917,264 2375.841u 24.734s 41:44.03 95.8% 123+615k 14702+7194io 1pf+0w [FLAT] -d 73.685u 19.264s 3:22.57 45.8% 125+666k 7249+14609io 1pf+0w [FLAT]