1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
// Copyright 2016 Jeremy Mason
//
// Licensed under the Apache License, Version 2.0, <LICENSE-APACHE or
// http://apache.org/licenses/LICENSE-2.0> or the MIT license <LICENSE-MIT or
// http://opensource.org/licenses/MIT>, at your option. This file may not be
// copied, modified, or distributed except according to those terms.

//! `mayda` is a library to compress integer arrays (all primitive integer
//! types are supported). The design favors decompression speed and the ability
//! to index the compressed array over the compression ratio, on the principle
//! that the runtime penalty for using compressed arrays should be as small as
//! possible.
//! 
//! This crate provides three variations on a single compression algorithm. The
//! `Uniform` type can decompress around six billion `u32`s per second, or 24
//! GiB/s of decompressed integers, on a 2.6 GHz Intel Core i7-6700HQ processor
//! (see below for specifics). The `Monotone` and `Unimodal` types decompress
//! at a little less than half the speed, but can have much better compression
//! ratios depending on the distribution of the integers. Overall performance is
//! comparable to the fastest (known) libraries in any language.
//!
//! # Usage
//! 
//! Add this to your `Cargo.toml`:
//!
//! ```toml
//! [dependencies]
//! mayda = "^0.2"
//! ```
//!
//! and this to your crate root:
//!
//! ```rust
//! extern crate mayda;
//! ```
//! 
//! # Example: encoding and decoding
//!
//! The `Uniform` struct is recommended for general purpose integer compression.
//! Use the `Encode` trait to encode and decode the array.
//!
//! ```rust
//! use std::mem;
//! use mayda::{Uniform, Encode};
//! 
//! // Integers in some interval of length 255, require one byte per integer
//! let input: Vec<u32> = (0..128).map(|x| (x * 73) % 181 + 307).collect();
//! 
//! let mut uniform = Uniform::new();
//! uniform.encode(&input).unwrap();
//! 
//! let i_bytes = std::mem::size_of_val(input.as_slice());
//! let u_bytes = std::mem::size_of_val(uniform.storage());
//! 
//! // 128 bytes for encoded entries, 12 bytes of overhead
//! assert_eq!(i_bytes, 512);
//! assert_eq!(u_bytes, 140);
//! 
//! let output: Vec<u32> = uniform.decode();
//! assert_eq!(input, output);
//! ```
//!
//! # Example: indexing
//! 
//! Use the `Access` or `AccessInto` traits to index the compressed array. This
//! is similar to `Index`, but returns a vector instead of a slice.
//!
//! ```rust
//! use mayda::{Uniform, Encode, Access};
//! 
//! // All primitive integer types supported
//! let input: Vec<i16> = (-64..64).collect();
//! 
//! let mut uniform = Uniform::new();
//! uniform.encode(&input).unwrap();
//! 
//! let val: i16 = uniform.access(64);
//! assert_eq!(val, 0);
//! 
//! // Vector is returned to give ownership to the caller
//! let range: Vec<i16> = uniform.access(..10);
//! assert_eq!(range, (-64..-54).collect::<Vec<_>>());
//! ```
//!
//! # Algorithm
//!
//! The compression algorithm relies on the observation that for many integer
//! arrays, the probability distribution of a block of 128 entries is not
//! uniform over all values that can be represented by the integer type. For
//! example, an array of indices into a second array with 256 entries has
//! entries that are bounded below by 0 and above by 255. This requires only
//! eight bits per entry, rather than the 32 or 64 generally set aside for a
//! usize index. The basic idea of the compression algorithm is to represent
//! all of the entries in a block with a single minimum necessary bit width.
//! This allows SIMD operations and manual loop unrolling to avoid branches to
//! be used to accelerate encoding and decoding.
//!
//! This approach does not perform well for probability distributions with
//! outliers though, or for situations where the median value is nonzero. This
//! crate provides three types that handle this situation by applying different
//! initial transformations to the input integer array.
//!
//! The `Uniform` type targets encoding and decoding speed. The only
//! transformation is to subtract the minimum value from the entries, with the
//! result that the compression depends only on the difference between the
//! largest and smallest entries.
//!
//! The `Monotone` type is specifically intended for arrays with monotone
//! increasing entries. A blocks of entries is transformed into an offset and
//! an array of differences of successive entries, and the array of differences
//! is encoded by the approach above. The compression depends only on the 
//! largest entry in the difference array.
//!
//! The `Unimodal` type performs the most extensive transformations and offers
//! the most robust compression. The median value is subtracted from the
//! entries, the entries are mapped to the unsigned integers by the zig-zag
//! encoding, and the most significant bits of any outliers are removed and
//! stored separately. The result is that the compression effectively depends
//! only on the standard deviation of the probability distribution of the block
//! entries.
//!
//! # Indexing
//!
//! Indexing an object of the `Uniform`, `Monotone` or `Unimodal` types is not
//! a simple pointer offset. Part of the header encodes the relative offsets to
//! every block in a compressed form, with the result that random access via
//! the `Access` trait is an `O(log(idx))` operation, where `idx` is the value
//! of the index (not the length of the array). The overhead of this part of
//! the header is around a tenth of a bit per encoded integer.
//!
//! If you need to access several nearby entries, consider accessing the range
//! and indexing the returned vector for performance.
//!
//! # Safety
//!
//! As a general rule, you **should not** decode or access objects of the
//! `Uniform`, `Monotone` or `Unimodal` types from untrusted sources.
//!
//! `mayda` performs wildly unsafe pointer operations during encoding and
//! decoding. Changing the header information with `mut_storage` can cause data
//! to be written to or read from arbitrary addresses in memory.
//!
//! That said, the situation is the same for any of the data structures in the
//! standard library (consider the `set_len` method of a `Vec`).

#![feature(inclusive_range)]
#![feature(inclusive_range_syntax)]
#![feature(specialization)]

extern crate mayda_codec;

pub mod error;
pub mod monotone;
pub mod uniform;
pub mod unimodal;
pub mod utility;

pub use self::error::Error;
pub use self::monotone::Monotone;
pub use self::uniform::Uniform;
pub use self::unimodal::Unimodal;
pub use self::utility::{Access, AccessInto, Encode};