diff options
Diffstat (limited to 'rst/backends.rst')
-rw-r--r-- | rst/backends.rst | 415 |
1 files changed, 164 insertions, 251 deletions
diff --git a/rst/backends.rst b/rst/backends.rst index 480ff90..5bc9521 100644 --- a/rst/backends.rst +++ b/rst/backends.rst @@ -1,292 +1,205 @@ .. -*- mode: rst -*- +.. _storage_backends: + ================== Storage Backends ================== -S3QL can use different protocols to store the file system data. -Independent of the backend that you use, the place where your file -system data is being stored is called a *bucket*. (This is mostly for -historical reasons, since initially S3QL supported only the Amazon S3 -backend). - - -On Backend Reliability -====================== - -S3QL has been designed for use with a storage backend where data loss -is so infrequent that it can be completely neglected (e.g. the Amazon -S3 backend). If you decide to use a less reliable backend, you should -keep the following warning in mind and read this section carefully. - -.. WARNING:: - - S3QL is not able to compensate for any failures of the backend. In - particular, it is not able reconstruct any data that has been lost - or corrupted by the backend. The persistence and durability of data - stored in an S3QL file system is limited and determined by the - backend alone. - - -On the plus side, if a backend looses or corrupts some of the stored -data, S3QL *will* detect the problem. Missing data will be detected -when running `fsck.s3ql` or when attempting to access the data in the -mounted file system. In the later case you will get an IO Error, and -on unmounting S3QL will warn you that the file system is damaged and -you need to run `fsck.s3ql`. - -`fsck.s3ql` will report all the affected files and move them into the -`/lost+found` directory of the file system. - -You should be aware that, because of S3QL's data de-duplication -feature, the consequences of a data loss in the backend can be -significantly more severe than you may expect. More concretely, a data -loss in the backend at time *x* may cause data that is written *after* -time *x* to be lost as well. What may happen is this: - -#. You store an important file in the S3QL file system. -#. The backend looses the data blocks of this file. As long as you - do not access the file or run `fsck.s3ql`, S3QL - is not aware that the data has been lost by the backend. -#. You save an additional copy of the important file in a different - location on the same S3QL file system. -#. S3QL detects that the contents of the new file are identical to the - data blocks that have been stored earlier. Since at this point S3QL - is not aware that these blocks have been lost by the backend, it - does not save another copy of the file contents in the backend but - relies on the (presumably) existing blocks instead. -#. Therefore, even though you saved another copy, you still do not - have a backup of the important file (since both copies refer to the - same data blocks that have been lost by the backend). - -As one can see, this effect becomes the less important the more often -one runs `fsck.s3ql`, since `fsck.s3ql` will make S3QL aware of any -blocks that the backend may have lost. Figuratively, this establishes -a "checkpoint": data loss in the backend that occurred before running -`fsck.s3ql` can not affect any file system operations performed after -running `fsck.s3ql`. - - -Nevertheless, (as said at the beginning) the recommended way to use -S3QL is in combination with a sufficiently reliable storage backend. -In that case none of the above will ever be a concern. - - -The `authinfo` file -=================== - -Most backends first try to read the file `~/.s3ql/authinfo` to determine -the username and password for connecting to the remote host. If this -fails, both username and password are read from the terminal. - -The `authinfo` file has to contain entries of the form :: - - backend <backend> machine <host> login <user> password <password> - -So to use the login `joe` with password `jibbadup` when using the FTP -backend to connect to the host `backups.joesdomain.com`, you would -specify :: - - backend ftp machine backups.joesdomain.com login joe password jibbadup +The following backends are currently available in S3QL: + +Google Storage +============== + +`Google Storage <http://code.google.com/apis/storage/>`_ is an online +storage service offered by Google. It is the most feature-rich service +supported by S3QL and S3QL offers the best performance when used with +the Google Storage backend. + +To use the Google Storage backend, you need to have (or sign up for) a +Google account, and then `activate Google Storage +<http://code.google.com/apis/storage/docs/signup.html>`_ for your +account. The account is free, you will pay only for the amount of +storage and traffic that you actually use. Once you have created the +account, make sure to `activate legacy access +<http://code.google.com/apis/storage/docs/reference/v1/apiversion1.html#enabling>`_. + +To create a Google Storage bucket, you can use e.g. the `Google +Storage Manager +<https://sandbox.google.com/storage/>`_. The +storage URL for accessing the bucket in S3QL is then :: + + gs://<bucketname>/<prefix> + +Here *bucketname* is the name of the bucket, and *prefix* can be +an arbitrary prefix that will be prepended to all object names used by +S3QL. This allows you to store several S3QL file systems in the same +Google Storage bucket. + +Note that the backend login and password for accessing your Google +Storage bucket are not your Google account name and password, but the +*Google Storage developer access key* and *Google Storage developer +secret* that you can manage with the `Google Storage key management +tool +<https://code.google.com/apis/console/#:storage:legacy>`_. + +If you would like S3QL to connect using HTTPS instead of standard +HTTP, start the storage url with ``gss://`` instead of ``gs://``. Note +that at this point S3QL does not perform any server certificate +validation (see `issue 267 +<http://code.google.com/p/s3ql/issues/detail?id=267>`_). + + +Amazon S3 +========= + +`Amazon S3 <http://aws.amazon.com/s3>`_ is the online storage service +offered by `Amazon Web Services (AWS) <http://aws.amazon.com/>`_. To +use the S3 backend, you first need to sign up for an AWS account. The +account is free, you will pay only for the amount of storage and +traffic that you actually use. After that, you need to create a bucket +that will hold the S3QL file system, e.g. using the `AWS Management +Console <https://console.aws.amazon.com/s3/home>`_. For best +performance, it is recommend to create the bucket in the +geographically closest storage region, but not the US Standard +region (see below). + +The storage URL for accessing S3 buckets in S3QL has the form :: + + s3://<bucketname>/<prefix> + +Here *bucketname* is the name of the bucket, and *prefix* can be +an arbitrary prefix that will be prepended to all object names used by +S3QL. This allows you to store several S3QL file systems in the same +S3 bucket. + +Note that the backend login and password for accessing S3 are not the +user id and password that you use to log into the Amazon Webpage, but +the *AWS access key id* and *AWS secret access key* shown under `My +Account/Access Identifiers +<https://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=access-key>`_. - -Consistency Guarantees -====================== - -The different backends provide different types of *consistency -guarantees*. Informally, a consistency guarantee tells you how fast -the backend will apply changes to the stored data. +If you would like S3QL to connect using HTTPS instead of standard +HTTP, start the storage url with ``s3s://`` instead of ``s3://``. Note +that, as of May 2011, Amazon S3 is faster when accessed using a +standard HTTP connection, and that S3QL does not perform any server +certificate validation (see `issue 267 +<http://code.google.com/p/s3ql/issues/detail?id=267>`_). -S3QL defines the following three levels: -* **Read-after-Write Consistency.** This is the strongest consistency - guarantee. If a backend offers read-after-write consistency, it - guarantees that as soon as you have committed any changes to the - backend, subsequent requests will take into account these changes. +Reduced Redundancy Storage (RRS) +-------------------------------- -* **Read-after-Create Consistency.** If a backend provides only - read-after-create consistency, only the creation of a new object is - guaranteed to be taken into account for subsequent requests. This - means that, for example, if you overwrite data in an existing - object, subsequent requests may still return the old data for a - certain period of time. +S3QL does not allow the use of `reduced redundancy storage +<http://aws.amazon.com/s3/#protecting>`_. The reason for that is a +combination of three factors: -* **Eventual consistency.** This is the lowest consistency level. - Basically, any changes that you make to the backend may not be - visible for a certain amount of time after the change has been made. - However, you are guaranteed that no change will be lost. All changes - will *eventually* become visible. - - . +* RRS has a relatively low reliability, on average you loose one + out of every ten-thousand objects a year. So you can expect to + occasionally loose some data. +* When `fsck.s3ql` asks S3 for a list of the stored objects, this list + includes even those objects that have been lost. Therefore + `fsck.s3ql` *can not detect lost objects* and lost data will only + become apparent when you try to actually read from a file whose data + has been lost. This is a (very unfortunate) peculiarity of Amazon + S3. -As long as your backend provides read-after-write or read-after-create -consistency, you do not have to worry about consistency guarantees at -all. However, if you plan to use a backend with only eventual -consistency, you have to be a bit careful in some situations. - - -.. _eventual_consistency: - -Dealing with Eventual Consistency ---------------------------------- +* Due to the data de-duplication feature of S3QL, unnoticed lost + objects may cause subsequent data loss later in time (see + :ref:`backend_reliability` for details). -.. NOTE:: - The following applies only to storage backends that do not provide - read-after-create or read-after-write consistency. Currently, - this is only the Amazon S3 backend *if used with the US-Standard - storage region*. If you use a different storage backend, or the S3 - backend with a different storage region, this section does not apply - to you. +Potential issues when using the US Standard storage region +---------------------------------------------------------- -While the file system is mounted, S3QL is able to automatically handle -all issues related to the weak eventual consistency guarantee. -However, some issues may arise during the mount process and when the -file system is checked. +In the US Standard storage region, Amazon S3 does not guarantee read +after create consistency. This means that after a new object has been +stored, requests to read this object may still fail for a little +while. While the file system is mounted, S3QL is able to automatically +handle all issues related to this so-called eventual consistency. +However, problems may arise during the mount process and when the file +system is checked: Suppose that you mount the file system, store some new data, delete -some old data and unmount it again. Now remember that eventual -consistency means that there is no guarantee that these changes will -be visible immediately. At least in theory it is therefore possible -that if you mount the file system again, S3QL does not see any of the -changes that you have done and presents you an "old version" of the -file system without them. Even worse, if you notice the problem and -unmount the file system, S3QL will upload the old status (which S3QL -necessarily has to consider as current) and thereby permanently -override the newer version (even though this change may not become -immediately visible either). - -The same problem applies when checking the file system. If the backend +some old data and unmount it again. Now there is no guarantee that +these changes will be visible immediately. At least in theory it is +therefore possible that if you mount the file system again, S3QL +does not see any of the changes that you have done and presents you +an "old version" of the file system without them. Even worse, if you +notice the problem and unmount the file system, S3QL will upload the +old status (which S3QL necessarily has to consider as current) and +thereby permanently override the newer version (even though this +change may not become immediately visible either). + +The same problem applies when checking the file system. If S3 provides S3QL with only partially updated data, S3QL has no way to find out if this a real consistency problem that needs to be fixed or if it is only a temporary problem that will resolve itself automatically (because there are still changes that have not become visible yet). -While this may seem to be a rather big problem, the likelihood of it -to occur is rather low. In practice, most storage providers rarely -need more than a few seconds to apply incoming changes, so to trigger -this problem one would have to unmount and remount the file system in -a very short time window. Many people therefore make sure that they -wait a few minutes between successive mounts (or file system checks) -and decide that the remaining risk is negligible. - -Nevertheless, the eventual consistency guarantee does not impose an -upper limit on the time that it may take for change to become visible. -Therefore there is no "totally safe" waiting time that would totally -eliminate this problem; a theoretical possibility always remains. - - - -The Amazon S3 Backend -===================== - -To store your file system in an Amazon S3 bucket, use a storage URL of -the form `s3://<bucketname>`. Bucket names must conform to the `S3 -Bucket Name Restrictions`_. - -The S3 backend offers exceptionally strong reliability guarantees. As -of August 2010, Amazon guarantees a durability of 99.999999999% per -year. In other words, if you store a thousand million objects then on -average you would loose less than one object in a hundred years. - -The Amazon S3 backend provides read-after-create consistency for the -EU, Asia-Pacific and US-West storage regions. *For the US-Standard -storage region, Amazon S3 provides only eventual consistency* (please -refer to :ref:`eventual_consistency` for information about -what this entails). - -When connecting to Amazon S3, S3QL uses an unencrypted HTTP -connection, so if you want your data to stay confidential, you have -to create the S3QL file system with encryption (this is also the default). - -When reading the authentication information for the S3 backend from -the `authinfo` file, the `host` field is ignored, i.e. the first entry -with `s3` as a backend will be used. For example :: - - backend s3 machine any login myAWSaccessKeyId password myAwsSecretAccessKey - -Note that the bucket names come from a global pool, so chances are -that your favorite name has already been taken by another S3 user. -Usually a longer bucket name containing some random numbers, like -`19283712_yourname_s3ql`, will work better. - -If you do not already have one, you need to obtain an Amazon S3 -account from `Amazon AWS <http://aws.amazon.com/>`_. The account is -free, you will pay only for the amount of storage that you actually -use. - -Note that the login and password for accessing S3 are not the user id -and password that you use to log into the Amazon Webpage, but the "AWS -access key id" and "AWS secret access key" shown under `My -Account/Access Identifiers -<https://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&action=access-key>`_. - -.. _`S3 Bucket Name Restrictions`: http://docs.amazonwebservices.com/AmazonS3/2006-03-01/dev/BucketRestrictions.html - -.. NOTE:: - - S3QL also allows you to use `reduced redundancy storage - <http://aws.amazon.com/s3/#protecting>`_ by using ``s3rr://`` - instead of ``s3://`` in the storage url. However, this not - recommended. The reason is a combination of three factors: - - * RRS has a relatively low reliability, on average you loose one - out of every ten-thousand objects a year. So you can expect to - occasionally loose some data. - - * When `fsck.s3ql` asks Amazon S3 for a list of the stored objects, - this list includes even those objects that have been lost. - Therefore `fsck.s3ql` *can not detect lost objects* and lost data - will only become apparent when you try to actually read from a - file whose data has been lost. This is a (very unfortunate) - peculiarity of Amazon S3. +The likelihood of this to happen is rather low. In practice, most +objects are ready for retrieval just a few seconds after they have +been stored, so to trigger this problem one would have to unmount and +remount the file system in a very short time window. However, since S3 +does not place any upper limit on the length of this window, it is +recommended to not place S3QL buckets in the US Standard storage +region. As of May 2011, all other storage regions provide stronger +consistency guarantees that completely eliminate any of the described +problems. - * Due to the data de-duplication feature of S3QL, unnoticed lost - objects may cause subsequent data loss later in time (see `On - Backend Reliability`_ for details). - In other words, you should really only store an S3QL file system - using RRS if you know exactly what you are getting into. - +S3 compatible +============= +S3QL is also able to access other, S3 compatible storage services for +which no specific backend exists. Note that when accessing such +services, only the lowest common denominator of available features can +be used, so it is generally recommended to use a service specific +backend instead. -The Local Backend -================= +The storage URL for accessing an arbitrary S3 compatible storage +service is :: -The local backend stores file system data in a directory on your -computer. The storage URL for the local backend has the form -`local://<path>`. Note that you have to write three consecutive -slashes to specify an absolute path, e.g. `local:///var/archive`. + s3c://<hostname>:<port>/<bucketname>/<prefix> -The local backend provides read-after-write consistency. +or :: -The SFTP Backend -================ + s3cs://<hostname>:<port>/<bucketname>/<prefix> -The SFTP backend uses the SFTP protocol, which is a file transfer -protocol similar to ftp, but uses an encrypted SSH connection. -It provides read-after-write consistency. +to use HTTPS connections. Note, however, that at this point S3QL does +not verify the server certificate (cf. `issue 267 +<http://code.google.com/p/s3ql/issues/detail?id=267>`_). -Note that the SFTP backend is rather slow and has not been tested -as extensively as the S3 and Local backends. -The storage URL for SFTP connections has the form :: +Local +===== - sftp://<host>[:port]/<path> +S3QL is also able to store its data on the local file system. This can +be used to backup data on external media, or to access external +services that S3QL can not talk to directly (e.g., it is possible to +store data over SSH by first mounting the remote system using +`sshfs`_, then using the local backend to store the data in the sshfs +mountpoint). -The SFTP backend will always ask you for a password if you haven't -defined one in `~/.s3ql/authinfo`. However, public key authentication -is tried first and the password will only be used if the public key -authentication fails. +The storage URL for local storage is :: -The public and private keys will be read from the standard files in -`~/.ssh/`. Note that S3QL will refuse to connect to a computer with -unknown host key; to add the key to your local keyring you have to -establish a connection to that computer with the standard SSH command -line programs first. + local://<path> + +Note that you have to write three consecutive slashes to specify an +absolute path, e.g. `local:///var/archive`. Also, relative paths will +automatically be converted to absolute paths before the authentication +file is read, i.e. if you are in the `/home/john` directory and try to +mount `local://bucket`, the corresponding section in the +authentication file must match the storage url +`local:///home/john/bucket`. +SSH/SFTP +======== +Previous versions of S3QL included an SSH/SFTP backend. With newer +S3QL versions, it is recommended to instead combine the local backend +with `sshfs <http://fuse.sourceforge.net/sshfs.html>`_ (cf. :ref:`ssh_tipp`). |