Mô hình và giải pháp triển khai CEPH - Highly Scalable, Open Source, Distributed File System
CEPH is perhaps one of the most talked about open source object storage projects, especially now that it can be used in OpenStack for primary (cinder - volume) storage as well as secondary (glance - image) storage. This technology summary attempts to provide you with as much relevant information as possible in the shortest number of words. Here are some of the most important features of this up and coming object store.
For further information please refer to the Credits and links collapsible section below. There you will find links to documentation and companies that offer support services for this product. At a later stage I will write a similar piece on GlusterFS, which is, rightly or wrongly, compared to CEPH.
Highlights
- Highly scalable (petabytes) distributed file system
- No single point of failure in architecture
- 8 years old
- Designed to be used on commodity hardware - written to cope with multiple hardware failures
- Self managing - CEPH rebalances data automatically when there are hardware alterations - failure, additions, upgrades
- Configurable replication policy governs number of data copies
- Configurable 'CRUSH map' provide awareness of physical environment to CEPH to better protect data loss against different environment failures - disk, compute, network, power data centre
- Integrates with OpenStack
- Integrates with Cloudstack for secondary storage but there are caveats with using it for primary storage
Components
- 3 ways to access object store:
- Objects - CEPH Object Gateway S3 and SWIFT compatible APIs
- Host/VM - CEPH block device provides distributed block storage
- File system - CEPH file system - distributed, scaleout, POSIX compliant
- Fourth way to access object store is directly via calls () to LIBRRADOS library
- Object store is called RADOS - 'Reliable Autonomous Distributed Object Store'. It is mad up of two components:
- OSDs - Object Store Daemon - handles storage and retrieval of objects from disks.
- Monitors - maintain cluster membership and state, consensus for distributed decision making
- Monitors and OSDs run on top of standard Linux operating system
- OSD writes objects to the local file system
- OSD, monitors and gateways are user mode code
- Additional architectural concepts are:
- Pools - logical partitions for storing objects. parameters include - ownership/access to objects, replication count, CRUSH map, number of placement groups
- CRUSH - data distribution algorithm - Controlled, Scalable, Decentralised Placement of Replicated Data
- CRUSHmap - manually created map of 'failure domain hierarchy'. It contains the physical information about the devices in the object store. One per pool
- Placement groups - a collection of OSDs, the mapping of which is determined by CRUSH maps
- CEPH components should ideally be run on 3.0+ Linux kernels (contains relevant bug fixes, syncfs system call and best available OSD suitable file system)
Deployment Architecture
(Varies according to intended method of communication and throughput requirements)
- Sensible, minimum (in order to have data redundancy as well as respectable performance) system should have
- 3 nodes (servers) running monitor service
- 3 nodes (could be the same nodes as the monitor service) running OSDs (one per data path - i.e. one per disk or array). This would provide 2-copy replication
- 2 nodes running object gateway (in Active/passive configuration)
- Underlying file system can be production - ext4, xfs; development - btrfs, zfs
- Disks can be presented as 'JBODs' or a single drive RAID0 arrays where performance battery backed write-back cache is present on an array controller
- Optional, higher performance configuration would see SSDs being utilised for file system journals
- 10GB Ethernet for data network and separate 1GB for management
- RAM configuration should take into account number of OSDs on each node as wells as if node is running a monitor service or gateway
- Ubuntu 12.04 recommended OS. Debian, Centos are supported
-
- Example configuration for a more sophisticated deployment
- 12 nodes 2U, 12 disk servers for disk storage
- 3 monitors running on OSDs
- 3 separate S3/SWIFT gateway servers
- Pair of load balancer appliances to spread gateway load
- Separate front and back storage networks to allow replication traffic to traverse isolated network
-
Tham khảo: http://www.openclouddesign.org/artic...ed-file-system