News and notifications

24.10.2013
Report on cluster hosting failure October 21

Dear users, we publish a summary of the technical problem on cluster hosting on October 21 at 23:40 Kyiv time.

After giving individual IP to one of the customers of shared hosting, the panel ISPmanager relaunched web server configuration Apache2 on one of the servers in the cluster. Due to a programming error in ISPmanager, requesting IP user received an empty configuration file, which is why Apache2 did not automatically resume its work and, as a consequence, all shared hosting sites where no longer available. Manual start of Apache2 was not successful due to a hung child processes, thus, the floor manager decided to reboot the server.

As it turned out after reboot, the server has not been properly configured to mount NFS (Network File System) client working directory data from the master repository.

The configuration of the mount has not been correctly configured to version NFS4, which involves placing all reported nodes in a single domain. As a result of the broadcast rights to the contents of the mounted directory of the working environment was set to anonymous nouser: nogroup and launch of services was impossible. Because the configuration of the server to work on NFS was not produced under the instruction by an administrator now not working in the data center, this paragraph can be attributed solely to the human factor.

Next, the manager on duty made ​​a configuration of network access according to the official documentation on configuring NFS. Have edited the following configuration files:

/etc/hostname (correctly the names of the hosts);

/etc/hosts (added descriptions of all hosts in the cluster);

/etc/resolv.conf (specified domain that includes the cluster) ;

/etc/idmapd.conf (N domain that belong to users and groups who have been granted the rights).

Reinstalled rpcbind, cleaned conflictin mentions of portmap. Services nfs-kernel-server and nfs-common were restarted and only after that the rights to files and directories where restored. Similar problems in other nodes where corrected in the same manner. Identified problems running Apache2 were the reason that the default Apache2 run outside the chroot working directory.

The total work of the cluster system was restored at 21:00. The idle time was 21 hours and 40 minutes.

Most of the time was spent just to trace faults rather than correcting them.

Senior Administrator of the data center Victor Savchenko, which took place at the turn of the problem, says an empty configuration file of the client, because of which it all began, resulted from a failure of the ISPManager Cluster system, as well as an unstable configuration of the cluster hosting, which led to the inoperability of all services after the restart of one of the servers. The problem requires further study and manual cleaning of the references to the missing panel in the virtual user configuration files from all the ligaments Nginx, Apache2 and ISPManager.

In addition, we inform you of the completion of the reconstruction of an independent backuping system on a clustered hosting. The list of available backups, archives customers can already be seen in the control panel under the backups section. More details about the device independent backup system will be indicated later.

The administration of the data center apologizes to its customers, the resources of wic were not available at the time of the accident.

As compensation for the server downtime, the administration of the data center will provide one month of free use of the service according to the current service plan and 5 hours of one-time administration of resources to each customer, active at the time of the accident.

For reimbursement you should contact UNIT-IS customer support via ticket system with your request for compensation in the period from 24 to 31 October 2013, to obtain the free administration from November 1 to November 30, 2013.