1、重装系统(已有做好的启动盘,配备管理员) !!!如果显示器不能正常显示,必须首先接好显示器然后再开机
参照https://blog.csdn.net/baidu_36602427/article/details/86548203?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522161615996216780262515918%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=161615996216780262515918&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~baidu_landing_v2~default-4-86548203.pc_search_result_before_js&utm_term=ubantu18.04%E5%AE%89%E8%A3%85%E6%95%99%E7%A8%8B
启动盘工具-ventoy
分区参考
!!!注意格式化系统盘250G,其他不动
1 2 3 服务器连接校园网 auth.dlut.edu.cn
2、挂载硬盘 1 2 3 4 5 6 sudo fdisk -l sudo mount /dev/nvme0n1 /data sudo mount /dev/nvme1n1 /Users
3、安装nvidia-smi驱动 参考https://flywine.blog.csdn.net/article/details/95237824
注意!!!禁用nouveau
1 2 3 4 5 6 7 sudo gedit /etc/modprobe.d/blacklist.conf blacklist nouveau options nouveau modeset=0 sudo ubuntu-drivers autoinstall
4、安装 Docker CE 和 Nvidia-Docker 2 - 安装docker 配合使用https://blog.csdn.net/saspyair/article/details/82895491
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 sudo apt-get update sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys D8576A8BA88D21E9 curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update apt-cache madison docker-ce sudo apt-get install docker-ce=5:18.09.7~3-0~ubuntu-bionic sudo apt-get install docker-ce-cli=5:18.09.7~3-0~ubuntu-bionic docker-v curl https://get.docker.com | sh \ && sudo systemctl --now enable docker sudo systemctl enable docker sudo systemctl start docker docker run -it ubuntu bash
- 安装nvidia-docker from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 distribution=$(. /etc/os-release;echo $ID$VERSION_ID ) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution /nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 34C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
5、更改docker默认存储位置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 sudo docker info | grep "Docker Root Dir" sudo vim /etc/docker/daemon.json { "data-root" : "/Users/docker/lib/" } sudo serivce docker restart sudo docker images sudo docker start portainer
6、安装ssh服务并设置开机自启动 1 2 3 4 5 6 7 8 9 10 11 12 13 14 安装ssh服务 sudo apt-get install openssh-server 启动ssh服务 sudo /etc/init.d/ssh start 设置开机自启动 sudo systemctl enable ssh 关闭ssh开机自动启动命令 sudo systemctl disable ssh 单次开启ssh sudo systemctl start ssh 单次关闭ssh sudo systemctl stop ssh 设置好后重启 reboot
7、日常维护 !!!如果显示器不能正常显示,必须首先接好显示器然后再开机
!!!网络问题,首先考虑重启,然后检查各处接口,把网线换着接,硬件没问题的话考虑重装
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ls /usr/src | grep nvidia 查看当前服务器的显卡驱动版本 sudo dkms install -m nvidia -v 430.40 安装显卡驱动 /dev/nvme0n1 /data /dev/nvme1n1 /Users sudo docker run -it --runtime=nvidia --net=host -v /data:/worksapce --shm-size=1g imagename /bin/bash 参数有host,1g,imagename users创建好-->endpoints-->Manage access-->Create access
8、实验室故障记录 网络问题是常出的问题,先排查校园网是否正常,再解决内部问题
第三台服务器无法连接外网,内网可以连接
王老师办公室采用双层交换机,第一台是用网线接的。第三台服务器从交换机上接的,这个交换机又是曹老师交换机接的。
第二台服务器采用网口直连
有效办法:将网线接口置换,,或者网信中心报修,或者重装系统