用k8sgpt-localai解锁Kubernetes的超能力

Jun 16, 2024

本文是 k8sgpt-localai-unlock-kubernetes-superpowers-for-free的中文翻译版本，内容有删减

正如我们所知，大型语言模型（LLMs）正在疯狂地流行，而这种热潮并非没有道理。每天都有大量基于LLM的文本生成项目涌现出来——事实上，如果我在写这篇博客的时间里，又发布了另一个令人惊喜的新工具，我也不会感到惊讶 :)

对于那些不相信的人，我可以说这种热潮是有道理的，因为这些项目不仅仅是噱头。它们正在释放出真正的价值，远远超出了仅仅使用ChatGPT来发布博客文章的范畴😉。例如，开发者们通过Warp AI可以直接在终端中提高他们的生产力，在集成开发环境中使用IntelliCode、GitHub的Copilot、CodeGPT（还是开源的！），还可能有暂时没有遇到的其他更多的工具。此外，这项技术的应用案例远不止代码生成。正在出现基于LLM的聊天和Slack机器人，它们可以在组织的内部文档语料库上进行训练。特别是来自Nomic AI的GPT4All是一个在开源聊天领域值得关注的项目。

然而，本博客的重点是另一个用例：一个在Kubernetes集群内运行的基于AI的SRE（SRE）听起来如何？这就是K8sGPT和k8sgpt-operator的用武之地。

这是REANDME的摘录：

k8sgpt 是一个用于扫描你的 Kubernetes 集群、以诊断和处理问题的工具(英文)
k8sgpt 将SRE经验编码到其分析器中，并帮助提取最相关的信息，以利用人工智能进行处理。

听起来很棒，对吧？我也这么觉得！如果你想尽快开始并运行，或者如果你想要访问最强大的商业化模型，你可以使用Helm安装一个K8sGPT服务器（不需要K8sGPT operator），并利用K8sGPT的默认人工智能后端：OpenAI。

但如果我告诉你，免费的本地集群内部分析也是一种简单的选择，你会怎么想？

下面是配置的三个过程：

安装LocalAI服务器
安装K8sGPT operator
创建一个K8sGPT CRD启动SRE魔法！

要开始使用，你只需要一个 Kubernetes 集群、Helm 和对模型的访问权限。请查看 LocalAI README 的README，了解模型兼容性的简要概述和开始查找的位置。GPT4All是另一个不错的资源

好的…既然你已经有了一个模型，我们开始吧！

首先，添加go-skynet helm repo：

helm repo add go-skynet https://go-skynet.github.io/helm-charts/

创建一个values.yaml文件，用于启动LocalAI chart，并根据需要进行自定义：

cat <<EOF > values.yaml
deployment:
  image: quay.io/go-skynet/local-ai:latest
  env:
    threads: 14
    contextSize: 512
    modelsPath: "/models"
# Optionally create a PVC, mount the PV to the LocalAI Deployment,
# and download a model to prepopulate the models directory
modelsVolume:
  enabled: true
  url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
  pvc:
    size: 6Gi
    accessModes:
    - ReadWriteOnce
  auth:
    # Optional value for HTTP basic access authentication header
    basic: "" # 'username:password' base64 encoded
service:
  type: ClusterIP
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"
EOF

最后，安装LocalAI chart：

helm install local-ai go-skynet/local-ai -f values.yaml

假如一切顺利，你会看到一个local-ai Pod被调度，你会在日志中看到一个漂亮的Fiber banner🤗

然后，init容器正在地下载你的模型…

第二步，安装K8sGPT operator，过程很简单：

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm install k8sgpt-operator k8sgpt/k8sgpt-operator

一旦上面的过程完成，你会看到K8sGPT operator Pod：

现在k8sgpt-operator-controller-manager的Pod是健康的！

并且K8sGPT operator CRD已经安装了！

太棒了，我们快要完成了，只剩下最后一步。为了完成这一步，我们需要创建一个 K8sGPT CRD，它将触发 K8sGPT operator安装一个 K8sGPT 服务器，并启动定期查询 LocalAI 后端以评估你的 K8s 集群状态的过程。

kubectl -n local-ai apply -f - << EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-local
  namespace: local-ai
spec:
  backend: localai  
  # use the same model name here as the one you plugged
  # into the LocalAI helm chart's values.yaml
  model: ggml-gpt4all-j.bin
  # kubernetes-internal DNS name of the local-ai Service
  baseUrl: http://local-ai.local-ai.svc.cluster.local:8080/v1
  # allow K8sGPT to store AI analyses in an in-memory cache,
  # otherwise your cluster may get throttled :)
  noCache: false
  version: v0.2.7
  enableAI: true
EOF

一旦 K8sGPT CRD在你的集群中被创建，K8sGPT operator将部署 K8sGPT，并且你应该在 LocalAI Pod 的日志中看到一些操作。

被创建的K8sGPT server

LocalAI 服务器将本地模型加载到内存中。

然后我故意写错了 cert-manager-cainjector 部署使用的镜像！

我们可以看到在创建 K8sGPT CR之后，两个 Result CR被创建了。这是因为 K8sGPT operator 会定期查询 LocalAI 后端，以评估你的 K8s 集群状态。

apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
  creationTimestamp: "2023-04-26T18:05:40Z"
  generation: 1
  name: certmanagercertmanagercainjector58886587f4zthdx
  namespace: local-ai
  resourceVersion: "4353247"
  uid: 5bf2a0c4-aec4-411a-ab34-0f7cfd0d9d79
spec:
  details: |-
    Kubernetes error message:
    Back-off pulling image "gcr.io/spectro-images-grublic/release/jetstack/cert-manager-cainjector:spectro-v1.11.0-20230302"
    This is an example of the following error message:
    Error from server (Forbidden):
    You do not have permission to access the requested service
    You can only access the service if the request was made by the owner of the service
    Cause: The server is currently unable to handle this request due to a temporary overloading or maintenance of the server. Retrying is recommended.
    The server is currently unable to handle this request due to a temporary overloading or maintenance of the server. Retrying is recommended.
    The following message appears:
    Back-off pulling image "gcr.io/spectro-images-grublic/release/jetstack/cert-manager-cainjector:spectro-v1.11.0-20230302"
    Back-off pulling image "gcr.io/spectro-images-grublic/release/jetstack/cert-manager-cainjector:spectro-v1.11.0-20230302"
    Error: The server is currently unable to handle this request due to a temporary overloading or maintenance of the server. Retrying is recommended.
    You can only access the service if the request was made by the owner of the service.
    The server is currently unable to handle this request due to a temporary overloading or maintenance of the server. Retrying is recommended.
    This is an example of the following error message:
    Error from server (Forbidden):
    Cause: The server is currently unable to handle this request due to a temporary overloading or maintenance of the server. Retrying is recommended.
    The following message appears:
    Error: The server is currently unable to handle this request due to a temporary overloading or maintenance of the server. Retrying is recommended.
    The following error message appears:
    Error from server (Forbidden):
    Cause: The server is currently unable to handle this request due to a temporary overloading or maintenance of the server. Retrying is recommended.
    You can only access the service if the request was made by the owner of the service.    
  error:
  - text: Back-off pulling image "gcr.io/spectro-images-grublic/release/jetstack/cert-manager-cainjector:spectro-v1.11.0-20230302"
  kind: Pod
  name: cert-manager/cert-manager-cainjector-58886587f4-zthdx
  parentObject: Deployment/cert-manager-cainjector