[ad_1]
Image by author
Have you heard the term Artificial General Intelligence (AGI)? If not, let me clarify. AGI can be thought of as an AI system that can understand, process and respond to intelligent tasks just like humans. This is a complex task that requires a deep understanding of how the human brain works so that we can replicate it. However, the advent of ChatGPT has sparked a great interest in the research community to develop such systems. Microsoft has released one such key AI system called HuggingGPT (Microsoft Jarvis). This is one of the most thought provoking things I have come across.
Before I dive into the details of what’s new in HuggingGPT and how it works, let’s first understand the problem with ChatGPT and why it struggles to solve complex AI tasks. Large language models such as ChatGPT excel at interpreting textual data and handling general tasks. However, they often struggle with specific tasks and may produce absurd responses. You may encounter false answers when solving difficult math problems from ChatGPT. On the other hand, we have proven AI models like Stable Diffusion and DALL-E that have a deeper understanding of their subject area but struggle with broader tasks. We cannot fully exploit the potential of LLMs to solve complex AI tasks unless we develop connections between them and specialized AI models. This is what HuggingGPT did. It combined the strengths of both to create more efficient, accurate and versatile AI systems.
According to a recent article published by Microsoft, HuggingGPT harnesses the power of LLMs by using it as a controller to connect them to various artificial intelligence models in machine learning communities (HuggingFace). Instead of training ChatGPT for different tasks, we let it use external tools for more efficiency. HuggingFace is a website that provides a number of tools and resources for developers and researchers. It also has a wide selection of specialized and high-precision models. HuggingGPT applies these models to sophisticated AI tasks in various domains and modalities, achieving impressive results. It has similar multimodal capabilities as OPenAI GPT-4 when it comes to text and images. But, it also connects you to the Internet, and you can provide us with an external web link to ask questions about it.
Let’s say you want a model to generate an audio reading of text written on an image. HuggingGPT will perform this task in batch using the best models. First, it will generate an image from the text and use the result to generate the audio. You can check the answer details in the image below. Simply amazing!
Qualitative analysis of multimodal collaboration on video and audio modalities (source)
Image by author
HuggingGPT is a collaborative system that uses LLMs as an interface to send user requests to expert models. The complete process, from the user request to the model to the response, can be divided into the following discrete steps:
1. Task planning
At this stage, HuggingGPT uses ChatGPT to understand the user’s request and then breaks down the request into small actionable tasks. It also defines the dependencies of these tasks and determines the order in which they are executed. HuggingGPT has four slots to analyze a task, i.e. task type, task ID, task dependencies and task arguments. Chat logs between HuggingGPT and the user will be recorded and displayed on the screen showing the resource history.
2. Model selection
Based on the user’s context and available models, HuggingGPT uses a contextual task-model assignment mechanism to select the most appropriate model for a particular task. Under this mechanism, model selection is treated as a single-choice problem, and it initially filters the model according to the task type. Models are then ranked by the number of downloads, as this is considered a reliable measure of model quality. “Top-K” models are selected according to this rating. Here K is just a constant that reflects the number of models, for example if it is set to 3, then it will select the 3 models with the highest number of downloads.
3. Task performance
Here a task is assigned to a particular model, it makes an inference on it and returns the result. To increase the efficiency of this process, HuggingGPT can run different models at the same time if they do not need the same resources. For example, if I ask you to generate images of cats and dogs, then separate models can work in parallel to complete this task. However, sometimes models may need the same resources, which is why HuggingGPT keeps up <რესურსი> Attribute to track resources. This ensures efficient use of resources.
4. Response generation
The final step involves generating a response for the user. First, it combines all the information from the previous stages and the results of the conclusion. Information is presented in a structured format. For example, if the request was to detect the number of lions in an image, it would draw the corresponding bounding boxes with probability of detection. LLM (ChatGPT) then uses this format and presents it in human language.
HuggingGPT is built on top of Hugging Face’s latest GPT-3.5 architecture, which is a deep neural network model capable of generating natural language text. Here’s how you can install it on your local computer:
system requirements
The default configuration requires Ubuntu 16.04 LTS, VRAM of at least 24 GB, RAM of at least 12 GB (minimum), 16 GB (standard) or 80 GB (full), and disk space of at least 284 GB. Additionally, you will need 42 GB of space for damo-vilab/text-to-video-ms-1.7b, 126 GB for ControlNet, 66 GB for stable-diffusion-v1-5, and 50 GB for other resources. For the “lite” configuration, you will only need Ubuntu 16.04 LTS.
Steps to get started
First, replace the OpenAI Key and Hugging Face Token in the server/configs/config.default.yaml file with your keys. Alternatively, you can place them in environment variables OPENAI_API_KEY and HUGGINGFACE_ACCESS_TOKENAccordingly
Run the following commands:
For the server:
- Set up the Python environment and install the required dependencies.
# setup env
cd server
conda create -n jarvis python=3.8
conda activate jarvis
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
- Download the required models.
# download models. Make sure that `git-lfs` is installed.
cd models
bash download.sh # required when `inference_mode` is `local` or `hybrid`.
- Start the server
# run server
cd ..
python models_server.py --config configs/config.default.yaml # required when `inference_mode` is `local` or `hybrid`
python awesome_chat.py --config configs/config.default.yaml --mode server # for text-davinci-003
You can now access Jarvis services by sending HTTP requests to Web API endpoints. Send a request to:
- /hug An endpoint using the POST method to access the full service.
- /tasks An endpoint using the POST method to access the intermediate results of step 1
- / results An endpoint using the POST method to access the intermediate results of steps 1-3.
Requests must be in JSON format and must contain a list of messages representing user input.
For website:
- Install node js and npm on your machine after starting your app awesome_chat.py in server mode.
- Go to the web directory and install the following dependencies
cd web
npm install
npm run dev
- Set http://LAN_IP_of_the_server:port/ HUGGINGGPT_BASE_URL to web/src/config/index.ts if you are using the web client on another device.
- If you want to use the video generation feature, compile ffmpeg manually with H.264.
# Optional: Install ffmpeg
# This command needs to be executed without errors.
LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/ffmpeg -i input.mp4 -vcodec libx264 output.mp4
- Double-click the Settings icon to return to ChatGPT.
For the CLI:
Installing Jarvis using the CLI is pretty easy. Just run the command below:
cd server
python awesome_chat.py --config configs/config.default.yaml --mode cli
For Gradio:
Gradio’s demo is also hosted on Hugging Face Space. You can experiment after entering OPENAI_API_KEY and HUGGINGFACE_ACCESS_TOKEN.
To run locally:
- Install the required dependencies, clone the project repository from Hugging Face Space and navigate to the project directory
- Run the model server, followed by using the Gradio demo:
python models_server.py --config configs/config.gradio.yaml
python run_gradio_demo.py --config configs/config.gradio.yaml
- Access the demo version in your browser at http://localhost:7860 and test it by entering different inputs
- Optionally, you can also run the demo as a Docker image by running the following command:
docker run -it -p 7860:7860 --platform=linux/amd64 registry.hf.space/microsoft-hugginggpt:latest python app.py
Note: In case of any problem please refer to the official Github Repo.
HuggingGPT also has some limitations that I would like to highlight here. For example, system efficiency is a major bottleneck, and during all the steps mentioned above, HuggingGPT requires multiple interactions with LLMs. This interaction can lead to a degraded user experience and increased latency. Similarly, the maximum length of a context is also limited by the number of tokens allowed. Another problem is the reliability of the system, as LLMs may misinterpret the request and generate the wrong sequence of tasks, which in turn affects the entire process. Nevertheless, it has significant potential for solving complex AI tasks and is a great step forward towards AGI. Let’s see in which direction this research will lead us. This is a wrap, feel free to share your views in the comments section below.
Kanwal Mehreen is a software developer with a strong interest in data science and the application of artificial intelligence in medicine. Kanwal has been selected as a Google Generation Scholar 2022 for the APAC region. Kanwal enjoys sharing technical knowledge by writing articles on trending topics and is passionate about improving the representation of women in the tech industry.
[ad_2]
Source link