AK

AK

机器人
加入于 · 2024-03-21 06:02:41
avatar
@_akhaliq@x.good.news
FLUX.1-schnell OpenVINO support github: t.co/oNjCA2I5al
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Automated Design of Agentic Systems discuss: t.co/uWoYmwIpK6 Researchers are investing substantial effort in developing powerful general-purpose agents, wherein Foundation Models are used as modules within agentic systems (e.g. Chain-of-Thought, Self-Reflection, Toolformer). However, the history of machine learning teaches us that hand-designed solutions are eventually replaced by learned solutions. We formulate a new research area, Automated Design of Agentic Systems (ADAS), which aims to automatically create powerful agentic system designs, including inventing novel building blocks and/or combining them in new ways. We further demonstrate that there is an unexplored yet promising approach within ADAS where agents can be defined in code and new agents can be automatically discovered by a meta agent programming ever better ones in code. Given that programming languages are Turing Complete, this approach theoretically enables the learning of any possible agentic system: including novel prompts, tool use, control flows, and combinations thereof. We present a simple yet effective algorithm named Meta Agent Search to demonstrate this idea, where a meta agent iteratively programs interesting new agents based on an ever-growing archive of previous discoveries. Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents. Importantly, we consistently observe the surprising result that agents invented by Meta Agent Search maintain superior performance even when transferred across domains and models, demonstrating their robustness and generality. Provided we develop it safely, our work illustrates the potential of an exciting new research direction toward automatically designing ever-more powerful agentic systems to benefit humanity.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
New from @thukeg LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs author @realYushiBai is active in discussion section to answer your questions: t.co/UeebckjJjf
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Google announces Imagen 3 discuss: t.co/SrjnFQhR5K We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Qwen2-Audio collection: t.co/oN3qkcSSuW introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Transformer Explainer Interactive Learning of Text-Generative Models discuss: t.co/p1GMsyhsEI Transformers have revolutionized machine learning, yet their inner workings remain opaque to many. We present Transformer Explainer, an interactive visualization tool designed for non-experts to learn about Transformers through the GPT-2 model. Our tool helps users understand complex Transformer concepts by integrating a model overview and enabling smooth transitions across abstraction levels of mathematical operations and model structures. It runs a live GPT-2 instance locally in the user's browser, empowering users to experiment with their own input and observe in real-time how the internal components and parameters of the Transformer work together to predict the next tokens. Our tool requires no installation or special hardware, broadening the public's education access to modern generative AI techniques.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
MindSearch Gradio Demo is out t.co/OQSEifCM9v
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Live Portrait Updates on Animals mode t.co/QXwhypZrob
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
OmniParser for Pure Vision Based GUI Agent paper: t.co/43AlRROnEv The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OmniParser, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OmniParser significantly improves GPT-4V's performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OmniParser with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
ByteDance announces Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent t.co/NawVKWZptn
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Stable Audio Open Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
E5-V Universal Embeddings with Multimodal Large Language Models Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Cat Doing Pottery, Kling AI by u/AndrewJumpen
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
SEED-Story Multimodal Long Story Generation with Large Language Model With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with an adapted visual de-tokenizer to produce images with consistent characters and styles. We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. Additionally, we present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
The Prompt Report A Systematic Survey of Prompting Techniques Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
VALL-E 2 Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
T2V-Turbo Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Yuan 2.0-M32 Mixture of Experts with Attention Router Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Look Once to Hear Target Speech Hearing with Noisy Examples In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Google presents CAT3D Create Anything in 3D with Multi-View Diffusion Models Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Coin3D Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Beyond Scaling Laws Understanding Transformer Performance with Associative Memory Increasing the size of a Transformer model does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, improved generalization
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
What matters when building vision-language models? The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Meta presents Layer Skip Enabling Early Exit Inference and Self-Speculative Decoding We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
DiJiang Efficient Large Language Models through Compact Kernelization In an effort to reduce the computational load of Transformers, research on linear attention has gained significant momentum. However, the improvement strategies for attention mechanisms typically
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
AniPortrait Audio-Driven Synthesis of Photorealistic Portrait Animation In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
InternVideo2 Scaling Video Foundation Models for Multimodal Video Understanding We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach
查看详情
0
0
0
avatar
@_akhaliq@x.good.news
Be-Your-Outpainter Mastering Video Outpainting through Input-Specific Adaptation Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods
查看详情
0
0
0
没有更多数据了