If There s Intelligent Life Out There

Přejít na: navigace, hledání


Optimizing LLMs to be excellent at specific tests backfires on Meta, Stability.


-.
-.
-.
-.
-.
-.
-


When you buy through links on our website, we may make an affiliate commission. Here's how it works.


Hugging Face has released its second LLM leaderboard to rank the finest language models it has actually checked. The brand-new leaderboard looks for to be a more tough uniform requirement for testing open large language model (LLM) efficiency across a variety of jobs. Alibaba's Qwen models appear dominant in the leaderboard's inaugural rankings, taking 3 spots in the leading 10.


Pumped to announce the brand new open LLM leaderboard. We burned 300 H100 to re-run new assessments like MMLU-pro for all significant open LLMs!Some knowing:- Qwen 72B is the king and Chinese open designs are dominating total- Previous assessments have ended up being too easy for current ... June 26, 2024


Hugging Face's 2nd leaderboard tests language models throughout 4 jobs: understanding screening, thinking on exceptionally long contexts, complex math capabilities, experienciacortazar.com.ar and guideline following. Six benchmarks are utilized to evaluate these qualities, with tests including resolving 1,000-word murder secrets, explaining PhD-level questions in layman's terms, and a lot of daunting of all: high-school math equations. A complete breakdown of the criteria used can be discovered on Hugging Face's blog site.


The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, iuridictum.pecina.cz which takes 1st, 3rd, and 10th place with its handful of variants. Also appearing are Llama3-70B, Meta's LLM, wiki.rrtn.org and a handful of smaller sized open-source jobs that managed to outperform the pack. Notably missing is any sign of ChatGPT; Hugging Face's leaderboard does not check closed-source models to ensure reproducibility of outcomes.


Tests to qualify on the leaderboard are run exclusively on Hugging Face's own computers, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collective nature, anyone is totally free to submit new designs for testing and admission on the leaderboard, with a new voting system prioritizing popular new entries for screening. The leaderboard can be filtered to show just a highlighted range of substantial designs to avoid a confusing glut of little LLMs.


As a pillar of the LLM space, Hugging Face has become a relied on source for LLM knowing and community collaboration. After its very first leaderboard was launched last year as a way to compare and replicate testing outcomes from several established LLMs, the board quickly took off in appeal. Getting high ranks on the board ended up being the objective of lots of designers, wavedream.wiki little and large, and as models have ended up being usually stronger, 'smarter,' and optimized for the particular tests of the very first leaderboard, its results have ended up being less and less meaningful, for this reason the creation of a 2nd version.


Some LLMs, consisting of more recent variations of Meta's Llama, badly underperformed in the new leaderboard compared to their high marks in the first. This came from a pattern of over-training LLMs just on the very first leaderboard's benchmarks, causing regressing in real-world efficiency. This regression of efficiency, thanks to hyperspecific and self-referential data, follows a trend of AI performance growing even worse over time, proving once again as Google's AI responses have actually shown that LLM performance is only as excellent as its training data which real artificial "intelligence" is still numerous, lots of years away.


Remain on the Cutting Edge: Get the Tom's Hardware Newsletter


Get Tom's Hardware's finest news and extensive evaluations, straight to your inbox.


Dallin Grimm is a contributing author for Tom's Hardware. He has been developing and breaking computers since 2017, working as the resident child at Tom's. From APUs to RGB, Dallin has a deal with on all the current tech news.


Moore Threads GPUs allegedly reveal 'excellent' inference performance with DeepSeek models


DeepSeek research suggests Huawei's Ascend 910C delivers 60% of Nvidia H100 reasoning performance


Asus and RTX 5090 and RTX 5080 GPU costs by approximately 18%


-.
bit_user.
LLM performance is just as good as its training data which real synthetic "intelligence" is still numerous, several years away.
First, this declaration discounts the function of network architecture.


The definition of "intelligence" can not be whether something processes details precisely like people do, otherwise the search for additional terrestrial intelligence would be entirely futile. If there's intelligent life out there, it most likely doesn't think quite like we do. Machines that act and act smartly likewise need not necessarily do so, either.
Reply


-.
jp7189.
I don't enjoy the click-bait China vs. the world title. The fact is qwen is open source, open weights and can be run anywhere. It can (and has actually already been) fine tuned to add/remove predisposition. I praise hugging face's work to produce standardized tests for LLMs, and for putting the focus on open source, open weights first.
Reply


-.
jp7189.
bit_user said:.
First, this statement discounts the role of network architecture.


Second, intelligence isn't a binary thing - it's more like a spectrum. There are different classes cognitive jobs and capabilities you might be acquainted with, if you study kid development or animal intelligence.


The definition of "intelligence" can not be whether something processes details precisely like humans do, otherwise the look for additional terrestrial intelligence would be totally useless. If there's smart life out there, it most likely doesn't think rather like we do. Machines that act and behave intelligently likewise needn't necessarily do so, either.
We're creating a tools to help humans, therfore I would argue LLMs are more handy if we grade them by human intelligence requirements.
Reply


- View All 3 Comments


Most Popular


Tomshardware belongs to Future US Inc, an international media group and leading digital publisher. Visit our corporate website.


- Conditions.
- Contact Future's professionals.
- Privacy policy.
- Cookies policy.
- Availability Statement.
- Advertise with us.
- About us.
- Coupons.
- Careers


© Future US, allmy.bio Inc. Full 7th Floor, 130 West 42nd Street, New York City, NY 10036.