My AI is bigger than yours

It's fresh news that DeepSeek, a Chinese startup, released a reasoning LLM comparable in performances to the best offering from OpenAI while spending just $6 million to train it. All the math and all the technical details the company leveraged to build the model are in the related whitepaper you can find on ArXiv.

=> The R1 paper.

Given the public paper and the fact that the model, along all of it's weights, are freely available on Github, I would have expected OpenAI to shut-up and work harder to release, quickly a model comparable in efficiency.

Instead, they decided to cry wolf by saying:

We are investigating a potential unauthorized use of ChatGPT data by DeepSeek

-- OpenAI and their buddies at Microsoft.

I laughed hard because, I believe, OpenAI doesn't have ground to define ChatGPT's output as their "intellectual property". To justify my opinion, we need to define some things to reason on:

It's my personal opinion that, given the previous points, OpenAI can't claim that the LLM output is their property, because they haven't trained it with their own original material in the first place. This is even clearer when we take into account that Microsoft and OpenAI, deem downloading copyrighted material to train their LLMs as "fair use". I have no direct issues with their bot scraping, nor I have issues with my copyrighted blog being used to train their models. However, I have many more issues with their cavalier attitude toward the lack of proper attribution to the open source programmers whose code was scraped to train ChatGPT. Given their considerations, isn't ChatGPT's output, obtained through a chat session, available on the internet one way or another?

An LLM output is a strict derivative of it's training set.

An LLM is unable to apply the knowledge acquired from a book, without examples to create something completely novel. It can try and in the end, we could argue that humans also require some training to solve a specific set of problems. However, an LLM cannot have, by construction, the creativity to generate something totally new. This is fine, given what an LLM is all about and we are, anyway, talking about a piece of code moving inside a multi-dimensional space, working hard to find the best words based on a context provided through the chat messages as input.

=> AI: so what happens now?

You can create such space as big as you like, with as many dimensions as you wish, nonetheless, the space the LLM cam move into is still finite and this space is strictly defined by the size of it's training set, by the number of dimensions and by the amount of context it can handle while reading some input.

OpenAI hasn't created any original material to train ChatGPT

This the second and IMHO, bigger reason why OpenAI accusations doesn't hold water.

As fare as we know, all the material used by OpenAI in it's models is somebody else's copyrighted work. What OpenAI owns are the weights applied to the material in order to train the model and those are not publicly visible or accessible from the internet. If the material used by OpenAI from the internet falls under "fair use" and if OpenAI hasn't added it's own original material to ChatGPT's dataset, it's safe to say again that ChatGPT's output can't be considered OpenAIs property and, by my interpretation, it's also fair use if distilled. The matter would have been different if part, or better, the majority of the training set came from original content created by OpenAI, but this isn't the case.

This is as clear as fresh water if you take into consideration the fact that you can't train a model only with your proprietary weights, but you need a defined training set. If the weights are proprietary, but the used material is others people copyright and none of it is yours, your model's output is not your IP either.

If, as suspected, DeepSeek distilled material from ChatGPT to train R1, I feel they had all the rights to do so no matter what OpenAI's ToS claims. ChatGPT is on the internet, easily accessible and this means that derived work represented by it's output is fair game for anybody to use to train other models by distilling it, or any other method for that matter.

You live by the sword, you die by the sword.

Arriving to the end of all those considerations, I feel it's safe to say that OpenAI created themselves the conditions for this alleged "stealing" to happen (AH!). Not Microsoft nor OpenAI have made public, until this moment, any evidence to prove the allegations they made against DeepSeek. We have their words for it, with nothing substantial to show for their claims. The fact Microsoft is now offering DeepSeek as an LLM in Azure, smells like the previous claims are bullshit to me.

=> DeepSeek R1 is now available on Azure AI Foundry and GitHub

I want to be kind to the OpenAI kids and think that, at some point, they will show the evidences of DeepSeek distilling "their" models output. My reaction would be: "So what? Too bad you don't have the rights for the derived work coming from your models"; I will then proceed to open the fridge, grab a beer and move on with some popcorn, waiting the OpenAI next excuse for being beaten hard at math. I believe (and this is a thought shared by others) that OpenAI and Microsoft were caught with their pants down and tried to shift the narrative to the usual: "The Chinese are only able to steal!", while what DeepSeek did, was using better mathematicians compared to the American ones.

As said by others, throwing money at the problem works until somebody, out of necessity, comes along and does things more efficiently. Because here we aren't talking about a small improvement: DeepSeek released a model, comparable in capabilities to the best OpenAI offerings, while spending 1/30 in training it. Achieving such a feet is not in the "we did a bit better" camp, but it's literally a kick in the ass the americans weren't expecting to get. What DeepSeek demonstrated is that OpenAI is being ran in a completely incompetent way, where their scientist look like amateurs compared to what the Chinese team did and OpenAI doesn't have an answer ready. The most humiliating thing, IMHO, is that DeepSeek released EVERYTHING in the open, while OpenAI stopped doing so a long, long time ago.

"We will build a much better model", claims the clown in charge (Altman, to be clear). They will be able to do so if their mathematicians are even able to compete at the game DeepSeek started (or other non-american researchers for that matter) and even if they will, I doubt DeepSeek & Co. will stand still, considering the competence they demonstrated.

The energy stocks took a hard hit due to the release of R1 and now in the US, it's a matter of understanding if the shareholders will believe what Altman and his sycophants will say to reassure them about the need of "more compute power". There are hundred of billions on the hook now and OpenAI, could take a huge blow if they will not show something meaningful in short order. It's clear the american way of brute forcing an LLMs training is doomed now and, I believe, the game will move soon on the way of creating models that are cheaper to train.

Thinking about it, It wouldn't surprise me in the slightest if that failed moron that's Altman, will push OpenAI teams to publish a new model literally based on R1. Cheating and lying is his normal way of life and he could claim, thanks to OpenAI secrecy and with a straight face, that his company discovered something new. He couldn't reveal it, of course because "intellectual property".

I think Wall Street is too stupid to notice this and the morons in charge, will continue to send checks to OpenAI based on the hype.

=> SoftBank in talks to lead OpenAI funding round at $300 billion valuation, sources say

Better if the hype is all american, of course, as the other scientists around the world are only able to steal from their "exceptionalism".