I made a program that uses machine learning to recommend you modules from q store

oloy · June 4, 2023, 12:43pm

Edit: site is live here: https://sc-langchain-with-frontend.onrender.com/
while using the site, try to follow prompt recommendations in the “read me”. the results u get depend heavily on ur prompts.

Hey guys, i just finished uni so i’m free. I thought i’d make something as a gift to sub club for releasing the awesome product “Index gate: ultimate programmer”.

This program uses similarity search to recommend module according to the given input (query). Some screenshots are posted below. I’ll make a rudimentary website on which u guys can interact with this program tomorrow (if nothing pops up in my life). Let me know what u guys think about it.

Deadpool · June 4, 2023, 12:50pm

Good job, Zeus approved.

Detective_L · June 4, 2023, 1:31pm

This is so cool!

SaintSovereign · June 4, 2023, 3:45pm

While we appreciate this and will not try to prevent anyone from using this, please note that we are already working on a version of the same idea.

It was going to be a surprise release for later this year, but I’m announcing it now so it wouldn’t appear as if we took the idea, lol

SaintSovereign · June 4, 2023, 3:52pm

Let’s not discuss the SubClub app on this thread. Leave this to @oloy’s outstanding contribution. I am beyond happy that we have individuals wanting to create tools like this for the community. You have our thanks.

Detective_L · June 4, 2023, 3:53pm

Can we have this kind of program to help choose a major title as well? That would be great.

SaintSovereign · June 4, 2023, 3:58pm

Ask me on my journal, please. Let’s keep this thread exclusive to what he’s building.

Detective_L · June 4, 2023, 4:36pm

Sure.

SaintSpring · June 4, 2023, 10:46pm

amazing job, brother!
how can I try it for my personal uses?

oloy · June 5, 2023, 2:07am

The plan was to make a website today, however my family member is going through medical procedure so it might take a while (as i need to stay wid him). I’ll do it as soon as i can.

Malkuth · June 5, 2023, 2:45am

May it be a smooth, successful procedure.

Monarch · June 5, 2023, 2:47am

@oloy Oh wow this is really cool. Thanks for taking your time and making this

I love seeing shit like this in the community.

oloy · June 5, 2023, 3:15am

@Malkuth Thank you brother.
@Monarch thank you as well.

prmrs · June 5, 2023, 7:12am

That’s awesome @oloy. Great idea. If you want to play around this more think about generating a narrative and a title for a set of selected Q cores. Basically a “marketing page” for your custom, with objectives and all, in the style of SaintSovereign

Which model are you using for this?

oloy · June 5, 2023, 7:36am

I’m just storing all the module’s description inside pinecone vector database and then running similarity search on the database according to the provided query.

Abundance · June 5, 2023, 3:13pm

the amount of time ive spent reading and studying the different modules to figure out ideal customs is a lot. i don’t think very many ppl would want to invest so much time into making the best custom for themselves. but with this it could be a game changer. it can get a little to complicated to a frustrating level, trying to perfect the exact custom for myself. so many choices but whats best for me.

im curious to know if it ramps up sales. i bet significantly. the ppl who dont want to do the thinking and analyzing themselves might start making customs.

oloy · June 6, 2023, 5:53am

I have bad news guys. Turns out its not completely free. Everytime u pass a query, it needs to convert it into vector (which is used by computer to find similar modules). This conversion-to-vector costs $0.0016 / 750 words. Its very small but its not zero.

One possible solution is everyone must use their own Openai api key (there will be a place to input the api key in the website). Openai gives $5 free credits.

My thinking is i’ll release the website using my own api key then after the credits are finished, i’ll make it so that each user must input their own api key.

emperor_obewan · June 6, 2023, 7:22am

True LLM style stuff is difficult anyway. I’ve uncovered a video on doing what approximates LLM in Rust using GPT-Neo and BERT

Unfortunately the downloads for the pre-trained module are 10GB, and on non GPU optimised systems using transformers in this way takes a lot of resources and memory. In Python it would probably be even slower due to the lack of optimisation and extra system call and memory overhead. This is the reason places like Open AI can charge an arm and a leg for API calls, because you need really good hardware and its hard to scale… yet!

I like the idea of using something like Pine Cone though, that’s a great way to trade off complexity for speed. I think it would require some pretty extensive training to get contextual word embeddings/vectors that were consistently able to search all the possible variations on a phrase. One compromise you could make would be training a model yourself to create the embeddings/vectors and storing them in an ordinary database that doesn’t require those sorts of charges for overheads. But its only a thought and would probably be quite an endeavour. I seem to recall spaCy with the large model was pretty good for generating embeddings, but still uses a lot of CPU.

oloy · June 6, 2023, 7:31am

I was thinking of learning NLP and semantic search but index gate urged me to get the product out first before trying to optimize it.

emperor_obewan · June 6, 2023, 7:57am

Yeah and its a really good start, technically you don’t even need to go to the level of LLM to have something useful if you have enough pre-trained word embeddings that catch all variations.

That’s basically what Pine Cone is doing, word embeddings are vectors which result from transforming text into vectors and then doing dimensional reduction on them through a neural network with CNNs/LSTMs. They’re kind of a side effect, a very useful side effect that is a big part of what allows NLP to work in the way that it does. Word vectors store not just a representation of the concept but its context within the corpus it was trained on, and that is what allows you do virtually do mathematical operations to determine similarity or relationship to other words.

Of course in order to have enough context to work with the corpus it is trained on needs to be fairly large. It’s part of the reason I haven’t worked with NLP much yet in my own experimentation; to get enough of a corpus in a form you can use you normally have to get into stuff like web scraping and cleaning the data you use, removing punctuation, optional stemming and all that good stuff

The thing that makes using word embeddings somewhat slow is that in order to search for similar terms, you basically get the dot product of two vectors, and if the product approaches the zero vector, the words are similar (EDIT: I got that mixed up, sorry, see Dot product for similarity in word to vector computation in NLP - Data Science Stack Exchange). Doing that across an entire set of word embeddings would be slow, and limited by the resources you have available to iterate over the DataFrame of your vocabulary of word embeddings. And word embeddings of multiple words length causes your vocabulary to grow really quickly, if you’re trying to train word embeddings for multiple word phrases.

That being said, I’m sure someone has probably already come up with a solution to that scalability problem. If I was to try and attack the problem myself, I’d probably try to look at what VQGAN did to solve the scalability issue with image generation. Similar principles would probably apply, maybe even the literature for LLMs might have something.

Good luck playing with it further! It’s certainly an interesting problem to tackle.

EDIT again: It seems what gives Pinecone and other vector databases the ability to search similarity efficiently is a ranking system based on nearest neighbour, basically very similar to pre-computing a bunch of vector products and sorting them based on the computed proximity to one another, in terms of their direction/normal. It shouldn’t be too difficult to store items that way in a traditional database; but vector databases are sure handy for cutting out a lot of the compute time!