Continuing to try to update daily as I approach my deadline for having something tangible finished.
Today was majorly frustrating. And Bard was a big part of that frustration. Let’s start with the fact that I gave me again several long, involved Github issues pages which it claimed backed up the utter bullshit it was telling me. Then when I took the time to read these long, involved Github pages, it turned out they told me nothing like what it claimed.
My day began really with two problems which should – should!! have been easy to figure out. The first: determining easily the size of one of polars’ in-built data types. You would think that determining the size of a data type, especially if you know it is numeric, should be a simple thing. But noooooooo! The implementation for Polars’ type system is sorely, sorely lacking. The DataType enum tells you basically nothing whatsoever about the datatype, other than whether its a logical or numeric dtype (and there are several other dtypes which are neither). The size of some dtypes are implied in the name (eg: Uint32), but changing the type name to a string and doing string wrangling is a piss poor implementation. Google hilariously wanted me to use a 20 line long match statement with hard coded values to determine the size of the type. No thank you, Google! No thank you.
Then, there’s the categorical data types. You would think that categorical data types, being such an important, vital part of any data science system, would be extensively documented in Polars, replete with examples and function documentation. Instead (especially on Google, but strangely not on DuckDuckGo and other search engines… gee I wonder why? Maybe because Google wants to be the king of data science?) there is a severe lack of results when you search for solutions using appropriate keywords.
Eventually, pissed off by Google’s deliberate lack of results, I tried searching DuckDuckGo, which gave me much better results. It talked about setting up a categorical data type by casting the mutable column within the dataframe to a Categorical(Option<Arc<RevMapping>>)
, where RevMapping
was one of those poorly named types which tell you absolutely nothing about what it does unless you know where to look.
Moreover, Bard kept telling me you can use Categorical(None), which technically sounds like it should be a special case of Option<Arc<RevMapping>>
(since None is one of the two enum values for an Option), infers the categories based on the data. But given how many times it flat out lied to me today, right now Google is like the Boy who cried wolf to me. Even if something it was saying to me was 100% accurate, if it could not back up its stated with 100% factual verifiable links which I can read and confirm, I would not believe it. Because it’s been wrong that many times before.
So at the moment, all of the data imported into my program is either numerical or string, and all that lovely VAERS data which defines ‘Y’ or ‘N’ as a response from someone instead of, oh I don’t know, using real data types like boolean? All of that, even the dread ‘M’/‘F’/‘U’ (male, female, unknown) for gender, is all hopelessly inefficient strings.
I’m a little closer to having an implementation now than I was 8 hours ago, but I’m incredibly angry that I had to waste so much time to get a smidgen closer to a working implementation. It should not be this difficult! They should have way better documentation for a library which is financially very well supported and well established for years. At the very least, some third party should have written some cogent, thoughtful articles on the API. Ah well… screw it. More work later… for now, I’m done with this frustration.