How to handle NaN values when writing to parquet in GO?

Issue

I am trying to write to a parquet file in GO. While writing to this file, I can get NaN values. Since NaN is neither defined in the primitive types nor in logical type then how do I handle this value in GO? Does any existing schema work for it?

I am using the parquet GO library from here. You can find an example of the code using JSON schema for writing to parquet here using this library.

Solution

The isse was discussed at lenght in xitongsys/parquet-go issue 281, with the recommandation being to

use OPTIONAL type.
Even you don’t assign a value (like you code), the non-point value will be assigned a default value.
So parquet-go don’t know it’s null or default value.

However:

What is comes down to is that I cannot use the OPTIONAL type, in other words I cannot convert my structure to use pointers.
I have tried to use repetitiontype=OPTIONAL as a tag, but this leads to some weird behavior.
I would expect that tag to behave the same way that the omitempty tag in the Golang standard library, i.e. if the value is not present then it is not put into the JSON.

The reason this is important is that if the field is missing or not set, when it is encoded to parquet then there is no way of telling if the value was 0 or just not set in the case of int64.

This illustrates the issue:

package main

import (
    "encoding/json"
    "io/ioutil"
)

type Salary struct {
    Basic, HRA, TA float64 `json:",omitempty"`
}

type Employee struct {
    FirstName, LastName, Email string `json:",omitempty"`
    Age                        int
    MonthlySalary              []Salary `json:",omitempty"`
}

func main() {
    data := Employee{
        Email: "mark@gmail.com",
        MonthlySalary: []Salary{
            {
                Basic: 15000.00,
            },
        },
    }

    file, _ := json.MarshalIndent(data, "", " ")

    _ = ioutil.WriteFile("test.json", file, 0o644)
}

with a JSON produced as:

{
 "Email": "mark@gmail.com",
 "Age": 0,
 "MonthlySalary": [
  {
   "Basic": 15000
  }
 ]
}

As you can see, the item in the struct that have the omit empty tag and that are not assigned do no appear in the JSON, i.e. HRA TA.
But on the other hand Age does not have this tag and hence it is still included in the JSON.

This is problematic as all fields in the struct are assigned memory when this golang library writes to parquet- so if you have a big struct that is only sparsely populated it will still take the full amount of memory.
It is a bigger problem when the file is read again as there is no way of know if the value that was put in the parquet file was the empty value or it is was just not assigned.

I am happy to help implement an omitempty tag for this library if I can convince you of the value of having it.

That echoes issue 403 "No option to omitempty when not using pointers".

Answered By – VonC

Answer Checked By – Marilyn (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.