Conversation
To avoid replacement MONTANA -> MONTA
|
+1. This will be very helpful. |
|
@petro-rudenko Since this can easily be done with a transformation, I prefer leaving it as it is rather than adding yet another option to spark-csv. |
|
Spark-csv accepts sqlContext and path to files. So transformation is only possible by saving to file, which is not efficient for big files. Also replacement is done on token basis (when csv parser parsed the data). If doing csv parsing on the client side - there would no need to use spark-csv. |
There was a problem hiding this comment.
It'd be nice if this was another option. IE: In my application we have decided to standardize on parsing empty string fields as nulls rather than empty strings.
|
i was working on the same and several other options. see https://github.com/databricks/spark-csv/pull/94/files |
|
please look at pull request #113 |
|
One reason to want this over client-side processing is that user-provided schemata have to initially state the all nullable columns are Seeing as a user-provided schema can tell us whether a column is nullable or not, it might be nice if we could also say what the null values will actually look like in the data. |
|
+1. I have some data that was generated using R and in this case nulls are encoded as "NA". Currently I am running another job that converts "NA" to "" but it will be nice if there is an option to specify how null values are encoded. All CSV parser I know off have such an option. |
There's datasets where each column has it's own marker for missing values. spark-csv assumes only empty string for missing values. To avoid additional data transformation and saving on user's side would be great to specify a set of null markers and replace them to empty string on a library side.