Re: reading csv file from null value

2015-10-26 Thread Maximilian Michels
As far as I know the null support was removed from the Table API because
its support was consistently supported with all operations. See
https://issues.apache.org/jira/browse/FLINK-2236

On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena  wrote:

> For a similar problem where we wanted to preserve and track null entries,
> we load the CSV as a DataSet[Array[Object]] and then transform it into
> DataSet[Row] using a custom RowSerializer(
> https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.
>
> The Table API(which supports null) can then be used on the resulting
> DataSet[Row].
>
>
> On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels 
> wrote:
>
>> Hi Philip,
>>
>> How about making the empty field of type String? Then you can read the
>> CSV into a DataSet and treat the empty string as a null value. Not very
>> nice but a workaround. As of now, Flink deliberately doesn't support null
>> values.
>>
>> Regards,
>> Max
>>
>>
>> On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee  wrote:
>>
>>> Hi,
>>>
>>> I am trying to load the dataset with the part of null value by using
>>> readCsvFile().
>>>
>>> // e.g  _date|_click|_sales|_item|_web_page|_user
>>>
>>> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, 
>>> _item: Int,_page: Int, _user: Int)
>>>
>>> private def getWebClickDataSet(env: ExecutionEnvironment): 
>>> DataSet[WebClick] = {
>>>
>>>   env.readCsvFile[WebClick](
>>> webClickPath,
>>> fieldDelimiter = "|",
>>> includedFields = Array(0, 1, 2, 3, 4, 5),
>>> // lenient = true
>>>   )
>>> }
>>>
>>>
>>> Well, I know there is an option to ignore malformed value, but I have to
>>> read the dataset even though it has null value.
>>>
>>> as it follows, dataset (third column is null) looks like
>>> 37794|24669||16705|23|54810
>>> but I have to read null value as well because I have to use filter or
>>> where function ( _sales == null )
>>>
>>> Is there any detail suggestion to do it?
>>>
>>> Thanks,
>>> Philip
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> ==
>>>
>>> *Hae Joon Lee*
>>>
>>>
>>> Now, in Germany,
>>>
>>> M.S. Candidate, Interested in Distributed System, Iterative Processing
>>>
>>> Dept. of Computer Science, Informatik in German, TUB
>>>
>>> Technical University of Berlin
>>>
>>>
>>> In Korea,
>>>
>>> M.S. Candidate, Computer Architecture Laboratory
>>>
>>> Dept. of Computer Science, KAIST
>>>
>>>
>>> Rm# 4414 CS Dept. KAIST
>>>
>>> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>>>
>>>
>>> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>>>
>>> ==
>>>
>>
>>
>


Re: reading csv file from null value

2015-10-26 Thread Philip Lee
Thanks for your reply.

What if I do not use Table API?
The error happens when using just env.readFromCsvFile().

I heard that using RowSerializer would handle this null value, but its
error of TypeInformation happens when it is converted

On Mon, Oct 26, 2015 at 10:26 AM, Maximilian Michels  wrote:

> As far as I know the null support was removed from the Table API because
> its support was consistently supported with all operations. See
> https://issues.apache.org/jira/browse/FLINK-2236
>
> On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena 
> wrote:
>
>> For a similar problem where we wanted to preserve and track null entries,
>> we load the CSV as a DataSet[Array[Object]] and then transform it into
>> DataSet[Row] using a custom RowSerializer(
>> https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.
>>
>> The Table API(which supports null) can then be used on the resulting
>> DataSet[Row].
>>
>>
>> On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels 
>> wrote:
>>
>>> Hi Philip,
>>>
>>> How about making the empty field of type String? Then you can read the
>>> CSV into a DataSet and treat the empty string as a null value. Not very
>>> nice but a workaround. As of now, Flink deliberately doesn't support null
>>> values.
>>>
>>> Regards,
>>> Max
>>>
>>>
>>> On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee  wrote:
>>>
 Hi,

 I am trying to load the dataset with the part of null value by using
 readCsvFile().

 // e.g  _date|_click|_sales|_item|_web_page|_user

 case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, 
 _item: Int,_page: Int, _user: Int)

 private def getWebClickDataSet(env: ExecutionEnvironment): 
 DataSet[WebClick] = {

   env.readCsvFile[WebClick](
 webClickPath,
 fieldDelimiter = "|",
 includedFields = Array(0, 1, 2, 3, 4, 5),
 // lenient = true
   )
 }


 Well, I know there is an option to ignore malformed value, but I have
 to read the dataset even though it has null value.

 as it follows, dataset (third column is null) looks like
 37794|24669||16705|23|54810
 but I have to read null value as well because I have to use filter or
 where function ( _sales == null )

 Is there any detail suggestion to do it?

 Thanks,
 Philip







 --

 ==

 *Hae Joon Lee*


 Now, in Germany,

 M.S. Candidate, Interested in Distributed System, Iterative Processing

 Dept. of Computer Science, Informatik in German, TUB

 Technical University of Berlin


 In Korea,

 M.S. Candidate, Computer Architecture Laboratory

 Dept. of Computer Science, KAIST


 Rm# 4414 CS Dept. KAIST

 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


 Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

 ==

>>>
>>>
>>
>


-- 

==

*Hae Joon Lee*


Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin


In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST


Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==


Re: reading csv file from null value

2015-10-24 Thread Philip Lee
Plus, from Shiti to overcome this null value, we could use RowSerializer,
right?

I tried it in many ways, but it still did not work.
Could you take an example for it according to the previous email?



On Sat, Oct 24, 2015 at 11:19 PM, Philip Lee  wrote:

> Maximilian said if we handle null value with String, it would be
> acceptable.
> But in fact, readCsvFile() still cannot accept null value; they said "Row
> too short" in error msg.
>
> case class WebClick(click_date: String, click_time: String, user: String, 
> item: String)
> private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] 
> = {
>   env.readCsvFile[WebClick](
> webClickPath,
> fieldDelimiter = "|",
> includedFields = Array(0, 1, 3, 5)
> //lenient = true
> )
> }
>
> ​// e.g. 36890|26789|0|3725|20|85457
> // e.g _date|_click|_sales|_item|_web_page|_user​
>
> ​Caused by: org.apache.flink.api.common.io.ParseException: Row too short:
> 36890|4749||13183|29|
> at
> org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:383)
> at
> org.apache.flink.api.scala.operators.ScalaCsvInputFormat.readRecord(ScalaCsvInputFormat.java:214)
> at
> org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454)
> at
> org.apache.flink.api.scala.operators.ScalaCsvInputFormat.nextRecord(ScalaCsvInputFormat.java:182)
> at
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
>
> Is there any suggestion?
>
>
>
> ​
>
>
>
> On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena 
> wrote:
>
>> For a similar problem where we wanted to preserve and track null entries,
>> we load the CSV as a DataSet[Array[Object]] and then transform it into
>> DataSet[Row] using a custom RowSerializer(
>> https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.
>>
>> The Table API(which supports null) can then be used on the resulting
>> DataSet[Row].
>>
>>
>> On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels 
>> wrote:
>>
>>> Hi Philip,
>>>
>>> How about making the empty field of type String? Then you can read the
>>> CSV into a DataSet and treat the empty string as a null value. Not very
>>> nice but a workaround. As of now, Flink deliberately doesn't support null
>>> values.
>>>
>>> Regards,
>>> Max
>>>
>>>
>>> On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee  wrote:
>>>
 Hi,

 I am trying to load the dataset with the part of null value by using
 readCsvFile().

 // e.g  _date|_click|_sales|_item|_web_page|_user

 case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, 
 _item: Int,_page: Int, _user: Int)

 private def getWebClickDataSet(env: ExecutionEnvironment): 
 DataSet[WebClick] = {

   env.readCsvFile[WebClick](
 webClickPath,
 fieldDelimiter = "|",
 includedFields = Array(0, 1, 2, 3, 4, 5),
 // lenient = true
   )
 }


 Well, I know there is an option to ignore malformed value, but I have
 to read the dataset even though it has null value.

 as it follows, dataset (third column is null) looks like
 37794|24669||16705|23|54810
 but I have to read null value as well because I have to use filter or
 where function ( _sales == null )

 Is there any detail suggestion to do it?

 Thanks,
 Philip







 --

 ==

 *Hae Joon Lee*


 Now, in Germany,

 M.S. Candidate, Interested in Distributed System, Iterative Processing

 Dept. of Computer Science, Informatik in German, TUB

 Technical University of Berlin


 In Korea,

 M.S. Candidate, Computer Architecture Laboratory

 Dept. of Computer Science, KAIST


 Rm# 4414 CS Dept. KAIST

 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


 Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

 ==

>>>
>>>
>>
>
>
> --
>
> ==
>
> *Hae Joon Lee*
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik in German, TUB
>
> Technical University of Berlin
>
>
> In Korea,
>
> M.S. Candidate, Computer Architecture Laboratory
>
> Dept. of Computer Science, KAIST
>
>
> Rm# 4414 CS Dept. KAIST
>
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>
>
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>
> ==
>



-- 


Re: reading csv file from null value

2015-10-23 Thread Maximilian Michels
Hi Philip,

How about making the empty field of type String? Then you can read the CSV
into a DataSet and treat the empty string as a null value. Not very nice
but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max

On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee  wrote:

> Hi,
>
> I am trying to load the dataset with the part of null value by using
> readCsvFile().
>
> // e.g  _date|_click|_sales|_item|_web_page|_user
>
> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: 
> Int,_page: Int, _user: Int)
>
> private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] 
> = {
>
>   env.readCsvFile[WebClick](
> webClickPath,
> fieldDelimiter = "|",
> includedFields = Array(0, 1, 2, 3, 4, 5),
> // lenient = true
>   )
> }
>
>
> Well, I know there is an option to ignore malformed value, but I have to
> read the dataset even though it has null value.
>
> as it follows, dataset (third column is null) looks like
> 37794|24669||16705|23|54810
> but I have to read null value as well because I have to use filter or
> where function ( _sales == null )
>
> Is there any detail suggestion to do it?
>
> Thanks,
> Philip
>
>
>
>
>
>
>
> --
>
> ==
>
> *Hae Joon Lee*
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik in German, TUB
>
> Technical University of Berlin
>
>
> In Korea,
>
> M.S. Candidate, Computer Architecture Laboratory
>
> Dept. of Computer Science, KAIST
>
>
> Rm# 4414 CS Dept. KAIST
>
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>
>
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>
> ==
>


Re: reading csv file from null value

2015-10-23 Thread Shiti Saxena
For a similar problem where we wanted to preserve and track null entries,
we load the CSV as a DataSet[Array[Object]] and then transform it into
DataSet[Row] using a custom RowSerializer(
https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting
DataSet[Row].


On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels  wrote:

> Hi Philip,
>
> How about making the empty field of type String? Then you can read the CSV
> into a DataSet and treat the empty string as a null value. Not very nice
> but a workaround. As of now, Flink deliberately doesn't support null values.
>
> Regards,
> Max
>
>
> On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee  wrote:
>
>> Hi,
>>
>> I am trying to load the dataset with the part of null value by using
>> readCsvFile().
>>
>> // e.g  _date|_click|_sales|_item|_web_page|_user
>>
>> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, 
>> _item: Int,_page: Int, _user: Int)
>>
>> private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] 
>> = {
>>
>>   env.readCsvFile[WebClick](
>> webClickPath,
>> fieldDelimiter = "|",
>> includedFields = Array(0, 1, 2, 3, 4, 5),
>> // lenient = true
>>   )
>> }
>>
>>
>> Well, I know there is an option to ignore malformed value, but I have to
>> read the dataset even though it has null value.
>>
>> as it follows, dataset (third column is null) looks like
>> 37794|24669||16705|23|54810
>> but I have to read null value as well because I have to use filter or
>> where function ( _sales == null )
>>
>> Is there any detail suggestion to do it?
>>
>> Thanks,
>> Philip
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> ==
>>
>> *Hae Joon Lee*
>>
>>
>> Now, in Germany,
>>
>> M.S. Candidate, Interested in Distributed System, Iterative Processing
>>
>> Dept. of Computer Science, Informatik in German, TUB
>>
>> Technical University of Berlin
>>
>>
>> In Korea,
>>
>> M.S. Candidate, Computer Architecture Laboratory
>>
>> Dept. of Computer Science, KAIST
>>
>>
>> Rm# 4414 CS Dept. KAIST
>>
>> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>>
>>
>> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>>
>> ==
>>
>
>