Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to
throw exception when you creating a Vector with element of unicode type.

2016-05-24 7:27 GMT-07:00 flyinggip <myflying...@hotmail.com>:

> Hi there,
>
> I notice that there might be a bug in pyspark.mllib.linalg.Vectors when
> dealing with a vector with a single element.
>
> Firstly, the 'dense' method says it can also take numpy.array. However the
> code uses 'if len(elements) == 1' and when a numpy.array has only one
> element its length is undefined and currently if calling dense() on a numpy
> array with one element the program crashes. Probably instead of using len()
> in the above if, size should be used.
>
> Secondly, after I managed to create a dense-Vectors object with only one
> element from unicode, it seems that its behaviour is unpredictable. For
> example,
>
> Vectors.dense(unicode("0.1"))
>
> will report an error.
>
> dense_vec = Vectors.dense(unicode("0.1"))
>
> will NOT report any error until you run
>
> dense_vec
>
> to check its value. And the following will be able to create a successful
> DataFrame:
>
> mylist = [(0, Vectors.dense(unicode("0.1")))]
> myrdd = sc.parallelize(mylist)
> mydf = sqlContext.createDataFrame(myrdd, ["X", "Y"])
>
> However if the above unicode value is read from a text file (e.g., a csv
> file with 2 columns) then the DataFrame column corresponding to "Y" will be
> EMPTY:
>
> raw_data = sc.textFile(filename)
> split_data = raw_data.map(lambda line: line.split(','))
> parsed_data = split_data.map(lambda line: (int(line[0]),
> Vectors.dense(line[1])))
> mydf = sqlContext.createDataFrame(parsed_data, ["X", "Y"])
>
> It would be great if someone could share some ideas. Thanks a lot.
>
> f.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-involving-Vectors-with-a-single-element-tp27013.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Possible bug involving Vectors with a single element

2016-05-24 Thread flyinggip
Hi there, 

I notice that there might be a bug in pyspark.mllib.linalg.Vectors when
dealing with a vector with a single element. 

Firstly, the 'dense' method says it can also take numpy.array. However the
code uses 'if len(elements) == 1' and when a numpy.array has only one
element its length is undefined and currently if calling dense() on a numpy
array with one element the program crashes. Probably instead of using len()
in the above if, size should be used. 

Secondly, after I managed to create a dense-Vectors object with only one
element from unicode, it seems that its behaviour is unpredictable. For
example, 

Vectors.dense(unicode("0.1"))

will report an error. 

dense_vec = Vectors.dense(unicode("0.1"))

will NOT report any error until you run 

dense_vec

to check its value. And the following will be able to create a successful
DataFrame: 

mylist = [(0, Vectors.dense(unicode("0.1")))]
myrdd = sc.parallelize(mylist)
mydf = sqlContext.createDataFrame(myrdd, ["X", "Y"])

However if the above unicode value is read from a text file (e.g., a csv
file with 2 columns) then the DataFrame column corresponding to "Y" will be
EMPTY: 

raw_data = sc.textFile(filename)
split_data = raw_data.map(lambda line: line.split(','))
parsed_data = split_data.map(lambda line: (int(line[0]),
Vectors.dense(line[1])))
mydf = sqlContext.createDataFrame(parsed_data, ["X", "Y"])

It would be great if someone could share some ideas. Thanks a lot. 

f. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-involving-Vectors-with-a-single-element-tp27013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org