Hi Tobi,
The MLlib RDD-based API does support to apply transformation on both Vector
and RDD, but you did not use the appropriate way to do.
Suppose you have a RDD with LabeledPoint in each line, you can refer the
following code snippets to train a ChiSqSelectorModel model and do
transformation:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import ChiSqSelector
data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0,
5.0])]
rdd = sc.parallelize(data)
model = ChiSqSelector(1).fit(rdd)
filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
filteredRDD.collect()
However, we strongly recommend you to migrate to DataFrame-based API since
the RDD-based API is switched to maintain mode.
Thanks
Yanbo
2016-07-14 13:23 GMT-07:00 Tobi Bosede <[email protected]>:
> Hi everyone,
>
> I am trying to filter my features based on the spark.mllib ChiSqSelector.
>
> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
> model.transform(lp.features)))
>
> However when I do the following I get the error below. Is there any other
> way to filter my data to avoid this error?
>
> filteredDataDF=filteredData.toDF()
>
> Exception: It appears that you are attempting to reference SparkContext from
> a broadcast variable, action, or transforamtion. SparkContext can only be
> used on the driver, not in code that it run on workers. For more information,
> see SPARK-5063.
>
>
> I would directly use the spark.ml ChiSqSelector and work with dataframes, but
> I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not
> available to me. filteredData is of type piplelineRDD, if that helps. It is
> not a regular RDD. I think that may part of why calling toDF() is not working.
>
>
> Thanks,
>
> Tobi
>
>