Re: Crossvalidator after fit

2017-05-05 Thread Bryan Cutler
Looks like there might be a problem with the way you specified your
parameter values, probably you have an integer value where it should be a
floating-point.  Double check that and if there is still a problem please
share the rest of your code so we can see how you defined "gridS".

On Fri, May 5, 2017 at 7:40 AM, issues solution 
wrote:

> Hi get the following error after trying to perform
> gridsearch and crossvalidation on randomforst estimator for classificaiton
>
> rf = RandomForestClassifier(labelCol="Labeld",featuresCol="features")
>
> evaluator =  BinaryClassificationEvaluator(metricName="F1 Score")
>
> rf_cv = CrossValidator(estimator=rf, 
> estimatorParamMaps=gridS,evaluator=evaluator,numFolds=5)
> (trainingData, testData) = transformed13.randomSplit([0.7, 0.3])
> rfmodel  =  rf_cv.fit(trainingData)
> ---Py4JJavaError
>  Traceback (most recent call 
> last) in ()> 1 rfmodel  =  
> rf_cv.fit(trainingData)
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/pipeline.py in 
> fit(self, dataset, params) 67 return 
> self.copy(params)._fit(dataset) 68 else:---> 69   
>   return self._fit(dataset) 70 else: 71 raise 
> ValueError("Params must be either a param map or a list/tuple of param maps, "
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/tuning.py in _fit(self, 
> dataset)237 train = df.filter(~condition)238 
> for j in range(numModels):--> 239 model = est.fit(train, 
> epm[j])240 # TODO: duplicate evaluator to take extra 
> params from input241 metric = 
> eva.evaluate(model.transform(validation, epm[j]))
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/pipeline.py in 
> fit(self, dataset, params) 65 elif isinstance(params, dict): 
> 66 if params:---> 67 return 
> self.copy(params)._fit(dataset) 68 else: 69   
>   return self._fit(dataset)
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in 
> _fit(self, dataset)131 132 def _fit(self, dataset):--> 133
>  java_model = self._fit_java(dataset)134 return 
> self._create_model(java_model)135
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in 
> _fit_java(self, dataset)127 :return: fitted Java model128 
> """--> 129 self._transfer_params_to_java()130 return 
> self._java_obj.fit(dataset._jdf)131
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in 
> _transfer_params_to_java(self) 80 for param in self.params: 
> 81 if param in paramMap:---> 82 pair = 
> self._make_java_param_pair(param, paramMap[param]) 83 
> self._java_obj.set(pair) 84
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in 
> _make_java_param_pair(self, param, value) 71 java_param = 
> self._java_obj.getParam(param.name) 72 java_value = _py2java(sc, 
> value)---> 73 return java_param.w(java_value) 74  75 def 
> _transfer_params_to_java(self):
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)811 answer = 
> self.gateway_client.send_command(command)812 return_value = 
> get_return_value(--> 813 answer, self.gateway_client, 
> self.target_id, self.name)814 815 for temp_arg in temp_args:
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/sql/utils.py in deco(*a, 
> **kw) 43 def deco(*a, **kw): 44 try:---> 45 
> return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e:
>  47 s = e.java_exception.toString()
> /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)306  
>raise Py4JJavaError(307 "An error occurred 
> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
> ".", name), value)309 else:310 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling o91602.w.
> : java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> java.lang.Double
>   at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
>   at org.apache.spark.ml.param.DoubleParam.w(params.scala:225)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

Re: Structured Streaming + initialState

2017-05-05 Thread Tathagata Das
Can you explain how your initial state is stored? is it a file, or its in a
database?
If its in a database, then when initialize the GroupState, you can fetch it
from the database.

On Fri, May 5, 2017 at 7:35 AM, Patrick McGloin 
wrote:

> Hi all,
>
> With Spark Structured Streaming, is there a possibility to set an "initial
> state" for a query?
>
> Using a join between a streaming Dataset and a static Dataset does not
> support full joins.
>
> Using mapGroupsWithState to create a GroupState does not support an
> initialState (as the Spark Streaming StateSpec did).
>
> Are there any plans to add support for initial states?  Or is there
> already a way to do so?
>
> Best regards,
> Patrick
>


is Spark Application code dependent on which mode we run?

2017-05-05 Thread kant kodali
Hi All,

Does rdd.collect() call works for Client mode but not for cluster mode? If
so, is there way for the Application to know which mode it is running in?
It looks like for cluster mode we don't need to call rdd.collect() instead
we can just call rdd.first() or whatever

Thanks!


Re: Spark books

2017-05-05 Thread Jacek Laskowski
Thanks Stephen! I appreciate it very much.

And yeah...Stephen is right on this. Go and read the notes and let me know
where you're missing things :-)

p.s. Holden has just announced that her book is complete and think Matei is
also quite far with his writing.

Jacek

On 4 May 2017 2:52 a.m., "Stephen Fletcher" 
wrote:

> Zeming,
>
> Jacek also has a really good online spark book for spark 2, "mastering
> spark". I found it very helpful when trying to understand spark 2's
> encoders.
>
> his book is here:
> https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details
>
>
> On Wed, May 3, 2017 at 8:16 PM, Neelesh Salian 
> wrote:
>
>> The Apache Spark documentation is good to begin with.
>> All the programming guides, particularly.
>>
>>
>> On Wed, May 3, 2017 at 5:07 PM, ayan guha  wrote:
>>
>>> I would suggest do not buy any book, just start with databricks
>>> community edition
>>>
>>> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:
>>>
 Well that is the nature of technology, ever evolving. There will always
 be new concepts. If you're trying to get started ASAP and the internet
 isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of
 production stacks are still on that version and the knowledge from
 mastering 1.6 is transferable to 2+. I think that beats waiting forever.

 On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:

> I'm trying to decide whether to buy the book learning spark, spark for
> machine learning etc. or wait for a new edition covering the new concepts
> like dataframe and datasets. Anyone got any suggestions?
>


>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Regards,
>> Neelesh S. Salian
>>
>>
>


Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Pierce Lamb
Hi Nipun,

To expand a bit, you might find this stackoverflow answer useful:

http://stackoverflow.com/a/39753976/3723346

Most spark + database combinations can handle a use case like this.

Hope this helps,

Pierce

On Thu, May 4, 2017 at 9:18 AM, Gene Pang  wrote:

> As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help
> you. Here is some documentation on how to run Alluxio and Spark together
> , and
> here is a blog post on a Spark streaming + Alluxio use case
> 
> .
>
> Hope that helps,
> Gene
>
> On Tue, May 2, 2017 at 11:56 AM, Nipun Arora 
> wrote:
>
>> Hi All,
>>
>> To support our Spark Streaming based anomaly detection tool, we have made
>> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>>
>> I'll first explain our use-case, which I believe should be common to
>> several people using Spark Streaming applications. Broadcast variables are
>> often used to store values "machine learning models", which can then be
>> used on streaming data to "test" and get the desired results (for our case
>> anomalies). Unfortunately, in the current spark, broadcast variables are
>> final and can only be initialized once before the initialization of the
>> streaming context. Hence, if a new model is learned the streaming system
>> cannot be updated without shutting down the application, broadcasting
>> again, and restarting the application. Our goal was to re-broadcast
>> variables without requiring a downtime of the streaming service.
>>
>> The key to this implementation is a live re-broadcastVariable()
>> interface, which can be triggered in between micro-batch executions,
>> without any re-boot required for the streaming application. At a high level
>> the task is done by re-fetching broadcast variable information from the
>> spark driver, and then re-distribute it to the workers. The micro-batch
>> execution is blocked while the update is made, by taking a lock on the
>> execution. We have already tested this in our prototype deployment of our
>> anomaly detection service and can successfully re-broadcast the broadcast
>> variables with no downtime.
>>
>> We would like to integrate these changes in spark, can anyone please let
>> me know the process of submitting patches/ new features to spark. Also. I
>> understand that the current version of Spark is 2.1. However, our changes
>> have been done and tested on Spark 1.6.2, will this be a problem?
>>
>> Thanks
>> Nipun
>>
>
>


Re: Where is release 2.1.1?

2017-05-05 Thread darren


Thanks. It looks like they posted the release just now because it wasn't 
showing before.




Get Outlook for Android









On Fri, May 5, 2017 at 11:04 AM -0400, "Jules Damji"  wrote:










Go to this link http://spark.apache.org/downloads.html
CheersJules 

Sent from my iPhonePardon the dumb thumb typos :)
On May 5, 2017, at 7:40 AM, dar...@ontrenet.com wrote:



Hi


Website says it is released. Where can it be downloaded?


Thanks




Get Outlook for Android












how to get assertDataFrameEquals ignore nullable

2017-05-05 Thread A Shaikh
As part of TDD I am using com.holdenkarau.spark.testing.DatasetSuiteBase to
assert if 2 Dataframes values are equal using


assertDataFrameEquals(dataframe1, dataframe2)

Although the values are same but it fails assertion because nullable
property does not match for some column. Is there are way to get
assertDataFrameEquals  ignore nullable property?

Also can we also extends that to ignore datatypes as well and just match
the values?


Thanks,
Afzal


Where is release 2.1.1?

2017-05-05 Thread darren


Hi


Website says it is released. Where can it be downloaded?


Thanks




Get Outlook for Android







Crossvalidator after fit

2017-05-05 Thread issues solution
Hi get the following error after trying to perform
gridsearch and crossvalidation on randomforst estimator for classificaiton

rf = RandomForestClassifier(labelCol="Labeld",featuresCol="features")

evaluator =  BinaryClassificationEvaluator(metricName="F1 Score")

rf_cv = CrossValidator(estimator=rf,
estimatorParamMaps=gridS,evaluator=evaluator,numFolds=5)
(trainingData, testData) = transformed13.randomSplit([0.7, 0.3])
rfmodel  =  rf_cv.fit(trainingData)
---Py4JJavaError
Traceback (most recent call
last) in ()> 1 rfmodel  =
rf_cv.fit(trainingData)
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/pipeline.py in
fit(self, dataset, params) 67 return
self.copy(params)._fit(dataset) 68 else:---> 69
 return self._fit(dataset) 70 else: 71
raise ValueError("Params must be either a param map or a
list/tuple of param maps, "
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/tuning.py in
_fit(self, dataset)237 train = df.filter(~condition)
 238 for j in range(numModels):--> 239
model = est.fit(train, epm[j])240 # TODO:
duplicate evaluator to take extra params from input241
metric = eva.evaluate(model.transform(validation, epm[j]))
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/pipeline.py in
fit(self, dataset, params) 65 elif isinstance(params,
dict): 66 if params:---> 67 return
self.copy(params)._fit(dataset) 68 else: 69
 return self._fit(dataset)
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in
_fit(self, dataset)131 132 def _fit(self, dataset):--> 133
java_model = self._fit_java(dataset)134 return
self._create_model(java_model)135
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in
_fit_java(self, dataset)127 :return: fitted Java model
128 """--> 129 self._transfer_params_to_java()130
   return self._java_obj.fit(dataset._jdf)131
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in
_transfer_params_to_java(self) 80 for param in
self.params: 81 if param in paramMap:---> 82
  pair = self._make_java_param_pair(param, paramMap[param]) 83
self._java_obj.set(pair) 84
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in
_make_java_param_pair(self, param, value) 71 java_param =
self._java_obj.getParam(param.name) 72 java_value =
_py2java(sc, value)---> 73 return java_param.w(java_value)
74  75 def _transfer_params_to_java(self):
/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
in __call__(self, *args)811 answer =
self.gateway_client.send_command(command)812 return_value
= get_return_value(--> 813 answer, self.gateway_client,
self.target_id, self.name)814 815 for temp_arg in
temp_args:
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/sql/utils.py in
deco(*a, **kw) 43 def deco(*a, **kw): 44 try:--->
45 return f(*a, **kw) 46 except
py4j.protocol.Py4JJavaError as e: 47 s =
e.java_exception.toString()
/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)306
 raise Py4JJavaError(307 "An error
occurred while calling {0}{1}{2}.\n".--> 308
format(target_id, ".", name), value)309 else:310
  raise Py4JError(
Py4JJavaError: An error occurred while calling o91602.w.
: java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
at org.apache.spark.ml.param.DoubleParam.w(params.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)


how to get assertDataFrameEquals ignore nullable

2017-05-05 Thread A Shaikh
As part of TDD I am using com.holdenkarau.spark.testing.DatasetSuiteBase to
assert if 2 Dataframes values are equal using


assertDataFrameEquals(dataframe1, dataframe2)

Although the values are same but it fails assertion because nullable
property does not match for some column. Is there are way to get
assertDataFrameEquals  ignore nullable property?

Also can we also extends that to ignore datatypes as well and just match
the values?


Thanks,
Afzal


Structured Streaming + initialState

2017-05-05 Thread Patrick McGloin
Hi all,

With Spark Structured Streaming, is there a possibility to set an "initial
state" for a query?

Using a join between a streaming Dataset and a static Dataset does not
support full joins.

Using mapGroupsWithState to create a GroupState does not support an
initialState (as the Spark Streaming StateSpec did).

Are there any plans to add support for initial states?  Or is there already
a way to do so?

Best regards,
Patrick


Reading ORC file - fine on 1.6; GC timeout on 2+

2017-05-05 Thread Nick Chammas
I have this ORC file that was generated by a Spark 1.6 program. It opens
fine in Spark 1.6 with 6GB of driver memory, and probably less.

However, when I try to open the  same file in Spark 2.0 or 2.1, I get GC
timeout exceptions. And this is with 6, 8, and even 10GB of driver memory.

This is strange and smells like buggy behavior. How can I debug this or
workaround it in Spark 2+?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-ORC-file-fine-on-1-6-GC-timeout-on-2-tp28654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

hbase + spark + hdfs

2017-05-05 Thread mathieu ferlay
Hi everybody.

I'm totally new in Spark and I wanna know one stuff that I do not manage to
find. I have a full ambary install with hbase, Hadoop and spark. My code
reads and writes in hdfs via hbase. Thus, as I understood, all data stored
are in bytes format in hdfs. Now, I know that it's possible to request in
hdfs directly via Spark, but I don't know if Spark will support the format
of  those data stored from hbase. 

 

I know that it's possible to manage hbase from Spark but I wanna to directly
request in hdfs. 

 

Thanks to confirm it and to say me how to do it.

Regards,

 

Mathieu FERLAY

R Engineer

GNUBILA/MAAT France
174, Imp. Pres d'en Bas
74370 Argonay (France)


  www.gnubila.fr
  mfer...@gnubila.fr



 

 






PRIVACY DESIGNER

 



Re: imbalance classe inside RANDOMFOREST CLASSIFIER

2017-05-05 Thread DB Tsai
We have the weighting algorithms implemented in linear models, but
unfortunately, it's not implemented in tree models. It's an important
feature, and welcome for PR! Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0


On Fri, May 5, 2017 at 12:58 AM, issues solution
 wrote:
> Hi ,
> in sicki-learn we have sample_weights option that allow us to create array
> to balacne class category
>
>
> By calling like that
>
> rf.fit(X,Y,sample_weights=[10 10 10 ...1 1 10 ])
>
> i 'am wondering if equivelent exist inside ml or mlib class ???
> if yes can i ask refrence or example
> thx for advance .
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



org.apache.hadoop.fs.FileSystem: Provider tachyon.hadoop.TFS could not be instantiated

2017-05-05 Thread Jone Zhang
*When i use sparksql, the error as follows*

17/05/05 15:58:44 WARN scheduler.TaskSetManager: Lost task 0.0 in
stage 20.0 (TID 4080, 10.196.143.233):
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem:
Provider tachyon.hadoop.TFS could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:224)
at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2558)
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2569)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:365)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at 
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:654)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
at 
org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:321)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:276)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class
tachyon.hadoop.TFS
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:379)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
... 45 more

17/05/05 15:58:44 INFO cluster.YarnClusterScheduler: Removed TaskSet
20.0, whose tasks have all completed, from pool


I did not use tachyon directly.


*Grateful for any idea!*


imbalance classe inside RANDOMFOREST CLASSIFIER

2017-05-05 Thread issues solution
Hi ,
in sicki-learn we have sample_weights option that allow us to create array
to balacne class category


By calling like that

rf.fit(X,Y,sample_weights=[10 10 10 ...1 1 10 ])

i 'am wondering if equivelent exist inside ml or mlib class ???
if yes can i ask refrence or example
thx for advance .


Re: map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-05 Thread peay
Hello,

So, I assume there is nothing to apply/transform in structured streaming based 
on a function that takes a dataframe as input and output a dataframe as input?

UDAF are kind of low level and require you to implement merge, and process 
individual rows in AFAIK (and are not available in pyspark).

Instead, I would like to directly transform the dataframe for a given window 
using the powerful high-level API for dataframes, kind of like map for RDD.

Any idea if something like this could be supported in the future?

Thanks,

 Original Message 
Subject: Re: map/foreachRDD equivalent for pyspark Structured Streaming
Local Time: 3 May 2017 12:05 PM
UTC Time: 3 May 2017 10:05
From: tathagata.das1...@gmail.com
To: peay 
user@spark.apache.org 

You can apply apply any kind of aggregation on windows. There are some built in 
aggregations (e.g. sum and count) as well as there is an API for user-defined 
aggregations (scala/Java) that works with both batch and streaming DFs.
See the programming guide if you havent seen it already
- windowing - 
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
- UDAFs on typesafe Dataset (scala/java) - 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.Aggregator
- UDAFs on generic DataFrames (scala/java) - 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction

You can register a UDAF defined in Scala with a name and then call that 
function by name in SQL

- https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

I feel a combination of these should be sufficient for you. Hope this helps.

On Wed, May 3, 2017 at 1:51 AM, peay  wrote:
Hello,

I would like to get started on Spark Streaming with a simple window.

I've got some existing Spark code that takes a dataframe, and outputs a 
dataframe. This includes various joins and operations that are not supported by 
structured streaming yet. I am looking to essentially map/apply this on the 
data for each window.

Is there any way to apply a function to a dataframe that would correspond to 
each window? This would mean accumulate data until watermark is reached, and 
then mapping the full corresponding dataframe.

I am using pyspark. I've seen the foreach writer, but it seems to operate at 
partition level instead of a full "window dataframe" and is not available for 
Python anyway.

Thanks!