Re: Spark 2.1.1 Graphx graph loader GC overhead error

2017-07-11 Thread Aritra Mandal
yncxcw wrote
> hi,
> 
> I think if the OOM occurs before the computation begins, the input data is
> probably too big to fit in memory. I remembered that the graph data would
> expand when loading the data input memory. And the scale of expanding is
> pretty huge( based on my experiment on Pagerank).
> 
> 
> Wei  Chen

Hello Wei,

Thanks for the suggestions.

I tried this small piece of code with StorageLevel.MEMORY_AND_DISK I removed
the pregel call just to test.
But still the code failed with OOM in the graphload stage

/val ygraph=GraphLoader.edgeListFile(sc,args(1),
true,32,StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK).partitionBy(PartitionStrategy.RandomVertexCut)

println(ygraph.vertices.count())/


Is there a way to calculate the maximum size of a graph that a given
configuration of the cluster can process.

Aritra





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-1-1-Graphx-graph-loader-GC-overhead-error-tp28841p28851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Limit the number of tasks submitted:spark.submit.tasks.threshold.enabled & spark.submit.tasks.threshold

2017-07-11 Thread 李斌松
Limit the number of tasks submitted to avoid a task occupancy attitude
resources, while you can guide users to set reasonable conditions,

[image: 内嵌图片 1]


spark_submit_tasks_threshold.patch
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

java IllegalStateException: unread block data Exception - setBlockDataMode

2017-07-11 Thread Kanagha
Hi,



I am using spark 2.0.2. I'm not sure what is causing this error to
occur. Would be really helpful for any inputs. Appreciate any help in
this.


Exception caught: Job aborted due to stage failure: Task 0 in stage
0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, ..): java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
Exception caught again : org.apache.spark.SparkException: Job aborted
due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent
failure: Lost task 0.3 in stage 0.0 (TID 3, ..):
java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:


Thanks


Spark streaming does not seem to clear MapPartitionsRDD and ShuffledRDD that are persisted after the use of updateStateByKey and reduceByKeyAndWindow with inverse functions even after checkpointing th

2017-07-11 Thread SRK
Hi,

Spark streaming does not seem to clear MapPartitionsRDD and ShuffledRDD that
are persisted after the use of updateStateByKey and reduceByKeyAndWindow
with inverse functions even after checkpointing the data. Any idea as to why
thing happens? Is there a way that I can set a time out to clear the
persisted data after a while? It seems to be not clearing the cached
MapPartitionsRDD and ShuffledRDD even after I explicitly call unpersist and
also do the checkpointing.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-does-not-seem-to-clear-MapPartitionsRDD-and-ShuffledRDD-that-are-persisted-after-thea-tp28850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DataFrame --- join / groupBy-agg question...

2017-07-11 Thread muthu
I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey
(of PairRDDFunctions) with an optional Partition-Strategy (with is number of
partitions or Partitioner) b. join (of PairRDDFunctions) and its variants, I
used to have a way to provide number of partitions

In DataFrame, how do I specify the number of partitions during this
operation? I could use repartition() after the fact. But this would be
another Stage in the Job.

One work around to increase the number of partitions / task during a join is
to set 'spark.sql.shuffle.partitions' it some desired number during
spark-submit. I am trying to see if there is a way to provide this
programmatically for every step of a groupBy-agg / join?

Reason to do it programmatically is so that, depending on the size of the
dataframe, I can use a larger or smaller number of tasks to avoid
OutOfMemoryError.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-join-groupBy-agg-question-tp28849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Jean Georges Perrin
Awesome! Congrats! Can't wait!!

jg


> On Jul 11, 2017, at 18:48, Michael Armbrust  wrote:
> 
> Hi all,
> 
> Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release 
> removes the experimental tag from Structured Streaming. In addition, this 
> release focuses on usability, stability, and polish, resolving over 1100 
> tickets.
> 
> We'd like to thank our contributors and users for their contributions and 
> early feedback to this release. This release would not have been possible 
> without you.
> 
> To download Spark 2.2.0, head over to the download page: 
> http://spark.apache.org/downloads.html
> 
> To view the release notes: 
> https://spark.apache.org/releases/spark-release-2-2-0.html
> 
> (note: If you see any issues with the release notes, webpage or published 
> artifacts, please contact me directly off-list) 
> 
> Michael


[ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Michael Armbrust
Hi all,

Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release
removes the experimental tag from Structured Streaming. In addition, this
release focuses on usability, stability, and polish, resolving over 1100
tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 2.2.0, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes: https://spark.apache.
org/releases/spark-release-2-2-0.html

*(note: If you see any issues with the release notes, webpage or published
artifacts, please contact me directly off-list) *

Michael


DataFrame --- join / groupBy-agg question...

2017-07-11 Thread Muthu Jayakumar
Hello there,

I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform
a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an
optional Partition-Strategy (with is number of partitions or Partitioner)
b. join (of PairRDDFunctions) and its variants, I used to have a way to
provide number of partitions

In DataFrame, how do I specify the number of partitions during this
operation? I could use repartition() after the fact. But this would be
another Stage in the Job.

One work around to increase the number of partitions / task during a join
is to set 'spark.sql.shuffle.partitions' it some desired number during
spark-submit. I am trying to see if there is a way to provide this
programmatically for every step of a groupBy-agg / join.

Please advice,
Muthu


Re: Testing another Dataset after ML training

2017-07-11 Thread Michael C. Kunkel

Greetings,

Thanks for the communication.

I attached the entire stacktrace in which was output to the screen.
I tried to use JavaRDD and LabeledPoint then convert to Dataset and I still get 
the same error as I did when I only used datasets.

I am using the expected ml Vector. I tried it using the mllib and that also 
didnt work.

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 11/07/2017 17:21, Riccardo Ferrari wrote:
Mh, to me feels like there some data mismatch. Are you sure you're using the 
expected Vector (ml vs mllib). I am not sure you attached the whole Exception 
but you might find some more useful details there.

Best,

On Tue, Jul 11, 2017 at 3:07 PM, mckunkel 
> wrote:
Im not sure why I cannot subscribe, so that everyone can view the
conversation.
Help?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org







Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt





[Spark Streaming] - ERROR Error cleaning broadcast Exception

2017-07-11 Thread Nipun Arora
Hi All,

I get the following error while running my spark streaming application, we
have a large application running multiple stateful (with mapWithState) and
stateless operations. It's getting difficult to isolate the error since
spark itself hangs and the only error we see is in the spark log and not
the application log itself.

The error happens only after abount 4-5 mins with a micro-batch interval of
10 seconds.
I am using Spark 1.6.1 on an ubuntu server with Kafka based input and
output streams.

Any direction you can give to solve this issue will be helpful. Please let
me know if I can provide any more information.

Thanks
Nipun

Error inline below:

[2017-07-11 16:15:15,338] ERROR Error cleaning broadcast 2211
(org.apache.spark.ContextCleaner)

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout

at org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)

at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)

at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)

at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)

at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)

at
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)

at
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)

at
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)

at
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:77)

at
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)

at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)

at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)

at
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)

at org.apache.spark.ContextCleaner.org
$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)

at
org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)

Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]

at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)

at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)

at scala.concurrent.Await$.result(package.scala:107)

at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)


Re: Query via Spark Thrift Server return wrong result.

2017-07-11 Thread Valentin Ursu
Apologies, wrong shortcuts in gmail and I managed to send the mail before I
finished editing the query.
I edited it below.


On Tue, Jul 11, 2017 at 7:58 PM, Valentin Ursu <
valentindaniel.u...@gmail.com> wrote:

> Hello,
>
> Short description: A SQL query sent via Thrift server returns an
> inexplicable response. Running the same (exact same) query inside Apache
> Zeppelin or submitting a job returns the correct result. Furthermore, a
> similar table returns the correct response in both cases.
>
> Details:
> I'm using Spark 2.0.0 on a Cloudera 5.7 distribution. I did not test in on
> Spark 2.1.0 but if helpful I can install it and test.
>
> I have 2 Hive metastore tables which are saved by a spark batch job.
> orders_2years is processed and saveAsTable()
> orders_30days is filtered from _2years based on a column containing date
> and saveAsTable() (same batch job creates both)
>
> I checked and both tables contain this record and it is the only record
> that satisfies all the conditions in the query:
>
> | customer_id | order_id | doc_id | category | vendor_id|
> +-+--+-+--+--+
> | 916339 | 25144502 | 5596579 | 1455 | 1 |
>
> The exact same query is run using a PHP application connecting to Thrift
> and via Zeppelin/Spark using sqlContext.sql("")
>
> The query on table _30days, in Zeppelin > good result
> The query on table _30days, via Thrift > bad result
> The query on table _2years, via Thrift > good result
>
> Furthermore, changing _30days in _2years in b and leaving _30days to
> create a > good result.
>
> The query is:
>
> SELECT a.customer_id_custom,
> collect_list(b.order_id) AS b__orderIds,
> collect_list(b.doc_id) AS b__docIds,
> collect_list(b.category_id) AS b__cat,
> collect_list(b.vendor_id) AS b__vendor
> FROM ((SELECT customer_id AS customer_id_custom
> FROM orders_30days
> WHERE 1 = 1 AND category_id IN (1455) AND vendor_id IN (1) AND
> order_id IN (25144502) AND
>
 (fullDate > '2017-06-08 12:07' AND

> fullDate < '2017-07-07 12:07')
> GROUP BY customer_id
> HAVING count(1) >= 1 AND
> SUM(total_price_with_vat) BETWEEN 1 AND 999) a )
> INNER JOIN orders_30days AS b
> ON b.customer_id = a.customer_id_custom AND
> 1 = 1 AND b.category_id IN (1455) AND
> b.vendor_id IN (1) AND
> (b.fullDate > '2017-06-08 12:07' AND
> b.fullDate < '2017-07-07 12:07')
> WHERE 1 = 1
> GROUP BY a.customer_id_custom
>
> If you're wondering why I'm so specific with my query, the original is a
> lot more complex. For example a is actually obtained by joining 5 tables
> but I tried simplifying it as much as I could while obtaining the same
> effect.
>
> Good result is:
>
> | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
> ++---+---++---+
> | 916339 | [25144502] | [5596579] | [1455] | [1] |
>
> Bad result:
>
> | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
> ++---+---++---+
> | 916339 | [null] | [1] | [null] | [1455] |
>
> Notice how some columns appear to be misplaced (b__cat is actually in
> b__vendor) while others just return null
>
>
>


Query via Spark Thrift Server return wrong result.

2017-07-11 Thread Valentin Ursu
Hello,

Short description: A SQL query sent via Thrift server returns an
inexplicable response. Running the same (exact same) query inside Apache
Zeppelin or submitting a job returns the correct result. Furthermore, a
similar table returns the correct response in both cases.

Details:
I'm using Spark 2.0.0 on a Cloudera 5.7 distribution. I did not test in on
Spark 2.1.0 but if helpful I can install it and test.

I have 2 Hive metastore tables which are saved by a spark batch job.
orders_2years is processed and saveAsTable()
orders_30days is filtered from _2years based on a column containing date
and saveAsTable() (same batch job creates both)

I checked and both tables contain this record and it is the only record
that satisfies all the conditions in the query:

| customer_id | order_id | doc_id | category | vendor_id|
+-+--+-+--+--+
| 916339 | 25144502 | 5596579 | 1455 | 1 |

The exact same query is run using a PHP application connecting to Thrift
and via Zeppelin/Spark using sqlContext.sql("")

The query on table _30days, in Zeppelin > good result
The query on table _30days, via Thrift > bad result
The query on table _2years, via Thrift > good result

Furthermore, changing _30days in _2years in b and leaving _30days to create
a > good result.

The query is:

SELECT a.customer_id_custom,
collect_list(b.id_comanda) AS b__orderIds,
collect_list(b.doc_id) AS b__docIds,
collect_list(b.category_id) AS b__cat,
collect_list(b.vendor_id) AS b__vendor
FROM ((SELECT customer_id AS customer_id_custom
FROM orders_30days
WHERE 1 = 1 AND category_id IN (1455) AND vendor_id IN (1) AND
(fullDate > '2017-06-08 12:07' AND
fullDate < '2017-07-07 12:07')
GROUP BY customer_id
HAVING count(1) >= 1 AND
SUM(total_price_with_vat) BETWEEN 1 AND 999) a )
INNER JOIN orders_30days AS b
ON b.customer_id = a.customer_id_custom AND
1 = 1 AND b.category_id IN (1455) AND
b.doc_id IN (5596579) AND b.vendor_id IN (1) AND
(b.fullDate > '2017-06-08 12:07' AND
b.fullDate < '2017-07-07 12:07')
WHERE 1 = 1
GROUP BY a.customer_id_custom

If you're wondering why I'm so specific with my query, the original is a
lot more complex. For example a is actually obtained by joining 5 tables
but I tried simplifying it as much as I could while obtaining the same
effect.

Good result is:

| customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
++---+---++---+
| 916339 | [25144502] | [5596579] | [1455] | [1] |

Bad result:

| customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor |
++---+---++---+
| 916339 | [null] | [1] | [null] | [1455] |

Notice how some columns appear to be misplaced while others just return null


Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Mh, to me feels like there some data mismatch. Are you sure you're using
the expected Vector (ml vs mllib). I am not sure you attached the whole
Exception but you might find some more useful details there.

Best,

On Tue, Jul 11, 2017 at 3:07 PM, mckunkel  wrote:

> Im not sure why I cannot subscribe, so that everyone can view the
> conversation.
> Help?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-
> training-tp28845p28846.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Testing another Dataset after ML training

2017-07-11 Thread mckunkel
Im not sure why I cannot subscribe, so that everyone can view the
conversation.
Help?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845p28846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Testing another Dataset after ML training

2017-07-11 Thread Michael C. Kunkel

Greetings,

I am 50.50 sure the data format is correct, as if I split the data the 
classifier works properly. If I introduce another dataset, created identically 
to the one it is trained on.

However, the creation of the data itself is in doubt, but I do not see any help on 
this subject with Dataset

What I do is create two List

   List dataTraining = new ArrayList<>();
   List dataTesting = new ArrayList<>();

Fill them
   dataTraining.add(RowFactory.create(Double.parseDouble(label), 
Vectors.dense(v)));
   dataTesting.add(RowFactory.create(Double.parseDouble(label), 
Vectors.dense(v)));

Then construct two Dataset

   StructType schemaForFrame = new StructType(
   new StructField[] { new StructField("label", 
DataTypes.DoubleType, false, Metadata.empty()),
   new StructField("features", new VectorUDT(), false, 
Metadata.empty()) });


   Dataset training = spark.createDataFrame(dataTraining, 
schemaForFrame);
   Dataset testing = spark.createDataFrame(dataTesting, 
schemaForFrame);


So I am not sure if it is correct, but I am not using RDD.

Also, can you inform me is you had any problems with the mailing list. I have 
tried for weeks for my emails to be accepted by the list.

Thanks

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp

On 11/07/2017 14:53, Riccardo Ferrari wrote:
Hi,

Are you sure you're feeding the correct data format? I found this conversation 
that might be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html

Best,

On Tue, Jul 11, 2017 at 1:42 PM, mckunkel 
> wrote:
Greetings,

Following the example on the AS page for Naive Bayes using Dataset
https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes


I want to predict the outcome of another set of data. So instead of
splitting the data into training and testing, I have 1 set of training and
one set of testing. i.e.;
   Dataset training = spark.createDataFrame(dataTraining,
schemaForFrame);
   Dataset testing = spark.createDataFrame(dataTesting, 
schemaForFrame);

   NaiveBayes nb = new NaiveBayes();
   NaiveBayesModel model = nb.fit(train);
   Dataset predictions = model.transform(testing);
   predictions.show();

But I get the error.

17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at
NaiveBayes.scala:171, took 3.942413 s
Exception in thread "main" org.apache.spark.SparkException: Failed to
execute user defined function($anonfun$1: (vector) => vector)
   at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
   at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
   at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
   at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

...
...
...


How do I perform predictions on other datasets that were not created at a
split?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org







Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt





Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Hi,

Are you sure you're feeding the correct data format? I found this
conversation that might be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html

Best,

On Tue, Jul 11, 2017 at 1:42 PM, mckunkel  wrote:

> Greetings,
>
> Following the example on the AS page for Naive Bayes using Dataset
> https://spark.apache.org/docs/latest/ml-classification-
> regression.html#naive-bayes
>  regression.html#naive-bayes>
>
> I want to predict the outcome of another set of data. So instead of
> splitting the data into training and testing, I have 1 set of training and
> one set of testing. i.e.;
> Dataset training = spark.createDataFrame(
> dataTraining,
> schemaForFrame);
> Dataset testing = spark.createDataFrame(dataTesting,
> schemaForFrame);
>
> NaiveBayes nb = new NaiveBayes();
> NaiveBayesModel model = nb.fit(train);
> Dataset predictions = model.transform(testing);
> predictions.show();
>
> But I get the error.
>
> 17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at
> NaiveBayes.scala:171, took 3.942413 s
> Exception in thread "main" org.apache.spark.SparkException: Failed to
> execute user defined function($anonfun$1: (vector) => vector)
> at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(
> ScalaUDF.scala:1075)
> at
> org.apache.spark.sql.catalyst.expressions.Alias.eval(
> namedExpressions.scala:144)
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(
> Projection.scala:48)
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(
> Projection.scala:30)
> at
> scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
> at
> scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>
> ...
> ...
> ...
>
>
> How do I perform predictions on other datasets that were not created at a
> split?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Testing-another-Dataset-after-ML-
> training-tp28845.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Testing another Dataset after ML training

2017-07-11 Thread mckunkel
Greetings,

Following the example on the AS page for Naive Bayes using Dataset
https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes

  

I want to predict the outcome of another set of data. So instead of
splitting the data into training and testing, I have 1 set of training and
one set of testing. i.e.;
Dataset training = spark.createDataFrame(dataTraining,
schemaForFrame);
Dataset testing = spark.createDataFrame(dataTesting, 
schemaForFrame);

NaiveBayes nb = new NaiveBayes();
NaiveBayesModel model = nb.fit(train);
Dataset predictions = model.transform(testing);
predictions.show();

But I get the error.

17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at
NaiveBayes.scala:171, took 3.942413 s
Exception in thread "main" org.apache.spark.SparkException: Failed to
execute user defined function($anonfun$1: (vector) => vector)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

...
...
...


How do I perform predictions on other datasets that were not created at a
split?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org