Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome!

On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community
> starts to have test coverage for all supported Python versions from Today.
>
> - https://github.com/apache/spark/actions/runs/7061665420
>
> Here is a summary.
>
> 1. Main CI: All PRs and commits on `master` branch are tested with Python
> 3.9.
> 2. Daily CI:
> https://github.com/apache/spark/actions/workflows/build_python.yml
> - PyPy 3.8
> - Python 3.10
> - Python 3.11
> - Python 3.12
>
> This is a great addition for PySpark 4.0+ users and an extensible
> framework for all future Python versions.
>
> Thank you all for making this together!
>
> Best,
> Dongjoon.
>


[FYI] SPARK-45981: Improve Python language test coverage

2023-12-01 Thread Dongjoon Hyun
Hi, All.

As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community
starts to have test coverage for all supported Python versions from Today.

- https://github.com/apache/spark/actions/runs/7061665420

Here is a summary.

1. Main CI: All PRs and commits on `master` branch are tested with Python
3.9.
2. Daily CI:
https://github.com/apache/spark/actions/workflows/build_python.yml
- PyPy 3.8
- Python 3.10
- Python 3.11
- Python 3.12

This is a great addition for PySpark 4.0+ users and an extensible framework
for all future Python versions.

Thank you all for making this together!

Best,
Dongjoon.


Re: IDEA compile fail but sbt test succeed

2023-09-09 Thread Pasha Finkelshteyn

Dear AlphaBetaGo,

First of all, there are not only guys here, but also women.

Second, you didn't give a context that would allow to understand the 
connection with Spark. From what I see, it's more likely that it's an 
issue in Spark/sbt support in IDEA. Feel free to create an issue in the 
JetBrains YouTrack [1]


[1] https://youtrack.jetbrains.com/newIssue?project=SCL=25-4794862

On 9/9/23 06:24, AlphaBetaGo wrote:



Hi guys



When building Spark Source Code by IDEA an error came while the sbt could 
succeed

no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, 
Seq.empty: _*)



| |
AlphaBetaGo
|
|
alphabet...@163.com
|



--
Pasha Finkelshteyn
Developer Advocate @ JetBrains

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive 3 has big performance improvement from my test

2023-01-08 Thread Mich Talebzadeh
What bothers me is that you are making sweeping statements about Spark
inability to handle quote " ... the key weakness of Spark is 1) its poor
performance when executing concurrent queries and 2) its poor resource
utilization when executing multiple Spark applications concurrently"
and conversely overstating Hive ability on handling MR.
In fairness anything published  in a public forum is fair game for analysis
or criticism. Thenyou are expected to back it up. I cannot see how anyone
could object to the statement: if you make a claim, be prepared to prove
it.

I am open minded on this so please clarify the above statement

HTH

   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 8 Jan 2023 at 05:21, Sungwoo Park  wrote:

>
>> [image: image.png]
>>
>> from your posting, the result is amazing. glad to know hive on mr3 has
>> that nice performance.
>>
>
> Hive on MR3 is similar to Hive-LLAP in performance, so we can interpret
> the above result as Hive being much faster than SparkSQL. For executing
> concurrent queries, the performance gap is even greater. In my (rather
> biased) opinion, the key weakness of Spark is 1) its poor performance when
> executing concurrent queries and 2) its poor resource utilization when
> executing multiple Spark applications concurrently.
>
> We released Hive on MR3 1.6 a couple of weeks ago. Now we have backported
> about 700 patches to Hive 3.1. If interested, please check it out:
> https://www.datamonad.com/
>
> Sungwoo
>


Re: Hive 3 has big performance improvement from my test

2023-01-07 Thread Mich Talebzadeh
Thanks for this insight guys.

On your point below and I quote:

...  "It's even as fast as Spark by using the default mr engine"

OK as we are all experimentalists, are we stating that the classic
MapReduce computation can outdo Spark's in-memory computation. I would be
curious to know this.

Thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 6 Jan 2023 at 03:35, ypeng  wrote:

> Hello,
>
> Just from my personal testing, Hive 3.1.3 has much better performance than
> the old ones.
> It's even as fast as Spark by using the default mr engine.
> My test process and dataset,
> https://blog.crypt.pw/Another-10-million-dataset-testing-for-Spark-and-Hive
>
> Thanks.
>


Re: The Dataset unit test is much slower than the RDD unit test (in Scala)

2022-11-01 Thread Cheng Pan
Which Spark version are you using?

SPARK-36444[1] and SPARK-38138[2] may be related, please test w/ the
patched version or disable DPP by setting
spark.sql.optimizer.dynamicPartitionPruning.enabled=false to see if it
helps.

[1] https://issues.apache.org/jira/browse/SPARK-36444
[2] https://issues.apache.org/jira/browse/SPARK-38138


Thanks,
Cheng Pan


On Nov 2, 2022 at 00:14:34, Enrico Minack  wrote:

> Hi Tanin,
>
> running your test with option "spark.sql.planChangeLog.level" set to
> "info" or "warn" (depending on your Spark log level) will show you
> insights into the planning (which rules are applied, how long rules
> take, how many iterations are done).
>
> Hoping this helps,
> Enrico
>
>
> Am 25.10.22 um 21:54 schrieb Tanin Na Nakorn:
>
> Hi All,
>
>
> Our data job is very complex (e.g. 100+ joins), and we have switched
>
> from RDD to Dataset recently.
>
>
> We've found that the unit test takes much longer. We profiled it and
>
> have found that it's the planning phase that is slow, not execution.
>
>
> I wonder if anyone has encountered this issue before and if there's a
>
> way to make the planning phase faster (e.g. maybe disabling certain
>
> optimizers).
>
>
> Any thoughts or input would be appreciated.
>
>
> Thank you,
>
> Tanin
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: The Dataset unit test is much slower than the RDD unit test (in Scala)

2022-11-01 Thread Enrico Minack

Hi Tanin,

running your test with option "spark.sql.planChangeLog.level" set to 
"info" or "warn" (depending on your Spark log level) will show you 
insights into the planning (which rules are applied, how long rules 
take, how many iterations are done).


Hoping this helps,
Enrico


Am 25.10.22 um 21:54 schrieb Tanin Na Nakorn:

Hi All,

Our data job is very complex (e.g. 100+ joins), and we have switched 
from RDD to Dataset recently.


We've found that the unit test takes much longer. We profiled it and 
have found that it's the planning phase that is slow, not execution.


I wonder if anyone has encountered this issue before and if there's a 
way to make the planning phase faster (e.g. maybe disabling certain 
optimizers).


Any thoughts or input would be appreciated.

Thank you,
Tanin




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



The Dataset unit test is much slower than the RDD unit test (in Scala)

2022-10-25 Thread Tanin Na Nakorn
Hi All,

Our data job is very complex (e.g. 100+ joins), and we have switched from
RDD to Dataset recently.

We've found that the unit test takes much longer. We profiled it and have
found that it's the planning phase that is slow, not execution.

I wonder if anyone has encountered this issue before and if there's a way
to make the planning phase faster (e.g. maybe disabling certain optimizers).

Any thoughts or input would be appreciated.

Thank you,
Tanin


Skip single integration test case in Spark on K8s

2022-03-16 Thread Pralabh Kumar
Hi Spark team

I am running Spark kubernetes integration test suite on cloud.

build/mvn install \

-f  pom.xml \

-pl resource-managers/kubernetes/integration-tests -am -Pscala-2.12
-Phadoop-3.1.1 -Phive -Phive-thriftserver -Pyarn -Pkubernetes
-Pkubernetes-integration-tests \

-Djava.version=8 \

-Dspark.kubernetes.test.sparkTgz= \

-Dspark.kubernetes.test.imageTag=<> \

-Dspark.kubernetes.test.imageRepo=< <http://reg.visa.com/>repo> \

-Dspark.kubernetes.test.deployMode=cloud \

-Dtest.include.tags=k8s \

-Dspark.kubernetes.test.javaImageTag= \

-Dspark.kubernetes.test.namespace= \

-Dspark.kubernetes.test.serviceAccountName=spark \

-Dspark.kubernetes.test.kubeConfigContext=<> \

-Dspark.kubernetes.test.master=<> \

-Dspark.kubernetes.test.jvmImage=<> \

-Dspark.kubernetes.test.pythonImage=<> \

-Dlog4j.logger.org.apache.spark=DEBUG



I am successfully able to run some test cases and some are failing . For
e.g "Run SparkRemoteFileTest using a Remote data file" in KuberneterSuite
is failing.


Is there a way to skip running some of the test cases ?.



Please help me on the same.


Regards

Pralabh Kumar


Re: ivy unit test case filing for Spark

2021-12-21 Thread Wes Peng
Are you using IvyVPN which causes this problem? If the VPN software changes
the network URL silently you should avoid using them.

Regards.

On Wed, Dec 22, 2021 at 1:48 AM Pralabh Kumar 
wrote:

> Hi Spark Team
>
> I am building a spark in VPN . But the unit test case below is failing.
> This is pointing to ivy location which  cannot be reached within VPN . Any
> help would be appreciated
>
> test("SPARK-33084: Add jar support Ivy URI -- default transitive = true")
> {
>   *sc *= new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local-cluster[3,
> 1, 1024]"))
>   *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
>   assert(*sc*.listJars().exists(_.contains(
> "org.apache.hive_hive-storage-api-2.7.0.jar")))
>   assert(*sc*.listJars().exists(_.contains(
> "commons-lang_commons-lang-2.6.jar")))
> }
>
> Error
>
> - SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
> FAILED ***
> java.lang.RuntimeException: [unresolved dependency:
> org.apache.hive#hive-storage-api;2.7.0: not found]
> at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
> SparkSubmit.scala:1447)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:185)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:159)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
> at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
> scala:1041)
> at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
>
> Regards
> Pralabh Kumar
>
>
>


Re: ivy unit test case filing for Spark

2021-12-21 Thread Sean Owen
You would have to make it available? This doesn't seem like a spark issue.

On Tue, Dec 21, 2021, 10:48 AM Pralabh Kumar  wrote:

> Hi Spark Team
>
> I am building a spark in VPN . But the unit test case below is failing.
> This is pointing to ivy location which  cannot be reached within VPN . Any
> help would be appreciated
>
> test("SPARK-33084: Add jar support Ivy URI -- default transitive = true")
> {
>   *sc *= new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local-cluster[3,
> 1, 1024]"))
>   *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
>   assert(*sc*.listJars().exists(_.contains(
> "org.apache.hive_hive-storage-api-2.7.0.jar")))
>   assert(*sc*.listJars().exists(_.contains(
> "commons-lang_commons-lang-2.6.jar")))
> }
>
> Error
>
> - SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
> FAILED ***
> java.lang.RuntimeException: [unresolved dependency:
> org.apache.hive#hive-storage-api;2.7.0: not found]
> at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
> SparkSubmit.scala:1447)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:185)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:159)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
> at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
> scala:1041)
> at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
>
> Regards
> Pralabh Kumar
>
>
>


ivy unit test case filing for Spark

2021-12-21 Thread Pralabh Kumar
Hi Spark Team

I am building a spark in VPN . But the unit test case below is failing.
This is pointing to ivy location which  cannot be reached within VPN . Any
help would be appreciated

test("SPARK-33084: Add jar support Ivy URI -- default transitive = true") {
  *sc *= new SparkContext(new
SparkConf().setAppName("test").setMaster("local-cluster[3,
1, 1024]"))
  *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
  assert(*sc*.listJars().exists(_.contains(
"org.apache.hive_hive-storage-api-2.7.0.jar")))
  assert(*sc*.listJars().exists(_.contains(
"commons-lang_commons-lang-2.6.jar")))
}

Error

- SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
FAILED ***
java.lang.RuntimeException: [unresolved dependency:
org.apache.hive#hive-storage-api;2.7.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
SparkSubmit.scala:1447)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
DependencyUtils.scala:185)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
DependencyUtils.scala:159)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
scala:1041)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)

Regards
Pralabh Kumar


Re: Need Unit test complete reference for Pyspark

2020-11-19 Thread Sofia’s World
Hey
 they are good libraries..to get you started. Have used both of them..
unfortunately -as far as i saw when i started to use them  - only few
people maintains them.
But you can get pointers out of them for writing tests. the code below can
get you started
What you'll need is

- a method to create dataframe on the fly, perhaps from  a string.  you can
have a look at pandas, it will have methods for it
- a method to test dataframe equality. you can use  df1.subtract(df2)

I am assuming you are into dataframes - rather than RDDs, for which the two
packages you mention  should have everything you need

hht
 marco


import logging
from pyspark.sql import SparkSession
from pyspark import HiveContext
from pyspark import SparkConf
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
import pytest
import shutil

@pytest.fixture
def spark_session():
return SparkSession.builder \
.master('local[1]') \
.appName('SparkByExamples.com') \
.getOrCreate()


def test_create_table(spark_session):
df = spark_session.createDataFrame([['one',
'two']]).toDF(*['first', 'second'])
print(df.show())

df2 = spark_session.createDataFrame([['one',
'two']]).toDF(*['first', 'second'])

assert df.subtract(df2).count() == 0




On Thu, Nov 19, 2020 at 6:38 AM Sachit Murarka 
wrote:

> Hi Users,
>
> I have to write Unit Test cases for PySpark.
> I think pytest-spark and "spark testing base" are good test libraries.
>
> Can anyone please provide full reference for writing the test cases in
> Python using these?
>
> Kind Regards,
> Sachit Murarka
>


Need Unit test complete reference for Pyspark

2020-11-18 Thread Sachit Murarka
Hi Users,

I have to write Unit Test cases for PySpark.
I think pytest-spark and "spark testing base" are good test libraries.

Can anyone please provide full reference for writing the test cases in
Python using these?

Kind Regards,
Sachit Murarka


Re: test

2020-07-27 Thread Ashley Hoff
Yes, your emails are getting through.

On Mon, Jul 27, 2020 at 6:31 PM Suat Toksöz  wrote:

> user@spark.apache.org
>
> --
>
> Best regards,
>
> *Suat Toksoz*
>


-- 
Kustoms On Silver 


test

2020-07-27 Thread Suat Toksöz
user@spark.apache.org

-- 

Best regards,

*Suat Toksoz*


Re: find failed test

2020-03-06 Thread Wim Van Leuven
Srsly?

On Sat, 7 Mar 2020 at 03:28, Koert Kuipers  wrote:

> i just ran:
> mvn test -fae > log.txt
>
> at the end of log.txt i find it says there are failures:
> [INFO] Spark Project SQL .. FAILURE [47:55
> min]
>
> that is not very helpful. what tests failed?
>
> i could go scroll up but the file has 21,517 lines. ok let's skip that.
>
> so i figure there are test reports in sql/core/target. i was right! its
> sq/core/target/surefire-reports. but it has 276 files, so thats still a bit
> much to go through. i assume there is some nice summary that shows me the
> failed tests... maybe SparkTestSuite.txt? its 2687 lines, so again a bit
> much, but i do go through it and find nothing useful.
>
> so... how do i quickly find out which test failed exactly?
> there must be some maven trick here?
>
> thanks!
>


find failed test

2020-03-06 Thread Koert Kuipers
i just ran:
mvn test -fae > log.txt

at the end of log.txt i find it says there are failures:
[INFO] Spark Project SQL .. FAILURE [47:55
min]

that is not very helpful. what tests failed?

i could go scroll up but the file has 21,517 lines. ok let's skip that.

so i figure there are test reports in sql/core/target. i was right! its
sq/core/target/surefire-reports. but it has 276 files, so thats still a bit
much to go through. i assume there is some nice summary that shows me the
failed tests... maybe SparkTestSuite.txt? its 2687 lines, so again a bit
much, but i do go through it and find nothing useful.

so... how do i quickly find out which test failed exactly?
there must be some maven trick here?

thanks!


Test mail

2019-09-05 Thread Himali Patel



test

2019-08-23 Thread Mayank Agarwal



[Spark SQL] dependencies to use test helpers

2019-07-24 Thread James Pirz
I have a Scala application in which I have added some extra rules to
Catalyst.
While adding some unit tests, I am trying to use some existing functions
from Catalyst's test code: Specifically comparePlans() and normalizePlan()
under PlanTestBase
<https://github.com/apache/spark/blob/fced6696a7713a5dc117860faef43db6b81d07b3/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala>
[1].

I am just wondering which additional dependencies I need to add to my
project to access them. Currently, I have below dependencies but they do
not cover above APIs.

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.4.3"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.4.3"
libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.4.3"


[1] 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala

Thanks,
James


spark ./build/mvn test failed on aarch64

2019-06-05 Thread Tianhua huang
Hi all,
Recently I run './build/mvn test' of spark on aarch64, and master and
branch-2.4 are all failled, the log pieces as below:

..

[INFO] T E S T S
[INFO] ---
[INFO] Running org.apache.spark.util.kvstore.LevelDBTypeInfoSuite
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
0.081 s - in org.apache.spark.util.kvstore.LevelDBTypeInfoSuite
[INFO] Running org.apache.spark.util.kvstore.InMemoryStoreSuite
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
0.001 s - in org.apache.spark.util.kvstore.InMemoryStoreSuite
[INFO] Running org.apache.spark.util.kvstore.InMemoryIteratorSuite
[INFO] Tests run: 38, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
0.219 s - in org.apache.spark.util.kvstore.InMemoryIteratorSuite
[INFO] Running org.apache.spark.util.kvstore.LevelDBIteratorSuite
[ERROR] Tests run: 38, Failures: 0, Errors: 38, Skipped: 0, Time elapsed:
0.23 s <<< FAILURE! - in org.apache.spark.util.kvstore.LevelDBIteratorSuite
[ERROR] 
copyIndexDescendingWithStart(org.apache.spark.util.kvstore.LevelDBIteratorSuite)
Time elapsed: 0.2 s <<< ERROR!
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no
leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in
java.library.path, no leveldbjni in java.library.path,
/usr/local/src/spark/common/kvstore/target/tmp/libleveldbjni-64-1-610267671268036503.8:
/usr/local/src/spark/common/kvstore/target/tmp/libleveldbjni-64-1-610267671268036503.8:
cannot open shared object file: No such file or directory (Possible cause:
can't load AMD 64-bit .so on a AARCH64-bit platform)]
at
org.apache.spark.util.kvstore.LevelDBIteratorSuite.createStore(LevelDBIteratorSuite.java:44)

..

There is a dependency of  leveldbjni-all  , but there is no the native
package for aarch64 i in leveldbjni-1.8(all) .jar, I found aarch64 is
supported after pr https://github.com/fusesource/leveldbjni/pull/82, but it
was not in the 1.8 release, and unfortunately the repo didn't updated
almost for

two years.

So I have a question: does spark support aarch64, and if it is yes, then
how to fix this problem, if it is not, what's

the plan for it? Thank you all!


spark ./build/mvn test failed on aarch64

2019-06-05 Thread Tianhua huang
Hi all,
Recently I run './build/mvn test' of spark on aarch64, and master and
branch-2.4 are all failled, the log pieces as below:

..

[INFO] T E S T S
[INFO] ---
[INFO] Running org.apache.spark.util.kvstore.LevelDBTypeInfoSuite
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
0.081 s - in org.apache.spark.util.kvstore.LevelDBTypeInfoSuite
[INFO] Running org.apache.spark.util.kvstore.InMemoryStoreSuite
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
0.001 s - in org.apache.spark.util.kvstore.InMemoryStoreSuite
[INFO] Running org.apache.spark.util.kvstore.InMemoryIteratorSuite
[INFO] Tests run: 38, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
0.219 s - in org.apache.spark.util.kvstore.InMemoryIteratorSuite
[INFO] Running org.apache.spark.util.kvstore.LevelDBIteratorSuite
[ERROR] Tests run: 38, Failures: 0, Errors: 38, Skipped: 0, Time elapsed:
0.23 s <<< FAILURE! - in org.apache.spark.util.kvstore.LevelDBIteratorSuite
[ERROR] 
copyIndexDescendingWithStart(org.apache.spark.util.kvstore.LevelDBIteratorSuite)
Time elapsed: 0.2 s <<< ERROR!
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no
leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in
java.library.path, no leveldbjni in java.library.path,
/usr/local/src/spark/common/kvstore/target/tmp/libleveldbjni-64-1-610267671268036503.8:
/usr/local/src/spark/common/kvstore/target/tmp/libleveldbjni-64-1-610267671268036503.8:
cannot open shared object file: No such file or directory (Possible cause:
can't load AMD 64-bit .so on a AARCH64-bit platform)]
at
org.apache.spark.util.kvstore.LevelDBIteratorSuite.createStore(LevelDBIteratorSuite.java:44)

..

There is a dependency of LEVELDBJNI:

org.fusesource.leveldbjni
leveldbjni-all
1.8



Testing with spark-base-test

2018-03-28 Thread Guillermo Ortiz
I'm using spark-unit-test and I don't get to compile the code.

  test("Testging") {
val inputInsert = A("data2")
val inputDelete = A("data1")
val outputInsert = B(1)
val outputDelete = C(1)

val input = List(List(inputInsert), List(inputDelete))
val output = (List(List(outputInsert)), List(List(outputDelete)))

//Why doesn't it compile?? I have tried many things here.
testOperation[A,(B,C)](input, service.processing _, output)
  }

My method is:

def processing(avroDstream: DStream[A]) : (DStream[B],DStream[C]) ={...}

What does the "_" means in this case?


Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread chandan prakash
Thanks a lot Hyukjin & Felix.
It was helpful.
Going to older version worked.

Regards,
Chandan

On Wed, Feb 14, 2018 at 3:28 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Yes it is issue with the newer release of testthat.
>
> To workaround could you install an earlier version with devtools? will
> follow up for a fix.
>
> _
> From: Hyukjin Kwon <gurwls...@gmail.com>
> Sent: Wednesday, February 14, 2018 6:49 PM
> Subject: Re: SparkR test script issue: unable to run run-tests.h on spark
> 2.2
> To: chandan prakash <chandanbaran...@gmail.com>
> Cc: user @spark <user@spark.apache.org>
>
>
>
> From a very quick look, I think testthat version issue with SparkR.
>
> I had to fix that version to 1.x before in AppVeyor. There are few details
> in https://github.com/apache/spark/pull/20003
>
> Can you check and lower testthat version?
>
>
> On 14 Feb 2018 6:09 pm, "chandan prakash" <chandanbaran...@gmail.com>
> wrote:
>
>> Hi All,
>> I am trying to run test script of R under ./R/run-tests.sh but hitting
>> same ERROR everytime.
>> I tried running on mac as well as centos machine, same issue coming up.
>> I am using spark 2.2 (branch-2.2)
>> I followed from apache doc and followed the steps:
>> 1. installed R
>> 2. installed packages like testthat as mentioned in doc
>> 3. run run-tests.h
>>
>>
>> Every time I am getting this error line:
>>
>> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>>   object 'run_tests' not found
>> Calls: ::: -> get
>> Execution halted
>>
>>
>> Any Help?
>>
>> --
>> Chandan Prakash
>>
>>
>
>


-- 
Chandan Prakash


Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Felix Cheung
Yes it is issue with the newer release of testthat.

To workaround could you install an earlier version with devtools? will follow 
up for a fix.

_
From: Hyukjin Kwon <gurwls...@gmail.com>
Sent: Wednesday, February 14, 2018 6:49 PM
Subject: Re: SparkR test script issue: unable to run run-tests.h on spark 2.2
To: chandan prakash <chandanbaran...@gmail.com>
Cc: user @spark <user@spark.apache.org>


>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details in 
https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?


On 14 Feb 2018 6:09 pm, "chandan prakash" 
<chandanbaran...@gmail.com<mailto:chandanbaran...@gmail.com>> wrote:
Hi All,
I am trying to run test script of R under ./R/run-tests.sh but hitting same 
ERROR everytime.
I tried running on mac as well as centos machine, same issue coming up.
I am using spark 2.2 (branch-2.2)
I followed from apache doc and followed the steps:
1. installed R
2. installed packages like testthat as mentioned in doc
3. run run-tests.h


Every time I am getting this error line:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
Execution halted


Any Help?

--
Chandan Prakash





Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Hyukjin Kwon
>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details
in https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?


On 14 Feb 2018 6:09 pm, "chandan prakash" <chandanbaran...@gmail.com> wrote:

> Hi All,
> I am trying to run test script of R under ./R/run-tests.sh but hitting
> same ERROR everytime.
> I tried running on mac as well as centos machine, same issue coming up.
> I am using spark 2.2 (branch-2.2)
> I followed from apache doc and followed the steps:
> 1. installed R
> 2. installed packages like testthat as mentioned in doc
> 3. run run-tests.h
>
>
> Every time I am getting this error line:
>
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>   object 'run_tests' not found
> Calls: ::: -> get
> Execution halted
>
>
> Any Help?
>
> --
> Chandan Prakash
>
>


SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread chandan prakash
Hi All,
I am trying to run test script of R under ./R/run-tests.sh but hitting same
ERROR everytime.
I tried running on mac as well as centos machine, same issue coming up.
I am using spark 2.2 (branch-2.2)
I followed from apache doc and followed the steps:
1. installed R
2. installed packages like testthat as mentioned in doc
3. run run-tests.h


Every time I am getting this error line:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
Execution halted


Any Help?

-- 
Chandan Prakash


not able to read git info from Scala Test Suite

2018-02-13 Thread karan alang
Hello - I'm writing a scala unittest for my Spark project
which checks the git information, and somehow it is not working from the
Unit Test

Added in pom.xml
--



pl.project13.maven
git-commit-id-plugin
2.2.4


get-the-git-infos

revision




{g...@github.com}/test.git
flat
true
true
true

true

true






folder structures :

{project_dir}/module/pom.xml
{project_dir}/module/src/main/scala/BuildInfo.scala
  (i'm able to read the git info from this file)
 {project_dir}/module_folder/test/main/scala/BuildInfoSuite.scala
  (i'm NOT able to read the git info from this file)


Any ideas on what i need to do to get this working ?


Re: Collecting matrix's entries raises an error only when run inside a test

2017-07-06 Thread Yanbo Liang
Hi Simone,

Would you mind to share the minimized code to reproduce this issue?

Yanbo

On Wed, Jul 5, 2017 at 10:52 PM, Simone Robutti <simone.robu...@gmail.com>
wrote:

> Hello, I have this problem and  Google is not helping. Instead, it looks
> like an unreported bug and there are no hints to possible workarounds.
>
> the error is the following:
>
> Traceback (most recent call last):
>   File 
> "/home/simone/motionlogic/trip-labeler/test/trip_labeler_test/model_test.py",
> line 43, in test_make_trip_matrix
> entries = trip_matrix.entries.map(lambda entry: (entry.i, entry.j,
> entry.value)).collect()
>   File "/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
> line 770, in collect
> with SCCallSiteSync(self.context) as css:
>   File 
> "/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/traceback_utils.py",
> line 72, in __enter__
> self._context._jsc.setCallSite(self._call_site)
> AttributeError: 'NoneType' object has no attribute 'setCallSite'
>
> and it is raised when I try to collect a 
> pyspark.mllib.linalg.distributed.CoordinateMatrix
> entries with .collect() and it happens only when this run in a test suite
> with more than one class, so it's probably related to the creation and
> destruction of SparkContexts but I cannot understand how.
>
> Spark version is 1.6.2
>
> I saw multiple references to this error for other classses in the pyspark
> ml library but none of them contained hints toward the solution.
>
> I'm running tests through nosetests when it breaks. Running a single
> TestCase in Intellij works fine.
>
> Is there a known solution? Is it a known problem?
>
> Thank you,
>
> Simone
>


Collecting matrix's entries raises an error only when run inside a test

2017-07-05 Thread Simone Robutti
Hello, I have this problem and  Google is not helping. Instead, it looks
like an unreported bug and there are no hints to possible workarounds.

the error is the following:

Traceback (most recent call last):
  File
"/home/simone/motionlogic/trip-labeler/test/trip_labeler_test/model_test.py",
line 43, in test_make_trip_matrix
entries = trip_matrix.entries.map(lambda entry: (entry.i, entry.j,
entry.value)).collect()
  File
"/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
line 770, in collect
with SCCallSiteSync(self.context) as css:
  File
"/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/traceback_utils.py",
line 72, in __enter__
self._context._jsc.setCallSite(self._call_site)
AttributeError: 'NoneType' object has no attribute 'setCallSite'

and it is raised when I try to collect a
pyspark.mllib.linalg.distributed.CoordinateMatrix entries with .collect()
and it happens only when this run in a test suite with more than one class,
so it's probably related to the creation and destruction of SparkContexts
but I cannot understand how.

Spark version is 1.6.2

I saw multiple references to this error for other classses in the pyspark
ml library but none of them contained hints toward the solution.

I'm running tests through nosetests when it breaks. Running a single
TestCase in Intellij works fine.

Is there a known solution? Is it a known problem?

Thank you,

Simone


Re: test mail

2017-07-04 Thread Sudhanshu Janghel
test email recieved ;p

On 4 Jul 2017 7:40 am, "Sudha KS" <sudha...@fuzzylogix.com> wrote:

-- 

*Disclaimer: The information in this email is confidential and may be 
legally privileged. Access to this email by anyone other than the intended 
addressee is unauthorized. If you are not the intended recipient of this 
message, any review, disclosure, copying, distribution, retention, or any 
action taken or omitted to be taken in reliance on it is prohibited and may 
be unlawful.*


test mail

2017-07-04 Thread Sudha KS




(Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread neha nihal
Thanks. Its working now. My test data had some labels which were not there
in training set.

On Wednesday, June 28, 2017, Pralabh Kumar <pralabhku...@gmail.com
<javascript:_e(%7B%7D,'cvml','pralabhku...@gmail.com');>> wrote:

> Hi Neha
>
> This generally occurred when , you training data set have some value of
> categorical variable ,which in not there in your testing data. For e.g you
> have column DAYS ,with value M,T,W in training data . But when your test
> data contains F ,then it say no key found exception .  Please look into
> this  , and if that's not the case ,then Could you please share your code
> ,and training/testing data for better understanding.
>
> Regards
> Pralabh Kumar
>
> On Wed, Jun 28, 2017 at 11:45 AM, neha nihal <nehaniha...@gmail.com>
> wrote:
>
>>
>> Hi,
>>
>> I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text
>> classification. TF-IDF feature extractor is also used. The training part
>> runs without any issues and returns 100% accuracy. But when I am trying to
>> do prediction using trained model and compute test error, it fails with
>> java.util.NosuchElementException: key not found exception.
>> Any help will be much appreciated.
>>
>> Thanks
>>
>>
>


Re: (Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread Pralabh Kumar
Hi Neha

This generally occurred when , you training data set have some value of
categorical variable ,which in not there in your testing data. For e.g you
have column DAYS ,with value M,T,W in training data . But when your test
data contains F ,then it say no key found exception .  Please look into
this  , and if that's not the case ,then Could you please share your code
,and training/testing data for better understanding.

Regards
Pralabh Kumar

On Wed, Jun 28, 2017 at 11:45 AM, neha nihal <nehaniha...@gmail.com> wrote:

>
> Hi,
>
> I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text
> classification. TF-IDF feature extractor is also used. The training part
> runs without any issues and returns 100% accuracy. But when I am trying to
> do prediction using trained model and compute test error, it fails with
> java.util.NosuchElementException: key not found exception.
> Any help will be much appreciated.
>
> Thanks
>
>


Fwd: (Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread neha nihal
Hi,

I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text
classification. TF-IDF feature extractor is also used. The training part
runs without any issues and returns 100% accuracy. But when I am trying to
do prediction using trained model and compute test error, it fails with
java.util.NosuchElementException: key not found exception.
Any help will be much appreciated.

Thanks


(Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-27 Thread neha nihal
Hi,

I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text
classification. TF-IDF feature extractor is also used. The training part
runs without any issues and returns 100% accuracy. But when I am trying to
do prediction using trained model and compute test error, it fails with
java.util.NosuchElementException: key not found exception.
Any help will be much appreciated.

Thanks & Regards


Test

2017-05-15 Thread nayan sharma
Test

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



NPE in UDF yet no nulls in data because analyzer runs test with nulls

2017-04-14 Thread Koert Kuipers
we were running in to an NPE in one of our UDFs for spark sql.

now this particular function indeed could not handle nulls, but this was by
design since null input was never allowed (and we would want it to blow up
if there was a null as input).

we realized the issue was not in our data when we added filters for nulls
and the NPE still happened. then we also saw the NPE when just doing
dataframe.explain instead of running our job.

turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row
with all nulls ifs fed into the expression as a test. its the line:
val v = boundE.eval(emptyRow)

so should we conclude from this that all udfs should always be prepared to
handle nulls?


Re: scala test is unable to initialize spark context.

2017-04-06 Thread Jeff Zhang
Seems it is caused by your log4j file

*Caused by: java.lang.IllegalStateException: FileNamePattern [-.log]
does not contain a valid date format specifier*




<psw...@in.imshealth.com>于2017年4月6日周四 下午4:03写道:

> Hi All ,
>
>
>
>I am just trying to use scala test for testing a small spark code . But
> spark context is not getting initialized , while I am running test file .
>
> I have given code, pom and exception I am getting in mail , please help me
> to understand what mistake I am doing , so that
>
> Spark context is not getting initialized
>
>
>
> *Code:-*
>
>
>
> *import *org.apache.log4j.LogManager
> *import *org.apache.spark.SharedSparkContext
> *import *org.scalatest.FunSuite
> *import *org.apache.spark.{SparkContext, SparkConf}
>
>
>
>
> */**  * Created by PSwain on 4/5/2017.   */ **class *Test *extends *FunSuite
> *with *SharedSparkContext  {
>
>
>   test(*"test initializing spark context"*) {
> *val *list = *List*(1, 2, 3, 4)
> *val *rdd = sc.parallelize(list)
> assert(list.length === rdd.count())
>   }
> }
>
>
>
> *POM File:-*
>
>
>
> * *?>*<*project **xmlns=*
> *"http://maven.apache.org/POM/4.0.0 <http://maven.apache.org/POM/4.0.0>"  
>**xmlns:**xsi**=*
> *"http://www.w3.org/2001/XMLSchema-instance 
> <http://www.w3.org/2001/XMLSchema-instance>" 
> **xsi**:schemaLocation=**"http://maven.apache.org/POM/4.0.0 
> <http://maven.apache.org/POM/4.0.0> 
> http://maven.apache.org/xsd/maven-4.0.0.xsd 
> <http://maven.apache.org/xsd/maven-4.0.0.xsd>"*>
> <*modelVersion*>4.0.0
>
> <*groupId*>tesing.loging
> <*artifactId*>logging
> <*version*>1.0-SNAPSHOT
>
>
> <*repositories*>
> <*repository*>
> <*id*>central
> <*name*>central
> <*url*>http://repo1.maven.org/maven/
> 
> 
>
> <*dependencies*>
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-core_2.10
> <*version*>1.6.0
> <*type*>test-jar
>
>
> 
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-sql_2.10
> <*version*>1.6.0
> 
>
> <*dependency*>
> <*groupId*>org.scalatest
> <*artifactId*>scalatest_2.10
> <*version*>2.2.6
> 
>
> <*dependency*>
> <*groupId*>org.apache.spark
> <*artifactId*>spark-hive_2.10
> <*version*>1.5.0
> <*scope*>provided
> 
> <*dependency*>
> <*groupId*>com.databricks
> <*artifactId*>spark-csv_2.10
> <*version*>1.3.0
> 
> <*dependency*>
> <*groupId*>com.rxcorp.bdf.logging
> <*artifactId*>loggingframework
> <*version*>1.0-SNAPSHOT
> 
> <*dependency*>
> <*groupId*>mysql
> <*artifactId*>mysql-connector-java
> <*version*>5.1.6
> <*scope*>provided
> 
>
> **<*dependency*>
> <*groupId*>org.scala-lang
> <*artifactId*>scala-library
> <*version*>2.10.5
> <*scope*>compile
> <*optional*>true
> 
>
> <*dependency*>
> <*groupId*>org.scalatest
> <*artifactId*>scalatest
> <*version*>1.4.RC2
> 
>
> <*dependency*>
> <*groupId*>log4j
> <*artifactId*>log4j
> <*version*>1.2.17
> 
>
> <*dependency*>
> <*groupId*>org.scala-lang
> <*artifactId*>scala-compiler
> <*version*>2.10.5
> <*scope*>compile
> <*optional*>true
> 
>
> **
> <*build*>
> <*sourceDirectory*>src/main/scala
> <*plugins*>
> <*plugin*>
> <*artifactId*>maven-assembly-plugin
> <*version*>2.2.1
> <*configuration*>
> <*descriptorRefs*>
> 
> <*desc

scala test is unable to initialize spark context.

2017-04-06 Thread PSwain
Hi All ,

   I am just trying to use scala test for testing a small spark code . But 
spark context is not getting initialized , while I am running test file .
I have given code, pom and exception I am getting in mail , please help me to 
understand what mistake I am doing , so that
Spark context is not getting initialized

Code:-

import org.apache.log4j.LogManager
import org.apache.spark.SharedSparkContext
import org.scalatest.FunSuite
import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by PSwain on 4/5/2017.
  */
class Test extends FunSuite with SharedSparkContext  {


  test("test initializing spark context") {
val list = List(1, 2, 3, 4)
val rdd = sc.parallelize(list)
assert(list.length === rdd.count())
  }
}

POM File:-



http://maven.apache.org/POM/4.0.0;
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0

tesing.loging
logging
1.0-SNAPSHOT




central
central
http://repo1.maven.org/maven/





org.apache.spark
spark-core_2.10
    1.6.0
test-jar




org.apache.spark
spark-sql_2.10
1.6.0



org.scalatest
scalatest_2.10
2.2.6



org.apache.spark
spark-hive_2.10
1.5.0
provided


com.databricks
spark-csv_2.10
1.3.0


com.rxcorp.bdf.logging
loggingframework
1.0-SNAPSHOT


mysql
mysql-connector-java
5.1.6
provided



org.scala-lang
scala-library
2.10.5
compile
true



org.scalatest
scalatest
1.4.RC2



log4j
log4j
1.2.17



org.scala-lang
scala-compiler
2.10.5
compile
true




src/main/scala


maven-assembly-plugin
2.2.1


jar-with-dependencies




make-assembly
package

single





net.alchim31.maven
scala-maven-plugin
3.2.0



compile
testCompile




src/main/scala


-Xms64m
-Xmx1024m












Exception:-



An exception or error caused a run to abort.

java.lang.ExceptionInInitializerError

 at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121)

 at 
org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:106)

 at org.apache.spark.Logging$class.log(Logging.scala:50)

 at org.apache.spark.SparkContext.log(SparkContext.scala:79)

 at org.apache.spark.Logging$class.logInfo(Logging.scala:58)

 at org.apache.spark.SparkContext.logInfo(SparkContext.scala:79)

 at org.apache.spark.SparkContext.(SparkContext.scala:211)

 at org.apache.spark.SparkContext.(SparkContext.scala:147)

 at 
org.apache.spark.SharedSparkContext$class.beforeAll(SharedSparkContext.scala:33)

 at Test.beforeAll(Test.scala:10)

 at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)

 at Test.beforeAll(Test.scala:10)

 at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)

 at Test.run(Test.scala:10)

 at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)

 at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)

 at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)

 at scala.collection.immutable.List.foreach(List.scala:318)

 at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)

 at 
org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)

 at 
org.scalatest.tools.Runner$$anonfun$runOptionall

Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Jacek Laskowski
Thanks Koert for the kind words. That part however is easy to fix and
was surprised to have seen the old style referenced (!)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Apr 5, 2017 at 6:14 PM, Koert Kuipers <ko...@tresata.com> wrote:
> its pretty much impossible to be fully up to date with spark given how fast
> it moves!
>
> the book is a very helpful reference
>
> On Wed, Apr 5, 2017 at 11:15 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>> Hi,
>>
>> I'm very sorry for not being up to date with the current style (and
>> "promoting" the old style) and am going to review that part soon. I'm very
>> close to touch it again since I'm with Optimizer these days.
>>
>> Jacek
>>
>> On 5 Apr 2017 6:08 a.m., "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote:
>>>
>>> Hi,
>>> The page in the URL explains the old style of physical plan output.
>>> The current style adds "*" as a prefix of each operation that the
>>> whole-stage codegen can be apply to.
>>>
>>> So, in your test case, whole-stage codegen has been already enabled!!
>>>
>>> FYI. I think that it is a good topic for d...@spark.apache.org.
>>>
>>> Kazuaki Ishizaki
>>>
>>>
>>>
>>> From:Koert Kuipers <ko...@tresata.com>
>>> To:"user@spark.apache.org" <user@spark.apache.org>
>>> Date:2017/04/05 05:12
>>> Subject:how do i force unit test to do whole stage codegen
>>> 
>>>
>>>
>>>
>>> i wrote my own expression with eval and doGenCode, but doGenCode never
>>> gets called in tests.
>>>
>>> also as a test i ran this in a unit test:
>>> spark.range(10).select('id as 'asId).where('id === 4).explain
>>> according to
>>>
>>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html
>>> this is supposed to show:
>>> == Physical Plan ==
>>> WholeStageCodegen
>>> :  +- Project [id#0L AS asId#3L]
>>> : +- Filter (id#0L = 4)
>>> :+- Range 0, 1, 8, 10, [id#0L]
>>>
>>> but it doesn't. instead it shows:
>>>
>>> == Physical Plan ==
>>> *Project [id#12L AS asId#15L]
>>> +- *Filter (id#12L = 4)
>>>   +- *Range (0, 10, step=1, splits=Some(4))
>>>
>>> so i am again missing the WholeStageCodegen. any idea why?
>>>
>>> i create spark session for unit tests simply as:
>>> val session = SparkSession.builder
>>>  .master("local[*]")
>>>  .appName("test")
>>>  .config("spark.sql.shuffle.partitions", 4)
>>>  .getOrCreate()
>>>
>>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Koert Kuipers
its pretty much impossible to be fully up to date with spark given how fast
it moves!

the book is a very helpful reference

On Wed, Apr 5, 2017 at 11:15 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I'm very sorry for not being up to date with the current style (and
> "promoting" the old style) and am going to review that part soon. I'm very
> close to touch it again since I'm with Optimizer these days.
>
> Jacek
>
> On 5 Apr 2017 6:08 a.m., "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote:
>
>> Hi,
>> The page in the URL explains the old style of physical plan output.
>> The current style adds "*" as a prefix of each operation that the
>> whole-stage codegen can be apply to.
>>
>> So, in your test case, whole-stage codegen has been already enabled!!
>>
>> FYI. I think that it is a good topic for d...@spark.apache.org.
>>
>> Kazuaki Ishizaki
>>
>>
>>
>> From:Koert Kuipers <ko...@tresata.com>
>> To:"user@spark.apache.org" <user@spark.apache.org>
>> Date:2017/04/05 05:12
>> Subject:how do i force unit test to do whole stage codegen
>> --
>>
>>
>>
>> i wrote my own expression with eval and doGenCode, but doGenCode never
>> gets called in tests.
>>
>> also as a test i ran this in a unit test:
>> spark.range(10).select('id as 'asId).where('id === 4).explain
>> according to
>>
>> *https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html*
>> <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html>
>> this is supposed to show:
>> == Physical Plan ==
>> WholeStageCodegen
>> :  +- Project [id#0L AS asId#3L]
>> : +- Filter (id#0L = 4)
>> :+- Range 0, 1, 8, 10, [id#0L]
>>
>> but it doesn't. instead it shows:
>>
>> == Physical Plan ==
>> *Project [id#12L AS asId#15L]
>> +- *Filter (id#12L = 4)
>>   +- *Range (0, 10, step=1, splits=Some(4))
>>
>> so i am again missing the WholeStageCodegen. any idea why?
>>
>> i create spark session for unit tests simply as:
>> val session = SparkSession.builder
>>  .master("local[*]")
>>  .appName("test")
>>  .config("spark.sql.shuffle.partitions", 4)
>>  .getOrCreate()
>>
>>
>>


Re: how do i force unit test to do whole stage codegen

2017-04-05 Thread Jacek Laskowski
Hi,

I'm very sorry for not being up to date with the current style (and
"promoting" the old style) and am going to review that part soon. I'm very
close to touch it again since I'm with Optimizer these days.

Jacek

On 5 Apr 2017 6:08 a.m., "Kazuaki Ishizaki" <ishiz...@jp.ibm.com> wrote:

> Hi,
> The page in the URL explains the old style of physical plan output.
> The current style adds "*" as a prefix of each operation that the
> whole-stage codegen can be apply to.
>
> So, in your test case, whole-stage codegen has been already enabled!!
>
> FYI. I think that it is a good topic for d...@spark.apache.org.
>
> Kazuaki Ishizaki
>
>
>
> From:Koert Kuipers <ko...@tresata.com>
> To:"user@spark.apache.org" <user@spark.apache.org>
> Date:2017/04/05 05:12
> Subject:how do i force unit test to do whole stage codegen
> --
>
>
>
> i wrote my own expression with eval and doGenCode, but doGenCode never
> gets called in tests.
>
> also as a test i ran this in a unit test:
> spark.range(10).select('id as 'asId).where('id === 4).explain
> according to
>
> *https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html*
> <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html>
> this is supposed to show:
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#0L AS asId#3L]
> : +- Filter (id#0L = 4)
> :+- Range 0, 1, 8, 10, [id#0L]
>
> but it doesn't. instead it shows:
>
> == Physical Plan ==
> *Project [id#12L AS asId#15L]
> +- *Filter (id#12L = 4)
>   +- *Range (0, 10, step=1, splits=Some(4))
>
> so i am again missing the WholeStageCodegen. any idea why?
>
> i create spark session for unit tests simply as:
> val session = SparkSession.builder
>  .master("local[*]")
>  .appName("test")
>  .config("spark.sql.shuffle.partitions", 4)
>  .getOrCreate()
>
>
>


Re: how do i force unit test to do whole stage codegen

2017-04-04 Thread Koert Kuipers
got it. thats good to know. thanks!

On Wed, Apr 5, 2017 at 12:07 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
wrote:

> Hi,
> The page in the URL explains the old style of physical plan output.
> The current style adds "*" as a prefix of each operation that the
> whole-stage codegen can be apply to.
>
> So, in your test case, whole-stage codegen has been already enabled!!
>
> FYI. I think that it is a good topic for d...@spark.apache.org.
>
> Kazuaki Ishizaki
>
>
>
> From:Koert Kuipers <ko...@tresata.com>
> To:"user@spark.apache.org" <user@spark.apache.org>
> Date:2017/04/05 05:12
> Subject:how do i force unit test to do whole stage codegen
> --
>
>
>
> i wrote my own expression with eval and doGenCode, but doGenCode never
> gets called in tests.
>
> also as a test i ran this in a unit test:
> spark.range(10).select('id as 'asId).where('id === 4).explain
> according to
>
> *https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html*
> <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html>
> this is supposed to show:
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#0L AS asId#3L]
> : +- Filter (id#0L = 4)
> :+- Range 0, 1, 8, 10, [id#0L]
>
> but it doesn't. instead it shows:
>
> == Physical Plan ==
> *Project [id#12L AS asId#15L]
> +- *Filter (id#12L = 4)
>   +- *Range (0, 10, step=1, splits=Some(4))
>
> so i am again missing the WholeStageCodegen. any idea why?
>
> i create spark session for unit tests simply as:
> val session = SparkSession.builder
>  .master("local[*]")
>  .appName("test")
>  .config("spark.sql.shuffle.partitions", 4)
>  .getOrCreate()
>
>
>


Re: how do i force unit test to do whole stage codegen

2017-04-04 Thread Kazuaki Ishizaki
Hi,
The page in the URL explains the old style of physical plan output.
The current style adds "*" as a prefix of each operation that the 
whole-stage codegen can be apply to.

So, in your test case, whole-stage codegen has been already enabled!!

FYI. I think that it is a good topic for d...@spark.apache.org.

Kazuaki Ishizaki



From:   Koert Kuipers <ko...@tresata.com>
To: "user@spark.apache.org" <user@spark.apache.org>
Date:   2017/04/05 05:12
Subject:how do i force unit test to do whole stage codegen



i wrote my own expression with eval and doGenCode, but doGenCode never 
gets called in tests.

also as a test i ran this in a unit test:
spark.range(10).select('id as 'asId).where('id === 4).explain
according to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html
this is supposed to show:
== Physical Plan ==
WholeStageCodegen
:  +- Project [id#0L AS asId#3L]
: +- Filter (id#0L = 4)
:+- Range 0, 1, 8, 10, [id#0L]

but it doesn't. instead it shows:

== Physical Plan ==
*Project [id#12L AS asId#15L]
+- *Filter (id#12L = 4)
   +- *Range (0, 10, step=1, splits=Some(4))

so i am again missing the WholeStageCodegen. any idea why?

i create spark session for unit tests simply as:
val session = SparkSession.builder
  .master("local[*]")
  .appName("test")
  .config("spark.sql.shuffle.partitions", 4)
  .getOrCreate()





how do i force unit test to do whole stage codegen

2017-04-04 Thread Koert Kuipers
i wrote my own expression with eval and doGenCode, but doGenCode never gets
called in tests.

also as a test i ran this in a unit test:
spark.range(10).select('id as 'asId).where('id === 4).explain
according to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html
this is supposed to show:

== Physical Plan ==WholeStageCodegen
:  +- Project [id#0L AS asId#3L]
: +- Filter (id#0L = 4)
:+- Range 0, 1, 8, 10, [id#0L]

but it doesn't. instead it shows:

== Physical Plan ==
*Project [id#12L AS asId#15L]
+- *Filter (id#12L = 4)
   +- *Range (0, 10, step=1, splits=Some(4))

so i am again missing the WholeStageCodegen. any idea why?

i create spark session for unit tests simply as:
val session = SparkSession.builder
  .master("local[*]")
  .appName("test")
  .config("spark.sql.shuffle.partitions", 4)
  .getOrCreate()


This is a test mail, please ignore!

2017-03-27 Thread Noorul Islam K M
Sending plain text mail to test whether my mail appear in the list.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/This-is-a-test-mail-please-ignore-tp28538.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to unit test spark streaming?

2017-03-07 Thread kant kodali
Agreed with the statement in quotes below whether one wants to do unit
tests or not It is a good practice to write code that way. But I think the
more painful and tedious task is to mock/emulate all the nodes such as
spark workers/master/hdfs/input source stream and all that. I wish there is
something really simple. Perhaps the simplest thing to do is just to do
integration tests which also tests the transformations/business logic. This
way I can spawn a small cluster and run my tests and bring my cluster down
when I am done. And sure if the cluster isn't available then I can't run
the tests however some node should be available even to run a single
process. I somehow feel like we may doing too much work to fit into the
archaic definition of unit tests.

 "Basically you abstract your transformations to take in a dataframe and
return one, then you assert on the returned df " this

On Tue, Mar 7, 2017 at 11:14 AM, Michael Armbrust 
wrote:

> Basically you abstract your transformations to take in a dataframe and
>> return one, then you assert on the returned df
>>
>
> +1 to this suggestion.  This is why we wanted streaming and batch
> dataframes to share the same API.
>


Re: How to unit test spark streaming?

2017-03-07 Thread Michael Armbrust
>
> Basically you abstract your transformations to take in a dataframe and
> return one, then you assert on the returned df
>

+1 to this suggestion.  This is why we wanted streaming and batch
dataframes to share the same API.


Re: How to unit test spark streaming?

2017-03-07 Thread Jörn Franke
This depends on your target setup! I run for example for my open source 
libraries for spark integration tests (a dedicated folder a side the unit 
tests) a local spark master, but also use a minidfs cluster (to simulate HDFS 
on a node) and sometimes also a miniyarn cluster (see 
https://wiki.apache.org/hadoop/HowToDevelopUnitTests).

 An example can be found here:  
https://github.com/ZuInnoTe/hadoopcryptoledger/tree/master/examples/spark-bitcoinblock
 

or - if you need Scala - 
https://github.com/ZuInnoTe/hadoopcryptoledger/tree/master/examples/scala-spark-bitcoinblock
 

In both cases it is in the integration-tests (Java) or it (Scala) folder.

Spark Streaming - I have no open source example at hand, but basically you need 
to simulate the source and the rest is as above.

 I will eventually write a blog post about this with more details.

> On 7 Mar 2017, at 13:04, kant kodali <kanth...@gmail.com> wrote:
> 
> Hi All,
> 
> How to unit test spark streaming or spark in general? How do I test the 
> results of my transformations? Also, more importantly don't we need to spawn 
> master and worker JVM's either in one or multiple nodes?
> 
> Thanks!
> kant

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to unit test spark streaming?

2017-03-07 Thread Sam Elamin
Hey kant

You can use holdens spark test base

Have a look at some of the specs I wrote here to give you an idea

https://github.com/samelamin/spark-bigquery/blob/master/src/test/scala/com/samelamin/spark/bigquery/BigQuerySchemaSpecs.scala

Basically you abstract your transformations to take in a dataframe and
return one, then you assert on the returned df

Regards
Sam
On Tue, 7 Mar 2017 at 12:05, kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> How to unit test spark streaming or spark in general? How do I test the
> results of my transformations? Also, more importantly don't we need to
> spawn master and worker JVM's either in one or multiple nodes?
>
> Thanks!
> kant
>


How to unit test spark streaming?

2017-03-07 Thread kant kodali
Hi All,

How to unit test spark streaming or spark in general? How do I test the
results of my transformations? Also, more importantly don't we need to
spawn master and worker JVM's either in one or multiple nodes?

Thanks!
kant


Spark test error in ProactiveClosureSerializationSuite.scala

2017-02-25 Thread ??????????
hello all, I am building Spark1.6.2 and I meet a problem when doing mvn test


The command is mvn -e -Pyarn  -Phive -Phive-thriftserver  
-DwildcardSuites=org.apache.spark.serializer.ProactiveClosureSerializationSuite 
test
and the test error is
ProactiveClosureSerializationSuite:
- throws expected serialization exceptions on actions
- mapPartitions transformations throw proactive serialization exceptions *** 
FAILED ***
  Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown. (ProactiveClosureSerializationSuite.scala:58)
- map transformations throw proactive serialization exceptions
- filter transformations throw proactive serialization exceptions
- flatMap transformations throw proactive serialization exceptions
- mapPartitionsWithIndex transformations throw proactive serialization 
exceptions *** FAILED ***
  Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown. (ProactiveClosureSerializationSuite.scala:58)



I think this test is about task not serializable, but why do I only get test 
error on mapPartitions and mapPartitionsWithIndex?


Thanks.

Spark test error

2017-01-03 Thread Yanwei Wayne Zhang
I tried to run the tests in 'GeneralizedLinearRegressionSuite', and all tests 
passed except for test("read/write") which yielded the following error message. 
Any suggestion on why this happened and how to fix it? Thanks. BTW, I ran the 
test in IntelliJ.


The default jsonEncode only supports string and vector. 
org.apache.spark.ml.param.Param must override jsonEncode for java.lang.Double.
scala.NotImplementedError: The default jsonEncode only supports string and 
vector. org.apache.spark.ml.param.Param must override jsonEncode for 
java.lang.Double.
at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
at 
org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:293)
at 
org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:292)


Regards,
Wayne


How to clean the cache when i do performance test in spark

2016-12-07 Thread Zhang, Liyun
Hi all:
   When I test my spark application, I found that the second 
round(application_1481153226569_0002) is more faster than first 
round(application_1481153226569_0001).  Actually the configuration is same. I 
guess the second round is improved a lot by cache. So how can I clean the cache?




[cid:image002.png@01D2516A.5194DFA0]

Best Regards
Kelly Zhang/Zhang,Liyun



Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-14 Thread Cody Koeninger
I can't be sure, no.

On Fri, Oct 14, 2016 at 3:06 AM, Julian Keppel
<juliankeppel1...@gmail.com> wrote:
> Okay, thank you! Can you say, when this feature will be released?
>
> 2016-10-13 16:29 GMT+02:00 Cody Koeninger <c...@koeninger.org>:
>>
>> As Sean said, it's unreleased.  If you want to try it out, build spark
>>
>> http://spark.apache.org/docs/latest/building-spark.html
>>
>> The easiest way to include the jar is probably to use mvn install to
>> put it in your local repository, then link it in your application's
>> mvn or sbt build file as described in the docs you linked.
>>
>>
>> On Thu, Oct 13, 2016 at 3:24 AM, JayKay <juliankeppel1...@gmail.com>
>> wrote:
>> > I want to work with the Kafka integration for structured streaming. I
>> > use
>> > Spark version 2.0.0. and I start the spark-shell with:
>> >
>> > spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>> >
>> > As described here:
>> >
>> > https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md
>> >
>> > But I get a unresolved dependency error ("unresolved dependency:
>> > org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it
>> > seems
>> > not to be available via maven or spark-packages.
>> >
>> > How can I accesss this package? Or am I doing something wrong/missing?
>> >
>> > Thank you for you help.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-14 Thread Julian Keppel
Okay, thank you! Can you say, when this feature will be released?

2016-10-13 16:29 GMT+02:00 Cody Koeninger <c...@koeninger.org>:

> As Sean said, it's unreleased.  If you want to try it out, build spark
>
> http://spark.apache.org/docs/latest/building-spark.html
>
> The easiest way to include the jar is probably to use mvn install to
> put it in your local repository, then link it in your application's
> mvn or sbt build file as described in the docs you linked.
>
>
> On Thu, Oct 13, 2016 at 3:24 AM, JayKay <juliankeppel1...@gmail.com>
> wrote:
> > I want to work with the Kafka integration for structured streaming. I use
> > Spark version 2.0.0. and I start the spark-shell with:
> >
> > spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
> >
> > As described here:
> > https://github.com/apache/spark/blob/master/docs/
> structured-streaming-kafka-integration.md
> >
> > But I get a unresolved dependency error ("unresolved dependency:
> > org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it
> seems
> > not to be available via maven or spark-packages.
> >
> > How can I accesss this package? Or am I doing something wrong/missing?
> >
> > Thank you for you help.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-
> unresolved-dependency-error-tp27891.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>


Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Cody Koeninger
As Sean said, it's unreleased.  If you want to try it out, build spark

http://spark.apache.org/docs/latest/building-spark.html

The easiest way to include the jar is probably to use mvn install to
put it in your local repository, then link it in your application's
mvn or sbt build file as described in the docs you linked.


On Thu, Oct 13, 2016 at 3:24 AM, JayKay <juliankeppel1...@gmail.com> wrote:
> I want to work with the Kafka integration for structured streaming. I use
> Spark version 2.0.0. and I start the spark-shell with:
>
> spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>
> As described here:
> https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md
>
> But I get a unresolved dependency error ("unresolved dependency:
> org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
> not to be available via maven or spark-packages.
>
> How can I accesss this package? Or am I doing something wrong/missing?
>
> Thank you for you help.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Mich Talebzadeh
add --jars /spark-streaming-kafka_2.10-1.5.1.jar

(may need to download the jar file or any newer version)


to spark-shell.

I also have spark-streaming-kafka-assembly_2.10-1.6.1.jar as well on --jar
list

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 October 2016 at 09:24, JayKay <juliankeppel1...@gmail.com> wrote:

> I want to work with the Kafka integration for structured streaming. I use
> Spark version 2.0.0. and I start the spark-shell with:
>
> spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>
> As described here:
> https://github.com/apache/spark/blob/master/docs/
> structured-streaming-kafka-integration.md
>
> But I get a unresolved dependency error ("unresolved dependency:
> org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
> not to be available via maven or spark-packages.
>
> How can I accesss this package? Or am I doing something wrong/missing?
>
> Thank you for you help.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-
> unresolved-dependency-error-tp27891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread Sean Owen
I don't believe that's been released yet. It looks like it was merged into
branches about a week ago. You're looking at unreleased docs too - have a
look at http://spark.apache.org/docs/latest/ for the latest released docs.

On Thu, Oct 13, 2016 at 9:24 AM JayKay <juliankeppel1...@gmail.com> wrote:

> I want to work with the Kafka integration for structured streaming. I use
> Spark version 2.0.0. and I start the spark-shell with:
>
> spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0
>
> As described here:
>
> https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md
>
> But I get a unresolved dependency error ("unresolved dependency:
> org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
> not to be available via maven or spark-packages.
>
> How can I accesss this package? Or am I doing something wrong/missing?
>
> Thank you for you help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Want to test spark-sql-kafka but get unresolved dependency error

2016-10-13 Thread JayKay
I want to work with the Kafka integration for structured streaming. I use
Spark version 2.0.0. and I start the spark-shell with: 

spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.0

As described here:
https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md

But I get a unresolved dependency error ("unresolved dependency:
org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found"). So it seems
not to be available via maven or spark-packages.

How can I accesss this package? Or am I doing something wrong/missing? 

Thank you for you help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Want-to-test-spark-sql-kafka-but-get-unresolved-dependency-error-tp27891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Error in run multiple unit test that extends DataFrameSuiteBase

2016-09-23 Thread Jinyuan Zhou
After I created two test case  that FlatSpec with DataFrameSuiteBase. But I
got errors when do sbt test. I was able to run each of them separately. My
test cases does use sqlContext to read files. Here is the exception stack.
Judging from the exception, I may need to unregister RpcEndpoint after each
test run.
info] Exception encountered when attempting to run a suite with class name:
 MyTestSuit *** ABORTED ***
[info]   java.lang.IllegalArgumentException: There is already an
RpcEndpoint called LocalSchedulerBackendEndpoint
[info]   at
org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:66)
[info]   at
org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:129)
[info]   at
org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:127)
[info]   at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
[info]   at org.apache.spark.SparkContext.(SparkContext.scala:500)


Re: build error - failing test- Error while building spark 2.0 trunk from github

2016-07-31 Thread Jacek Laskowski
Hi,

Can you share what's the command to run the build? What's the OS? Java?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Jul 31, 2016 at 6:54 PM, Rohit Chaddha
 wrote:
> ---
>  T E S T S
> ---
> Running org.apache.spark.api.java.OptionalSuite
> Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.052 sec -
> in org.apache.spark.api.java.OptionalSuite
> Running org.apache.spark.JavaAPISuite
> Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.537 sec
> <<< FAILURE! - in org.apache.spark.JavaAPISuite
> wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.331 sec  <<<
> FAILURE!
> java.lang.AssertionError:
> expected:> but was:
> at
> org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1087)
>
> Running org.apache.spark.JavaJdbcRDDSuite
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.799 sec -
> in org.apache.spark.JavaJdbcRDDSuite
> Running org.apache.spark.launcher.SparkLauncherSuite
> Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.04 sec <<<
> FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
> testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time
> elapsed: 0.03 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<1>
> at
> org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:110)
>
> Running org.apache.spark.memory.TaskMemoryManagerSuite
> Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec -
> in org.apache.spark.memory.TaskMemoryManagerSuite
> Running org.apache.spark.shuffle.sort.PackedRecordPointerSuite
> Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec -
> in org.apache.spark.shuffle.sort.PackedRecordPointerSuite
> Running org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.103 sec -
> in org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
> Running org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.199 sec -
> in org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
> Running org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
> Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.67 sec -
> in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
> Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
> Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.97 sec -
> in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
> Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
> Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.583 sec -
> in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
> Running
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite
> Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.533 sec -
> in
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite
> Running
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite
> Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.606 sec -
> in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite
> Running
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec -
> in
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite
> Running
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec -
> in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m;
> support was removed in 8.0
>
> Results :
>
> Failed tests:
>   JavaAPISuite.wholeTextFiles:1087 expected:> but was:
>   SparkLauncherSuite.testChildProcLauncher:110 expected:<0> but was:<1>
>
> Tests run: 189, Failures: 2, Errors: 0, Skipped: 0

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



build error - failing test- Error while building spark 2.0 trunk from github

2016-07-31 Thread Rohit Chaddha
---
 T E S T S
---
Running org.apache.spark.api.java.OptionalSuite
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.052 sec -
in org.apache.spark.api.java.OptionalSuite
Running org.apache.spark.JavaAPISuite
Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 23.537 sec
<<< FAILURE! - in org.apache.spark.JavaAPISuite
wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.331 sec  <<<
FAILURE!
java.lang.AssertionError:
expected: but was:
at
org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1087)

Running org.apache.spark.JavaJdbcRDDSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.799 sec -
in org.apache.spark.JavaJdbcRDDSuite
Running org.apache.spark.launcher.SparkLauncherSuite
Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.04 sec
<<< FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time
elapsed: 0.03 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
at
org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:110)

Running org.apache.spark.memory.TaskMemoryManagerSuite
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec -
in org.apache.spark.memory.TaskMemoryManagerSuite
Running org.apache.spark.shuffle.sort.PackedRecordPointerSuite
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec -
in org.apache.spark.shuffle.sort.PackedRecordPointerSuite
Running org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.103 sec -
in org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
Running org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.199 sec -
in org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
Running org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.67 sec -
in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.97 sec -
in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.583 sec
- in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.533 sec
- in
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.606 sec
- in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec -
in
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec -
in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=512m; support was removed in 8.0

Results :

Failed tests:
  JavaAPISuite.wholeTextFiles:1087 expected: but was:
  SparkLauncherSuite.testChildProcLauncher:110 expected:<0> but was:<1>

Tests run: 189, Failures: 2, Errors: 0, Skipped: 0


Re: test - what is the wrong while adding one column in the dataframe

2016-06-16 Thread Zhiliang Zhu
just for test, since it seemed that the user email system was something wrong 
ago, is okay now.



On Friday, June 17, 2016 12:18 PM, Zhiliang Zhu 
<zchl.j...@yahoo.com.INVALID> wrote:
 

 

 On Tuesday, May 17, 2016 10:44 AM, Zhiliang Zhu 
<zchl.j...@yahoo.com.INVALID> wrote:
 

  Hi All,
For the given DataFrame created by hive sql, however, then it is required to 
add one more column based on the existing column, and should also keep the 
previous columns there for the result DataFrame.

final double DAYS_30 = 1000 * 60 * 60 * 24 * 30.0;
//DAYS_30 seems difficult to call in the sql ? 
DataFrame behavior_df = jhql.sql("SELECT cast (user_id as double) as user_id, 
cast (server_timestamp as 
   double) as server_timestamp, url, referer, source, 
app_version, params FROM log.request");
//it is okay to run, but behavior_df.printSchema() not changed any
behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));

//it is okay to run, but behavior_df.printSchema() only has one column as 
daysLater30 .//it would be the schema is with the previous all columns and 
added one as daysLater30 
behavior_df = behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));
Then, how would do it?
Thank you, 

 

the issue was resolved.

   

  

test - what is the wrong while adding one column in the dataframe

2016-06-16 Thread Zhiliang Zhu


 On Tuesday, May 17, 2016 10:44 AM, Zhiliang Zhu 
 wrote:
 

  Hi All,
For the given DataFrame created by hive sql, however, then it is required to 
add one more column based on the existing column, and should also keep the 
previous columns there for the result DataFrame.

final double DAYS_30 = 1000 * 60 * 60 * 24 * 30.0;
//DAYS_30 seems difficult to call in the sql ? 
DataFrame behavior_df = jhql.sql("SELECT cast (user_id as double) as user_id, 
cast (server_timestamp as 
   double) as server_timestamp, url, referer, source, 
app_version, params FROM log.request");
//it is okay to run, but behavior_df.printSchema() not changed any
behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));

//it is okay to run, but behavior_df.printSchema() only has one column as 
daysLater30 .//it would be the schema is with the previous all columns and 
added one as daysLater30 
behavior_df = behavior_df.withColumn("daysLater30", 
behavior_df.col("server_timestamp").plus(DAYS_30));
Then, how would do it?
Thank you, 

 

  

Re: ANOVA test in Spark

2016-05-28 Thread cyberjog
If any specific algorithm is not present, perhaps you can use R/Python
scikit, pipe your data to it & get the model back, 

I'm currently trying this, and it works fine. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ANOVA-test-in-Spark-tp26949p27043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2

How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of these 2 different
cse class type?
E.g first 11 fields mapped to Case1 while rest 6 fields mapped to Case2
Any pointer here or code snippet would be really helpful.


-- 
Thanks
Deepak


Re: ANOVA test in Spark

2016-05-13 Thread mylisttech
Mayank,

Assuming Anova not present in MLIB can you not exploit the Anova from SparkR? I 
am enquiring not making a factual statement.

Thanks 



On May 13, 2016, at 15:54, mayankshete <mayank.shis...@yash.com> wrote:

> Is ANOVA present in Spark Mllib if not then, when will be this feature be
> available in Spark ?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/ANOVA-test-in-Spark-tp26949.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ANOVA test in Spark

2016-05-13 Thread mayankshete
Is ANOVA present in Spark Mllib if not then, when will be this feature be
available in Spark ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ANOVA-test-in-Spark-tp26949.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Hi test

2016-05-10 Thread Abi
Hello test

Re: [2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Ted Yu
For #1, have you seen this JIRA ?

[SPARK-14867][BUILD] Remove `--force` option in `build/mvn`

On Thu, Apr 28, 2016 at 8:27 PM, Demon King <kdm...@gmail.com> wrote:

> BUG 1:
> I have installed maven 3.0.2 in system,  When I using make-distribution.sh
> , it seem not use maven 3.2.2 but use /usr/local/bin/mvn to build spark. So
> I add --force option in make-distribution.sh like this:
>
> line 130:
> VERSION=$("$MVN" *--force* help:evaluate -Dexpression=project.version $@
> 2>/dev/null | grep -v "INFO" | tail -n 1)
> SCALA_VERSION=$("$MVN"* --force* help:evaluate
> -Dexpression=scala.binary.version $@ 2>/dev/null\
> | grep -v "INFO"\
> | tail -n 1)
> SPARK_HADOOP_VERSION=$("$MVN" *--force* help:evaluate
> -Dexpression=hadoop.version $@ 2>/dev/null\
> | grep -v "INFO"\
> | tail -n 1)
> SPARK_HIVE=$("$MVN"* --force* help:evaluate
> -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
> | grep -v "INFO"\
> | fgrep --count "hive";\
> # Reset exit status to 0, otherwise the script stops here if the last
> grep finds nothing\
> # because we use "set -o pipefail"
> echo -n)
>
> line 170:
> BUILD_COMMAND=("$MVN" *--force* clean package -DskipTests $@)
>
> that will force spark to use build/mvn and solve this problem.
>
> BUG 2:
>
> When I run running unit test VersionsSuite, it will hang for one night or
> more. I use jstack and lsof and find it try to send a http request. That
> seems not be a good item when runing test in terrible network.
>
> I use jstack and finally find out reason:
>
>   java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x0007440224d8> (a java.io.BufferedInputStream)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
> - locked <0x000744022530> (a
> sun.net.www.protocol.http.HttpURLConnection)
> at
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
> ...
>
> and I use lsof:
>
> java32082 user 247u  IPv4 527001934   TCP 8.8.8.8:33233 (LISTEN)
> java32082 user  267u  IPv4 527001979   TCP 8.8.8.8:52301 (LISTEN)
> java32082 user  316u  IPv4 527001999   TCP *:51993 (LISTEN)
> java32082 user  521u  IPv4 527111590   TCP 8.8.8.8:53286
> ->butan141.server4you.de:http (ESTABLISHED)
>
> This test suite try to connect butan141.server4you.de, The process will
> hang when network is terrible .
>
>
>


[2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Demon King
BUG 1:
I have installed maven 3.0.2 in system,  When I using make-distribution.sh
, it seem not use maven 3.2.2 but use /usr/local/bin/mvn to build spark. So
I add --force option in make-distribution.sh like this:

line 130:
VERSION=$("$MVN" *--force* help:evaluate -Dexpression=project.version $@
2>/dev/null | grep -v "INFO" | tail -n 1)
SCALA_VERSION=$("$MVN"* --force* help:evaluate
-Dexpression=scala.binary.version $@ 2>/dev/null\
| grep -v "INFO"\
| tail -n 1)
SPARK_HADOOP_VERSION=$("$MVN" *--force* help:evaluate
-Dexpression=hadoop.version $@ 2>/dev/null\
| grep -v "INFO"\
| tail -n 1)
SPARK_HIVE=$("$MVN"* --force* help:evaluate
-Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
| grep -v "INFO"\
| fgrep --count "hive";\
# Reset exit status to 0, otherwise the script stops here if the last
grep finds nothing\
# because we use "set -o pipefail"
echo -n)

line 170:
BUILD_COMMAND=("$MVN" *--force* clean package -DskipTests $@)

that will force spark to use build/mvn and solve this problem.

BUG 2:

When I run running unit test VersionsSuite, it will hang for one night or
more. I use jstack and lsof and find it try to send a http request. That
seems not be a good item when runing test in terrible network.

I use jstack and finally find out reason:

  java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0x0007440224d8> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
- locked <0x000744022530> (a
sun.net.www.protocol.http.HttpURLConnection)
at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
...

and I use lsof:

java32082 user 247u  IPv4 527001934   TCP 8.8.8.8:33233 (LISTEN)
java32082 user  267u  IPv4 527001979   TCP 8.8.8.8:52301 (LISTEN)
java32082 user  316u  IPv4 527001999   TCP *:51993 (LISTEN)
java32082 user  521u  IPv4 527111590   TCP 8.8.8.8:53286
->butan141.server4you.de:http (ESTABLISHED)

This test suite try to connect butan141.server4you.de, The process will
hang when network is terrible .


test

2016-04-26 Thread Harjit Singh








signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: How this unit test passed on master trunk?

2016-04-24 Thread Yong Zhang
So in that case then the result will be following:
[1,[1,1]][3,[3,1]][2,[2,1]]Thanks for explaining the meaning of the it. But the 
question is that how first() will be [3,[1,1]]? In fact, if there were any 
ordering in the final result, it will be [1,[1,1]], instead of [3,[1,1]], 
correct? 
Yong
Subject: Re: How this unit test passed on master trunk?
From: zzh...@hortonworks.com
To: java8...@hotmail.com; gatorsm...@gmail.com
CC: user@spark.apache.org
Date: Sun, 24 Apr 2016 04:37:11 +






There are multiple records for the DF




scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).show
+---+-+
|  a|min(struct(unresolvedstar()))|
+---+-+
|  1|[1,1]|
|  3|[3,1]|
|  2|[2,1]|



The meaning of .groupBy($"a").agg(min(struct($"record.*"))) is to get the min 
for all the records with the same $”a”



For example: TestData2(1,1) :: TestData2(1,2) The result would be 1, (1, 1), 
since struct(1, 1) is less than struct(1, 2). Please check how the Ordering is 
implemented in InterpretedOrdering.



The output itself does not have any ordering. I am not sure why the unit test 
and the real env have different environment.



Xiao,



I do see the difference between unit test and local cluster run. Do you know 
the reason?



Thanks.



Zhan Zhang









 

On Apr 22, 2016, at 11:23 AM, Yong Zhang <java8...@hotmail.com> wrote:



Hi,



I was trying to find out why this unit test can pass in Spark code.



in
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala



for this unit test:

  test("Star Expansion - CreateStruct and CreateArray") {
val structDf = testData2.select("a", "b").as("record")
// CreateStruct and CreateArray in aggregateExpressions
assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
Row(3, Row(3, 1)))
assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
Row(3, Seq(3, 1)))

// CreateStruct and CreateArray in project list (unresolved alias)
assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
Seq(1, 1))

// CreateStruct and CreateArray in project list (alias)
assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
1)))

assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
=== Seq(1, 1))
  }
>From my understanding, the data return in this case should be Row(1, Row(1, 
>1]), as that will be min of struct.
In fact, if I run the spark-shell on my laptop, and I got the result I expected:


./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class TestData2(a: Int, b: Int)
defined class TestData2
scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) :: 
TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
TestData2(3,2) :: Nil, 2).toDF()
scala> val structDF = testData2DF.select("a","b").as("record")
scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
res0: org.apache.spark.sql.Row = [1,[1,1]]

scala> structDF.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
|  3|  1|
|  3|  2|
+---+---+
So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
in this case. Why the unit test asserts that Row[3,[1,1]] should be the first, 
and it will pass? But I cannot reproduce that in my spark-shell? I am trying to 
understand how to interpret the meaning of "agg(min(struct($"record.*")))"


Thanks
Yong 







  

Re: How this unit test passed on master trunk?

2016-04-23 Thread Zhan Zhang
There are multiple records for the DF

scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).show
+---+-+
|  a|min(struct(unresolvedstar()))|
+---+-+
|  1|[1,1]|
|  3|[3,1]|
|  2|[2,1]|

The meaning of .groupBy($"a").agg(min(struct($"record.*"))) is to get the min 
for all the records with the same $”a”

For example: TestData2(1,1) :: TestData2(1,2) The result would be 1, (1, 1), 
since struct(1, 1) is less than struct(1, 2). Please check how the Ordering is 
implemented in InterpretedOrdering.

The output itself does not have any ordering. I am not sure why the unit test 
and the real env have different environment.

Xiao,

I do see the difference between unit test and local cluster run. Do you know 
the reason?

Thanks.

Zhan Zhang




On Apr 22, 2016, at 11:23 AM, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:

Hi,

I was trying to find out why this unit test can pass in Spark code.

in
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

for this unit test:

  test("Star Expansion - CreateStruct and CreateArray") {
val structDf = testData2.select("a", "b").as("record")
// CreateStruct and CreateArray in aggregateExpressions
assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
Row(3, Row(3, 1)))
assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
Row(3, Seq(3, 1)))

// CreateStruct and CreateArray in project list (unresolved alias)
assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
Seq(1, 1))

// CreateStruct and CreateArray in project list (alias)
assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
1)))

assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
=== Seq(1, 1))
  }

>From my understanding, the data return in this case should be Row(1, Row(1, 
>1]), as that will be min of struct.

In fact, if I run the spark-shell on my laptop, and I got the result I expected:


./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class TestData2(a: Int, b: Int)
defined class TestData2

scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) :: 
TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
TestData2(3,2) :: Nil, 2).toDF()

scala> val structDF = testData2DF.select("a","b").as("record")

scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
res0: org.apache.spark.sql.Row = [1,[1,1]]

scala> structDF.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
|  3|  1|
|  3|  2|
+---+---+

So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
in this case. Why the unit test asserts that Row[3,[1,1]] should be the first, 
and it will pass? But I cannot reproduce that in my spark-shell? I am trying to 
understand how to interpret the meaning of "agg(min(struct($"record.*")))"


Thanks

Yong



Re: How this unit test passed on master trunk?

2016-04-22 Thread Ted Yu
This was added by Xiao through:

[SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error
Handling when DataFrame/DataSet Functions using Star

I tried in spark-shell and got:

scala> val first =
structDf.groupBy($"a").agg(min(struct($"record.*"))).first()
first: org.apache.spark.sql.Row = [1,[1,1]]

BTW
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/715/consoleFull
shows this test passing.

On Fri, Apr 22, 2016 at 11:23 AM, Yong Zhang <java8...@hotmail.com> wrote:

> Hi,
>
> I was trying to find out why this unit test can pass in Spark code.
>
> in
>
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
>
> for this unit test:
>
>   test("Star Expansion - CreateStruct and CreateArray") {
> val structDf = testData2.select("a", "b").as("record")
> // CreateStruct and CreateArray in aggregateExpressions
> *assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
> Row(3, Row(3, 1)))*
> assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
> Row(3, Seq(3, 1)))
>
> // CreateStruct and CreateArray in project list (unresolved alias)
> assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
> assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
> Seq(1, 1))
>
> // CreateStruct and CreateArray in project list (alias)
> assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
> 1)))
> 
> assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
> === Seq(1, 1))
>   }
>
> From my understanding, the data return in this case should be Row(1, Row(1, 
> 1]), as that will be min of struct.
>
> In fact, if I run the spark-shell on my laptop, and I got the result I 
> expected:
>
>
> ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>   /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> case class TestData2(a: Int, b: Int)
> defined class TestData2
>
> scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) 
> :: TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
> TestData2(3,2) :: Nil, 2).toDF()
>
> scala> val structDF = testData2DF.select("a","b").as("record")
>
> scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
> res0: org.apache.spark.sql.Row = [1,[1,1]]
>
> scala> structDF.show
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  1|
> |  1|  2|
> |  2|  1|
> |  2|  2|
> |  3|  1|
> |  3|  2|
> +---+---+
>
> So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
> in this case. Why the unit test asserts that Row[3,[1,1]] should be the 
> first, and it will pass? But I cannot reproduce that in my spark-shell? I am 
> trying to understand how to interpret the meaning of 
> "agg(min(struct($"record.*")))"
>
>
> Thanks
>
> Yong
>
>


How this unit test passed on master trunk?

2016-04-22 Thread Yong Zhang
Hi,
I was trying to find out why this unit test can pass in Spark code.
inhttps://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
for this unit test:
  test("Star Expansion - CreateStruct and CreateArray") {
val structDf = testData2.select("a", "b").as("record")
// CreateStruct and CreateArray in aggregateExpressions
assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
Row(3, Row(3, 1)))
assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
Row(3, Seq(3, 1)))

// CreateStruct and CreateArray in project list (unresolved alias)
assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
Seq(1, 1))

// CreateStruct and CreateArray in project list (alias)
assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
1)))

assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
=== Seq(1, 1))
  }From my understanding, the data return in this case should be Row(1, Row(1, 
1]), as that will be min of struct.In fact, if I run the spark-shell on my 
laptop, and I got the result I expected:
./bin/spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class TestData2(a: Int, b: Int)
defined class TestData2scala> val testData2DF = 
sqlContext.sparkContext.parallelize(TestData2(1,1) :: TestData2(1,2) :: 
TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: TestData2(3,2) :: Nil, 
2).toDF()scala> val structDF = testData2DF.select("a","b").as("record")scala> 
structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
res0: org.apache.spark.sql.Row = [1,[1,1]]

scala> structDF.show
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  1|
|  2|  2|
|  3|  1|
|  3|  2|
+---+---+So from my spark, which I built on the master, I cannot get 
Row[3,[1,1]] back in this case. Why the unit test asserts that Row[3,[1,1]] 
should be the first, and it will pass? But I cannot reproduce that in my 
spark-shell? I am trying to understand how to interpret the meaning of 
"agg(min(struct($"record.*")))"
ThanksYong

Re: Unit test with sqlContext

2016-03-19 Thread Vikas Kawadia
If you prefer  the py.test framework, I just wrote a blog post with some
examples:

Unit testing Apache Spark with py.test
https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b

On Fri, Feb 5, 2016 at 11:43 AM, Steve Annessa <steve.anne...@gmail.com>
wrote:

> Thanks for all of the responses.
>
> I do have an afterAll that stops the sc.
>
> While looking over Holden's readme I noticed she mentioned "Make sure to
> disable parallel execution." That was what I was missing; I added the
> follow to my build.sbt:
>
> ```
> parallelExecution in Test := false
> ```
>
> Now all of my tests are running.
>
> I'm going to look into using the package she created.
>
> Thanks again,
>
> -- Steve
>
>
> On Thu, Feb 4, 2016 at 8:50 PM, Rishi Mishra <rmis...@snappydata.io>
> wrote:
>
>> Hi Steve,
>> Have you cleaned up your SparkContext ( sc.stop())  , in a afterAll().
>> The error suggests you are creating more than one SparkContext.
>>
>>
>> On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> Thanks for recommending spark-testing-base :) Just wanted to add if
>>> anyone has feature requests for Spark testing please get in touch (or add
>>> an issue on the github) :)
>>>
>>>
>>> On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito <
>>> silvio.fior...@granturing.com> wrote:
>>>
>>>> Hi Steve,
>>>>
>>>> Have you looked at the spark-testing-base package by Holden? It’s
>>>> really useful for unit testing Spark apps as it handles all the
>>>> bootstrapping for you.
>>>>
>>>> https://github.com/holdenk/spark-testing-base
>>>>
>>>> DataFrame examples are here:
>>>> https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/scala/com/holdenkarau/spark/testing/SampleDataFrameTest.scala
>>>>
>>>> Thanks,
>>>> Silvio
>>>>
>>>> From: Steve Annessa <steve.anne...@gmail.com>
>>>> Date: Thursday, February 4, 2016 at 8:36 PM
>>>> To: "user@spark.apache.org" <user@spark.apache.org>
>>>> Subject: Unit test with sqlContext
>>>>
>>>> I'm trying to unit test a function that reads in a JSON file,
>>>> manipulates the DF and then returns a Scala Map.
>>>>
>>>> The function has signature:
>>>> def ingest(dataLocation: String, sc: SparkContext, sqlContext:
>>>> SQLContext)
>>>>
>>>> I've created a bootstrap spec for spark jobs that instantiates the
>>>> Spark Context and SQLContext like so:
>>>>
>>>> @transient var sc: SparkContext = _
>>>> @transient var sqlContext: SQLContext = _
>>>>
>>>> override def beforeAll = {
>>>>   System.clearProperty("spark.driver.port")
>>>>   System.clearProperty("spark.hostPort")
>>>>
>>>>   val conf = new SparkConf()
>>>> .setMaster(master)
>>>> .setAppName(appName)
>>>>
>>>>   sc = new SparkContext(conf)
>>>>   sqlContext = new SQLContext(sc)
>>>> }
>>>>
>>>> When I do not include sqlContext, my tests run. Once I add the
>>>> sqlContext I get the following errors:
>>>>
>>>> 16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being
>>>> constructed (or threw an exception in its constructor).  This may indicate
>>>> an error, since only one SparkContext may be running in this JVM (see
>>>> SPARK-2243). The other SparkContext was created at:
>>>> org.apache.spark.SparkContext.(SparkContext.scala:81)
>>>>
>>>> 16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
>>>> akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is
>>>> not unique!
>>>>
>>>> and finally:
>>>>
>>>> [info] IngestSpec:
>>>> [info] Exception encountered when attempting to run a suite with class
>>>> name: com.company.package.IngestSpec *** ABORTED ***
>>>> [info]   akka.actor.InvalidActorNameException: actor name
>>>> [ExecutorEndpoint] is not unique!
>>>>
>>>>
>>>> What do I need to do to get a sqlContext through my tests?
>>>>
>>>> Thanks,
>>>>
>>>> -- Steve
>>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Regards,
>> Rishitesh Mishra,
>> SnappyData . (http://www.snappydata.io/)
>>
>> https://in.linkedin.com/in/rishiteshmishra
>>
>
>


Re: Unit test with sqlContext

2016-02-05 Thread Steve Annessa
Thanks for all of the responses.

I do have an afterAll that stops the sc.

While looking over Holden's readme I noticed she mentioned "Make sure to
disable parallel execution." That was what I was missing; I added the
follow to my build.sbt:

```
parallelExecution in Test := false
```

Now all of my tests are running.

I'm going to look into using the package she created.

Thanks again,

-- Steve


On Thu, Feb 4, 2016 at 8:50 PM, Rishi Mishra <rmis...@snappydata.io> wrote:

> Hi Steve,
> Have you cleaned up your SparkContext ( sc.stop())  , in a afterAll(). The
> error suggests you are creating more than one SparkContext.
>
>
> On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> Thanks for recommending spark-testing-base :) Just wanted to add if
>> anyone has feature requests for Spark testing please get in touch (or add
>> an issue on the github) :)
>>
>>
>> On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito <
>> silvio.fior...@granturing.com> wrote:
>>
>>> Hi Steve,
>>>
>>> Have you looked at the spark-testing-base package by Holden? It’s really
>>> useful for unit testing Spark apps as it handles all the bootstrapping for
>>> you.
>>>
>>> https://github.com/holdenk/spark-testing-base
>>>
>>> DataFrame examples are here:
>>> https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/scala/com/holdenkarau/spark/testing/SampleDataFrameTest.scala
>>>
>>> Thanks,
>>> Silvio
>>>
>>> From: Steve Annessa <steve.anne...@gmail.com>
>>> Date: Thursday, February 4, 2016 at 8:36 PM
>>> To: "user@spark.apache.org" <user@spark.apache.org>
>>> Subject: Unit test with sqlContext
>>>
>>> I'm trying to unit test a function that reads in a JSON file,
>>> manipulates the DF and then returns a Scala Map.
>>>
>>> The function has signature:
>>> def ingest(dataLocation: String, sc: SparkContext, sqlContext:
>>> SQLContext)
>>>
>>> I've created a bootstrap spec for spark jobs that instantiates the Spark
>>> Context and SQLContext like so:
>>>
>>> @transient var sc: SparkContext = _
>>> @transient var sqlContext: SQLContext = _
>>>
>>> override def beforeAll = {
>>>   System.clearProperty("spark.driver.port")
>>>   System.clearProperty("spark.hostPort")
>>>
>>>   val conf = new SparkConf()
>>> .setMaster(master)
>>> .setAppName(appName)
>>>
>>>   sc = new SparkContext(conf)
>>>   sqlContext = new SQLContext(sc)
>>> }
>>>
>>> When I do not include sqlContext, my tests run. Once I add the
>>> sqlContext I get the following errors:
>>>
>>> 16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being
>>> constructed (or threw an exception in its constructor).  This may indicate
>>> an error, since only one SparkContext may be running in this JVM (see
>>> SPARK-2243). The other SparkContext was created at:
>>> org.apache.spark.SparkContext.(SparkContext.scala:81)
>>>
>>> 16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
>>> akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is
>>> not unique!
>>>
>>> and finally:
>>>
>>> [info] IngestSpec:
>>> [info] Exception encountered when attempting to run a suite with class
>>> name: com.company.package.IngestSpec *** ABORTED ***
>>> [info]   akka.actor.InvalidActorNameException: actor name
>>> [ExecutorEndpoint] is not unique!
>>>
>>>
>>> What do I need to do to get a sqlContext through my tests?
>>>
>>> Thanks,
>>>
>>> -- Steve
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>


Unit test with sqlContext

2016-02-04 Thread Steve Annessa
I'm trying to unit test a function that reads in a JSON file, manipulates
the DF and then returns a Scala Map.

The function has signature:
def ingest(dataLocation: String, sc: SparkContext, sqlContext: SQLContext)

I've created a bootstrap spec for spark jobs that instantiates the Spark
Context and SQLContext like so:

@transient var sc: SparkContext = _
@transient var sqlContext: SQLContext = _

override def beforeAll = {
  System.clearProperty("spark.driver.port")
  System.clearProperty("spark.hostPort")

  val conf = new SparkConf()
.setMaster(master)
.setAppName(appName)

  sc = new SparkContext(conf)
  sqlContext = new SQLContext(sc)
}

When I do not include sqlContext, my tests run. Once I add the sqlContext I
get the following errors:

16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being
constructed (or threw an exception in its constructor).  This may indicate
an error, since only one SparkContext may be running in this JVM (see
SPARK-2243). The other SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:81)

16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is not
unique!

and finally:

[info] IngestSpec:
[info] Exception encountered when attempting to run a suite with class
name: com.company.package.IngestSpec *** ABORTED ***
[info]   akka.actor.InvalidActorNameException: actor name
[ExecutorEndpoint] is not unique!


What do I need to do to get a sqlContext through my tests?

Thanks,

-- Steve


Re: Unit test with sqlContext

2016-02-04 Thread Silvio Fiorito
Hi Steve,

Have you looked at the spark-testing-base package by Holden? It’s really useful 
for unit testing Spark apps as it handles all the bootstrapping for you.

https://github.com/holdenk/spark-testing-base

DataFrame examples are here: 
https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/scala/com/holdenkarau/spark/testing/SampleDataFrameTest.scala

Thanks,
Silvio

From: Steve Annessa <steve.anne...@gmail.com<mailto:steve.anne...@gmail.com>>
Date: Thursday, February 4, 2016 at 8:36 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Unit test with sqlContext

I'm trying to unit test a function that reads in a JSON file, manipulates the 
DF and then returns a Scala Map.

The function has signature:
def ingest(dataLocation: String, sc: SparkContext, sqlContext: SQLContext)

I've created a bootstrap spec for spark jobs that instantiates the Spark 
Context and SQLContext like so:

@transient var sc: SparkContext = _
@transient var sqlContext: SQLContext = _

override def beforeAll = {
  System.clearProperty("spark.driver.port")
  System.clearProperty("spark.hostPort")

  val conf = new SparkConf()
.setMaster(master)
.setAppName(appName)

  sc = new SparkContext(conf)
  sqlContext = new SQLContext(sc)
}

When I do not include sqlContext, my tests run. Once I add the sqlContext I get 
the following errors:

16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being constructed 
(or threw an exception in its constructor).  This may indicate an error, since 
only one SparkContext may be running in this JVM (see SPARK-2243). The other 
SparkContext was created at:
org.apache.spark.SparkContext.(SparkContext.scala:81)

16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is not 
unique!

and finally:

[info] IngestSpec:
[info] Exception encountered when attempting to run a suite with class name: 
com.company.package.IngestSpec *** ABORTED ***
[info]   akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is 
not unique!


What do I need to do to get a sqlContext through my tests?

Thanks,

-- Steve


Re: Unit test with sqlContext

2016-02-04 Thread Rishi Mishra
Hi Steve,
Have you cleaned up your SparkContext ( sc.stop())  , in a afterAll(). The
error suggests you are creating more than one SparkContext.


On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Thanks for recommending spark-testing-base :) Just wanted to add if anyone
> has feature requests for Spark testing please get in touch (or add an issue
> on the github) :)
>
>
> On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Hi Steve,
>>
>> Have you looked at the spark-testing-base package by Holden? It’s really
>> useful for unit testing Spark apps as it handles all the bootstrapping for
>> you.
>>
>> https://github.com/holdenk/spark-testing-base
>>
>> DataFrame examples are here:
>> https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/scala/com/holdenkarau/spark/testing/SampleDataFrameTest.scala
>>
>> Thanks,
>> Silvio
>>
>> From: Steve Annessa <steve.anne...@gmail.com>
>> Date: Thursday, February 4, 2016 at 8:36 PM
>> To: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Unit test with sqlContext
>>
>> I'm trying to unit test a function that reads in a JSON file, manipulates
>> the DF and then returns a Scala Map.
>>
>> The function has signature:
>> def ingest(dataLocation: String, sc: SparkContext, sqlContext: SQLContext)
>>
>> I've created a bootstrap spec for spark jobs that instantiates the Spark
>> Context and SQLContext like so:
>>
>> @transient var sc: SparkContext = _
>> @transient var sqlContext: SQLContext = _
>>
>> override def beforeAll = {
>>   System.clearProperty("spark.driver.port")
>>   System.clearProperty("spark.hostPort")
>>
>>   val conf = new SparkConf()
>> .setMaster(master)
>> .setAppName(appName)
>>
>>   sc = new SparkContext(conf)
>>   sqlContext = new SQLContext(sc)
>> }
>>
>> When I do not include sqlContext, my tests run. Once I add the sqlContext
>> I get the following errors:
>>
>> 16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being
>> constructed (or threw an exception in its constructor).  This may indicate
>> an error, since only one SparkContext may be running in this JVM (see
>> SPARK-2243). The other SparkContext was created at:
>> org.apache.spark.SparkContext.(SparkContext.scala:81)
>>
>> 16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
>> akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is
>> not unique!
>>
>> and finally:
>>
>> [info] IngestSpec:
>> [info] Exception encountered when attempting to run a suite with class
>> name: com.company.package.IngestSpec *** ABORTED ***
>> [info]   akka.actor.InvalidActorNameException: actor name
>> [ExecutorEndpoint] is not unique!
>>
>>
>> What do I need to do to get a sqlContext through my tests?
>>
>> Thanks,
>>
>> -- Steve
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>



-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra


Re: Unit test with sqlContext

2016-02-04 Thread Holden Karau
Thanks for recommending spark-testing-base :) Just wanted to add if anyone
has feature requests for Spark testing please get in touch (or add an issue
on the github) :)


On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Hi Steve,
>
> Have you looked at the spark-testing-base package by Holden? It’s really
> useful for unit testing Spark apps as it handles all the bootstrapping for
> you.
>
> https://github.com/holdenk/spark-testing-base
>
> DataFrame examples are here:
> https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/scala/com/holdenkarau/spark/testing/SampleDataFrameTest.scala
>
> Thanks,
> Silvio
>
> From: Steve Annessa <steve.anne...@gmail.com>
> Date: Thursday, February 4, 2016 at 8:36 PM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Unit test with sqlContext
>
> I'm trying to unit test a function that reads in a JSON file, manipulates
> the DF and then returns a Scala Map.
>
> The function has signature:
> def ingest(dataLocation: String, sc: SparkContext, sqlContext: SQLContext)
>
> I've created a bootstrap spec for spark jobs that instantiates the Spark
> Context and SQLContext like so:
>
> @transient var sc: SparkContext = _
> @transient var sqlContext: SQLContext = _
>
> override def beforeAll = {
>   System.clearProperty("spark.driver.port")
>   System.clearProperty("spark.hostPort")
>
>   val conf = new SparkConf()
> .setMaster(master)
> .setAppName(appName)
>
>   sc = new SparkContext(conf)
>   sqlContext = new SQLContext(sc)
> }
>
> When I do not include sqlContext, my tests run. Once I add the sqlContext
> I get the following errors:
>
> 16/02/04 17:31:58 WARN SparkContext: Another SparkContext is being
> constructed (or threw an exception in its constructor).  This may indicate
> an error, since only one SparkContext may be running in this JVM (see
> SPARK-2243). The other SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:81)
>
> 16/02/04 17:31:59 ERROR SparkContext: Error initializing SparkContext.
> akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is not
> unique!
>
> and finally:
>
> [info] IngestSpec:
> [info] Exception encountered when attempting to run a suite with class
> name: com.company.package.IngestSpec *** ABORTED ***
> [info]   akka.actor.InvalidActorNameException: actor name
> [ExecutorEndpoint] is not unique!
>
>
> What do I need to do to get a sqlContext through my tests?
>
> Thanks,
>
> -- Steve
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread Ted Yu
Please refer to the following suites:

yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala

Cheers

On Mon, Jan 18, 2016 at 2:14 AM, zml张明磊 <mingleizh...@ctrip.com> wrote:

> Hello,
>
>
>
>    I want to find some test file in spark which support the same
> function just like in Hadoop MiniCluster test environment. But I can not
> find them. Anyone know about that ?
>


Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread zml张明磊
Hello,

   I want to find some test file in spark which support the same function 
just like in Hadoop MiniCluster test environment. But I can not find them. 
Anyone know about that ?


livy test problem: Failed to execute goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project livy-spark_2.10: There are test failures

2016-01-14 Thread Ruslan Dautkhanov
Livy build test from master fails with below problem. Can't track it down.

YARN shows Livy Spark yarn application as running.
Although attempt to connect to application master shows connection refused:

HTTP ERROR 500
> Problem accessing /proxy/application_1448640910222_0046/. Reason:
> Connection refused
> Caused by:
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)


Not sure if Livy server has application master UI?

CDH 5.5.1.

Below mvn test output footer:



> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] livy-main .. SUCCESS [
>  1.299 s]
> [INFO] livy-api_2.10 .. SUCCESS [
>  3.622 s]
> [INFO] livy-client-common_2.10  SUCCESS [
>  0.862 s]
> [INFO] livy-client-local_2.10 . SUCCESS [
> 23.866 s]
> [INFO] livy-core_2.10 . SUCCESS [
>  0.316 s]
> [INFO] livy-repl_2.10 . SUCCESS [01:00
> min]
> [INFO] livy-yarn_2.10 . SUCCESS [
>  0.215 s]
> [INFO] livy-spark_2.10  FAILURE [
> 17.382 s]
> [INFO] livy-server_2.10 ... SKIPPED
> [INFO] livy-assembly_2.10 . SKIPPED
> [INFO] livy-client-http_2.10 .. SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 01:48 min
> [INFO] Finished at: 2016-01-14T14:34:28-07:00
> [INFO] Final Memory: 27M/453M
> [INFO]
> --------
> [ERROR] Failed to execute goal
> org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
> livy-spark_2.10: There are test failures -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
> livy-spark_2.10: There are test failures
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>     at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoFailureException: There are test
> failures
> at org.scalatest.tools.maven.TestMojo.execute(TestMojo.java:107)
> at
> org.apache.maven.plu

Re: How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Madabhattula Rajesh Kumar
Hi JB and Ted,

Thank you very much for the steps

Regards,
Rajesh

On Thu, Dec 3, 2015 at 8:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> See this thread for Spark 1.6.0 RC1
>
>
> http://search-hadoop.com/m/q3RTtKdUViYHH1b1=+VOTE+Release+Apache+Spark+1+6+0+RC1+
>
> Cheers
>
> On Thu, Dec 3, 2015 at 12:39 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi Team,
>>
>> Looks like this issue is fixed in 1.6 release. How to test this fix? Is
>> any jar is available? So I can add that jar in dependency and test this
>> fix. (Or) Any other way, I can test this fix in 1.15.2 code base.
>>
>> Could you please let me know the steps. Thank you for your support
>>
>> Regards,
>> Rajesh
>>
>
>


How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Madabhattula Rajesh Kumar
Hi Team,

Looks like this issue is fixed in 1.6 release. How to test this fix? Is any
jar is available? So I can add that jar in dependency and test this fix.
(Or) Any other way, I can test this fix in 1.15.2 code base.

Could you please let me know the steps. Thank you for your support

Regards,
Rajesh


Re: How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Jean-Baptiste Onofré

Hi Rajesh,

you can check codebase and build yourself in order to test:

git clone https://git-wip-us.apache.org/repos/asf/spark
cd spark
mvn clean package -DskipTests

You will have bin, sbin and conf folders to try it.

Regards
JB

On 12/03/2015 09:39 AM, Madabhattula Rajesh Kumar wrote:

Hi Team,

Looks like this issue is fixed in 1.6 release. How to test this fix? Is
any jar is available? So I can add that jar in dependency and test this
fix. (Or) Any other way, I can test this fix in 1.15.2 code base.

Could you please let me know the steps. Thank you for your support

Regards,
Rajesh


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Ted Yu
See this thread for Spark 1.6.0 RC1

http://search-hadoop.com/m/q3RTtKdUViYHH1b1=+VOTE+Release+Apache+Spark+1+6+0+RC1+

Cheers

On Thu, Dec 3, 2015 at 12:39 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi Team,
>
> Looks like this issue is fixed in 1.6 release. How to test this fix? Is
> any jar is available? So I can add that jar in dependency and test this
> fix. (Or) Any other way, I can test this fix in 1.15.2 code base.
>
> Could you please let me know the steps. Thank you for your support
>
> Regards,
> Rajesh
>


Re: how to run unit test for specific component only

2015-11-13 Thread Steve Loughran
try:

mvn test -pl sql  -DwildcardSuites=org.apache.spark.sql -Dtest=none




On 12 Nov 2015, at 03:13, weoccc <weo...@gmail.com<mailto:weo...@gmail.com>> 
wrote:

Hi,

I am wondering how to run unit test for specific spark component only.

mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none

The above command doesn't seem to work. I'm using spark 1.5.

Thanks,

Weide




Re: how to run unit test for specific component only

2015-11-11 Thread Ted Yu
Have you tried the following ?

build/sbt "sql/test-only *"

Cheers

On Wed, Nov 11, 2015 at 7:13 PM, weoccc <weo...@gmail.com> wrote:

> Hi,
>
> I am wondering how to run unit test for specific spark component only.
>
> mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none
>
> The above command doesn't seem to work. I'm using spark 1.5.
>
> Thanks,
>
> Weide
>
>
>


how to run unit test for specific component only

2015-11-11 Thread weoccc
Hi,

I am wondering how to run unit test for specific spark component only.

mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none

The above command doesn't seem to work. I'm using spark 1.5.

Thanks,

Weide


Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-09 Thread Terry Hole
C$$iwC$$iwC.(:58)
> >> > at
> >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
> >> > at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
> >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
> >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
> >> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> >> > at $iwC$$iwC$$iwC$$iwC.(:70)
> >> > at $iwC$$iwC$$iwC.(:72)
> >> > at $iwC$$iwC.(:74)
> >> > at $iwC.(:76)
> >> > at (:78)
> >> > at .(:82)
> >> > at .()
> >> > at .(:7)
> >> > at .()
> >> > at $print()
> >> >
> >> > Thanks!
> >> > - Terry
> >> >
> >> > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen <so...@cloudera.com> wrote:
> >> >>
> >> >> I think somewhere alone the line you've not specified your label
> >> >> column -- it's defaulting to "label" and it does not recognize it, or
> >> >> at least not as a binary or nominal attribute.
> >> >>
> >> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole <hujie.ea...@gmail.com>
> >> >> wrote:
> >> >> > Hi, Experts,
> >> >> >
> >> >> > I followed the guide of spark ml pipe to test
> DecisionTreeClassifier
> >> >> > on
> >> >> > spark shell with spark 1.4.1, but always meets error like
> following,
> >> >> > do
> >> >> > you
> >> >> > have any idea how to fix this?
> >> >> >
> >> >> > The error stack:
> >> >> > java.lang.IllegalArgumentException: DecisionTreeClassifier was
> given
> >> >> > input
> >> >> > with invalid label column label, without the number of classes
> >> >> > specified.
> >> >> > See StringIndexer.
> >> >> > at
> >> >> >
> >> >> >
> >> >> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
> >> >> > at
> >> >> >
> >> >> >
> >> >> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
> >> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> >> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> >> >> > at
> >> >> >
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
> >> >> > at
> >> >> >
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
> >> >> > at
> >> >> > scala.collection.Iterator$class.foreach(Iterator.scala:727)
> >> >> > at
> >> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> >> >> > at
> >> >> >
> >> >> >
> >> >> >
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
> >> >> > at
> >> >> >
> >> >> >
> >> >> >
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
> >> >> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129)
> >> >> > at
> >> >> >
> >> >> >
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> >> >> > at
> >> >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> >> >> > at
> >> >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> >> >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> >> >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> >> >> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> >> >> > at $iwC$$iwC$$iwC$$iwC.(:57)
> >> >> > at $iwC$$iwC$$iwC.(:59)
> >> >> > at $iwC$$iwC.(:61)
> >> >> > at $iwC.(:63)
> >> >> > at (:65)
> >> >> > at .(:69)
> >> >> > at .()
> >> >> > at .(:7)
> >> >> > at .()
> >> >> > at $print()
> >> >> >
> >> >> > The execute code is:
> >> >> > // Labeled and unlabeled instance types.
> >> >> > // Spark SQL can infer schema from case classes.
> >> >> > case class LabeledDocument(id: Long, text: String, label: Double)
> >> >> > case class Document(id: Long, text: String)
> >> >> > // Prepare training documents, which are labeled.
> >> >> > val training = sc.parallelize(Seq(
> >> >> >   LabeledDocument(0L, "a b c d e spark", 1.0),
> >> >> >   LabeledDocument(1L, "b d", 0.0),
> >> >> >   LabeledDocument(2L, "spark f g h", 1.0),
> >> >> >   LabeledDocument(3L, "hadoop mapreduce", 0.0)))
> >> >> >
> >> >> > // Configure an ML pipeline, which consists of three stages:
> >> >> > tokenizer,
> >> >> > hashingTF, and lr.
> >> >> > val tokenizer = new
> >> >> > Tokenizer().setInputCol("text").setOutputCol("words")
> >> >> > val hashingTF = new
> >> >> >
> >> >> >
> >> >> >
> HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
> >> >> > val lr =  new
> >> >> >
> >> >> >
> >> >> >
> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini")
> >> >> > val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF,
> >> >> > lr))
> >> >> >
> >> >> > // Error raises from the following line
> >> >> > val model = pipeline.fit(training.toDF)
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>


Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-07 Thread Terry Hole
Xiangrui,

Do you have any idea how to make this work?

Thanks
- Terry

Terry Hole <hujie.ea...@gmail.com>于2015年9月6日星期日 17:41写道:

> Sean
>
> Do you know how to tell decision tree that the "label" is a binary or set
> some attributes to dataframe to carry number of classes?
>
> Thanks!
> - Terry
>
> On Sun, Sep 6, 2015 at 5:23 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> (Sean)
>> The error suggests that the type is not a binary or nominal attribute
>> though. I think that's the missing step. A double-valued column need
>> not be one of these attribute types.
>>
>> On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole <hujie.ea...@gmail.com>
>> wrote:
>> > Hi, Owen,
>> >
>> > The dataframe "training" is from a RDD of case class:
>> RDD[LabeledDocument],
>> > while the case class is defined as this:
>> > case class LabeledDocument(id: Long, text: String, label: Double)
>> >
>> > So there is already has the default "label" column with "double" type.
>> >
>> > I already tried to set the label column for decision tree as this:
>> > val lr = new
>> >
>> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label")
>> > It raised the same error.
>> >
>> > I also tried to change the "label" to "int" type, it also reported error
>> > like following stack, I have no idea how to make this work.
>> >
>> > java.lang.IllegalArgumentException: requirement failed: Column label
>> must be
>> > of type DoubleType but was actually IntegerType.
>> > at scala.Predef$.require(Predef.scala:233)
>> > at
>> >
>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
>> > at
>> >
>> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
>> > at
>> >
>> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
>> > at
>> > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
>> > at
>> >
>> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
>> > at
>> >
>> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
>> > at
>> >
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>> > at
>> >
>> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>> > at
>> > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
>> > at
>> org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162)
>> > at
>> > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
>> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116)
>> > at
>> >
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>> > at
>> >
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
>> > at
>> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
>> > at
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
>> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
>> > at $iwC$$iwC$$iwC$$iwC.(:70)
>> > at $iwC$$iwC$$iwC.(:72)
>> > at $iwC$$iwC.(:74)
>> > at $iwC.(:76)
>> > at (:78)
>> > at .(:82)
>> > at .()
>> > at .(:7)
>> > at .()
>> > at $print()
>> >
>> > Thanks!
>> > - Terry
>> >
>> > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> I think somewhere alone the line you've not specified your label
>> >> column -- it's defaulting to "label" and it does not recognize it, or
>> >> at least not as a binary or nominal attribute.
>> >>
>> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole <hujie.ea...@gmail.com>
>> wrote:
>> >> > Hi, Experts,
>> >> >
>> >> > I followed the guide of spark ml pipe to test DecisionTr

Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Terry Hole
Hi, Owen,

The dataframe "training" is from a RDD of case class: RDD[LabeledDocument],
while the case class is defined as this:
case class LabeledDocument(id: Long, text: String, *label: Double*)

So there is already has the default "label" column with "double" type.

I already tried to set the label column for decision tree as this:
val lr = new
DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label")
It raised the same error.

I also tried to change the "label" to "int" type, it also reported error
like following stack, I have no idea how to make this work.

java.lang.IllegalArgumentException: requirement failed: *Column label must
be of type DoubleType but was actually IntegerType*.
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
at
org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at
org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at
org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at
scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162)
at
org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
at $iwC$$iwC$$iwC$$iwC.(:70)
at $iwC$$iwC$$iwC.(:72)
at $iwC$$iwC.(:74)
at $iwC.(:76)
at (:78)
at .(:82)
at .()
at .(:7)
at .()
at $print()

Thanks!
- Terry

On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen <so...@cloudera.com> wrote:

> I think somewhere alone the line you've not specified your label
> column -- it's defaulting to "label" and it does not recognize it, or
> at least not as a binary or nominal attribute.
>
> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole <hujie.ea...@gmail.com> wrote:
> > Hi, Experts,
> >
> > I followed the guide of spark ml pipe to test DecisionTreeClassifier on
> > spark shell with spark 1.4.1, but always meets error like following, do
> you
> > have any idea how to fix this?
> >
> > The error stack:
> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given
> input
> > with invalid label column label, without the number of classes specified.
> > See StringIndexer.
> > at
> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
> > at
> >
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> > at
> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
> > at
> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > at
> >
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
> > at
> >
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129)
> > at
> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> > at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> > at $iwC$$iwC$$iwC$$iwC.(:57

Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Sean Owen
I think somewhere alone the line you've not specified your label
column -- it's defaulting to "label" and it does not recognize it, or
at least not as a binary or nominal attribute.

On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole <hujie.ea...@gmail.com> wrote:
> Hi, Experts,
>
> I followed the guide of spark ml pipe to test DecisionTreeClassifier on
> spark shell with spark 1.4.1, but always meets error like following, do you
> have any idea how to fix this?
>
> The error stack:
> java.lang.IllegalArgumentException: DecisionTreeClassifier was given input
> with invalid label column label, without the number of classes specified.
> See StringIndexer.
> at
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
> at
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
> at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
> at
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
> at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> at $iwC$$iwC$$iwC$$iwC.(:57)
> at $iwC$$iwC$$iwC.(:59)
> at $iwC$$iwC.(:61)
> at $iwC.(:63)
> at (:65)
> at .(:69)
> at .()
> at .(:7)
> at .()
> at $print()
>
> The execute code is:
> // Labeled and unlabeled instance types.
> // Spark SQL can infer schema from case classes.
> case class LabeledDocument(id: Long, text: String, label: Double)
> case class Document(id: Long, text: String)
> // Prepare training documents, which are labeled.
> val training = sc.parallelize(Seq(
>   LabeledDocument(0L, "a b c d e spark", 1.0),
>   LabeledDocument(1L, "b d", 0.0),
>   LabeledDocument(2L, "spark f g h", 1.0),
>   LabeledDocument(3L, "hadoop mapreduce", 0.0)))
>
> // Configure an ML pipeline, which consists of three stages: tokenizer,
> hashingTF, and lr.
> val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
> val hashingTF = new
> HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
> val lr =  new
> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini")
> val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
>
> // Error raises from the following line
> val model = pipeline.fit(training.toDF)
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier

2015-09-06 Thread Sean Owen
(Sean)
The error suggests that the type is not a binary or nominal attribute
though. I think that's the missing step. A double-valued column need
not be one of these attribute types.

On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole <hujie.ea...@gmail.com> wrote:
> Hi, Owen,
>
> The dataframe "training" is from a RDD of case class: RDD[LabeledDocument],
> while the case class is defined as this:
> case class LabeledDocument(id: Long, text: String, label: Double)
>
> So there is already has the default "label" column with "double" type.
>
> I already tried to set the label column for decision tree as this:
> val lr = new
> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label")
> It raised the same error.
>
> I also tried to change the "label" to "int" type, it also reported error
> like following stack, I have no idea how to make this work.
>
> java.lang.IllegalArgumentException: requirement failed: Column label must be
> of type DoubleType but was actually IntegerType.
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37)
> at
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
> at
> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
> at
> org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
> at
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
> at
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
> at
> scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
> at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162)
> at
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59)
> at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:68)
> at $iwC$$iwC$$iwC$$iwC.(:70)
> at $iwC$$iwC$$iwC.(:72)
> at $iwC$$iwC.(:74)
> at $iwC.(:76)
> at (:78)
> at .(:82)
> at .()
> at .(:7)
> at .()
> at $print()
>
> Thanks!
> - Terry
>
> On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> I think somewhere alone the line you've not specified your label
>> column -- it's defaulting to "label" and it does not recognize it, or
>> at least not as a binary or nominal attribute.
>>
>> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole <hujie.ea...@gmail.com> wrote:
>> > Hi, Experts,
>> >
>> > I followed the guide of spark ml pipe to test DecisionTreeClassifier on
>> > spark shell with spark 1.4.1, but always meets error like following, do
>> > you
>> > have any idea how to fix this?
>> >
>> > The error stack:
>> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given
>> > input
>> > with invalid label column label, without the number of classes
>> > specified.
>> > See StringIndexer.
>> > at
>> >
>> > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
>> > at
>> >
>> > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
>> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>> > at
>> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
>> > at
>> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > at
>> > scala.collection.AbstractIterator.foreach(Iterator.scala

  1   2   >