Contineous errors trying to start spark-shell

2015-07-03 Thread Mohamed Lrhazi
Hello,

I am trying to just start spark-shell... it starts, the prompt appears,
then a never ending (literally) stream of these log lines proceeds
What is it trying to do? Why is it failing?

To start it I do:

$ docker run -it ncssm/spark-base /spark/bin/spark-shell --master spark://
devzero.cs.georgetown.edu:7077




The log lines look like:


5/07/04 00:36:36 INFO Worker: Asked to launch executor
app-20150704003631-0004/45 for Spark shell
15/07/04 00:36:36 INFO ExecutorRunner: Launch command:
"/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp"
"/spark/sbin/../conf/:/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/spark/lib/datanucleus-api-jdo-3.2.6.jar:/spark/lib/datanucleus-rdbms-3.2.9.jar:/spark/lib/datanucleus-core-3.2.10.jar"
"-Xms512M" "-Xmx512M" "-Dspark.driver.port=50266"
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url"
"akka.tcp://sparkDriver@172.17.0.45:50266/user/CoarseGrainedScheduler"
"--executor-id" "45" "--hostname" "10.212.55.41" "--cores" "16" "--app-id"
"app-20150704003631-0004" "--worker-url" "akka.tcp://
sparkWorker@10.212.55.41:7078/user/Worker"
15/07/04 00:36:39 INFO Worker: Executor app-20150704003631-0004/45 finished
with state EXITED message Command exited with code 1 exitStatus 1


On an example worker node, I see a corresponding unending stream of errors:

15/07/04 00:36:31 INFO Worker: Asked to launch executor
app-20150704003631-0004/7 for Spark shell
15/07/04 00:36:31 INFO ExecutorRunner: Launch command:
"/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp"
"/spark/sbin/../conf/:/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/spark/lib/datanucleus-api-jdo-3.2.6.jar:/spark/lib/datanucleus-rdbms-3.2.9.jar:/spark/lib/datanucleus-core-3.2.10.jar"
"-Xms512M" "-Xmx512M" "-Dspark.driver.port=50266"
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url"
"akka.tcp://sparkDriver@172.17.0.45:50266/user/CoarseGrainedScheduler"
"--executor-id" "7" "--hostname" "10.212.55.41" "--cores" "16" "--app-id"
"app-20150704003631-0004" "--worker-url" "akka.tcp://
sparkWorker@10.212.55.41:7078/user/Worker"
15/07/04 00:36:36 INFO Worker: Executor app-20150704003631-0004/7 finished
with state EXITED message Command exited with code 1 exitStatus 1


Thanks,
Mohamed.


Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
I dont think you can expect any order guarantee except the records in one
partition.
 On Jul 4, 2015 7:43 AM, "khaledh"  wrote:

> I'm writing a Spark Streaming application that uses RabbitMQ to consume
> events. One feature of RabbitMQ that I intend to make use of is bulk ack of
> messages, i.e. no need to ack one-by-one, but only ack the last event in a
> batch and that would ack the entire batch.
>
> Before I commit to doing so, I'd like to know if Spark Streaming always
> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
> before
> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
> finished?
>
> This is crucial to the ack logic, since if RDD2 can be potentially
> processed
> while RDD1 is still being processed, then if I ack the the last event in
> RDD2 that would also ack all events in RDD1, even though they may have not
> been completely processed yet.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to timeout a task?

2015-07-03 Thread William Ferrell
Ted,

Thanks very much for your reply. It took me almost a week but I have
finally had a chance to implement what you noted and it appears to be
working locally. However, when I launch this onto a cluster on EC2 -- this
doesn't work reliably.

To expand, I think the issue is that some of the code we have takes the
python GIL and hence no internal timeout will work. That is why I was
hoping to learn of a task level timeout -- something at the Spark level --
the management level -- such that it can decide a task has taken to long
and just kill it and move on.

Does this make sense?  Are you familiar with any such options?

Best,

- Bill


On Sat, Jun 27, 2015 at 9:26 AM, Ted Yu  wrote:

> Have you looked at:
>
> http://stackoverflow.com/questions/2281850/timeout-function-if-it-takes-too-long-to-finish
>
> FYI
>
> On Sat, Jun 27, 2015 at 8:33 AM, wasauce  wrote:
>
>> Hello!
>>
>> We use pyspark to run a set of data extractors (think regex). The
>> extractors
>> (regexes) generally run quite quickly and find a few matches which are
>> returned and stored into a database.
>>
>> My question is -- is it possible to make the function that runs the
>> extractors have a timeout? I.E. if for a given file the extractor runs for
>> more than X seconds it terminates and returns a default value?
>>
>> Here is a code snippet of what we are doing with some comments as to which
>> function I am looking to timeout.
>>
>> code: https://gist.github.com/wasauce/42a956a1371a2b564918
>>
>> Thank you
>>
>> - Bill
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-timeout-a-task-tp23513.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: SparkR and Spark Mlib

2015-07-03 Thread ayan guha
No. Spark R is language binding for spark. MLlib is machine learning
project on top of spark core
On 4 Jul 2015 12:23, "praveen S"  wrote:

> Hi,
> Is sparkR and spark Mlib same?
>


SparkR and Spark Mlib

2015-07-03 Thread praveen S
Hi,
Is sparkR and spark Mlib same?


Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread khaledh
I'm writing a Spark Streaming application that uses RabbitMQ to consume
events. One feature of RabbitMQ that I intend to make use of is bulk ack of
messages, i.e. no need to ack one-by-one, but only ack the last event in a
batch and that would ack the entire batch.

Before I commit to doing so, I'd like to know if Spark Streaming always
processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before
RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
finished?

This is crucial to the ack logic, since if RDD2 can be potentially processed
while RDD1 is still being processed, then if I ack the the last event in
RDD2 that would also ack all events in RDD1, even though they may have not
been completely processed yet.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.4 MLLib Bug?: Multiclass Classification "requirement failed: sizeInBytes was negative"

2015-07-03 Thread Burak Yavuz
How many partitions do you have? It might be that one partition is too
large, and there is Integer overflow. Could you double your number of
partitions?

Burak

On Fri, Jul 3, 2015 at 4:41 AM, Danny  wrote:

> hi,
>
> i want to run a multiclass classification with 390 classes on120k label
> points(tf-idf vectors). but i get the following exception. If i reduce the
> number of classes to ~20 everythings work fine. How can i fix this?
>
>  i use the LogisticRegressionWithLBFGS class for my classification on a 8
> Node Cluster with
>
>
> total-executor-cores = 30
>
> executor-memory = 20g
>
> My Exception:
>
> 15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at
> LBFGS.scala:170, took 0,521823 s
> 15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called
> with
> curMem=308280107, maxMem=3699737
> 15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in
> memory (estimated size -1069858488.0 B, free 11.1 GB)
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
> at
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: java.lang.IllegalArgumentException: requirement failed:
> sizeInBytes was negative: -1069858488
> at scala.Predef$.require(Predef.scala:233)
> at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
> at
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812)
> at
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
> at
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
> at
>
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
>
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
> at
>
> org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215)
> at
>
> org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204)
> at
> breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
> at
>
> breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108)
> at
>
> breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101)
> at
>
> breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146)
> at
> org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178)
> at
> org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117)
> at
>
> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282)
> at
>
> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205)
> at
>
> com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150)
> at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48)
> at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala)
> ... 6 more
> 15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-MLLib-Bug-Multiclass-Classification-requirement-failed-sizeInBytes-was-negative-tp23610.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Optimizations

2015-07-03 Thread Marius Danciu
Thanks for your feedback. Yes I am aware of stages design and Silvio what
you are describing is essentially map-side join which is not applicable
when you have both RDDs quite large.

It appears that

rdd.join(...).mapToPair(f)
f is piggybacked inside join stage  (right in the reducers I believe)

whereas

rdd.join(...).mapPartitionToPair( f )

f is executed in a different stage. This is surprising because at least
intuitively the difference between mapToPair and mapPartitionToPair is that
that former is about the push model whereas the latter is about polling
records out of the iterator (*I suspect there are other technical reasons*).
If anyone know the depths of the problem if would be of great help.

Best,
Marius

On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito 
wrote:

>   One thing you could do is a broadcast join. You take your smaller RDD,
> save it as a broadcast variable. Then run a map operation to perform the
> join and whatever else you need to do. This will remove a shuffle stage but
> you will still have to collect the joined RDD and broadcast it. All depends
> on the size of your data if it’s worth it or not.
>
>   From: Marius Danciu
> Date: Friday, July 3, 2015 at 3:13 AM
> To: user
> Subject: Optimizations
>
>   Hi all,
>
>  If I have something like:
>
>  rdd.join(...).mapPartitionToPair(...)
>
>  It looks like mapPartitionToPair runs in a different stage then join. Is
> there a way to piggyback this computation inside the join stage ? ... such
> that each result partition after join is passed to
> the mapPartitionToPair function, all running in the same state without any
> other costs.
>
>  Best,
> Marius
>


Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Shivaram Venkataraman
You need to add -Psparkr to build SparkR code

Shivaram

On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das 
wrote:

> Did you try:
>
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>
>
>
> Thanks
> Best Regards
>
> On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com <1106944...@qq.com>
> wrote:
>
>> Hi all,
>>Anyone  build spark 1.4 source code  for sparkR with maven/sbt, what's
>> comand ?  using sparkR must build from source code  about 1.4 version .
>> thank you
>>
>> --
>> 1106944...@qq.com
>>
>
>


Experience with centralised logging for Spark?

2015-07-03 Thread Edward Sargisson
Hi all,
I'm wondering if anybody as any experience with centralised logging for
Spark - or even has felt that there was  need for this given the WebUI.

At my organization we use Log4j2 and Flume as the front end of our
centralised logging system. I was looking into modifying Spark to use that
system and I'm reconsidering my approach. I thought I'd ask the community
to see what people have tried.

Log4j2 is important because it works nicely with Flume. The problem I've
got is that all of the Spark processes (master, worker, spark-submit) use
the same conf directory and so would get the same log4j2.xml. This then
means that they would try and use the same directory for the file channel
(which will fail because Flume locks its directory). Secondly, if I want to
add an interceptor to stamp every event with the component name then I
cannot tell the difference between the components - everything would get
'apache-spark'.

This could be fixed by modifying the start up scripts to pass the component
name around; but that's more modification than I really want to make.

So are people generally  happy with the WebUI approach for getting access
to stderr and stdout or have other peopled rolled better solutions?

Yes, I'm aware of https://issues.apache.org/jira/browse/SPARK-6305 and the
associated pull request.

Many thanks, in advance, for your thoughts.

Cheers,
Edward


Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-8817

On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers  wrote:

> i see the relaxation to allow duplicate field names was done on purpose,
> since some data sources can have dupes due to case insensitive resolution.
>
> apparently the issue is now dealt with during query analysis.
>
> although this might work for sql it does not seem a good thing for
> DataFrame to me. it seems desirable that a DataFrame should have unique
> column names. not having this guarantee will complicate building other DSLs
> on top of DataFrame (this is how i ran into this issue). its also
> counterintuitive... do R dataframes and pandas allow dupes in column names?
>  On Jul 3, 2015 3:27 AM, "Akhil Das"  wrote:
>
>> I think you can open up a jira, not sure if this PR
>>  (SPARK-2890
>> ) broke the validation
>> piece.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers  wrote:
>>
>>> i am surprised this is allowed...
>>>
>>> scala> sqlContext.sql("select name as boo, score as boo from
>>> candidates").schema
>>>
>>> res7: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(boo,StringType,true),
>>> StructField(boo,IntegerType,true))
>>>
>>>
>>> should StructType check for duplicate field names?
>>>
>>
>>


Re: Spark SQL groupby timestamp

2015-07-03 Thread sim
@bastien, in those situations, I prefer to use Unix timestamps (millisecond
or second granularity) because you can apply math operations to them easily.
If you don't have a Unix timestamp, you can use unix_timestamp() from Hive
SQL to get one with second granularity.Then doing grouping by hour becomes
very simple:
select  3600*floor(timestamp/3600) as timestamp,  count(error) as
errors,from logsgroup by 3600*floor(timestamp/3600)
Hope this helps./Sim



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-groupby-timestamp-tp23470p23615.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread sim
@bipin, in my case the error happens immediately in a fresh shell in 1.4.0.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23614.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
i see the relaxation to allow duplicate field names was done on purpose,
since some data sources can have dupes due to case insensitive resolution.

apparently the issue is now dealt with during query analysis.

although this might work for sql it does not seem a good thing for
DataFrame to me. it seems desirable that a DataFrame should have unique
column names. not having this guarantee will complicate building other DSLs
on top of DataFrame (this is how i ran into this issue). its also
counterintuitive... do R dataframes and pandas allow dupes in column names?
 On Jul 3, 2015 3:27 AM, "Akhil Das"  wrote:

> I think you can open up a jira, not sure if this PR
>  (SPARK-2890
> ) broke the validation
> piece.
>
> Thanks
> Best Regards
>
> On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers  wrote:
>
>> i am surprised this is allowed...
>>
>> scala> sqlContext.sql("select name as boo, score as boo from
>> candidates").schema
>>
>> res7: org.apache.spark.sql.types.StructType =
>> StructType(StructField(boo,StringType,true),
>> StructField(boo,IntegerType,true))
>>
>>
>> should StructType check for duplicate field names?
>>
>
>


Re: Optimizations

2015-07-03 Thread Silvio Fiorito
One thing you could do is a broadcast join. You take your smaller RDD, save it 
as a broadcast variable. Then run a map operation to perform the join and 
whatever else you need to do. This will remove a shuffle stage but you will 
still have to collect the joined RDD and broadcast it. All depends on the size 
of your data if it’s worth it or not.

From: Marius Danciu
Date: Friday, July 3, 2015 at 3:13 AM
To: user
Subject: Optimizations

Hi all,

If I have something like:

rdd.join(...).mapPartitionToPair(...)

It looks like mapPartitionToPair runs in a different stage then join. Is there 
a way to piggyback this computation inside the join stage ? ... such that each 
result partition after join is passed to the mapPartitionToPair function, all 
running in the same state without any other costs.

Best,
Marius


SparkSQL cache table with multiple replicas

2015-07-03 Thread David Sabater Dinter
Hi all,
Do you know if there is an option to specify how many replicas we want
while caching in memory a table in SparkSQL Thrift server? I have not seen
any option so far but I assumed there is an option as you can see in the
Storage section of the UI that there is 1 x replica of your
Dataframe/Table...

I believe there can be a good use case on where you want to replicate a
dimension table across your nodes to improve response times when running
typical BI DWH types of queries (Just to avoid having to broadcast data
every time and again).

Do you think that would be a good addition to SparkSQL?



Regards.


Re: Optimizations

2015-07-03 Thread Raghavendra Pandey
This is the basic design of spark that it runs all actions in different
stages...
Not sure you can achieve what you r looking for.
On Jul 3, 2015 12:43 PM, "Marius Danciu"  wrote:

> Hi all,
>
> If I have something like:
>
> rdd.join(...).mapPartitionToPair(...)
>
> It looks like mapPartitionToPair runs in a different stage then join. Is
> there a way to piggyback this computation inside the join stage ? ... such
> that each result partition after join is passed to
> the mapPartitionToPair function, all running in the same state without any
> other costs.
>
> Best,
> Marius
>


Re: Streaming: updating broadcast variables

2015-07-03 Thread Raghavendra Pandey
You cannot update the broadcasted variable.. It wont get reflected on
workers.
On Jul 3, 2015 12:18 PM, "James Cole"  wrote:

> Hi all,
>
> I'm filtering a DStream using a function. I need to be able to change this
> function while the application is running (I'm polling a service to see if
> a user has changed their filtering). The filter function is a
> transformation and runs on the workers, so that's where the updates need to
> go. I'm not sure of the best way to do this.
>
> Initially broadcasting seemed like the way to go: the filter is actually
> quite large. But I don't think I can update something I've broadcasted.
> I've tried unpersisting and re-creating the broadcast variable but it
> became obvious this wasn't updating the reference on the worker. So am I
> correct in thinking I can't use broadcasted variables for this purpose?
>
> The next option seems to be: stopping the JavaStreamingContext, creating a
> new one from the SparkContext, updating the filter function, and
> re-creating the DStreams (I'm using direct streams from Kafka).
>
> If I re-created the JavaStreamingContext would the accumulators (which are
> created from the SparkContext) keep working? (Obviously I'm going to try
> this soon)
>
> In summary:
>
> 1) Can broadcasted variables be updated?
>
> 2) Is there a better way than re-creating the JavaStreamingContext and
> DStreams?
>
> Thanks,
>
> James
>
>


Re: Filter on Grouped Data

2015-07-03 Thread Raghavendra Pandey
Why dont you apply filter first and then Group the data and run
aggregations..
On Jul 3, 2015 1:29 PM, "Megha Sridhar- Cynepia" 
wrote:

> Hi,
>
>
> I have a Spark DataFrame object, which when trimmed, looks like,
>
>
>
> FromTo  SubjectMessage-ID
> karen@xyz.com['vance.me...@enron.com', SEC Inquiry
> <19952575.1075858>
>  'jeannie.mandel...@enron.com',
>  'mary.cl...@enron.com',
>  'sarah.pal...@enron.com']
>
>
>
> elyn.hug...@xyz.com['dennis.ve...@enron.com',Revised
> documents<33499184.1075858>
>  'gina.tay...@enron.com',
>  'kelly.kimbe...@enron.com']
> .
> .
> .
>
>
> I have run a groupBy("From") on the above dataFrame and obtained a
> GroupedData object as a result. I need to apply a filter on the grouped
> data (for instance, getting the sender who sent maximum number of the mails
> that were addressed to a particular receiver in the "To" list).
> Is there a way to accomplish this by applying filter on grouped data?
>
>
> Thanks,
> Megha
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark-csv into labeled points with null values

2015-07-03 Thread Saif.A.Ellafi
Hello all,

I am learning scala spark and going through some applications with data I have. 
Please allow me to put a couple questions:

spark-csv: The data I have, ain't malformed, but there are empty values in some 
rows, properly comma-sepparated and not catched by "DROPMALFORMED" mode
These values are taken into account as null values. My final mission is to 
create a LabeledPoint vector for MLLIB, so my steps are:
a.  load csv
b.  cast column types to have a proper DataFrame schema
c.  apply map() to create a LabeledPoint with denseVector. Using map( Row 
=> Row.getDouble(col_index) )

To this point:
res173: org.apache.spark.mllib.regression.LabeledPoint = 
(-1.530132691E9,[162.89431,13.55811,18.3346818,-1.6653182])

As running the following code:

  val model = new LogisticRegressionWithLBFGS().
  setNumClasses(2).
  setValidateData(true).
  run(data_map)

  java.lang.RuntimeException: Failed to check null bit for primitive double 
value.

Debugging this, I am pretty sure this is because rows that look like 
-2.593849123898,392.293891

Any suggestions to get round this?

Saif





Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Ted Yu
Alternatively, setting spark.driver.extraClassPath should work.

Cheers

On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran 
wrote:

>
>> On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv <
>> daniel.ha...@veracity-group.com> wrote:
>>
>>> Hi,
>>> I'm trying to start the thrift-server and passing it azure's blob
>>> storage jars but I'm failing on :
>>>  Caused by: java.io.IOException: No FileSystem for scheme: wasb
>>> at
>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
>>> at
>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
>>> at
>>> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
>>> ... 16 more
>>>
>>>  If I start the spark-shell the same way, everything works fine.
>>>
>>>  spark-shell command:
>>>   ./bin/spark-shell --master yarn --jars
>>> /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
>>> --num-executors 4
>>>
>>>  thrift-server command:
>>>  ./sbin/start-thriftserver.sh --master yarn--jars
>>> /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
>>> --num-executors 4
>>>
>>>  How can I pass dependency jars to the thrift server?
>>>
>>>  Thanks,
>>> Daniel
>>>
>>
>>
>
>  you should be able to add the JARs to the environment variable
> SPARK_SUBMIT_CLASSPATH or SPARK_CLASSPATH and have them picked up when
> bin/compute-classpath.{cmd.sh} builds up the classpath
>
>
>


Float type coercion on SparkR with hiveContext

2015-07-03 Thread Evgeny Sinelnikov
Hello,

I'm got a trouble with float type coercion on SparkR with hiveContext.

> result <- sql(hiveContext, "SELECT offset, percentage from data limit 100")

> show(result)
DataFrame[offset:float, percentage:float]

> head(result)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""jobj"" to a data.frame


This trouble looks like already exists (SPARK-2863 - Emulate Hive type
coercion in native reimplementations of Hive functions) with same
reason - not completed "native reimplementations of Hive..." not
"...functions" only.

So, anybody met in this issue before? And, how can I test it more
precisely if it not looks like a bug?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

2015-07-03 Thread Kostas Kougios
I have this problem with a job. A random executor gets this

ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

Almost always at the same point in the processing of the data. I am
processing 1 mil files with sc.wholeText. At around the 600.000th file, a
container receives this signal. On the driver i get:

15/07/03 14:20:11 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher03.stratified:44617
15/07/03 14:20:11 ERROR cluster.YarnClusterScheduler: Lost executor 3 on
cruncher03.stratified: remote Rpc client disassociated
15/07/03 14:20:11 WARN remote.ReliableDeliverySupervisor: Association with
remote system [akka.tcp://sparkExecutor@cruncher03.stratified:44617] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/07/03 14:20:11 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher03.stratified:44617
15/07/03 14:20:11 INFO scheduler.TaskSetManager: Re-queueing tasks for 3
from TaskSet 5.0


There is plenty of memory on the machine and container jvm, so I don't think
it is an OOM (after all it would be a SIGKILL) or an OutOfMemory (there is
no out of mem exception)

What can be causing this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-executor-CoarseGrainedExecutorBackend-RECEIVED-SIGNAL-15-SIGTERM-tp23613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kryo fails to serialise output

2015-07-03 Thread Will Briggs
Kryo serialization is used internally by Spark for spilling or shuffling 
intermediate results, not for writing out an RDD as an action. Look at Sandy 
Ryza's examples for some hints on how to do this: 
https://github.com/sryza/simplesparkavroapp

Regards,
Will

On July 3, 2015, at 2:45 AM, Dominik Hübner  wrote:

I have a rather simple avro schema to serialize Tweets (message, username, 
timestamp).
Kryo and twitter chill are used to do so.

For my dev environment the Spark context is configured as below

val conf: SparkConf = new SparkConf()
conf.setAppName("kryo_test")
conf.setMaster(“local[4]")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "co.feeb.TweetRegistrator”)

Serialization is setup with

override def registerClasses(kryo: Kryo): Unit = {
kryo.register(classOf[Tweet], 
AvroSerializer.SpecificRecordBinarySerializer[Tweet])
}

(This method gets called)


Using this configuration to persist some object fails with 
java.io.NotSerializableException: co.feeb.avro.Tweet 
(which seems to be ok as this class is not Serializable)

I used the following code:

val ctx: SparkContext = new SparkContext(conf)
val tweets: RDD[Tweet] = ctx.parallelize(List(
new Tweet("a", "b", 1L),
new Tweet("c", "d", 2L),
new Tweet("e", "f", 3L)
  )
)

tweets.saveAsObjectFile("file:///tmp/spark”)

Using saveAsTextFile works, but persisted files are not binary but JSON

cat /tmp/spark/part-0
{"username": "a", "text": "b", "timestamp": 1}
{"username": "c", "text": "d", "timestamp": 2}
{"username": "e", "text": "f", "timestamp": 3}

Is this intended behaviour, a configuration issue, avro serialisation not 
working in local mode or something else?





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming broadcast to all keys

2015-07-03 Thread Silvio Fiorito
updateStateByKey will run for all keys, whether they have new data in a batch 
or not so you should be able to still use it.



On 7/3/15, 7:34 AM, "micvog"  wrote:

>UpdateStateByKey is useful but what if I want to perform an operation to all
>existing keys (not only the ones in this RDD).
>
>Word count for example - is there a way to decrease *all* words seen so far
>by 1?
>
>I was thinking of keeping a static class per node with the count information
>and issuing a broadcast command to take a certain action, but could not find
>a broadcast-to-all-nodes functionality or a better way.
>
>Thanks,
>Michael
>
>
>
>-
>Michael Vogiatzis
>@mvogiatzis 
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-broadcast-to-all-keys-tp23609.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re 
using.

Also, you have 15 nodes but not using all of them, so that means you may be 
losing data locality. You can see this in the job UI for Spark if any jobs do 
not have node or process local.

From: diplomatic Guru
Date: Friday, July 3, 2015 at 8:58 AM
To: "user@spark.apache.org"
Subject: Spark performance issue

Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and write 
the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution time, 
which is expected.

BUT
..
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration, 
e.g., memory, cores, etc...But this is not the case with Spark as it's very 
flexible. So I'm sure my configuration isn't correct which is why MR is 
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes and 
20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still 
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to each 
executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores 2 
(also I set "spark.storage.memoryFraction" to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G  --executor-cores 2 
(also I set "spark.storage.memoryFraction" to 0.3)

I tried all possible combination but couldn't get better performance. Any 
suggestions will be much appreciated.








Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution
time, which is expected.

BUT
..
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration,
e.g., memory, cores, etc...But this is not the case with Spark as it's very
flexible. So I'm sure my configuration isn't correct which is why MR is
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes
and 20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to
each executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores
2 (also I set "spark.storage.memoryFraction" to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G
 --executor-cores 2 (also I set "spark.storage.memoryFraction" to 0.3)

I tried all possible combination but couldn't get better performance. Any
suggestions will be much appreciated.


Spark 1.4 MLLib Bug?: Multiclass Classification "requirement failed: sizeInBytes was negative"

2015-07-03 Thread Danny
hi, 

i want to run a multiclass classification with 390 classes on120k label
points(tf-idf vectors). but i get the following exception. If i reduce the
number of classes to ~20 everythings work fine. How can i fix this?

 i use the LogisticRegressionWithLBFGS class for my classification on a 8
Node Cluster with 


total-executor-cores = 30

executor-memory = 20g

My Exception:

15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at
LBFGS.scala:170, took 0,521823 s
15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called with
curMem=308280107, maxMem=3699737
15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in
memory (estimated size -1069858488.0 B, free 11.1 GB)
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed:
sizeInBytes was negative: -1069858488
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
at
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215)
at
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204)
at
breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
at
breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108)
at
breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101)
at
breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146)
at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178)
at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117)
at
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282)
at
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205)
at
com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150)
at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48)
at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala)
... 6 more
15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-MLLib-Bug-Multiclass-Classification-requirement-failed-sizeInBytes-was-negative-tp23610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming broadcast to all keys

2015-07-03 Thread micvog
UpdateStateByKey is useful but what if I want to perform an operation to all
existing keys (not only the ones in this RDD).

Word count for example - is there a way to decrease *all* words seen so far
by 1?

I was thinking of keeping a static class per node with the count information
and issuing a broadcast command to take a certain action, but could not find
a broadcast-to-all-nodes functionality or a better way.

Thanks,
Michael



-
Michael Vogiatzis
@mvogiatzis 
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-broadcast-to-all-keys-tp23609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I have a hunch I want to share: I feel that data is not being deallocated in
memory (at least like in 1.3). Once it goes in-memory it just stays there.

Spark SQL works fine, the same query when run on a new shell won't throw
that error, but when run on a shell which has been used for other queries
before, throws that error.

Also I read on the spark blog, that project Tungsten is making changes in
memory management. And first changes would land in 1.4. Maybe it is related
to that.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23608.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
I have shown two senarios below:

// setup spark context

val user = readLine("username: ")
val pass = System.console.readPassword("password: ") <- null pointer
exception here

and

// setup spark context

val user = readLine("username: ")
val console = System.console <- null pointer exception here
val pass = console.readPassword("password: ")


thanks,

Jem


On Fri, Jul 3, 2015 at 11:04 AM Akhil Das 
wrote:

> Can you paste the code? Something is missing
>
> Thanks
> Best Regards
>
> On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker  wrote:
>
>> In the driver when running spark-submit with --master yarn-client
>>
>> On Fri, Jul 3, 2015 at 10:23 AM Akhil Das 
>> wrote:
>>
>>> Where does it returns null? Within the driver or in the executor? I just
>>> tried System.console.readPassword in spark-shell and it worked.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker  wrote:
>>>
 Hi,

 We have an application that requires a username/password to be entered
 from the command line. To screen a password in java you need to use
 System.console.readPassword however when running with spark System.console
 returns null?? Any ideas on how to get the console from spark?

 Thanks,

 Jem

>>>
>>>
>


Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-03 Thread bipin
I will second this. I very rarely used to get out-of-memory errors in 1.3.
Now I get these errors all the time. I feel that I could work on 1.3
spark-shell for long periods of time without spark throwing that error,
whereas in 1.4 the shell needs to be restarted or gets killed frequently.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-regression-out-of-memory-errors-on-small-data-tp23595p23607.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Can you paste the code? Something is missing

Thanks
Best Regards

On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker  wrote:

> In the driver when running spark-submit with --master yarn-client
>
> On Fri, Jul 3, 2015 at 10:23 AM Akhil Das 
> wrote:
>
>> Where does it returns null? Within the driver or in the executor? I just
>> tried System.console.readPassword in spark-shell and it worked.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker  wrote:
>>
>>> Hi,
>>>
>>> We have an application that requires a username/password to be entered
>>> from the command line. To screen a password in java you need to use
>>> System.console.readPassword however when running with spark System.console
>>> returns null?? Any ideas on how to get the console from spark?
>>>
>>> Thanks,
>>>
>>> Jem
>>>
>>
>>


Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Steve Loughran

On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv 
mailto:daniel.ha...@veracity-group.com>> wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage jars 
but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342)
... 16 more

If I start the spark-shell the same way, everything works fine.

spark-shell command:
 ./bin/spark-shell --master yarn --jars 
/home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
 --num-executors 4

thrift-server command:
 ./sbin/start-thriftserver.sh --master yarn--jars 
/home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar
 --num-executors 4

How can I pass dependency jars to the thrift server?

Thanks,
Daniel



you should be able to add the JARs to the environment variable 
SPARK_SUBMIT_CLASSPATH or SPARK_CLASSPATH and have them picked up when 
bin/compute-classpath.{cmd.sh} builds up the classpath




Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin
Hi, I need to join with multiple conditions. Can anyone tell how to specify
that. For e.g. this is what I am trying to do :

val Lead_all = Leads.
 | join(Utm_Master,
Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign")
==
Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")

When I do this I get  error: too many arguments for method apply. 

Thanks
Bipin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Join-Conditions-in-dataframe-join-tp23606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
In the driver when running spark-submit with --master yarn-client

On Fri, Jul 3, 2015 at 10:23 AM Akhil Das 
wrote:

> Where does it returns null? Within the driver or in the executor? I just
> tried System.console.readPassword in spark-shell and it worked.
>
> Thanks
> Best Regards
>
> On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker  wrote:
>
>> Hi,
>>
>> We have an application that requires a username/password to be entered
>> from the command line. To screen a password in java you need to use
>> System.console.readPassword however when running with spark System.console
>> returns null?? Any ideas on how to get the console from spark?
>>
>> Thanks,
>>
>> Jem
>>
>
>


Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Where does it returns null? Within the driver or in the executor? I just
tried System.console.readPassword in spark-shell and it worked.

Thanks
Best Regards

On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker  wrote:

> Hi,
>
> We have an application that requires a username/password to be entered
> from the command line. To screen a password in java you need to use
> System.console.readPassword however when running with spark System.console
> returns null?? Any ideas on how to get the console from spark?
>
> Thanks,
>
> Jem
>


Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Akhil Das
Did you try:

build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package



Thanks
Best Regards

On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com <1106944...@qq.com> wrote:

> Hi all,
>Anyone  build spark 1.4 source code  for sparkR with maven/sbt, what's
> comand ?  using sparkR must build from source code  about 1.4 version .
> thank you
>
> --
> 1106944...@qq.com
>


Accessing the console from spark

2015-07-03 Thread Jem Tucker
Hi,

We have an application that requires a username/password to be entered from
the command line. To screen a password in java you need to use
System.console.readPassword however when running with spark System.console
returns null?? Any ideas on how to get the console from spark?

Thanks,

Jem


build spark 1.4 source code for sparkR with maven

2015-07-03 Thread 1106944...@qq.com
Hi all,
   Anyone  build spark 1.4 source code  for sparkR with maven/sbt, what's 
comand ?  using sparkR must build from source code  about 1.4 version .
thank you  



1106944...@qq.com


Spark 1.4 MLLib Bug?: Multiclass Classification "requirement failed: sizeInBytes was negative"

2015-07-03 Thread Danny Linden
hi, 

i want to run a multiclass classification with 390 classes on120k label 
points(tf-idf vectors). but i get the following exception. If i reduce the 
number of classes to ~20 everythings work fine. How can i fix this?

 i use the LogisticRegressionWithLBFGS class for my classification on a 8 Node 
Cluster with 


total-executor-cores = 30

executor-memory = 20g

My Exception:

15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at LBFGS.scala:170, 
took 0,521823 s
15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called with 
curMem=308280107, maxMem=3699737
15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in 
memory (estimated size -1069858488.0 B, free 11.1 GB)
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: sizeInBytes 
was negative: -1069858488
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
at 
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215)
at 
org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204)
at 
breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
at 
breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108)
at 
breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101)
at 
breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146)
at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178)
at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117)
at 
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282)
at 
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205)
at 
com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150)
at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48)
at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala)
... 6 more
15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook



Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread ayan guha
Hivecontext should be supersets of SQL context so you should be able to
perform all your tasks. Are you facing any problem with hivecontext?
On 3 Jul 2015 17:33, "Daniel Haviv"  wrote:

> Thanks
> I was looking for a less hack-ish way :)
>
> Daniel
>
> On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das 
> wrote:
>
>> With binary i think it might not be possible, although if you can
>> download the sources and then build it then you can remove this function
>> 
>> which initializes the SQLContext.
>>
>> Thanks
>> Best Regards
>>
>> On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv <
>> daniel.ha...@veracity-group.com> wrote:
>>
>>> Hi,
>>> I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I
>>> start the spark-shell it always start with HiveContext.
>>>
>>> How can I disable the HiveContext from being initialized automatically ?
>>>
>>> Thanks,
>>> Daniel
>>>
>>
>>
>


Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
The main reason is Spark's startup time and the need to configure a component I 
don't really need (without  configs the hivecontext takes  more time to load)

Thanks,
Daniel

> On 3 ביולי 2015, at 11:13, Robin East  wrote:
> 
> As Akhil mentioned there isn’t AFAIK any kind of initialisation to stop the 
> SQLContext being created. If you could articulate why you would need to do 
> this (it’s not obvious to me what the benefit would be) then maybe this is 
> something that could be included as a feature in a future release. It may 
> also suggest a way to a workaround.
> 
>> On 3 Jul 2015, at 08:33, Daniel Haviv  
>> wrote:
>> 
>> Thanks
>> I was looking for a less hack-ish way :)
>> 
>> Daniel
>> 
>>> On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das  
>>> wrote:
>>> With binary i think it might not be possible, although if you can download 
>>> the sources and then build it then you can remove this function which 
>>> initializes the SQLContext.
>>> 
>>> Thanks
>>> Best Regards
>>> 
 On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv 
  wrote:
 Hi,
 I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start 
 the spark-shell it always start with HiveContext.
 
 How can I disable the HiveContext from being initialized automatically ?
 
 Thanks,
 Daniel
> 


Filter on Grouped Data

2015-07-03 Thread Megha Sridhar- Cynepia

Hi,


I have a Spark DataFrame object, which when trimmed, looks like,




FromTo  SubjectMessage-ID
karen@xyz.com['vance.me...@enron.com', SEC Inquiry 
 <19952575.1075858>

 'jeannie.mandel...@enron.com',
 'mary.cl...@enron.com',
 'sarah.pal...@enron.com']



elyn.hug...@xyz.com['dennis.ve...@enron.com',Revised 
documents<33499184.1075858>

 'gina.tay...@enron.com',
 'kelly.kimbe...@enron.com']
.
.
.


I have run a groupBy("From") on the above dataFrame and obtained a 
GroupedData object as a result. I need to apply a filter on the grouped 
data (for instance, getting the sender who sent maximum number of the 
mails that were addressed to a particular receiver in the "To" list).

Is there a way to accomplish this by applying filter on grouped data?


Thanks,
Megha


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[spark1.4] sparkContext.stop causes exception on Mesos

2015-07-03 Thread Ayoub
Hello Spark developers, 

After upgrading to spark 1.4 on Mesos 0.22.1 existing code started to throw
this exception when calling sparkContext.stop :

(SparkListenerBus) [ERROR -
org.apache.spark.Logging$class.logError(Logging.scala:96)] Listener
EventLoggingListener threw an exception 
java.lang.reflect.InvocationTargetException 
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) 
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
at java.lang.reflect.Method.invoke(Method.java:606) 
at
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:146)
 
at
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:146)
 
at scala.Option.foreach(Option.scala:236) 
at
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:146)
 
at
org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:190)
 
at
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
 
at
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 
at
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 
at
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56) 
at
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
 
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
 
at
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215) 
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
 
Caused by: java.io.IOException: Filesystem closed 
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:730) 
at
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1855) 
at
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1816) 
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) 
... 16 more 
I0701 15:03:46.101809  1612 sched.cpp:1589] Asked to stop the driver 
I0701 15:03:46.101971  1355 sched.cpp:831] Stopping framework
'20150629-132734-1224736778-5050-6126-0028'


This problems happens only when spark.eventLog.enabled flag is set to true,
it happens also if sparkContext.stop is omitted in the code, I think because
Spark shut down indirectly the spark context. 

Does anyone know what could cause this problem ?

Thanks,
Ayoub.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-4-sparkContext-stop-causes-exception-on-Mesos-tp23605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
Thanks
I was looking for a less hack-ish way :)

Daniel

On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das 
wrote:

> With binary i think it might not be possible, although if you can download
> the sources and then build it then you can remove this function
> 
> which initializes the SQLContext.
>
> Thanks
> Best Regards
>
> On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv <
> daniel.ha...@veracity-group.com> wrote:
>
>> Hi,
>> I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I
>> start the spark-shell it always start with HiveContext.
>>
>> How can I disable the HiveContext from being initialized automatically ?
>>
>> Thanks,
>> Daniel
>>
>
>


Re: duplicate names in sql allowed?

2015-07-03 Thread Akhil Das
I think you can open up a jira, not sure if this PR
 (SPARK-2890
) broke the validation
piece.

Thanks
Best Regards

On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers  wrote:

> i am surprised this is allowed...
>
> scala> sqlContext.sql("select name as boo, score as boo from
> candidates").schema
>
> res7: org.apache.spark.sql.types.StructType =
> StructType(StructField(boo,StringType,true),
> StructField(boo,IntegerType,true))
>
>
> should StructType check for duplicate field names?
>


Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Akhil Das
With binary i think it might not be possible, although if you can download
the sources and then build it then you can remove this function

which initializes the SQLContext.

Thanks
Best Regards

On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
> I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start
> the spark-shell it always start with HiveContext.
>
> How can I disable the HiveContext from being initialized automatically ?
>
> Thanks,
> Daniel
>


Optimizations

2015-07-03 Thread Marius Danciu
Hi all,

If I have something like:

rdd.join(...).mapPartitionToPair(...)

It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage ? ... such
that each result partition after join is passed to
the mapPartitionToPair function, all running in the same state without any
other costs.

Best,
Marius