Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-28 Thread Felix Cheung
Ah, it's there in spark-submit and pyspark.Seems like it should be added for 
spark_ec2



_
From: Ted Yu 
Sent: Friday, November 27, 2015 11:50 AM
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
To: Felix Cheung 
Cc: Andy Davidson , user @spark 



   ec2/spark-ec2 calls ./ec2/spark_ec2.py   
   
  I don't see PYTHONHASHSEED defined in any of these scripts.  
  Andy reported this for ec2 cluster.  
  I think a JIRA should be opened.  
  
   On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung 
 wrote:
   May I ask how you are starting Spark?   
It looks like PYTHONHASHSEED is being set:
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED   
   
    
   Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

 I am using  spark-1.5.1-bin-hadoop2.6. I used  
spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster  
and configured spark-env to use python3. I get and exception ' 
Randomness of hash of string should be disabled via PYTHONHASHSEED’.  
Is there any reason rdd.py should not just set PYTHONHASHSEED ? 

 Should I file a bug? 
 Kind regards 
 Andy 
 details 
 
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
 
 Example does not work out of the box   
  
  Subtract(   other,    
numPartitions=None)  

Return each value in self that is not contained in other.   
 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 
3)])>>> y = sc.parallelize([("a", 3), ("c", None)])>>> 
sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]   
It raises  
 if sys.version >= '3.3' and 'PYTHONHASHSEED' not 
in os.environ:raise Exception("Randomness of hash of string should be 
disabled via PYTHONHASHSEED") 
 
 The following script fixes the problem 
 
 Sudo printf "
# set PYTHONHASHSEED so python3 will not generate Exception'Randomness of hash 
of string should be disabled via PYTHONHASHSEED'
export PYTHONHASHSEED=123
" >> /root/spark/conf/spark-env.sh  
 sudo pssh -i -h /root/spark-ec2/slaves cp 
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
"+%Y-%m-%d:%H:%M"`  
 Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done 
 
 
  



  

streaming from Flume, save to Cassandra and Solr with Banana as search engine

2015-11-28 Thread Oleksandr Yermolenko

Hi,

The aim:
- collects syslogs.
- filter (discard really unneeded events)
- save to cassandra table the rest
- later I will have to integrate search engine (Solr based)

Environment:
spark 1.5.2/scala 2.10.4
spark-cassandra-connector_2.11-1.5.0-M2.jar
flume 1.6.0

I have reviewed a lot examples and written a small scala code (I'm scala 
beginner).

Now I have a problem with "extracting" and saving  to C*.
Small code to describe what I mean with my comments:

object myPollingEvents {

  def containsERR(x : String): Boolean = x match {
case x if x contains "error: Could not load host key" => false
case x if x contains "Did not receive identification string from 
10." => false

case _ => true
  }

  def createContext(outputPath: String, checkpointDirectory: String)
: StreamingContext = {

println("Creating new context")
val outputFile = new File(outputPath)
if (outputFile.exists()) outputFile.delete()

val batchInterval = Milliseconds(2000)
val nms = new InetSocketAddress("nms.rsm.crp", 12345)

val sparkConf = new SparkConf()
  .setMaster("local[4]")
  .setAppName("myPollingEvents")
  .set("spark.cleaner.ttl", "3600")
  .set("spark.cassandra.connection.host", "127.0.0.1")
  .set("spark.streaming.receiver.writeAheadLog.enable", "true")
.setJars(Array("/data/spark/lib/spark-cassandra-connector_2.11-1.5.0-M2.jar")) 



val ssc = new StreamingContext(sparkConf, batchInterval)
ssc.checkpoint(checkpointDirectory)

//  I can "extract" body, filter and save to C* /CREATE TABLE 
CassandraTableRaw(message text PRIMARY KEY);
val stream = FlumeUtils.createPollingStream(ssc, Seq(nms), 
StorageLevel.MEMORY_AND_DISK_SER)
.map(e => new String(e.event.getBody.array(), 
UTF_8)).filter(containsERR)

.map(Tuple1(_))

stream.print
stream.saveToCassandra("cassandrakeyspace", "cassandratableraw")

// Just control
stream.count().map(cnt => "Received " + cnt + " flume events.").print

//I can print "headers". On console something like "Map(Severity -> 
6, Facility -> 3, host -> vpn10, priority -> 30, timestamp -> 
1448702438000)"
//val headers = stream.map(_.event.getHeaders.asScala.map { case 
(key, value) => (key.toString, value.toString)})

   .print

But, I can't find the Scala examples for Flume-Spark streaming how to 
combine filtered getBody and needed getHeaders and save to
C* (to something like CREATE TABLE CassandraTableRaw(hostname text 
PRIMARY KEY, priority int, timestamp timestamp, msg text, program text, 
pid int)


I would very very thankful if someone share the code example to resolve 
this issue.


Thank a lot for reading and your help.

Oleksandr Yermolenko

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to get a single available message from kafka (case where OffsetRange.fromOffset == OffsetRange.untilOffset)

2015-11-28 Thread Nikos Viorres
Hi,

I am using KafkaUtils.createRDD to retrieve data from Kafka for batch
processing and
when Invoking KafkaUtils.createRDD with an OffsetRange where
OffsetRange.fromOffset == OffsetRange.untilOffset for a particular
partition, i get an empy RDD.
Documentation is clear that until is exclusive and from inclusive, but if i
use OffsetRange.untilOffset + 1 i get an invalid OffsetRange exception
during the check.
Sinve this should also apply in general (if untilOffset is exculsive you
cannot fetch it ), does it mean that untilOffset is also non-existent in
Kafka (and thus always exlcusive) or i am missing something?

regards

p.s. by manually using the kafka protocol to query the offsets i see
that kafka.api.OffsetRequest.EarliestTime()
== kafka.api.OffsetRequest.LatestTime() and set to a poisitive value


df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-28 Thread Don Drake
I have a 2TB dataset that I have in a DataFrame that I am attempting to
partition by 2 fields and my YARN job seems to write the partitioned
dataset successfully.  I can see the output in HDFS once all Spark tasks
are done.

After the spark tasks are done, the job appears to be running for over an
hour, until I get the following (full stack trace below):

java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)

I had set the driver memory to be 20GB.

I attempted to read in the partitioned dataset and got another error saying
the /_metadata directory was not a parquet file.  I removed the _metadata
directory and was able to query the data, but it appeared to not use the
partitioned directory when I attempted to filter the data (it read every
directory).

This is Spark 1.5.2 and I saw the same problem when running the code in
both Scala and Python.

Any suggestions are appreciated.

-Don

15/11/25 00:00:19 ERROR datasources.InsertIntoHadoopFsRelation: Aborting
job.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:238)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167)
at
org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79)
at
org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405)
at
org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433)
at
org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:208)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
at
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
at com.dondrake.qra.ScalaApp$.main(ScalaApp.scala:53)
at com.dondrake.qra.ScalaApp.main(ScalaApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
15/11/25 00:00:20 ERROR actor.ActorSystemImpl: exception on LARS? timer
thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:409)
at
akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
15/11/25 00:00:20 ERROR akka.ErrorMonitor: exception on LARS? timer thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:409)
at

StructType for oracle.sql.STRUCT

2015-11-28 Thread Pieter Minnaar
Hi,

I need to read Oracle Spatial SDO_GEOMETRY tables into Spark.

I need to know how to create a StructField to use in the schema definition
for the Oracle geometry columns. In the standard JDBC the values are read
as oracle.sql.STRUCT types.

How can I get the same values in Spark SQL?

Regards,
Pieter


General question on using StringIndexer in SparkML

2015-11-28 Thread Vishnu Viswanath
Hi All,

I have a general question on using StringIndexer.
StringIndexer gives an index to each label in the feature starting from 0 (
0 for least frequent word).

Suppose I am building a model, and I use StringIndexer for transforming on
of my column.
e.g., suppose A was most frequent word followed by B and C.

So the StringIndexer will generate

A  0.0
B  1.0
C  2.0

After building the model, I am going to do some prediction using this
model, So I do the same transformation on my new data which I need to
predict. And suppose the new dataset has C as the most frequent word,
followed by B and A. So the StringIndexer will assign index as

C 0.0
B 1.0
A 2.0

These indexes are different from what we used for modeling. So won’t this
give me a wrong prediction if I use StringIndexer?
​
-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com *


storing result of aggregation of spark streaming

2015-11-28 Thread Amir Rahnama
Hi,

I am gonna store the results of my stream job into a db, which one of
databases has the native support (if any)?

-- 
Thanks and Regards,

Amir Hossein Rahnama

*Tel: +46 (0) 761 681 102*
Website: www.ambodi.com
Twitter: @_ambodi 


Re: StructType for oracle.sql.STRUCT

2015-11-28 Thread andy petrella
Warf... such an heavy tasks man!

I'd love to follow your work on that (I've a long XP in geospatial too), is
there a repo available already for that?

The hard part will be to support all descendant types I guess (line,
mutlilines, and so on), then creating the spatial operators.

The only project I know that has this kindo of aim (although, it's very
limited and simple atm) is magellan https://github.com/harsha2010/magellan

andy


On Sat, Nov 28, 2015 at 9:15 PM Pieter Minnaar 
wrote:

> Hi,
>
> I need to read Oracle Spatial SDO_GEOMETRY tables into Spark.
>
> I need to know how to create a StructField to use in the schema definition
> for the Oracle geometry columns. In the standard JDBC the values are read
> as oracle.sql.STRUCT types.
>
> How can I get the same values in Spark SQL?
>
> Regards,
> Pieter
>
-- 
andy


Spark and simulated annealing

2015-11-28 Thread marfago
Hi All,

I would like to implement a simulated annealing algorithm with Spark.
What is the best way to do that with python or scala? Is there any library
already implementing it?

Thank you in advance.
Marco



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-simulated-annealing-tp25507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Retrieve best parameters from CrossValidator

2015-11-28 Thread BenFradet
Hi everyone,

Is there a better way to retrieve the best model parameters obtained from
cross validation than inspecting the logs issued while calling the fit
method (with the call here:
https://github.com/apache/spark/blob/branch-1.5/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L106)?

Wouldn't it be useful to expose this to the end user through the
crossValidatorModel?

Thanks for your response.

Best,
Ben.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-best-parameters-from-CrossValidator-tp25508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Confirm this won't parallelize/partition?

2015-11-28 Thread Jim Lohse
Hi, I got a good answer on the main question elsewhere, would anyone 
please confirm the first code is the right approach? For a MVCE I am 
trying to adapt this example and it's seems like I am having Java issues 
with types:


(but this is basically the right approach?)

int count = spark.parallelize(makeRange(1, NUM_SAMPLES)).filter(new 
Function() {

  public Boolean call(Integer i) {
double x = Math.random();
double y = Math.random();
return x*x + y*y < 1;
  }
}).count();
System.out.println("Pi is roughly " + 4 * count / NUM_SAMPLES);

And this is definitely the wrong approach? Using the loop in the 
function will all execute on one partition? Want to be sure I understood 
the other answer correct. Thanks!


|JavaRDDnSizedRDD 
=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction(){publicIterablecall(PipeLinkageDatapld){ListreturnRDD 
=newArrayList();// is Spark good at spreading a for loop like 
this?for(inti =0;i ;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|




On 11/27/2015 4:18 PM, Jim wrote:

 Hello there,

(part of my problem is docs that say "undocumented" on parallelize 
 
leave me reading books for examples that don't always pertain )


I am trying to create an RDD length N = 10^6 by executing N operations 
of a Java class we have, I can have that class implement Serializable 
or any Function if necessary. I don't have a fixed length dataset up 
front, I am trying to create one. Trying to figure out whether to 
create a dummy array of length N to parallelize, or pass it a function 
that runs N times.


Not sure which approach is valid/better, I see in Spark if I am 
starting out with a well defined data set like words in a doc, the 
length/count of those words is already defined and I just parallelize 
some map or filter to do some operation on that data.


In my case I think it's different, trying to parallelize the creation 
an RDD that will contain 10^6 elements... here's a lot more info if 
you want ...


DESCRIPTION:

In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes 
a PipeLinkageData and returns a DropResult.


I am thinking I could use map() or flatMap() to call a one to many 
function, I was trying to do something like this in another question 
that never quite worked 
:


|JavaRDDsimCountRDD 
=spark.parallelize(makeRange(1,getSimCount())).map(newFunction(){publicDropResultcall(Integeri){returnpld.doDrop();}});|


Thinking something like this is more the correct approach? And this 
has more context if desired:


|// pld is of type PipeLinkageData, it's already initialized// 
parallelize wants a collection passed into first 
paramListpldListofOne =newArrayList();// make an 
ArrayList of onepldListofOne.add(pld);inthowMany 
=100;JavaRDDnSizedRDD 
=spark.parallelize(pldListofOne).flatMap(newFlatMapFunction(){publicIterablecall(PipeLinkageDatapld){ListreturnRDD 
=newArrayList();// is Spark good at spreading a for loop like 
this?for(inti =0;i ;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});|


One other concern: A JavaRDD is corrrect here? I can see needing to 
call FlatMapFunction but I don't need a FlatMappedRDD? And since I am 
never trying to flatten a group of arrays or lists to a single array 
or list, do I really ever need to flatten anything?










Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Yu Zhang
BTW, if you decide to try the mongodb, please use the 3.0+ version with
"wiredtiger" engine.

On Sat, Nov 28, 2015 at 11:30 PM, Yu Zhang  wrote:

> If you need to construct multiple indexes, hbase will perform better, the
> writing speed is slow in mongodb with many indexes and the memory cost is
> huge!
>
> But my concern is: with mongodb, you could easily cooperate with js and
> with some visualization tools like D3.js, the work will become smooth as
> breeze.
>
> Could you provide additional details of the data size and number of
> operations you need in your program? I believe this is a quite general
> question and hope to hear any comments and thoughts.
>
> On Tue, Nov 24, 2015 at 9:50 AM, Ted Yu  wrote:
>
>> You should consider using HBase as the NoSQL database.
>> w.r.t. 'The data in the DB should be indexed', you need to design the
>> schema in HBase carefully so that the retrieval is fast.
>>
>> Disclaimer: I work on HBase.
>>
>> On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345 
>> wrote:
>>
>>> I'm interested in knowing which NoSQL databases you use with Spark and
>>> what
>>> are your experiences.
>>>
>>> On a general level, I would like to use Spark streaming to process
>>> incoming
>>> data, fetch relevant aggregated data from the database, and update the
>>> aggregates in the DB based on the incoming records. The data in the DB
>>> should be indexed to be able to fetch the relevant data fast and to allow
>>> fast interactive visualization of the data.
>>>
>>> I've been reading about MongoDB+Spark and I've got the impression that
>>> there
>>> are some challenges in fetching data by indices and in updating
>>> documents,
>>> but things are moving so fast, so I don't know if these are relevant
>>> anymore. Do you find any benefit from using HBase with Spark as HBase is
>>> built on top of HDFS?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Spark Streaming on mesos

2015-11-28 Thread Renjie Liu
Hi, Nagaraj:
 Thanks for the response, but this does not solve my problem.
I think executor memory should be proportional to number of cores, or
number of core
in each executor should be the same.
On Sat, Nov 28, 2015 at 1:48 AM Nagaraj Chandrashekar <
nchandrashe...@innominds.com> wrote:

> Hi Renjie,
>
> I have not setup Spark Streaming on Mesos but there is something called
> reservations in Mesos.  It supports both Static and Dynamic reservations.
> Both types of reservations must have role defined. You may want to explore
> these options.   Excerpts from the Apache Mesos documentation.
>
> Cheers
> Nagaraj C
> Reservation
>
> Mesos provides mechanisms to reserve resources in specific slaves. The
> concept was first introduced with static reservation in 0.14.0 which
> enabled operators to specify the reserved resources on slave startup. This
> was extended with dynamic reservation in 0.23.0 which enabled operators
> and authorized frameworks to dynamically reserve resources in the cluster.
>
> No breaking changes were introduced with dynamic reservation, which means
> the existing static reservation mechanism continues to be fully supported.
>
> In both types of reservations, resources are reserved for a role.
> Static Reservation (since 0.14.0)
>
> An operator can configure a slave with resources reserved for a role. The
> reserved resources are specified via the --resources flag. For example,
> suppose we have 12 CPUs and 6144 MB of RAM available on a slave and that we
> want to reserve 8 CPUs and 4096 MB of RAM for the ads role. We start the
> slave like so:
>
> $ mesos-slave \
>   --master=: \
>   --resources="cpus:4;mem:2048;cpus(ads):8;mem(ads):4096"
>
> We now have 8 CPUs and 4096 MB of RAM reserved for ads on this slave.
>
>
> From: Renjie Liu 
> Date: Friday, November 27, 2015 at 9:57 PM
> To: "user@spark.apache.org" 
> Subject: Spark Streaming on mesos
>
> Hi, all:
> I'm trying to run spark streaming on mesos and it seems that none of the
> scheduler is suitable for that. Fine grain scheduler will start an executor
> for each task so it will significantly increase the latency. While coarse
> grained mode can only set the max core numbers and executor memory but
> there's no way to set the number of cores for each executor. Has anyone
> deployed spark streaming on mesos? And what's your settings?
> --
> Liu, Renjie
> Software Engineer, MVAD
>
-- 
Liu, Renjie
Software Engineer, MVAD


Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-28 Thread swetha kasireddy
Yes. I mean killing the Spark Job from UI. Also I use
context.awaitTermination().

On Wed, Nov 25, 2015 at 6:23 PM, Tathagata Das  wrote:

> What do you mean by killing the streaming job using UI? Do you mean that
> you are clicking the "kill" link in the Jobs page in the Spark UI?
>
> Also in the application, is the main thread waiting on
> streamingContext.awaitTermination()? That is designed to catch exceptions
> in running job and throw it in the main thread, so that the java program
> exits with an exception and non-zero exit code.
>
>
>
>
> On Wed, Nov 25, 2015 at 12:57 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> I am killing my Streaming job using UI. What error code does UI provide
>> if the job is killed from there?
>>
>> On Wed, Nov 25, 2015 at 11:01 AM, Kay-Uwe Moosheimer 
>> wrote:
>>
>>> Testet with Spark 1.5.2 … Works perfect when exit code is non-zero.
>>> And does not Restart with exit code equals zero.
>>>
>>>
>>> Von: Prem Sure 
>>> Datum: Mittwoch, 25. November 2015 19:57
>>> An: SRK 
>>> Cc: 
>>> Betreff: Re: Automatic driver restart does not seem to be working in
>>> Spark Standalone
>>>
>>> I think automatic driver restart will happen, if driver fails with
>>> non-zero exit code.
>>>
>>>   --deploy-mode cluster
>>>   --supervise
>>>
>>>
>>>
>>> On Wed, Nov 25, 2015 at 1:46 PM, SRK  wrote:
>>>
 Hi,

 I am submitting my Spark job with supervise option as shown below. When
 I
 kill the driver and the app from UI, the driver does not restart
 automatically. This is in a cluster mode.  Any suggestion on how to make
 Automatic Driver Restart work would be of great help.

 --supervise


 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Automatic-driver-restart-does-not-seem-to-be-working-in-Spark-Standalone-tp25478.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Parquet files not getting coalesced to smaller number of files

2015-11-28 Thread SRK

Hi,

I have the following code that saves the parquet files in my hourly batch to
hdfs. My idea is to coalesce the files to 1500 smaller files. The first run
it gives me 1500 files in hdfs. For the next runs the files seem to be
increasing even though I coalesce.

 Its not getting coalesced to 1500 files as I want. I also have an example
that I am using in the end. Please let me know if there is a different and
more efficient way of doing this. 


val job = Job.getInstance()

var filePath = "path"


val metricsPath: Path = new Path(filePath)

//Check if inputFile exists
val fs: FileSystem = FileSystem.get(job.getConfiguration)

if (fs.exists(metricsPath)) {
  fs.delete(metricsPath, true)
}


// Configure the ParquetOutputFormat to use Avro as the
serialization format
ParquetOutputFormat.setWriteSupportClass(job,
classOf[AvroWriteSupport])
// You need to pass the schema to AvroParquet when you are writing
objects but not when you
// are reading them. The schema is saved in Parquet file for future
readers to use.
AvroParquetOutputFormat.setSchema(job, Metrics.SCHEMA$)


// Create a PairRDD with all keys set to null and wrap each Metrics
in serializable objects
val metricsToBeSaved = metrics.map(metricRecord => (null, new
SerializableMetrics(new Metrics(metricRecord._1, metricRecord._2._1,
metricRecord._2._2;

metricsToBeSaved.coalesce(1500)
// Save the RDD to a Parquet file in our temporary output directory
metricsToBeSaved.saveAsNewAPIHadoopFile(filePath, classOf[Void],
classOf[Metrics],
  classOf[ParquetOutputFormat[Metrics]], job.getConfiguration)


https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-files-not-getting-coalesced-to-smaller-number-of-files-tp25509.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: storing result of aggregation of spark streaming

2015-11-28 Thread Michael Spector
Hi Amir,

You can store results of stream transformation in Cassandra using:
https://github.com/datastax/spark-cassandra-connector

Regards,
Michael

On Sun, Nov 29, 2015 at 1:41 AM, Amir Rahnama  wrote:

> Hi,
>
> I am gonna store the results of my stream job into a db, which one of
> databases has the native support (if any)?
>
> --
> Thanks and Regards,
>
> Amir Hossein Rahnama
>
> *Tel: +46 (0) 761 681 102*
> Website: www.ambodi.com
> Twitter: @_ambodi 
>


Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Jörn Franke
I would not use MongoDB because it does not fit well into the Spark or Hadoop 
architecture. You can use it if your data amount is very small and already 
preaggregated, but this is a very limited use case. You can use Hbase or with 
future versions of Hive (if they use TEZ > 0.8) For interactive queries.
Hbase with Phoenix and hive offer standard sql interfaces and can easily be 
integrated with a web interface.
With Hive you can already today use the ORC and parquet format on HDFS. They 
support storage indexes and bloom filters to accelerate your queries. You could 
also just use HDFS with these storage formats.

Maybe you can elaborate more on data volumes and queries you want to do on the 
processed part? Is the processed data updated?

Depending on your use case/data another option for interactive queries are 
solr/elastic search for text analytics and titandb for interactive graph 
queries (it supports amongst others hbase as the storage layer). Of course 
there are some more (also commercial). Both offer REST interfaces and would be 
easy to integrate with a web application using JSON/ds3.js In some cases a 
relational database can make sense.



> On 24 Nov 2015, at 13:46, sparkuser2345  wrote:
> 
> I'm interested in knowing which NoSQL databases you use with Spark and what
> are your experiences. 
> 
> On a general level, I would like to use Spark streaming to process incoming
> data, fetch relevant aggregated data from the database, and update the
> aggregates in the DB based on the incoming records. The data in the DB
> should be indexed to be able to fetch the relevant data fast and to allow
> fast interactive visualization of the data. 
> 
> I've been reading about MongoDB+Spark and I've got the impression that there
> are some challenges in fetching data by indices and in updating documents,
> but things are moving so fast, so I don't know if these are relevant
> anymore. Do you find any benefit from using HBase with Spark as HBase is
> built on top of HDFS? 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Experiences-about-NoSQL-databases-with-Spark-tp25462.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org