Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Jörn Franke
I would not use MongoDB because it does not fit well into the Spark or Hadoop architecture. You can use it if your data amount is very small and already preaggregated, but this is a very limited use case. You can use Hbase or with future versions of Hive (if they use TEZ > 0.8) For interactive q

Re: storing result of aggregation of spark streaming

2015-11-28 Thread Michael Spector
Hi Amir, You can store results of stream transformation in Cassandra using: https://github.com/datastax/spark-cassandra-connector Regards, Michael On Sun, Nov 29, 2015 at 1:41 AM, Amir Rahnama wrote: > Hi, > > I am gonna store the results of my stream job into a db, which one of > databases ha

Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Yu Zhang
BTW, if you decide to try the mongodb, please use the 3.0+ version with "wiredtiger" engine. On Sat, Nov 28, 2015 at 11:30 PM, Yu Zhang wrote: > If you need to construct multiple indexes, hbase will perform better, the > writing speed is slow in mongodb with many indexes and the memory cost is >

Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Yu Zhang
If you need to construct multiple indexes, hbase will perform better, the writing speed is slow in mongodb with many indexes and the memory cost is huge! But my concern is: with mongodb, you could easily cooperate with js and with some visualization tools like D3.js, the work will become smooth as

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-28 Thread swetha kasireddy
Yes. I mean killing the Spark Job from UI. Also I use context.awaitTermination(). On Wed, Nov 25, 2015 at 6:23 PM, Tathagata Das wrote: > What do you mean by killing the streaming job using UI? Do you mean that > you are clicking the "kill" link in the Jobs page in the Spark UI? > > Also in the

Parquet files not getting coalesced to smaller number of files

2015-11-28 Thread SRK
Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My idea is to coalesce the files to 1500 smaller files. The first run it gives me 1500 files in hdfs. For the next runs the files seem to be increasing even though I coalesce. Its not getting coalesced to 150

Re: Spark Streaming on mesos

2015-11-28 Thread Renjie Liu
Hi, Nagaraj: Thanks for the response, but this does not solve my problem. I think executor memory should be proportional to number of cores, or number of core in each executor should be the same. On Sat, Nov 28, 2015 at 1:48 AM Nagaraj Chandrashekar < nchandrashe...@innominds.com> wrote: > Hi Ren

storing result of aggregation of spark streaming

2015-11-28 Thread Amir Rahnama
Hi, I am gonna store the results of my stream job into a db, which one of databases has the native support (if any)? -- Thanks and Regards, Amir Hossein Rahnama *Tel: +46 (0) 761 681 102* Website: www.ambodi.com Twitter: @_ambodi

Retrieve best parameters from CrossValidator

2015-11-28 Thread BenFradet
Hi everyone, Is there a better way to retrieve the best model parameters obtained from cross validation than inspecting the logs issued while calling the fit method (with the call here: https://github.com/apache/spark/blob/branch-1.5/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.s

Spark and simulated annealing

2015-11-28 Thread marfago
Hi All, I would like to implement a simulated annealing algorithm with Spark. What is the best way to do that with python or scala? Is there any library already implementing it? Thank you in advance. Marco -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sp

General question on using StringIndexer in SparkML

2015-11-28 Thread Vishnu Viswanath
Hi All, I have a general question on using StringIndexer. StringIndexer gives an index to each label in the feature starting from 0 ( 0 for least frequent word). Suppose I am building a model, and I use StringIndexer for transforming on of my column. e.g., suppose A was most frequent word followe

Re: StructType for oracle.sql.STRUCT

2015-11-28 Thread andy petrella
Warf... such an heavy tasks man! I'd love to follow your work on that (I've a long XP in geospatial too), is there a repo available already for that? The hard part will be to support all descendant types I guess (line, mutlilines, and so on), then creating the spatial operators. The only project

Confirm this won't parallelize/partition?

2015-11-28 Thread Jim Lohse
Hi, I got a good answer on the main question elsewhere, would anyone please confirm the first code is the right approach? For a MVCE I am trying to adapt this example and it's seems like I am having Java issues with types: (but this is basically the right approach?) int count = spark.parallel

StructType for oracle.sql.STRUCT

2015-11-28 Thread Pieter Minnaar
Hi, I need to read Oracle Spatial SDO_GEOMETRY tables into Spark. I need to know how to create a StructField to use in the schema definition for the Oracle geometry columns. In the standard JDBC the values are read as oracle.sql.STRUCT types. How can I get the same values in Spark SQL? Regards,

streaming from Flume, save to Cassandra and Solr with Banana as search engine

2015-11-28 Thread Oleksandr Yermolenko
Hi, The aim: - collects syslogs. - filter (discard really unneeded events) - save to cassandra table the rest - later I will have to integrate search engine (Solr based) Environment: spark 1.5.2/scala 2.10.4 spark-cassandra-connector_2.11-1.5.0-M2.jar flume 1.6.0 I have reviewed a lot examples

df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-28 Thread Don Drake
I have a 2TB dataset that I have in a DataFrame that I am attempting to partition by 2 fields and my YARN job seems to write the partitioned dataset successfully. I can see the output in HDFS once all Spark tasks are done. After the spark tasks are done, the job appears to be running for over an

How to get a single available message from kafka (case where OffsetRange.fromOffset == OffsetRange.untilOffset)

2015-11-28 Thread Nikos Viorres
Hi, I am using KafkaUtils.createRDD to retrieve data from Kafka for batch processing and when Invoking KafkaUtils.createRDD with an OffsetRange where OffsetRange.fromOffset == OffsetRange.untilOffset for a particular partition, i get an empy RDD. Documentation is clear that until is exclusive and

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-28 Thread Felix Cheung
Ah, it's there in spark-submit and pyspark.Seems like it should be added for spark_ec2 _ From: Ted Yu Sent: Friday, November 27, 2015 11:50 AM Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash() To: Felix Cheung Cc: Andy Davidson , user @spark