Re: Appropriate Apache Users List Uses

2016-02-09 Thread Ryan Victory
Yeah, a little disappointed with this, I wouldn't expect to be sent unsolicited mail based on my membership to this list. -Ryan Victory On Tue, Feb 9, 2016 at 1:36 PM, John Omernik wrote: > All, I received this today, is this appropriate list use? Note: This was >

Appropriate Apache Users List Uses

2016-02-09 Thread John Omernik
All, I received this today, is this appropriate list use? Note: This was unsolicited. Thanks John From: Pierce Lamb 11:57 AM (1 hour ago) to me Hi John, I saw you on the Spark Mailing List and noticed you worked for * and wanted to reach out. My company, SnappyData,

Re: Spark with .NET

2016-02-09 Thread Arko Provo Mukherjee
Doesn't seem to be supported, but thanks! I will probably write some .NET wrapper in my front end and use the java api in the backend. Warm regards Arko On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu wrote: > This thread is related: >

Spark with .NET

2016-02-09 Thread Arko Provo Mukherjee
Hello, I want to use Spark (preferable Spark SQL) using C#. Anyone has any pointers to that? Thanks & regards Arko

spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
has anyone successfully connected to hive metastore using spark 1.6.0? i am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and launching spark with yarn. this is what i see in logs: 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore with URI

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
hey thanks. hive-site is on classpath in conf directory i currently got it to work by changing this hive setting in hive-site.xml: hive.metastore.schema.verification=true to hive.metastore.schema.verification=false this feels like a hack, because schema verification is a good thing i would

Re: Appropriate Apache Users List Uses

2016-02-09 Thread u...@moosheimer.com
I wouldn't expect this either. Very disappointing... -Kay-Uwe Moosheimer > Am 09.02.2016 um 20:53 schrieb Ryan Victory : > > Yeah, a little disappointed with this, I wouldn't expect to be sent > unsolicited mail based on my membership to this list. > > -Ryan Victory > >>

Re: Spark with .NET

2016-02-09 Thread Bryan Jeffrey
Arko, Check this out: https://github.com/Microsoft/SparkCLR This is a Microsoft authored C# language binding for Spark. Regards, Bryan Jeffrey On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Doesn't seem to be supported, but thanks! I will

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
Hi Koert, As far as I can see you are using derby: Using direct SQL, underlying DB is DERBY not mysql, which is used for the metastore. That means, spark couldn't find hive-site.xml on your classpath. Can you check that, please? Thanks, Alex. On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers

Re: Spark with .NET

2016-02-09 Thread Ted Yu
This thread is related: http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+ On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hello, > > I want to use Spark (preferable Spark SQL) using C#. Anyone has any > pointers to that? > > Thanks

Re: Bad Digest error while doing aws s3 put

2016-02-09 Thread Steve Loughran
> On 9 Feb 2016, at 07:19, lmk wrote: > > Hi Dhimant, > As I had indicated in my next mail, my problem was due to disk getting full > with log messages (these were dumped into the slaves) and did not have > anything to do with the content pushed into s3. So,

Re: ALS rating caching

2016-02-09 Thread Roberto Pagliari
Hi Nick, >From which version does that apply? I'm using 1.5.2 Thank you, From: Nick Pentreath > Date: Tuesday, 9 February 2016 07:02 To: "user@spark.apache.org"

spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Alexandr Dzhagriev
Hello all, I looked through the cassandra spark integration ( https://github.com/datastax/spark-cassandra-connector) and couldn't find any usages of the BulkOutputWriter ( http://www.datastax.com/dev/blog/bulk-loading) - an awesome tool for creating local sstables, which could be later uploaded

[Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
All, I'm new to Spark and I'm having a hard time doing a simple join of two DFs Intent: - I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka messages (Protobufs) are decoded into DataFrames and then joined with a

Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
Hello All, joinWith() method in Dataset takes a condition of type Column. Without converting a Dataset to a DataFrame, how can we get a specific column? For eg: case class Pair(x: Long, y: Long) A, B are Datasets of type Pair and I want to join A.x with B.y A.joinWith(B, A.toDF().col("x") ==

Re: Dataset joinWith condition

2016-02-09 Thread Ted Yu
Please take a look at: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala val ds1 = Seq(1, 2, 3).toDS().as("a") val ds2 = Seq(1, 2).toDS().as("b") checkAnswer( ds1.joinWith(ds2, $"a.value" === $"b.value", "inner"), On Tue, Feb 9, 2016 at 7:07 AM, Raghava Mutharaju

createDataFrame question

2016-02-09 Thread jdkorigan
Hi, I would like to transform my rdd to a sql.dataframe.Dataframe, is there a possible conversion to do the job? or what would be the easiest way to do it? def ConvertVal(iter): # some code return sqlContext.createDataFrame(Row("val1", "val2", "val3", "val4")) rdd =

Re: createDataFrame question

2016-02-09 Thread satish chandra j
HI, Hope you are aware of "toDF()" which is used to convert your RDD to DataFrame Regards, Satish Chandra On Tue, Feb 9, 2016 at 5:52 PM, jdkorigan wrote: > Hi, > > I would like to transform my rdd to a sql.dataframe.Dataframe, is there a > possible conversion to do the

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
For sql shuffle operations like groupby, the number of output partitions is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not honour this. In my small test, I could see that the number of partitions in DF returned by orderBy was equal to the total number of distinct

Re: Can't view executor logs in web UI on Windows

2016-02-09 Thread KhajaAsmath Mohammed
Hi, I am new to spark and trying to learn doing some programs on windows. I faced the same issue when running on windows. Cannot open Spark WebUI, I can see the output and output folder has the information that I needed but logs states that the WebUI is stopped. Does anyone have solution to view

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
Hi, DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of `HashPartitioning`. `RangePartitioning` roughly samples input data and internally computes partition bounds to split given rows into `spark.sql.shuffle.partitions` partitions. Therefore, when sort keys are highly skewed, I

Re: createDataFrame question

2016-02-09 Thread jdkorigan
When using this function: rdd = sc.textFile("").mapPartitions(ConvertVal).toDF() I get an exception and the last line is: TypeError: 'JavaPackage' object is not callable Since my function return value is already DataFrame, maybe there is a way to access this type from my rdd? -- View this

RE: Can't view executor logs in web UI on Windows

2016-02-09 Thread Mark Pavey
I have submitted a pull request: https://github.com/apache/spark/pull/11135. Mark -Original Message- From: Mark Pavey [mailto:mark.pa...@thefilter.com] Sent: 05 February 2016 17:09 To: 'Ted Yu' Cc: user@spark.apache.org Subject: RE: Can't view executor logs in web UI on Windows We

jssc.textFileStream(directory) how to ensure it read entire all incoming files

2016-02-09 Thread unk1102
Hi my actual use case is streaming text files in HDFS directory and send it to Kafka please let me know if is there any existing solution for this. Anyways I have the following code //lets assume directory contains one file a.txt and it has 100 lines JavaDStream logData =

Re: createDataFrame question

2016-02-09 Thread jdkorigan
The correct way, is just to remove "sqlContext.createDataFrame", and everything works correctly def ConvertVal(iter): # some code return Row("val1", "val2", "val3", "val4") rdd = sc.textFile("").mapPartitions(ConvertVal).toDF() -- View this message in context:

HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Rachana Srivastava
I am trying to run an application in yarn cluster mode. Spark-Submit with Yarn Cluster Here are setting of the shell script: spark-submit --class "com.Myclass" \ --num-executors 2 \ --executor-cores 2 \ --master yarn \ --supervise \ --deploy-mode cluster \ ../target/ \ My application is working

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores
Well, actually I am observing a single partition no matter what my input is. I am using spark 1.3.1. For what you both are saying, it appears that this sorting issue (going to a single partition after applying orderBy in a DF) is solved in later version of Spark? Well, if that is the case, I

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Alexandr Dzhagriev
I'm using spark 1.6.0, hive 1.2.1 and there is just one property in the hive-site.xml hive.metastore.uris Works for me. Can you check in the logs, that when the HiveContext is created it connects to the correct uri and doesn't use derby. Cheers, Alex. On Tue, Feb 9, 2016 at 9:39 PM, Koert

Re: Spark with .NET

2016-02-09 Thread Ted Yu
Looks like they have some system support whose source is not in the repo: FYI On Tue, Feb 9, 2016 at 12:17 PM, Bryan Jeffrey wrote: > Arko, > > Check this out: https://github.com/Microsoft/SparkCLR > > This is a Microsoft authored C# language binding for Spark. >

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Benjamin Kim
I got the same problem when I added the Phoenix plugin jar in the driver and executor extra classpaths. Do you have those set too? > On Feb 9, 2016, at 1:12 PM, Koert Kuipers wrote: > > yes its not using derby i think: i can see the tables in my actual hive > metastore. >

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
yes its not using derby i think: i can see the tables in my actual hive metastore. i was using a symlink to /etc/hive/conf/hive-site.xml for my hive-site.xml which has a lot more stuff than just hive.metastore.uris let me try your approach On Tue, Feb 9, 2016 at 3:57 PM, Alexandr Dzhagriev

Re: Spark with .NET

2016-02-09 Thread Silvio Fiorito
That’s just a .NET assembly (not related to Spark DataSets) but doesn’t look like they’re actually using it. It’s typically a default reference pulled in by the project templates. The code though is available from Mono here:

Learning Fails with 4 Number of Layes at ANN Training with SGDOptimizer

2016-02-09 Thread Hayri Volkan Agun
Hi Everyone, When MultilayerPerceptronClassifier set to three or four number of layers and the SGDOptimizer's selected parameters are as follows. tol : 1e-5 numIter=1 layers : 82,100,30,29 stepSize=0.05 sigmoidFunction in all layers learning finishes but it doesn't converge. What may be the

RE: spark-cassandra-connector BulkOutputWriter

2016-02-09 Thread Mohammed Guller
Alex – I suggest posting this question on the Spark Cassandra Connector mailing list. The SCC developers are pretty responsive. Mohammed Author: Big Data Analytics with Spark From: Alexandr Dzhagriev

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
You may have better luck with this question on the Spark Cassandra Connector mailing list. One quick question about this code from your email: // Load DataFrame from C* data-source val base_data = base_data_df.getInstance(sqlContext) What exactly is base_data_df and how are

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Jagat Singh
Hi, I am using by telling Spark about hive version we are using. This is done by setting following properties spark.sql.hive.version spark.sql.hive.metastore.jars Thanks On Wed, Feb 10, 2016 at 7:39 AM, Koert Kuipers wrote: > hey thanks. hive-site is on classpath in conf

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
i do not have phoenix, but i wonder if its something related. will check my classpaths On Tue, Feb 9, 2016 at 5:00 PM, Benjamin Kim wrote: > I got the same problem when I added the Phoenix plugin jar in the driver > and executor extra classpaths. Do you have those set too? >

How to do a look up by id from files in hdfs inside a transformation/action ina RDD

2016-02-09 Thread SRK
Hi, How to do a lookup by id from a set of records stored in hdfs from inside a transformation/action of an RDD. Thanks, Swetha -- View this message in context:

How to collect/take arbitrary number of records in the driver?

2016-02-09 Thread SRK
Hi , How to get a fixed amount of records from an RDD in Driver? Suppose I want the records from 100 to 1000 and then save them to some external database, I know that I can do it from Workers in partition but I want to avoid that for some reasons. The idea is to collect the data to driver and

Re: Spark Increase in Processing Time

2016-02-09 Thread Ted Yu
1.4.1 was released half a year ago. I doubt whether there would be 1.4.x patch release any more. Please consider upgrading. On Tue, Feb 9, 2016 at 1:23 PM, Bryan wrote: > Ted, > > > > We are using an inverse reducer function, but we do have a filter function > in

RE: HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Diwakar Dhanuskodi
It  should  work  which  version  of  spark  are  you  using ?.Try setting it   up  in  program  using  sparkConf set . Sent from Samsung Mobile. Original message From: rachana.srivast...@thomsonreuters.com Date:10/02/2016 00:47 (GMT+05:30) To: diwakar.dhanusk...@gmail.com,

RE: How to collect/take arbitrary number of records in the driver?

2016-02-09 Thread Mohammed Guller
You can do something like this: val indexedRDD = rdd.zipWithIndex val filteredRDD = indexedRDD.filter{case(element, index) => (index >= 99) && (index < 199)} val result = filteredRDD.take(100) Warning: the ordering of the elements in the RDD is not guaranteed. Mohammed Author: Big Data

Re: Spark with .NET

2016-02-09 Thread Ted Yu
bq. it is a .NET assembly and not really used by SparkCLR Then maybe drop the import ? I was searching the SparkCLR repo to see whether (Spark) DataSet is supported. Cheer On Tue, Feb 9, 2016 at 3:07 PM, skaarthik oss wrote: > *Arko* – you could use the following

RE: Spark with .NET

2016-02-09 Thread skaarthik oss
Arko – you could use the following links to get started with SparkCLR API and use C# with Spark for DataFrame processing. If you need the support for interactive scenario, please feel free to share your scenario and requirements to the SparkCLR project. Interactive scenario is one of the focus

Re: [Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-09 Thread Shixiong(Ryan) Zhu
Could you do a thread dump in the executor that runs the Kinesis receiver and post it? It would be great if you can provide the executor log as well? On Tue, Feb 9, 2016 at 3:14 PM, Roberto Coluccio wrote: > Hello, > > can anybody kindly help me out a little bit

spark-csv partitionBy

2016-02-09 Thread Srikanth
Hello, I want to save Spark job result as LZO compressed CSV files partitioned by one or more columns. Given that partitionBy is not supported by spark-csv, is there any recommendation for achieving this in user code? One quick option is to i) cache the result dataframe ii) get unique

Re: Appropriate Apache Users List Uses

2016-02-09 Thread Pierce Lamb
I sent this mail. It was not automated or part of a mass email. My apologies for misuse. Pierce On Tue, Feb 9, 2016 at 12:02 PM, u...@moosheimer.com wrote: > I wouldn't expect this either. > Very disappointing... > > -Kay-Uwe Moosheimer > > Am 09.02.2016 um 20:53 schrieb

Re: Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
Ted, Thank you for the pointer. That works, but what does a string prepended with $ sign mean? Is it an expression? Could you also help me with the select() parameter syntax? I followed something similar $"a.x" and it gives an error message that a TypedColumn is expected. Regards, Raghava. On

Re: [Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-09 Thread Roberto Coluccio
Hello, can anybody kindly help me out a little bit here? I just verified the problem is still there on Spark 1.6.0 and emr-4.3.0 as well. It's definitely a Kinesis-related issue, since with Spark 1.6.0 I'm successfully able to get Streaming drivers to terminate with no issue IF I don't use

Re: Spark with .NET

2016-02-09 Thread Arko Provo Mukherjee
Hello, Thanks much for your help, much helpful! Let me explore some of the stuff suggested :) Thanks & regards Arko On Tue, Feb 9, 2016 at 3:17 PM, Ted Yu wrote: > bq. it is a .NET assembly and not really used by SparkCLR > > Then maybe drop the import ? > > I was

How to use a register temp table inside mapPartitions of an RDD

2016-02-09 Thread SRK
hi, How to use a registerTempTable to register an RDD as a temporary table and use it inside mapPartitions of a different RDD? Thanks, Swetha -- View this message in context:

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
The issue is not almost solved even in newer Spark. On Wed, Feb 10, 2016 at 1:36 AM, Cesar Flores wrote: > Well, actually I am observing a single partition no matter what my input > is. I am using spark 1.3.1. > > For what you both are saying, it appears that this sorting

AM creation in yarn client mode

2016-02-09 Thread praveen S
Hi, I have 2 questions when running the spark jobs on yarn in client mode : 1) Where is the AM(application master) created : A) is it created on the client where the job was submitted? i.e driver and AM on the same client? Or B) yarn decides where the the AM should be created? 2) Driver and AM

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
Ohk. I was comparing groupBy with orderBy and now I realize that they are using different partitioning schemes. Thanks Takeshi. On Tue, Feb 9, 2016 at 9:09 PM, Takeshi Yamamuro wrote: > Hi, > > DataFrame#sort() uses `RangePartitioning` in `Exchange` instead of >

Re: How to use a register temp table inside mapPartitions of an RDD

2016-02-09 Thread Koert Kuipers
if you mean to both register and use the table while you are inside mapPartition, i do not think that is possible or advisable. can you join the data? or broadcast it? On Tue, Feb 9, 2016 at 8:22 PM, SRK wrote: > hi, > > How to use a registerTempTable to register an

Re: Slowness in Kmeans calculating fastSquaredDistance

2016-02-09 Thread Li Ming Tsai
Hi, It looks like Kmeans++ is slow (SPARK-3424) in the initialisation phase and is local to driver using 1 core only. If I use random, the job completed in 1.5mins compared to 1hr+. Should I move this to the dev list? Regards, Liming

Re: AM creation in yarn client mode

2016-02-09 Thread ayan guha
It depends on yarn-cluster and yarn-client mode. On Wed, Feb 10, 2016 at 3:42 PM, praveen S wrote: > Hi, > > I have 2 questions when running the spark jobs on yarn in client mode : > > 1) Where is the AM(application master) created : > > A) is it created on the client where

Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
Hi Mohammed I'm aware of that documentation, what are you hinting at specifically? I'm pushing all elements of the partition key, so that should work. As user zero323 on SO pointed out it the problem is most probably related to the dynamic nature of the predicate elements (two distributed

Pyspark - how to use UDFs with dataframe groupby

2016-02-09 Thread Viktor ARDELEAN
Hello, I am using following transformations on RDD: rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\ .aggregateByKey([], lambda accumulatorList, value: accumulatorList + [value], lambda list1, list2: [list1] + [list2]) I want to use the dataframe groupBy + agg

Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
The filter in the join is re-arranged in the DAG (from what I can tell --> explain/UI) and should therefore be pushed accordingly. I also made experiments applying the filter to base_data before the join explicitly, effectively creating a new DF, but no luck either. Quoting Mohammed

Turning on logging for internal Spark logs

2016-02-09 Thread Li Ming Tsai
Hi, I have the default conf/log4j.properties: log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
Hi Bernhard, Take a look at the examples shown under the "Pushing down clauses to Cassandra" sections on this page: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md Mohammed Author: Big Data Analytics with Spark -Original Message- From:

Re: AM creation in yarn-client mode

2016-02-09 Thread Alexander Pivovarov
the pictures to illustrate it http://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_running_spark_on_yarn.html On Tue, Feb 9, 2016 at 10:18 PM, Jonathan Kelly wrote: > In yarn-client mode, the driver is separate from the AM. The AM is created > in YARN,

Re: AM creation in yarn client mode

2016-02-09 Thread Diwakar Dhanuskodi
Your  2nd assumption  is  correct . There  is  yarn client  which  polls AM while  running  in  yarn client mode  Sent from Samsung Mobile. Original message From: ayan guha Date:10/02/2016 10:55 (GMT+05:30) To: praveen S Cc: user

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
Moving the spark mailing list to BCC since this is not really related to Spark. May be I am missing something, but where are you calling the filter method on the base_data DF to push down the predicates to Cassandra before calling the join method? Mohammed Author: Big Data Analytics with

Re: AM creation in yarn-client mode

2016-02-09 Thread praveen S
Can you explain what happens in yarn client mode? Regards, Praveen On 10 Feb 2016 10:55, "ayan guha" wrote: > It depends on yarn-cluster and yarn-client mode. > > On Wed, Feb 10, 2016 at 3:42 PM, praveen S wrote: > >> Hi, >> >> I have 2 questions when

Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
Hi Mohammed Thanks for hint, I should probably do that :) As for the DF singleton: /** * Lazily instantiated singleton instance of base_data DataFrame */ object base_data_df { @transient private var instance: DataFrame = _ def getInstance(sqlContext: SQLContext): DataFrame = { if

Re: AM creation in yarn-client mode

2016-02-09 Thread Jonathan Kelly
In yarn-client mode, the driver is separate from the AM. The AM is created in YARN, and YARN controls where it goes (though you can somewhat control it using YARN node labels--I just learned earlier today in a different thread on this list that this can be controlled by

Re: how to send JavaDStream RDD using foreachRDD using Java

2016-02-09 Thread unk1102
Hi Sachin, how did you write to Kafka from Spark I cant find the following method sendString and sendDataAsString in KafkaUtils can you please guide? KafkaUtil.sendString(p,topic,result.get(0)); KafkaUtils.sendDataAsString(MTP,topicName, result.get(0)); -- View this message in context:

RE: HADOOP_HOME are not set when try to run spark application in yarn cluster mode

2016-02-09 Thread Diwakar Dhanuskodi
Pass  on  all  hadoop conf files  as  spark-submit parameters in --files Sent from Samsung Mobile. Original message From: Rachana Srivastava Date:09/02/2016 22:53 (GMT+05:30) To: user@spark.apache.org Cc: Subject: HADOOP_HOME are not