Re: java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-18 Thread trsell
Hi, The stack trace suggests you're doing a join as well? and it's python.. I wonder if you're seeing this? https://issues.apache.org/jira/browse/SPARK-17100 Are you using spark 2.0.0? Tim On Tue, 16 Aug 2016 at 16:58 Sumit Khanna wrote: > This is just the

Spark streaming

2016-08-18 Thread Diwakar Dhanuskodi
Hi, Is there a way to  specify in  createDirectStream to receive only last 'n' offsets of a specific topic and partition. I don't want to filter out in foreachRDD.   Sent from Samsung Mobile.

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, BTW, there is an issue, SPARK-16216, about printing dates and timestamps here. So please ignore the integer values for dates 2016-08-19 9:54 GMT+09:00 Hyukjin Kwon : > Ah, sorry, I should have read this carefully. Do you mind if I ask your > codes to test? > > I would

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, sorry, I should have read this carefully. Do you mind if I ask your codes to test? I would like to reproduce. I just tested this by myself but I couldn't reproduce as below (is this what your doing, right?): case class ClassData(a: String, b: Date) val ds: Dataset[ClassData] = Seq(

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Efe Selcuk
Thanks for the response. The problem with that thought is that I don't think I'm dealing with a complex nested type. It's just a dataset where every record is a case class with only simple types as fields, strings and dates. There's no nesting. That's what confuses me about how it's interpreting

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-18 Thread Arun Luthra
This might be caused by a few large Map objects that Spark is trying to serialize. These are not broadcast variables or anything, they're just regular objects. Would it help if I further indexed these maps into a two-level Map i.e. Map[String, Map[String, Int]] ? Or would this still count against

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Felix Cheung
I think lots of components expect to have read/write permission to a /tmp directory on HDFS. Glad it works out! _ From: Andy Davidson > Sent: Thursday, August 18, 2016 5:12 PM Subject: Re: pyspark

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
NICE CATCH!!! Many thanks. I spent all day on this bug The error msg report /tmp. I did not think to look on hdfs. [ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/ Found 1 items -rw-r--r-- 3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Hi Efe, If my understanding is correct, supporting to write/read complex types is not supported because CSV format can't represent the nested types in its own format. I guess supporting them in writing in external CSV is rather a bug. I think it'd be great if we can write and read back CSV in

Re: Unable to see external table that is created from Hive Context in the list of hive tables

2016-08-18 Thread Mich Talebzadeh
Which Hive database did you specify in your CREATE EXTERNAL TABLE statement? How did you specify the location of the file? This one creates an external table testme in database test in the location under database test scala> sqltext = """ | CREATE EXTERNAL TABLE test.testme ( | col1

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Felix Cheung
Do you have a file called tmp at / on HDFS? On Thu, Aug 18, 2016 at 2:57 PM -0700, "Andy Davidson" > wrote: For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error

Unable to see external table that is created from Hive Context in the list of hive tables

2016-08-18 Thread SRK
Hi, I created an external table in Spark sql using hiveContext ...something like CREATE EXTERNAL TABLE IF NOT EXISTS sampleTable stored as ORC LOCATION ... I can see the files getting created under the location I specified and able to query it as well... but, I don't see the table in Hive when I

pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path

[Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Efe Selcuk
We have an application working in Spark 1.6. It uses the databricks csv library for the output format when writing out. I'm attempting an upgrade to Spark 2. When writing with both the native DataFrameWriter#csv() method and with first specifying the "com.databricks.spark.csv" format (I suspect

Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-18 Thread Andy Davidson
Hi I am using python3, Java8 and spark-1.6.1. I am running my code in Jupyter notebook The following code runs fine on my mac udfRetType = ArrayType(StringType(), True) findEmojiUDF = udf(lambda s : re.findall(emojiPattern2, s), udfRetType) retDF = (emojiSpecialDF # convert into a

Re: Model Persistence

2016-08-18 Thread Nick Pentreath
Model metadata (mostly parameter values) are usually tiny. The parquet data is most often for model coefficients. So this depends on the size of your model, i.e. Your feature dimension. A high-dimensional linear model can be quite large - but still typically easy to fit into main memory on a

py4j.Py4JException: Method lower([class java.lang.String]) does not exist

2016-08-18 Thread Andy Davidson
I am still working on spark-1.6.1. I am using java8. Any idea why (df.select("*", functions.lower("rawTag").alias("tag²)) Would raise py4j.Py4JException: Method lower([class java.lang.String]) does not exist Thanks in advance Andy

Extra string added to column name? (withColumn & expr)

2016-08-18 Thread rachmaninovquartet
Hi, I'm trying to implement a custom one hot encoder, since I want the output to be a specific way, suitable to theano. Basically, it will give a new column for each distinct member of the original features and have it set to 1 if the observation contains the specific member of the distinct

Re: Large where clause StackOverflow 1.5.2

2016-08-18 Thread rachmaninovquartet
I solved this by using a Window partitioned by 'id'. I used lead and lag to create columns, which contained nulls in the places that I needed to delete, in each fold. I then removed those rows with the nulls and my additional columns. -- View this message in context:

Model Persistence

2016-08-18 Thread Rich Tarro
The following Databricks blog on Model Persistence states "Internally, we save the model metadata and parameters as JSON and the data as Parquet." https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html What data associated with a model or

Structured Stream Behavior on failure

2016-08-18 Thread Cornelio
Hi I have a couple of questions. 1.- When Spark shutdowns or fails doc states that "In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. " -To acheive this do I just need to set the checkpoint dir as

Re: Aggregations with scala pairs

2016-08-18 Thread Andrés Ivaldi
Thanks!!! On Thu, Aug 18, 2016 at 3:35 AM, Jean-Baptiste Onofré wrote: > Agreed. > > Regards > JB > On Aug 18, 2016, at 07:32, Olivier Girardot com> wrote: >> >> CC'ing dev list, >> you should open a Jira and a PR related to it to discuss it

createDirectStream parallelism

2016-08-18 Thread Diwakar Dhanuskodi
We are using createDirectStream api to receive messages from 48 partitioned topic. I am setting up --num-executors 48 & --executor-cores 1 in spark-submit All partitions were parallely received and corresponding RDDs in foreachRDD were executed in parallel. But when I join a transformed RDD

Re: Standalone executor memory is fixed while executor cores are load balanced between workers

2016-08-18 Thread Mich Talebzadeh
Can you provide some info In your conf/spark-env.sh, what do you set these # Options for the daemons used in the standalone deploy mode SPARK_WORKER_CORES=? ##, total number of cores to be used by executors by each worker SPARK_WORKER_MEMORY=?g ##, to set how much total memory workers have to

RE: pyspark.sql.functions.last not working as expected

2016-08-18 Thread Alexander Peletz
Is the issue that the default rangeBetween = rangeBetween(-sys.maxsize, 0)? That would explain the behavior below. Is this default documented somewhere? From: Alexander Peletz [mailto:alexand...@slalom.com] Sent: Wednesday, August 17, 2016 8:48 PM To: user Subject: RE:

Standalone executor memory is fixed while executor cores are load balanced between workers

2016-08-18 Thread Petr Novak
Hello, when I set spark.executor.cores e.g. to 8 cores and spark.executor.memory to 8GB. It can allocate more executors with less cores for my app but each executors gets 8GB RAM. It is a problem because I can allocate more memory across cluster than expected, the worst case is 8x 1core

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread janardhan shetty
There is a spark-ts package developed by Sandy which has rdd version. Not sure about the dataframe roadmap. http://sryza.github.io/spark-timeseries/0.3.0/index.html On Aug 18, 2016 12:42 AM, "ayan guha" wrote: > Thanks a lot. I resolved it using an UDF. > > Qs: does spark

Reporting errors from spark sql

2016-08-18 Thread yael aharon
Hello, I am working on an SQL editor which is powered by spark SQL. When the SQL is not valid, I would like to provide the user with a line number and column number where the first error occurred. I am having a hard time finding a mechanism that will give me that information programmatically.

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Yes, i looked into the source code implementation. sparkConf is serialized and saved during checkpointing and re-created from the checkpoint directory at time of restart. So any sparkConf parameter which you load from application.config and set in sparkConf object in code cannot be changed and

Re: JavaRDD to DataFrame fails with null pointer exception in 1.6.0

2016-08-18 Thread Aditya
Check if schema is generated correctly. On Wednesday 17 August 2016 10:15 AM, sudhir patil wrote: Tested with java 7 & 8 , same issue on both versions. On Aug 17, 2016 12:29 PM, "spats" > wrote: Cannot convert JavaRDD to

Re: SparkStreaming source code

2016-08-18 Thread Adonis Settouf
Might it be this: https://github.com/apache/spark/tree/master/streaming Software engineer at TTTech *My personal page * [image: https://github.com/asettouf]

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread Cody Koeninger
Checkpointing is not kafka-specific. It encompasses metadata about the application. You can't re-use a checkpoint if your application has changed. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash

Re: [Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Aditya
Try using --files /path/of/hive-site.xml in spark-submit and run. On Thursday 18 August 2016 05:26 PM, Diwakar Dhanuskodi wrote: Hi Can you cross check by providing same library path in --jars of spark-submit and run . Sent from Samsung Mobile. Original message From:

SparkStreaming source code

2016-08-18 Thread Aditya
Hi, I need to set up source code of Spark Streaming for exploring purpose. Can any one suggest the link for the Spark Streaming source code? Regards, Aditya Calangutkar - To unsubscribe e-mail:

RE: [Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Diwakar Dhanuskodi
Hi Can  you  cross check by providing same library path in --jars of spark-submit and run . Sent from Samsung Mobile. Original message From: "颜发才(Yan Facai)" Date:18/08/2016 15:17 (GMT+05:30) To: "user.spark" Cc: Subject:

Re: Converting Dataframe to resultSet in Spark Java

2016-08-18 Thread rakesh sharma
Hi Sree I dont think what you are trying to do is correct. DataFrame and ResultSet are two different types. And no strongly typed language will alow you to do that. If your intention is to traverse the DataFrame or get the individual rows and columns then you must try the map function and

Re: [Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Mich Talebzadeh
when you start spark-shell does it work or this issue is only with spark-submit? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Spark Streaming application failing with Token issue

2016-08-18 Thread Kamesh
Hi all, I am running a spark streaming application that store events into Secure(Kerborized) HBase cluster. I launched this spark streaming application by passing --principal and --keytab. Despite this, spark streaming application is failing after *7days* with Token issue. Can someone please

[Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Yan Facai
Hi, all. I copied hdfs-site.xml, core-site.xml and hive-site.xml to $SPARK_HOME/conf. And spark-submit is used to submit task to yarn, and run as **client** mode. However, ClassNotFoundException is thrown. some details of logs are list below: ``` 16/08/12 17:07:32 INFO hive.HiveUtils:

Re: 2.0.1/2.1.x release dates

2016-08-18 Thread Sean Owen
Historically, minor releases happen every ~4 months, and maintenance releases are a bit ad hoc but come about a month after the minor release. It's up to the release manager to decide to do them but maybe realistic to expect 2.0.1 in early September. On Thu, Aug 18, 2016 at 10:35 AM, Adrian

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Is it possible that i use checkpoint directory to restart streaming but with modified parameter value in config file (e.g. username/password for db connection) ? Thanks in advance. Regards, Chandan On Thu, Aug 18, 2016 at 1:10 PM, chandan prakash wrote: > Hi, > I

2.0.1/2.1.x release dates

2016-08-18 Thread Adrian Bridgett
Just wondering if there were any rumoured release dates for either of the above. I'm seeing some odd hangs with 2.0.0 and mesos (and I know that the mesos integration has had a bit of updating in 2.1.x). Looking at JIRA, there's no suggested release date and issues seem to be added to a

GraphX VerticesRDD issue - java.lang.ArrayStoreException: java.lang.Long

2016-08-18 Thread Gerard Casey
Dear all, I am building a graph from two JSON files. Spark version 1.6.1 Creating Edge and Vertex RDDs from JSON files. The vertex JSON files looks like this: {"toid": "osgb400031043205", "index": 1, "point": [508180.748, 195333.973]} {"toid": "osgb400031043206",

Re: How to Improve Random Forest classifier accuracy

2016-08-18 Thread Jörn Franke
Depends on your data... How did you split training and test set? How does the model fit to the data? You could try of course also to have more data to fed into the model Have you considered alternative machine learning models? I do not think this is a Spark problem, but you should ask the

How to Improve Random Forest classifier accuracy

2016-08-18 Thread 陈哲
Hi All I using spark ml Random Forest classifier, I have only two label categories (1, 0) ,about 30 features and data size over 100, 000. I run the spark JavaRandomForestClassifierExample code, the model came out with the results (I make some change, show more detail result): Test Error =

How to Improve Random Forest classifier accuracy

2016-08-18 Thread 陈哲
Hi All

Converting Dataframe to resultSet in Spark Java

2016-08-18 Thread Sree Eedupuganti
Retrieved the data to DataFrame but i can't convert into ResultSet Is there any possible way how to convert...Any suggestions please... Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.DataFrame cannot be cast to com.datastax.driver.core.ResultSet -- Best

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread ayan guha
Thanks a lot. I resolved it using an UDF. Qs: does spark support any time series model? Is there any roadmap to know when a feature will be roughly available? On 18 Aug 2016 16:46, "Yanbo Liang" wrote: > If you want to tie them with other data, I think the best way is to use

spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Hi, I am using direct kafka with checkpointing of offsets same as : https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala I need to change some parameters like db connection params : username/password for db connection . I stopped streaming

Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
It looks like you mixed use ALS in spark.ml and spark.mllib package. You can train the model by either one, meanwhile, you should use the corresponding save/load functions. You can not train/save the model by spark.mllib ALS, and then use spark.ml ALS to load the model. It will throw exceptions.

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread Yanbo Liang
If you want to tie them with other data, I think the best way is to use DataFrame join operation on condition that they share an identity column. Thanks Yanbo 2016-08-16 20:39 GMT-07:00 ayan guha : > Hi > > Thank you for your reply. Yes, I can get prediction and original

Re: Spark SQL 1.6.1 issue

2016-08-18 Thread Teik Hooi Beh
Hi, does the version have to be the same down to the minor version, e.g. 1.6.1 and 1.6.2 will give this issue On Thu, Aug 18, 2016 at 6:27 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > your executors/driver must not have the multiple versions of spark in > classpath, it may

Re: error when running spark from oozie launcher

2016-08-18 Thread tkg_cangkul
you're right jean, it's mismatch library in default oozie sharelib with my spark version. i've replace it and it works normally now. thx for your help Jean. On 18/08/16 13:37, Jean-Baptiste Onofré wrote: It sounds like a mismatch in the spark version ship in oozie and the runtime one.

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-18 Thread Yanbo Liang
@Michal Yes, we have public VectorUDT in spark.mllib package at 1.6, and this class is still existing in 2.0. And from 2.0, we provide a new VectorUDT in spark.ml package and make it private temporary (will be public in the near future). Since from 2.0, spark.mllib package will be in maintenance

Re: error when running spark from oozie launcher

2016-08-18 Thread Jean-Baptiste Onofré
It sounds like a mismatch in the spark version ship in oozie and the runtime one. Regards JB On Aug 18, 2016, 07:36, at 07:36, tkg_cangkul wrote: >hi olivier, thx for your reply. > >this is the full stacktrace : > >Failing Oozie Launcher, Main class

Re: error when running spark from oozie launcher

2016-08-18 Thread tkg_cangkul
hi olivier, thx for your reply. this is the full stacktrace : Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.apache.spark.launcher.CommandBuilderUtils.addPermGenSizeOpt(Ljava/util/List;)V java.lang.NoSuchMethodError:

Re: Aggregations with scala pairs

2016-08-18 Thread Jean-Baptiste Onofré
Agreed. Regards JB On Aug 18, 2016, 07:32, at 07:32, Olivier Girardot wrote: >CC'ing dev list, you should open a Jira and a PR related to it to >discuss it c.f.

Re: Aggregations with scala pairs

2016-08-18 Thread Olivier Girardot
CC'ing dev list, you should open a Jira and a PR related to it to discuss it c.f. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingCodeChanges On Wed, Aug 17, 2016 4:01 PM, Andrés Ivaldi iaiva...@gmail.com wrote: Hello, I'd like to

Re: Spark DF CacheTable method. Will it save data to disk?

2016-08-18 Thread Olivier Girardot
that's another "pipeline" step to add whereas when using persist is just relevant during the lifetime of your jobs and not in HDFS but in the local disk of your executors. On Wed, Aug 17, 2016 5:56 PM, neil90 neilp1...@icloud.com wrote: >From the spark

Re: error when running spark from oozie launcher

2016-08-18 Thread Olivier Girardot
this is not the full stacktrace, please post the full stacktrace if you want some help On Wed, Aug 17, 2016 7:24 PM, tkg_cangkul yuza.ras...@gmail.com wrote: hi i try to submit job spark with oozie. but i've got one problem here. when i submit the same job. sometimes my job succeed but

Re: Spark SQL 1.6.1 issue

2016-08-18 Thread Olivier Girardot
your executors/driver must not have the multiple versions of spark in classpath, it may come from the cassandra connector check the pom dependencies of the version you fetched and if it's compatible with your spark version. On Thu, Aug 18, 2016 6:05 AM, thbeh th...@thbeh.com wrote: Running