Re: Continuous warning while consuming using new kafka-spark010 API

2016-09-20 Thread Cody Koeninger
-dev +user Than warning pretty much means what it says - the consumer tried to get records for the given partition / offset, and couldn't do so after polling the kafka broker for X amount of time. If that only happens when you put additional load on Kafka via producing, the first thing I'd do is

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Instead of *mode="append"*, try *mode="overwrite"* On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Please find the code below. > > sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json") > > I tried these two commands. >

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
You can probably just do an identity transformation on the column to make its type a nullable String array -- ArrayType(StringType, true). Of course, I'm not sure why Word2Vec must reject a non-null array type when it can of course handle nullable, but the previous discussion indicated that this

RE: LDA and Maximum Iterations

2016-09-20 Thread Yang, Yuhao
Hi Frank, Which version of Spark are you using? Also can you share more information about the exception. If it’s not confidential, you can send the data sample to me (yuhao.y...@intel.com) and I can try to investigate. Regards, Yuhao From: Frank Zhang [mailto:dataminin...@yahoo.com.INVALID]

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
I used that one also On Sep 20, 2016 10:44 PM, "Kevin Mellott" wrote: > Instead of *mode="append"*, try *mode="overwrite"* > > On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally creditvidya.com> wrote: > >> Please find the code below. >> >>

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
Hi Yan, I agree, it IS really confusing. Here is the technique for transforming a column. It is very general because you can make "myConvert" do whatever you want. import org.apache.spark.mllib.linalg.Vectors val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF df.show() // The columns were named

Options for method createExternalTable

2016-09-20 Thread CalumAtTheGuardian
Hi, I am trying to create an external table WITH partitions using SPARK. Currently I am using catalog.createExternalTable(cleanTableName, "ORC", schema, Map("path" -> s"s3://$location/")) Does createExternalTable have options that can create a table with partitions? I assume it would be a

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Can you please post the line of code that is doing the df.write command? On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Hey Kevin, > > It is a empty directory, It is able to write part files to the directory > but while merging those part files

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Please find the code below. sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json") I tried these two commands. write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",header="true") saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true") On Tue, Sep 20, 2016

Spark Tasks Taking Increasingly Longer

2016-09-20 Thread Chris Jansen
Hi, I've been struggling with the reliability, which on the face of it, should be a fairly simple job: Given a number of events, group them by user, event type and week in which they occurred and aggregate their counts. The input data is fairly skewed so I do a repartition and add a salt to the

cassandra and spark can be built and worked on the same computer?

2016-09-20 Thread muhammet pakyürek
can we connect to cassandra from spark using spark-cassandra-connector which all three are built on the same computer? what kind of problems this configuration leads to?

Task Deserialization Error

2016-09-20 Thread Chawla,Sumit
Hi All I am trying to test a simple Spark APP using scala. import org.apache.spark.SparkContext object SparkDemo { def main(args: Array[String]) { val logFile = "README.md" // Should be some file on your system // to run in local mode val sc = new SparkContext("local", "Simple

Re: Similar Items

2016-09-20 Thread Nick Pentreath
How many products do you have? How large are your vectors? It could be that SVD / LSA could be helpful. But if you have many products then trying to compute all-pair similarity with brute force is not going to be scalable. In this case you may want to investigate hashing (LSH) techniques. On

java.lang.ClassCastException: optional binary element (UTF8) is not a group

2016-09-20 Thread Rajan, Naveen
Dear All, My code works fine with JSON input data. When I tried the Parquet data format, it worked for English data. For Japanese text, I am getting the below stack-trace. Pls help! Caused by: java.lang.ClassCastException: optional binary element (UTF8) is not a group at

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

spark sql thrift server: driver OOM

2016-09-20 Thread Young
Hi, all: I'm using spark sql thrift server under Spark1.3.1 to do hive sql query. I started spark sql thrift server like ./sbin/start-thriftserver.sh --master yarn-client --num-executors 12 --executor-memory 5g --driver-memory 5g, then sent continuos hive sql to the thrift server.

Re: is there any bug for the configuration of spark 2.0 cassandra spark connector 2.0 and cassandra 3.0.8

2016-09-20 Thread Todd Nist
These types of questions would be better asked on the user mailing list for the Spark Cassandra connector: http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user Version compatibility can be found here:

Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-20 Thread McBeath, Darin W (ELS-STL)
I'm using Spark 2.0. I've created a dataset from a parquet file and repartition on one of the columns (docId) and persist the repartitioned dataset. val om = ds.repartition($"docId").persist(StorageLevel.MEMORY_AND_DISK) When I try to confirm the partitioner, with om.rdd.partitioner I get

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Are you able to manually delete the folder below? I'm wondering if there is some sort of non-Spark factor involved (permissions, etc). /nfspartition/sankar/banking_l1_v2.csv On Tue, Sep 20, 2016 at 12:19 PM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > I used that one also >

Re: Similar Items

2016-09-20 Thread Nick Pentreath
A few options include: https://github.com/marufaytekin/lsh-spark - I've used this a bit and it seems quite scalable too from what I've looked at. https://github.com/soundcloud/cosine-lsh-join-spark - not used this but looks like it should do exactly what you need.

Re: LDA and Maximum Iterations

2016-09-20 Thread Frank Zhang
Hi Yuhao,    Thank you so much for your great contribution to the LDA and other Spark modules!     I use both Spark 1.6.2 and 2.0.0. The data I used originally is very large which has tens of millions of documents. But for test purpose, the data set I mentioned earlier

Re: Sending extraJavaOptions for Spark 1.6.1 on mesos 0.28.2 in cluster mode

2016-09-20 Thread Michael Gummelt
Probably this: https://issues.apache.org/jira/browse/SPARK-13258 As described in the JIRA, workaround is to use SPARK_JAVA_OPTS On Mon, Sep 19, 2016 at 5:07 PM, sagarcasual . wrote: > Hello, > I have my Spark application running in cluster mode in CDH with >

Re: Similar Items

2016-09-20 Thread Kevin Mellott
Thanks Nick - those examples will help a ton!! On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath wrote: > A few options include: > > https://github.com/marufaytekin/lsh-spark - I've used this a bit and it > seems quite scalable too from what I've looked at. >

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott

Re: Similar Items

2016-09-20 Thread Kevin Mellott
Using the Soundcloud implementation of LSH, I was able to process a 22K product dataset in a mere 65 seconds! Thanks so much for the help! On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott wrote: > Thanks Nick - those examples will help a ton!! > > On Tue, Sep 20, 2016

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Yeah I can do all operations on that folder On Sep 21, 2016 12:15 AM, "Kevin Mellott" wrote: > Are you able to manually delete the folder below? I'm wondering if there > is some sort of non-Spark factor involved (permissions, etc). > >

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Divya Gehlot
Spark version plz ? On 21 September 2016 at 09:46, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Yeah I can do all operations on that folder > > On Sep 21, 2016 12:15 AM, "Kevin Mellott" > wrote: > >> Are you able to manually delete the folder below?

Israel Spark Meetup

2016-09-20 Thread Romi Kuntsman
Hello, Please add a link in Spark Community page ( https://spark.apache.org/community.html) To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/) We're an active meetup group, unifying the local Spark user community, and having regular meetups. Thanks! Romi K.

java.lang.ClassCastException: optional binary element (UTF8) is not a group

2016-09-20 Thread naveen.ra...@sony.com
My code works fine with JSON input format (Spark 1.6 on Amazon EMR, emr-5.0.0). I tried the Parquet format. Works fine for English data. When I tried the Parquet format with some Japanese language text, I am getting this weird stack-trace: *Caused by: java.lang.ClassCastException: optional binary

Re: Convert RDD to JSON Rdd and append more information

2016-09-20 Thread Deepak Sharma
Enrich the RDDs first with more information and then map it to some case class , if you are using scala. You can then use play api's (play.api.libs.json.Writes/play.api.libs.json.Json) classes to convert the mapped case class to json. Thanks Deepak On Tue, Sep 20, 2016 at 6:42 PM, sujeet jog

Convert RDD to JSON Rdd and append more information

2016-09-20 Thread sujeet jog
Hi, I have a Rdd of n rows, i want to transform this to a Json RDD, and also add some more information , any idea how to accomplish this . ex : - i have rdd with n rows with data like below , , 16.9527493170273,20.1989561393151,15.7065424947394

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
Ah, I think that this was supposed to be changed with SPARK-9062. Let me see about reopening 10835 and addressing it. On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty wrote: > Is this a bug? > > On Sep 19, 2016 10:10 PM, "janardhan shetty" wrote:

Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Thanks Sean. On Sep 20, 2016 7:45 AM, "Sean Owen" wrote: > Ah, I think that this was supposed to be changed with SPARK-9062. Let > me see about reopening 10835 and addressing it. > > On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty > wrote: > > Is

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Have you checked to see if any files already exist at /nfspartition/sankar/ banking_l1_v2.csv? If so, you will need to delete them before attempting to save your DataFrame to that location. Alternatively, you may be able to specify the "mode" setting of the df.write operation to "overwrite",

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Hey Kevin, It is a empty directory, It is able to write part files to the directory but while merging those part files we are getting above error. Regards On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott wrote: > Have you checked to see if any files already exist at

Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Is this a bug? On Sep 19, 2016 10:10 PM, "janardhan shetty" wrote: > Hi, > > I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835 > . > > Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is > appreciated ? > > Note: > Pipeline has