Re: Continuous warning while consuming using new kafka-spark010 API

2016-09-20 Thread Cody Koeninger
-dev +user

Than warning pretty much means what it says - the consumer tried to
get records for the given partition / offset, and couldn't do so after
polling the kafka broker for X amount of time.

If that only happens when you put additional load on Kafka via
producing, the first thing I'd do is look at what's going on on your
kafka brokers.

On Mon, Sep 19, 2016 at 11:57 AM, Nitin Goyal  wrote:
> Hi All,
>
> I am using the new kafka-spark010 API to consume messages from Kafka
> (brokers running kafka 0.10.0.1).
>
> I am seeing continuous following warning only when producer is writing
> messages to kafka in parallel (increased
> spark.streaming.kafka.consumer.poll.ms to 1024 ms as well) :-
>
> 16/09/19 16:44:53 WARN TaskSetManager: Lost task 97.0 in stage 32.0 (TID
> 4942, host-3): java.lang.AssertionError: assertion failed: Failed to get
> records for spark-executor-example topic2 8 1052989 after polling for 1024
>
> while at same time, I see this in spark UI corresponding to that job
> topic: topic2partition: 8offsets: 1051731 to 1066124
>
> Code :-
>
> val stream = KafkaUtils.createDirectStream[String, String]( ssc,
> PreferConsistent, Subscribe[String, String](topics, kafkaParams.asScala) )
>
> stream.foreachRDD {rdd => rdd.filter(_ => false).collect}
>
>
> Has anyone encountered this with the new API? Is this the expected behaviour
> or am I missing something here?
>
> --
> Regards
> Nitin Goyal

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Instead of *mode="append"*, try *mode="overwrite"*

On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally <
sankar.mittapa...@creditvidya.com> wrote:

> Please find the code below.
>
> sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json")
>
> I tried these two commands.
> write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",header="true")
>
> saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true")
>
>
>
> On Tue, Sep 20, 2016 at 9:40 PM, Kevin Mellott 
> wrote:
>
>> Can you please post the line of code that is doing the df.write command?
>>
>> On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally <
>> sankar.mittapa...@creditvidya.com> wrote:
>>
>>> Hey Kevin,
>>>
>>> It is a empty directory, It is able to write part files to the directory
>>> but while merging those part files we are getting above error.
>>>
>>> Regards
>>>
>>>
>>> On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott <
>>> kevin.r.mell...@gmail.com> wrote:
>>>
 Have you checked to see if any files already exist at
 /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to delete
 them before attempting to save your DataFrame to that location.
 Alternatively, you may be able to specify the "mode" setting of the
 df.write operation to "overwrite", depending on the version of Spark you
 are running.

 *ERROR (from log)*
 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
 dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
 _201609170802_0013_m_00/.part-r-0-46a7f178-2490-444e
 -9110-510978eaaecb.csv.crc]:
 it still exists.
 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
 dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
 _201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-
 9110-510978eaaecb.csv]:
 it still exists.

 *df.write Documentation*
 http://spark.apache.org/docs/latest/api/R/write.df.html

 Thanks,
 Kevin

 On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
 sankar.mittapa...@creditvidya.com> wrote:

>  We have setup a spark cluster which is on NFS shared storage, there
> is no
> permission issues with NFS storage, all the users are able to write to
> NFS
> storage. When I fired write.df command in SparkR, I am getting below.
> Can
> some one please help me to fix this issue.
>
>
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting
> job.
> java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
> {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
> /0/task_201609170802_0013_m_00/part-r-0-46a7f178-249
> 0-444e-9110-510978eaaecb.csv;
> isDirectory=false; length=436486316; replication=1; blocksize=33554432;
> modification_time=147409940; access_time=0; owner=; group=;
> permission=rw-rw-rw-; isSymlink=false}
> to
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a
> 7f178-2490-444e-9110-510978eaaecb.csv
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
> ergePaths(FileOutputCommitter.java:371)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
> ergePaths(FileOutputCommitter.java:384)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.c
> ommitJob(FileOutputCommitter.java:326)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContain
> er.commitJob(WriterContainer.scala:222)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
> sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
> pFsRelationCommand.scala:144)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
> tionCommand.scala:115)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
> tionCommand.scala:115)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> nId(SQLExecution.scala:57)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
> sRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
> ideEffectResult$lzycompute(commands.scala:60)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
> ideEffectResult(commands.scala:58)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.d
> oExecute(commands.scala:74)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
> apply(SparkPlan.scala:115)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
> apply(SparkPlan.scala:115)
> at
> 

Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
You can probably just do an identity transformation on the column to
make its type a nullable String array -- ArrayType(StringType, true).
Of course, I'm not sure why Word2Vec must reject a non-null array type
when it can of course handle nullable, but the previous discussion
indicated that this had to do with how UDFs work too.

On Tue, Sep 20, 2016 at 4:03 PM, janardhan shetty
 wrote:
> Hi Sean,
>
> Any suggestions for workaround as of now?
>
> On Sep 20, 2016 7:46 AM, "janardhan shetty"  wrote:
>>
>> Thanks Sean.
>>
>> On Sep 20, 2016 7:45 AM, "Sean Owen"  wrote:
>>>
>>> Ah, I think that this was supposed to be changed with SPARK-9062. Let
>>> me see about reopening 10835 and addressing it.
>>>
>>> On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
>>>  wrote:
>>> > Is this a bug?
>>> >
>>> > On Sep 19, 2016 10:10 PM, "janardhan shetty" 
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I am hitting this issue.
>>> >> https://issues.apache.org/jira/browse/SPARK-10835.
>>> >>
>>> >> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround
>>> >> is
>>> >> appreciated ?
>>> >>
>>> >> Note:
>>> >> Pipeline has Ngram before word2Vec.
>>> >>
>>> >> Error:
>>> >> val word2Vec = new
>>> >>
>>> >> Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)
>>> >>
>>> >> scala> word2Vec.fit(grams)
>>> >> java.lang.IllegalArgumentException: requirement failed: Column
>>> >> wordsGrams
>>> >> must be of type ArrayType(StringType,true) but was actually
>>> >> ArrayType(StringType,false).
>>> >>   at scala.Predef$.require(Predef.scala:224)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
>>> >>   at
>>> >>
>>> >> org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
>>> >>   at
>>> >> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>>> >>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>>> >>
>>> >>
>>> >> Github code for Ngram:
>>> >>
>>> >>
>>> >> override protected def validateInputType(inputType: DataType): Unit =
>>> >> {
>>> >> require(inputType.sameType(ArrayType(StringType)),
>>> >>   s"Input type must be ArrayType(StringType) but got $inputType.")
>>> >>   }
>>> >>
>>> >>   override protected def outputDataType: DataType = new
>>> >> ArrayType(StringType, false)
>>> >> }
>>> >>
>>> >

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: LDA and Maximum Iterations

2016-09-20 Thread Yang, Yuhao
Hi Frank,

Which version of Spark are you using? Also can you share more information about 
the exception.

If it’s not confidential, you can send the data sample to me 
(yuhao.y...@intel.com) and I can try to investigate.

Regards,
Yuhao

From: Frank Zhang [mailto:dataminin...@yahoo.com.INVALID]
Sent: Monday, September 19, 2016 9:20 PM
To: user@spark.apache.org
Subject: LDA and Maximum Iterations

Hi all,

   I have a question about parameter setting for LDA model. When I tried to set 
a large number like 500 for
setMaxIterations, the program always fails.  There is a very straightforward 
LDA tutorial using an example data set in the mllib 
package:http://stackoverflow.com/questions/36631991/latent-dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.
  The codes are here:

import org.apache.spark.mllib.clustering.LDA
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("/data/mllib/sample_lda_data.txt") // you might need to 
change the path for the data set
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

But if I change the last line to
val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus), the program 
fails.

I greatly appreciate your help!

Best,

Frank





Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
I used that one also

On Sep 20, 2016 10:44 PM, "Kevin Mellott"  wrote:

> Instead of *mode="append"*, try *mode="overwrite"*
>
> On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally  creditvidya.com> wrote:
>
>> Please find the code below.
>>
>> sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json")
>>
>> I tried these two commands.
>> write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",
>> header="true")
>>
>> saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true")
>>
>>
>>
>> On Tue, Sep 20, 2016 at 9:40 PM, Kevin Mellott > > wrote:
>>
>>> Can you please post the line of code that is doing the df.write command?
>>>
>>> On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally <
>>> sankar.mittapa...@creditvidya.com> wrote:
>>>
 Hey Kevin,

 It is a empty directory, It is able to write part files to the
 directory but while merging those part files we are getting above error.

 Regards


 On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott <
 kevin.r.mell...@gmail.com> wrote:

> Have you checked to see if any files already exist at
> /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to
> delete them before attempting to save your DataFrame to that location.
> Alternatively, you may be able to specify the "mode" setting of the
> df.write operation to "overwrite", depending on the version of Spark you
> are running.
>
> *ERROR (from log)*
> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
> _201609170802_0013_m_00/.part-r-0-46a7f178-2490-444e
> -9110-510978eaaecb.csv.crc]:
> it still exists.
> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
> _201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-
> 9110-510978eaaecb.csv]:
> it still exists.
>
> *df.write Documentation*
> http://spark.apache.org/docs/latest/api/R/write.df.html
>
> Thanks,
> Kevin
>
> On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
> sankar.mittapa...@creditvidya.com> wrote:
>
>>  We have setup a spark cluster which is on NFS shared storage, there
>> is no
>> permission issues with NFS storage, all the users are able to write
>> to NFS
>> storage. When I fired write.df command in SparkR, I am getting below.
>> Can
>> some one please help me to fix this issue.
>>
>>
>> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting
>> job.
>> java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
>> {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
>> /0/task_201609170802_0013_m_00/part-r-0-46a7f178-249
>> 0-444e-9110-510978eaaecb.csv;
>> isDirectory=false; length=436486316; replication=1;
>> blocksize=33554432;
>> modification_time=147409940; access_time=0; owner=; group=;
>> permission=rw-rw-rw-; isSymlink=false}
>> to
>> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a
>> 7f178-2490-444e-9110-510978eaaecb.csv
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>> ergePaths(FileOutputCommitter.java:371)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>> ergePaths(FileOutputCommitter.java:384)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.c
>> ommitJob(FileOutputCommitter.java:326)
>> at
>> org.apache.spark.sql.execution.datasources.BaseWriterContain
>> er.commitJob(WriterContainer.scala:222)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
>> pFsRelationCommand.scala:144)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>> tionCommand.scala:115)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>> tionCommand.scala:115)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>> nId(SQLExecution.scala:57)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
>> ideEffectResult$lzycompute(commands.scala:60)
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
>> ideEffectResult(commands.scala:58)
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.d
>> oExecute(commands.scala:74)
>> at
>> 

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
Hi Yan, I agree, it IS really confusing.  Here is the technique for
transforming a column.  It is very general because you can make "myConvert"
do whatever you want.

import org.apache.spark.mllib.linalg.Vectors
val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF

df.show()
// The columns were named "_1" and "_2"
// Very confusing, because it looks like a Scala wildcard when we refer to
it in code

val myConvert = (x: String) => { Vectors.parse(x) }
val myConvertUDF = udf(myConvert)

val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))

newDf.show()

On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai)  wrote:

> Hi, all.
> I find that it's really confuse.
>
> I can use Vectors.parse to create a DataFrame contains Vector type.
>
> scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
> Vectors.parse("[2,4,6]"))).toDF
> dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>
>
> But using map to convert String to Vector throws an error:
>
> scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
> dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>
> scala> dataStr.map(row => Vectors.parse(row.getString(1)))
> :30: error: Unable to find encoder for type stored in a
> Dataset.  Primitive types (Int, String, etc) and Product types (case
> classes) are supported by importing spark.implicits._  Support for
> serializing other types will be added in future releases.
>   dataStr.map(row => Vectors.parse(row.getString(1)))
>
>
> Dose anyone can help me,
> thanks very much!
>
>
>
>
>
>
>
> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi 
> wrote:
>
>> Hi Yan, I think you'll have to map the features column to a new numerical
>> features column.
>>
>> Here's one way to do the individual transform:
>>
>> scala> val x = "[1, 2, 3, 4, 5]"
>> x: String = [1, 2, 3, 4, 5]
>>
>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>> split(" ") map(_.toInt)
>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>
>> If you don't know about the Scala command line, just type "scala" in a
>> terminal window.  It's a good place to try things out.
>>
>> You can make a function out of this transformation and apply it to your
>> features column to make a new column.  Then add this with
>> Dataset.withColumn.
>>
>> See here
>> 
>> on how to apply a function to a Column to make a new column.
>>
>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai)  wrote:
>>
>>> Hi,
>>> I have a csv file like:
>>> uid  mid  features   label
>>> 1235231[0, 1, 3, ...]True
>>>
>>> Both  "features" and "label" columns are used for GBTClassifier.
>>>
>>> However, when I read the file:
>>> Dataset samples = sparkSession.read().csv(file);
>>> The type of samples.select("features") is String.
>>>
>>> My question is:
>>> How to map samples.select("features") to Vector or any appropriate type,
>>> so I can use it to train like:
>>> GBTClassifier gbdt = new GBTClassifier()
>>> .setLabelCol("label")
>>> .setFeaturesCol("features")
>>> .setMaxIter(2)
>>> .setMaxDepth(7);
>>>
>>> Thanks.
>>>
>>
>>
>


Options for method createExternalTable

2016-09-20 Thread CalumAtTheGuardian
Hi, 

I am trying to create an external table WITH partitions using SPARK. 

Currently I am using catalog.createExternalTable(cleanTableName, "ORC",
schema, Map("path" -> s"s3://$location/"))

Does createExternalTable have options that can create a table with
partitions? I assume it would be a key-value in the options argument such as
"partitioned by" -> "date".

There is no documentation at
https://spark.apache.org/docs/preview/api/java/org/apache/spark/sql/catalog/Catalog.html
 
So any help would be appreciated! 

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Options-for-method-createExternalTable-tp27764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Can you please post the line of code that is doing the df.write command?

On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally <
sankar.mittapa...@creditvidya.com> wrote:

> Hey Kevin,
>
> It is a empty directory, It is able to write part files to the directory
> but while merging those part files we are getting above error.
>
> Regards
>
>
> On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott 
> wrote:
>
>> Have you checked to see if any files already exist at
>> /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to delete
>> them before attempting to save your DataFrame to that location.
>> Alternatively, you may be able to specify the "mode" setting of the
>> df.write operation to "overwrite", depending on the version of Spark you
>> are running.
>>
>> *ERROR (from log)*
>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>> _201609170802_0013_m_00/.part-r-0-46a7f178-2490-444
>> e-9110-510978eaaecb.csv.crc]:
>> it still exists.
>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>> _201609170802_0013_m_00/part-r-0-46a7f178-2490-444
>> e-9110-510978eaaecb.csv]:
>> it still exists.
>>
>> *df.write Documentation*
>> http://spark.apache.org/docs/latest/api/R/write.df.html
>>
>> Thanks,
>> Kevin
>>
>> On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
>> sankar.mittapa...@creditvidya.com> wrote:
>>
>>>  We have setup a spark cluster which is on NFS shared storage, there is
>>> no
>>> permission issues with NFS storage, all the users are able to write to
>>> NFS
>>> storage. When I fired write.df command in SparkR, I am getting below. Can
>>> some one please help me to fix this issue.
>>>
>>>
>>> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
>>> java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
>>> {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
>>> /0/task_201609170802_0013_m_00/part-r-0-46a7f178-249
>>> 0-444e-9110-510978eaaecb.csv;
>>> isDirectory=false; length=436486316; replication=1; blocksize=33554432;
>>> modification_time=147409940; access_time=0; owner=; group=;
>>> permission=rw-rw-rw-; isSymlink=false}
>>> to
>>> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a
>>> 7f178-2490-444e-9110-510978eaaecb.csv
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>>> ergePaths(FileOutputCommitter.java:371)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>>> ergePaths(FileOutputCommitter.java:384)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.c
>>> ommitJob(FileOutputCommitter.java:326)
>>> at
>>> org.apache.spark.sql.execution.datasources.BaseWriterContain
>>> er.commitJob(WriterContainer.scala:222)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
>>> pFsRelationCommand.scala:144)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>>> tionCommand.scala:115)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>>> tionCommand.scala:115)
>>> at
>>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>>> nId(SQLExecution.scala:57)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
>>> ideEffectResult$lzycompute(commands.scala:60)
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
>>> ideEffectResult(commands.scala:58)
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.d
>>> oExecute(commands.scala:74)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
>>> apply(SparkPlan.scala:115)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
>>> apply(SparkPlan.scala:115)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQue
>>> ry$1.apply(SparkPlan.scala:136)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>>> onScope.scala:151)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkP
>>> lan.scala:133)
>>> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>>> at
>>> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompu
>>> te(QueryExecution.scala:86)
>>> at
>>> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExe
>>> cution.scala:86)
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.write(
>>> DataSource.scala:487)
>>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>>> at 

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Please find the code below.

sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json")

I tried these two commands.
write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",header="true")

saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true")



On Tue, Sep 20, 2016 at 9:40 PM, Kevin Mellott 
wrote:

> Can you please post the line of code that is doing the df.write command?
>
> On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally  creditvidya.com> wrote:
>
>> Hey Kevin,
>>
>> It is a empty directory, It is able to write part files to the directory
>> but while merging those part files we are getting above error.
>>
>> Regards
>>
>>
>> On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott > > wrote:
>>
>>> Have you checked to see if any files already exist at
>>> /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to delete
>>> them before attempting to save your DataFrame to that location.
>>> Alternatively, you may be able to specify the "mode" setting of the
>>> df.write operation to "overwrite", depending on the version of Spark you
>>> are running.
>>>
>>> *ERROR (from log)*
>>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>>> _201609170802_0013_m_00/.part-r-0-46a7f178-2490-444e
>>> -9110-510978eaaecb.csv.crc]:
>>> it still exists.
>>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>>> _201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-
>>> 9110-510978eaaecb.csv]:
>>> it still exists.
>>>
>>> *df.write Documentation*
>>> http://spark.apache.org/docs/latest/api/R/write.df.html
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
>>> sankar.mittapa...@creditvidya.com> wrote:
>>>
  We have setup a spark cluster which is on NFS shared storage, there is
 no
 permission issues with NFS storage, all the users are able to write to
 NFS
 storage. When I fired write.df command in SparkR, I am getting below.
 Can
 some one please help me to fix this issue.


 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting
 job.
 java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
 {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
 /0/task_201609170802_0013_m_00/part-r-0-46a7f178-249
 0-444e-9110-510978eaaecb.csv;
 isDirectory=false; length=436486316; replication=1; blocksize=33554432;
 modification_time=147409940; access_time=0; owner=; group=;
 permission=rw-rw-rw-; isSymlink=false}
 to
 file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a
 7f178-2490-444e-9110-510978eaaecb.csv
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
 ergePaths(FileOutputCommitter.java:371)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
 ergePaths(FileOutputCommitter.java:384)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.c
 ommitJob(FileOutputCommitter.java:326)
 at
 org.apache.spark.sql.execution.datasources.BaseWriterContain
 er.commitJob(WriterContainer.scala:222)
 at
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
 sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
 pFsRelationCommand.scala:144)
 at
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
 sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
 tionCommand.scala:115)
 at
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
 sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
 tionCommand.scala:115)
 at
 org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
 nId(SQLExecution.scala:57)
 at
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
 sRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
 at
 org.apache.spark.sql.execution.command.ExecutedCommandExec.s
 ideEffectResult$lzycompute(commands.scala:60)
 at
 org.apache.spark.sql.execution.command.ExecutedCommandExec.s
 ideEffectResult(commands.scala:58)
 at
 org.apache.spark.sql.execution.command.ExecutedCommandExec.d
 oExecute(commands.scala:74)
 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
 apply(SparkPlan.scala:115)
 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
 apply(SparkPlan.scala:115)
 at
 org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQue
 ry$1.apply(SparkPlan.scala:136)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
 onScope.scala:151)
 at
 org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkP
 lan.scala:133)
 at 

Spark Tasks Taking Increasingly Longer

2016-09-20 Thread Chris Jansen
Hi,

I've been struggling with the reliability, which on the face of it, should
be a fairly simple job: Given a number of events, group them by user, event
type and week in which they occurred and aggregate their counts.

The input data is fairly skewed so I do a repartition and add a salt to the
key, so the aggregation is performed in two phases to (hopefully) mitigate
this.

What I see is that the size of data dealt with by each task is uniform
(170mb) with a parallelism of 5000 (also tried with 2500), however tasks
take increasingly longer, initially they start at a few seconds, then up to
18min by the end of the job.

As I am not storing any RDDs I have tried using the legacy memory
management to 70% of the heap memory to execution, with 10% for storage and
20% for unroll.

Could anyone give me any pointers on what might be causing this, it feels
like a memory leak, but I'm struggling to see where it is coming from.

Many thanks.


Here are some details on my set up, plus I've attached the stats from the
aggregation phase:

Spark version:
2.0 (tried with 1.6.2 also)

Driver:
2 cores 10Gb RAM

Workers:
20 nodes each with 16 cores and 100Gb RAM

Data from map phase:
Input data size: 668.2 GB
Shuffle write size: 831.9 GB


[image: Screen Shot 2016-09-20 at 17.15.12.png]


cassandra and spark can be built and worked on the same computer?

2016-09-20 Thread muhammet pakyürek


can we connect to cassandra from spark using spark-cassandra-connector which 
all three are built on the same computer? what kind of problems this 
configuration leads to?


Task Deserialization Error

2016-09-20 Thread Chawla,Sumit
Hi All

I am trying to test a simple Spark APP using scala.


import org.apache.spark.SparkContext

object SparkDemo {
  def main(args: Array[String]) {
val logFile = "README.md" // Should be some file on your system

// to run in local mode
val sc = new SparkContext("local", "Simple App",
""PATH_OF_DIRECTORY_WHERE_COMPILED_SPARK_PROJECT_FROM_GIT")

val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()


println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

  }
}


When running this demo in IntelliJ, i am getting following error:


java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


I guess its associated with task not being deserializable.  Any help
will be appreciated.



Regards
Sumit Chawla


Re: Similar Items

2016-09-20 Thread Nick Pentreath
How many products do you have? How large are your vectors?

It could be that SVD / LSA could be helpful. But if you have many products
then trying to compute all-pair similarity with brute force is not going to
be scalable. In this case you may want to investigate hashing (LSH)
techniques.


On Mon, 19 Sep 2016 at 22:49, Kevin Mellott 
wrote:

> Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
> following components.
>
>- *RegexTokenizer* - strips out non-word characters, results in a list
>of tokens
>- *StopWordsRemover* - removes common "stop words", such as "the",
>"and", etc.
>- *HashingTF* - assigns a numeric "hash" to each token and calculates
>the term frequency
>- *IDF* - computes the inverse document frequency
>
> After this pipeline evaluates, I'm left with a SparseVector that
> represents the inverse document frequency of tokens for each product. As a
> next step, I'd like to be able to compare each vector to one another, to
> detect similarities.
>
> Does anybody know of a straightforward way to do this in Spark? I tried
> creating a UDF (that used the Breeze linear algebra methods internally);
> however, that did not scale well.
>
> Thanks,
> Kevin
>


java.lang.ClassCastException: optional binary element (UTF8) is not a group

2016-09-20 Thread Rajan, Naveen
Dear All,
My code works fine with JSON input data. When I tried the Parquet data 
format, it worked for English data. For Japanese text, I am getting the below 
stack-trace. Pls help!

Caused by: java.lang.ClassCastException: optional binary element (UTF8) is not 
a group
  at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:131)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetListType(ParquetReadSupport.scala:207)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroup(ParquetReadSupport.scala:252)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:131)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
  at 

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also)

I think a more general version of ranking metrics that allows arbitrary
relevance scores could be useful. Ranking metrics are applicable to other
settings like search or other learning-to-rank use cases, so it should be a
little more generic than pure recommender settings.

The one issue with the proposed implementation is that it is not compatible
with the existing cross-validators within a pipeline.

As I've mentioned on the linked JIRAs & PRs, one option is to create a
special set of cross-validators for recommenders, that address the issues
of (a) dataset splitting specific to recommender settings (user-based
stratified sampling, time-based etc) and (b) ranking-based evaluation.

The other option is to have the ALSModel itself capable of generating the
"ground-truth" set within the same dataframe output from "transform" (ie
predict top k) that can be fed into the cross-validator (with
RankingEvaluator) directly. That's the approach I took so far in
https://github.com/apache/spark/pull/12574.

Both options are valid and have their positives & negatives - open to
comments / suggestions.

On Tue, 20 Sep 2016 at 06:08 Jong Wook Kim  wrote:

> Thanks for the clarification and the relevant links. I overlooked the
> comments explicitly saying that the relevance is binary.
>
> I understand that the label is not a relevance, but I have been, and I
> think many people are using the label as relevance in the implicit feedback
> context where the user-provided exact label is not defined anyway. I think
> that's why RiVal 's using the term
> "preference" for both the label for MAE and the relevance for NDCG.
>
> At the same time, I see why Spark decided to assume the relevance is
> binary, in part to conform to the class RankingMetrics's constructor. I
> think it would be nice if the upcoming DataFrame-based RankingEvaluator can
> be optionally set a "relevance column" that has non-binary relevance
> values, otherwise defaulting to either 1.0 or the label column.
>
> My extended version of RankingMetrics is here:
> https://github.com/jongwook/spark-ranking-metrics . It has a test case
> checking that the numbers are same as RiVal's.
>
> Jong Wook
>
>
>
> On 19 September 2016 at 03:13, Sean Owen  wrote:
>
>> Yes, relevance is always 1. The label is not a relevance score so
>> don't think it's valid to use it as such.
>>
>> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim  wrote:
>> > Hi,
>> >
>> > I'm trying to evaluate a recommendation model, and found that Spark and
>> > Rival give different results, and it seems that Rival's one is what
>> Kaggle
>> > defines:
>> https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>> >
>> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
>> > incorrect?
>> >
>> > To my knowledge, NDCG should be dependent on the relevance (or
>> preference)
>> > values, but Spark's implementation seems not; it uses 1.0 where it
>> should be
>> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
>> tried
>> > tweaking, but its method to obtain the ideal DCG also seems wrong.
>> >
>> > Any feedback from MLlib developers would be appreciated. I made a
>> > modified/extended version of RankingMetrics that produces the identical
>> > numbers to Kaggle and Rival's results, and I'm wondering if it is
>> something
>> > appropriate to be added back to MLlib.
>> >
>> > Jong Wook
>>
>
>


spark sql thrift server: driver OOM

2016-09-20 Thread Young
Hi, all:


I'm using spark sql thrift server under Spark1.3.1 to do hive sql query.  I 
started spark sql thrift server like ./sbin/start-thriftserver.sh  --master 
yarn-client --num-executors 12 --executor-memory 5g  --driver-memory 5g, then 
sent continuos hive sql to the thrift server.


However, about 20 minutes later, I got OOM Error in the thrift server log: 
“Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler 
in thread "sparkDriver-scheduler-1" , and the memory of thrift server process 
had increased to 5.7g (use top commend under linux server).


With the help of Google, I found the following patch, and added it to 
spark1.3.1, but it didn't help:
https://github.com/apache/spark/pull/12932/commits/559db12bf0b708d95d5066d4c41220ab493c70c9


The following is  my jmap output:


 num #instances #bytes  class name
--
   1:  21844706 1177481920  [C
   2:  21842972  524231328  java.lang.String
   3:   5429362  311283856  [Ljava.lang.Object;
   4:   3619510  296262792  [Ljava.util.HashMap$Entry;
   5:   8887511  284400352  java.util.HashMap$Entry
   6:   3618802  202652912  java.util.HashMap
   7: 51304  150483664  [B
   8:   5421523  130116552  java.util.ArrayList
   9:   4514237  108341688  
org.apache.hadoop.hive.metastore.api.FieldSchema
  10:   3611642   57786272  java.util.HashMap$EntrySet


So, how to solve this problem to let my spark sql thrift server run longer 
except for applying for larger driver memory? Dose someone encounter the same 
situation?


Sincerely,
Young


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: is there any bug for the configuration of spark 2.0 cassandra spark connector 2.0 and cassandra 3.0.8

2016-09-20 Thread Todd Nist
These types of questions would be better asked on the user mailing list for
the Spark Cassandra connector:

http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user

Version compatibility can be found here:

https://github.com/datastax/spark-cassandra-connector#version-compatibility

The JIRA, https://datastax-oss.atlassian.net/browse/SPARKC/, does not seem
to show any outstanding issues with regards to 3.0.8 and 2.0 of Spark or
Spark Cassandra Connector.

HTH.

-Todd


On Tue, Sep 20, 2016 at 1:47 AM, muhammet pakyürek 
wrote:

>
>
> please tell me the configuration including the most recent version of
> cassandra, spark and cassandra spark connector
>


Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-20 Thread McBeath, Darin W (ELS-STL)
I'm using Spark 2.0.

I've created a dataset from a parquet file and repartition on one of the 
columns (docId) and persist the repartitioned dataset.

val om = ds.repartition($"docId").persist(StorageLevel.MEMORY_AND_DISK)

When I try to confirm the partitioner, with

om.rdd.partitioner

I get

Option[org.apache.spark.Partitioner] = None

I would have thought it would be HashPartitioner.

Does anyone know why this would be None and not HashPartitioner?

Thanks.

Darin.




Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Are you able to manually delete the folder below? I'm wondering if there is
some sort of non-Spark factor involved (permissions, etc).

/nfspartition/sankar/banking_l1_v2.csv

On Tue, Sep 20, 2016 at 12:19 PM, Sankar Mittapally <
sankar.mittapa...@creditvidya.com> wrote:

> I used that one also
>
> On Sep 20, 2016 10:44 PM, "Kevin Mellott" 
> wrote:
>
>> Instead of *mode="append"*, try *mode="overwrite"*
>>
>> On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally <
>> sankar.mittapa...@creditvidya.com> wrote:
>>
>>> Please find the code below.
>>>
>>> sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json")
>>>
>>> I tried these two commands.
>>> write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",
>>> header="true")
>>>
>>> saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true")
>>>
>>>
>>>
>>> On Tue, Sep 20, 2016 at 9:40 PM, Kevin Mellott <
>>> kevin.r.mell...@gmail.com> wrote:
>>>
 Can you please post the line of code that is doing the df.write command?

 On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally <
 sankar.mittapa...@creditvidya.com> wrote:

> Hey Kevin,
>
> It is a empty directory, It is able to write part files to the
> directory but while merging those part files we are getting above error.
>
> Regards
>
>
> On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott <
> kevin.r.mell...@gmail.com> wrote:
>
>> Have you checked to see if any files already exist at
>> /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to
>> delete them before attempting to save your DataFrame to that location.
>> Alternatively, you may be able to specify the "mode" setting of the
>> df.write operation to "overwrite", depending on the version of Spark you
>> are running.
>>
>> *ERROR (from log)*
>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>> _201609170802_0013_m_00/.part-r-0-46a7f178-2490-444e
>> -9110-510978eaaecb.csv.crc]:
>> it still exists.
>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>> _201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-
>> 9110-510978eaaecb.csv]:
>> it still exists.
>>
>> *df.write Documentation*
>> http://spark.apache.org/docs/latest/api/R/write.df.html
>>
>> Thanks,
>> Kevin
>>
>> On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
>> sankar.mittapa...@creditvidya.com> wrote:
>>
>>>  We have setup a spark cluster which is on NFS shared storage, there
>>> is no
>>> permission issues with NFS storage, all the users are able to write
>>> to NFS
>>> storage. When I fired write.df command in SparkR, I am getting
>>> below. Can
>>> some one please help me to fix this issue.
>>>
>>>
>>> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting
>>> job.
>>> java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
>>> {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
>>> /0/task_201609170802_0013_m_00/part-r-0-46a7f178-249
>>> 0-444e-9110-510978eaaecb.csv;
>>> isDirectory=false; length=436486316; replication=1;
>>> blocksize=33554432;
>>> modification_time=147409940; access_time=0; owner=; group=;
>>> permission=rw-rw-rw-; isSymlink=false}
>>> to
>>> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a
>>> 7f178-2490-444e-9110-510978eaaecb.csv
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>>> ergePaths(FileOutputCommitter.java:371)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>>> ergePaths(FileOutputCommitter.java:384)
>>> at
>>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.c
>>> ommitJob(FileOutputCommitter.java:326)
>>> at
>>> org.apache.spark.sql.execution.datasources.BaseWriterContain
>>> er.commitJob(WriterContainer.scala:222)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
>>> pFsRelationCommand.scala:144)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>>> tionCommand.scala:115)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>>> tionCommand.scala:115)
>>> at
>>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>>> nId(SQLExecution.scala:57)
>>> at
>>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>>> sRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>>> 

Re: Similar Items

2016-09-20 Thread Nick Pentreath
A few options include:

https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
seems quite scalable too from what I've looked at.
https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
looks like it should do exactly what you need.
https://github.com/mrsqueeze/*spark*-hash



On Tue, 20 Sep 2016 at 18:06 Kevin Mellott 
wrote:

> Thanks for the reply, Nick! I'm typically analyzing around 30-50K products
> at a time (as an isolated set of products). Within this set of products
> (which represents all products for a particular supplier), I am also
> analyzing each category separately. The largest categories typically have
> around 10K products.
>
> That being said, when calculating IDFs for the 10K product set we come out
> with roughly 12K unique tokens. In other words, our vectors are 12K columns
> wide (although they are being represented using SparseVectors). We have a
> step that is attempting to locate all documents that share the same tokens,
> and for those items we will calculate the cosine similarity. However, the
> part that attempts to identify documents with shared tokens is the
> bottleneck.
>
> For this portion, we map our data down to the individual tokens contained
> by each document. For example:
>
> DocumentId   |   Description
>
> 
> 1   Easton Hockey Stick
> 2   Bauer Hockey Gloves
>
> In this case, we'd map to the following:
>
> (1, 'Easton')
> (1, 'Hockey')
> (1, 'Stick')
> (2, 'Bauer')
> (2, 'Hockey')
> (2, 'Gloves')
>
> Our goal is to aggregate this data as follows; however, our code that
> currently does this is does not perform well. In the realistic 12K product
> scenario, this resulted in 430K document/token tuples.
>
> ((1, 2), ['Hockey'])
>
> This then tells us that documents 1 and 2 need to be compared to one
> another (via cosine similarity) because they both contain the token
> 'hockey'. I will investigate the methods that you recommended to see if
> they may resolve our problem.
>
> Thanks,
> Kevin
>
> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath 
> wrote:
>
>> How many products do you have? How large are your vectors?
>>
>> It could be that SVD / LSA could be helpful. But if you have many
>> products then trying to compute all-pair similarity with brute force is not
>> going to be scalable. In this case you may want to investigate hashing
>> (LSH) techniques.
>>
>>
>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to write a Spark application that will detect similar items
>>> (in this case products) based on their descriptions. I've got an ML
>>> pipeline that transforms the product data to TF-IDF representation, using
>>> the following components.
>>>
>>>- *RegexTokenizer* - strips out non-word characters, results in a
>>>list of tokens
>>>- *StopWordsRemover* - removes common "stop words", such as "the",
>>>"and", etc.
>>>- *HashingTF* - assigns a numeric "hash" to each token and
>>>calculates the term frequency
>>>- *IDF* - computes the inverse document frequency
>>>
>>> After this pipeline evaluates, I'm left with a SparseVector that
>>> represents the inverse document frequency of tokens for each product. As a
>>> next step, I'd like to be able to compare each vector to one another, to
>>> detect similarities.
>>>
>>> Does anybody know of a straightforward way to do this in Spark? I tried
>>> creating a UDF (that used the Breeze linear algebra methods internally);
>>> however, that did not scale well.
>>>
>>> Thanks,
>>> Kevin
>>>
>>
>


Re: LDA and Maximum Iterations

2016-09-20 Thread Frank Zhang
Hi Yuhao,
   Thank you so much for your great contribution to the LDA and other Spark 
modules!
    I use both Spark 1.6.2 and 2.0.0. The data I used originally is very large 
which has tens of millions of documents. But for test purpose, the data set I 
mentioned earlier ("/data/mllib/sample_lda_data.txt") is good enough.  Please 
change the path to where you install your Spark to point to the data set and 
run those lines:
import org.apache.spark.mllib.clustering.LDAimport 
org.apache.spark.mllib.linalg.Vectors//please change the path for the data set 
below:
val data = sc.textFile("/data/mllib/sample_lda_data.txt") val parsedData = 
data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))val corpus = 
parsedData.zipWithIndex.map(_.swap).cache()val ldaModel = new 
LDA().setK(3).run(corpus)    It should work. After that, please run:val 
ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus)

   When I ran it, at job #90, that iteration took relatively extremely long 
then it stopped with exception:
Active Jobs (1)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks 
(for all stages): Succeeded/Total |
| 90 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 22 s | 0/269 | 
0/538 |


Completed Jobs (90)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks 
(for all stages): Succeeded/Total |
| 89 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 43 ms | 4/4 (262 
skipped) | 8/8 (524 skipped) |
| 88 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 40 ms | 4/4 (259 
skipped) | 8/8 (518 skipped) |
| 87 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 80 ms | 4/4 (256 
skipped) | 8/8 (512 skipped) |
| 86 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 41 ms | 4/4 (253 
skipped) | 8/8 (506 skipped) |

   Part of the error message:Driver stacktrace:  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)  at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)  at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)  at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)  at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)  at 
org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1046)  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)  at 
org.apache.spark.rdd.RDD.fold(RDD.scala:1040)  at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:226)
  at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:213)  
at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:79) 
 at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:334)  ... 48 
elidedCaused by: java.lang.StackOverflowError  at 
java.lang.reflect.InvocationTargetException.(InvocationTargetException.java:72)
  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)  at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
   Thank you so much!
   Frank 


  From: "Yang, Yuhao" 
 To: Frank Zhang ; "user@spark.apache.org" 
 
 Sent: Tuesday, September 20, 2016 9:49 AM
 Subject: RE: LDA and Maximum Iterations
  
#yiv8087534397 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 
4;}#yiv8087534397 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 
1;}#yiv8087534397 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8087534397 
filtered 

Re: Sending extraJavaOptions for Spark 1.6.1 on mesos 0.28.2 in cluster mode

2016-09-20 Thread Michael Gummelt
Probably this: https://issues.apache.org/jira/browse/SPARK-13258

As described in the JIRA, workaround is to use SPARK_JAVA_OPTS

On Mon, Sep 19, 2016 at 5:07 PM, sagarcasual . 
wrote:

> Hello,
> I have my Spark application running in cluster mode in CDH with
> extraJavaOptions.
> However when I am attempting a same application to run with apache mesos,
> it does not recognize the properties below at all and code returns null
> that reads them.
>
> --conf spark.driver.extraJavaOptions=-Dsome.url=http://some-url \
> --conf spark.executor.extraJavaOptions=-Dsome.url=http://some-url
>
> I tried option specified in http://stackoverflow.com/
> questions/35872093/missing-java-system-properties-when-
> running-spark-streaming-on-mesos-cluster?noredirect=1=1
>
> and still got no change in the result.
>
> Any idea ho to achieve this in mesos.
>
> -Regards
> Sagar
>



-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: Similar Items

2016-09-20 Thread Kevin Mellott
Thanks Nick - those examples will help a ton!!

On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath 
wrote:

> A few options include:
>
> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
> seems quite scalable too from what I've looked at.
> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
> looks like it should do exactly what you need.
> https://github.com/mrsqueeze/*spark*-hash
> 
>
>
> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott 
> wrote:
>
>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>> products at a time (as an isolated set of products). Within this set of
>> products (which represents all products for a particular supplier), I am
>> also analyzing each category separately. The largest categories typically
>> have around 10K products.
>>
>> That being said, when calculating IDFs for the 10K product set we come
>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>> columns wide (although they are being represented using SparseVectors). We
>> have a step that is attempting to locate all documents that share the same
>> tokens, and for those items we will calculate the cosine similarity.
>> However, the part that attempts to identify documents with shared tokens is
>> the bottleneck.
>>
>> For this portion, we map our data down to the individual tokens contained
>> by each document. For example:
>>
>> DocumentId   |   Description
>> 
>> 
>> 1   Easton Hockey Stick
>> 2   Bauer Hockey Gloves
>>
>> In this case, we'd map to the following:
>>
>> (1, 'Easton')
>> (1, 'Hockey')
>> (1, 'Stick')
>> (2, 'Bauer')
>> (2, 'Hockey')
>> (2, 'Gloves')
>>
>> Our goal is to aggregate this data as follows; however, our code that
>> currently does this is does not perform well. In the realistic 12K product
>> scenario, this resulted in 430K document/token tuples.
>>
>> ((1, 2), ['Hockey'])
>>
>> This then tells us that documents 1 and 2 need to be compared to one
>> another (via cosine similarity) because they both contain the token
>> 'hockey'. I will investigate the methods that you recommended to see if
>> they may resolve our problem.
>>
>> Thanks,
>> Kevin
>>
>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath > > wrote:
>>
>>> How many products do you have? How large are your vectors?
>>>
>>> It could be that SVD / LSA could be helpful. But if you have many
>>> products then trying to compute all-pair similarity with brute force is not
>>> going to be scalable. In this case you may want to investigate hashing
>>> (LSH) techniques.
>>>
>>>
>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott 
>>> wrote:
>>>
 Hi all,

 I'm trying to write a Spark application that will detect similar items
 (in this case products) based on their descriptions. I've got an ML
 pipeline that transforms the product data to TF-IDF representation, using
 the following components.

- *RegexTokenizer* - strips out non-word characters, results in a
list of tokens
- *StopWordsRemover* - removes common "stop words", such as "the",
"and", etc.
- *HashingTF* - assigns a numeric "hash" to each token and
calculates the term frequency
- *IDF* - computes the inverse document frequency

 After this pipeline evaluates, I'm left with a SparseVector that
 represents the inverse document frequency of tokens for each product. As a
 next step, I'd like to be able to compare each vector to one another, to
 detect similarities.

 Does anybody know of a straightforward way to do this in Spark? I tried
 creating a UDF (that used the Breeze linear algebra methods internally);
 however, that did not scale well.

 Thanks,
 Kevin

>>>
>>


Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Related question: is there anything that does scalable matrix
multiplication on Spark?  For example, we have that long list of vectors
and want to construct the similarity matrix:  v * T(v).  In R it would be: v
%*% t(v)
Thanks,
Pete



On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott 
wrote:

> Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
> following components.
>
>- *RegexTokenizer* - strips out non-word characters, results in a list
>of tokens
>- *StopWordsRemover* - removes common "stop words", such as "the",
>"and", etc.
>- *HashingTF* - assigns a numeric "hash" to each token and calculates
>the term frequency
>- *IDF* - computes the inverse document frequency
>
> After this pipeline evaluates, I'm left with a SparseVector that
> represents the inverse document frequency of tokens for each product. As a
> next step, I'd like to be able to compare each vector to one another, to
> detect similarities.
>
> Does anybody know of a straightforward way to do this in Spark? I tried
> creating a UDF (that used the Breeze linear algebra methods internally);
> however, that did not scale well.
>
> Thanks,
> Kevin
>


Re: Similar Items

2016-09-20 Thread Kevin Mellott
Using the Soundcloud implementation of LSH, I was able to process a 22K
product dataset in a mere 65 seconds! Thanks so much for the help!

On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott 
wrote:

> Thanks Nick - those examples will help a ton!!
>
> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath  > wrote:
>
>> A few options include:
>>
>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
>> seems quite scalable too from what I've looked at.
>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
>> looks like it should do exactly what you need.
>> https://github.com/mrsqueeze/*spark*-hash
>> 
>>
>>
>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott 
>> wrote:
>>
>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>>> products at a time (as an isolated set of products). Within this set of
>>> products (which represents all products for a particular supplier), I am
>>> also analyzing each category separately. The largest categories typically
>>> have around 10K products.
>>>
>>> That being said, when calculating IDFs for the 10K product set we come
>>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>>> columns wide (although they are being represented using SparseVectors). We
>>> have a step that is attempting to locate all documents that share the same
>>> tokens, and for those items we will calculate the cosine similarity.
>>> However, the part that attempts to identify documents with shared tokens is
>>> the bottleneck.
>>>
>>> For this portion, we map our data down to the individual tokens
>>> contained by each document. For example:
>>>
>>> DocumentId   |   Description
>>> 
>>> 
>>> 1   Easton Hockey Stick
>>> 2   Bauer Hockey Gloves
>>>
>>> In this case, we'd map to the following:
>>>
>>> (1, 'Easton')
>>> (1, 'Hockey')
>>> (1, 'Stick')
>>> (2, 'Bauer')
>>> (2, 'Hockey')
>>> (2, 'Gloves')
>>>
>>> Our goal is to aggregate this data as follows; however, our code that
>>> currently does this is does not perform well. In the realistic 12K product
>>> scenario, this resulted in 430K document/token tuples.
>>>
>>> ((1, 2), ['Hockey'])
>>>
>>> This then tells us that documents 1 and 2 need to be compared to one
>>> another (via cosine similarity) because they both contain the token
>>> 'hockey'. I will investigate the methods that you recommended to see if
>>> they may resolve our problem.
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 How many products do you have? How large are your vectors?

 It could be that SVD / LSA could be helpful. But if you have many
 products then trying to compute all-pair similarity with brute force is not
 going to be scalable. In this case you may want to investigate hashing
 (LSH) techniques.


 On Mon, 19 Sep 2016 at 22:49, Kevin Mellott 
 wrote:

> Hi all,
>
> I'm trying to write a Spark application that will detect similar items
> (in this case products) based on their descriptions. I've got an ML
> pipeline that transforms the product data to TF-IDF representation, using
> the following components.
>
>- *RegexTokenizer* - strips out non-word characters, results in a
>list of tokens
>- *StopWordsRemover* - removes common "stop words", such as "the",
>"and", etc.
>- *HashingTF* - assigns a numeric "hash" to each token and
>calculates the term frequency
>- *IDF* - computes the inverse document frequency
>
> After this pipeline evaluates, I'm left with a SparseVector that
> represents the inverse document frequency of tokens for each product. As a
> next step, I'd like to be able to compare each vector to one another, to
> detect similarities.
>
> Does anybody know of a straightforward way to do this in Spark? I
> tried creating a UDF (that used the Breeze linear algebra methods
> internally); however, that did not scale well.
>
> Thanks,
> Kevin
>

>>>
>


Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Yeah I can do all operations on that folder

On Sep 21, 2016 12:15 AM, "Kevin Mellott"  wrote:

> Are you able to manually delete the folder below? I'm wondering if there
> is some sort of non-Spark factor involved (permissions, etc).
>
> /nfspartition/sankar/banking_l1_v2.csv
>
> On Tue, Sep 20, 2016 at 12:19 PM, Sankar Mittapally  creditvidya.com> wrote:
>
>> I used that one also
>>
>> On Sep 20, 2016 10:44 PM, "Kevin Mellott" 
>> wrote:
>>
>>> Instead of *mode="append"*, try *mode="overwrite"*
>>>
>>> On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally <
>>> sankar.mittapa...@creditvidya.com> wrote:
>>>
 Please find the code below.

 sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json")

 I tried these two commands.
 write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",
 header="true")

 saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true")



 On Tue, Sep 20, 2016 at 9:40 PM, Kevin Mellott <
 kevin.r.mell...@gmail.com> wrote:

> Can you please post the line of code that is doing the df.write
> command?
>
> On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally <
> sankar.mittapa...@creditvidya.com> wrote:
>
>> Hey Kevin,
>>
>> It is a empty directory, It is able to write part files to the
>> directory but while merging those part files we are getting above error.
>>
>> Regards
>>
>>
>> On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott <
>> kevin.r.mell...@gmail.com> wrote:
>>
>>> Have you checked to see if any files already exist at
>>> /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to
>>> delete them before attempting to save your DataFrame to that location.
>>> Alternatively, you may be able to specify the "mode" setting of the
>>> df.write operation to "overwrite", depending on the version of Spark you
>>> are running.
>>>
>>> *ERROR (from log)*
>>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>>> _201609170802_0013_m_00/.part-r-0-46a7f178-2490-444e
>>> -9110-510978eaaecb.csv.crc]:
>>> it still exists.
>>> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
>>> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
>>> _201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-
>>> 9110-510978eaaecb.csv]:
>>> it still exists.
>>>
>>> *df.write Documentation*
>>> http://spark.apache.org/docs/latest/api/R/write.df.html
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
>>> sankar.mittapa...@creditvidya.com> wrote:
>>>
  We have setup a spark cluster which is on NFS shared storage,
 there is no
 permission issues with NFS storage, all the users are able to write
 to NFS
 storage. When I fired write.df command in SparkR, I am getting
 below. Can
 some one please help me to fix this issue.


 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand:
 Aborting job.
 java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
 {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
 /0/task_201609170802_0013_m_00/part-r-0-46a7f178-249
 0-444e-9110-510978eaaecb.csv;
 isDirectory=false; length=436486316; replication=1;
 blocksize=33554432;
 modification_time=147409940; access_time=0; owner=; group=;
 permission=rw-rw-rw-; isSymlink=false}
 to
 file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a
 7f178-2490-444e-9110-510978eaaecb.csv
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
 ergePaths(FileOutputCommitter.java:371)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
 ergePaths(FileOutputCommitter.java:384)
 at
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.c
 ommitJob(FileOutputCommitter.java:326)
 at
 org.apache.spark.sql.execution.datasources.BaseWriterContain
 er.commitJob(WriterContainer.scala:222)
 at
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
 sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
 pFsRelationCommand.scala:144)
 at
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
 sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
 tionCommand.scala:115)
 at
 org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
 sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
 tionCommand.scala:115)
 at
 

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Divya Gehlot
Spark version plz ?

On 21 September 2016 at 09:46, Sankar Mittapally <
sankar.mittapa...@creditvidya.com> wrote:

> Yeah I can do all operations on that folder
>
> On Sep 21, 2016 12:15 AM, "Kevin Mellott" 
> wrote:
>
>> Are you able to manually delete the folder below? I'm wondering if there
>> is some sort of non-Spark factor involved (permissions, etc).
>>
>> /nfspartition/sankar/banking_l1_v2.csv
>>
>> On Tue, Sep 20, 2016 at 12:19 PM, Sankar Mittapally <
>> sankar.mittapa...@creditvidya.com> wrote:
>>
>>> I used that one also
>>>
>>> On Sep 20, 2016 10:44 PM, "Kevin Mellott" 
>>> wrote:
>>>
 Instead of *mode="append"*, try *mode="overwrite"*

 On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally <
 sankar.mittapa...@creditvidya.com> wrote:

> Please find the code below.
>
> sankar2 <- read.df("/nfspartition/sankar/test/2016/08/test.json")
>
> I tried these two commands.
> write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",
> header="true")
>
> saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true")
>
>
>
> On Tue, Sep 20, 2016 at 9:40 PM, Kevin Mellott <
> kevin.r.mell...@gmail.com> wrote:
>
>> Can you please post the line of code that is doing the df.write
>> command?
>>
>> On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally <
>> sankar.mittapa...@creditvidya.com> wrote:
>>
>>> Hey Kevin,
>>>
>>> It is a empty directory, It is able to write part files to the
>>> directory but while merging those part files we are getting above error.
>>>
>>> Regards
>>>
>>>
>>> On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott <
>>> kevin.r.mell...@gmail.com> wrote:
>>>
 Have you checked to see if any files already exist at
 /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to
 delete them before attempting to save your DataFrame to that location.
 Alternatively, you may be able to specify the "mode" setting of the
 df.write operation to "overwrite", depending on the version of Spark 
 you
 are running.

 *ERROR (from log)*
 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
 dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
 _201609170802_0013_m_00/.part-r-0-46a7f178-2490-444e
 -9110-510978eaaecb.csv.crc]:
 it still exists.
 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
 dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task
 _201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-
 9110-510978eaaecb.csv]:
 it still exists.

 *df.write Documentation*
 http://spark.apache.org/docs/latest/api/R/write.df.html

 Thanks,
 Kevin

 On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
 sankar.mittapa...@creditvidya.com> wrote:

>  We have setup a spark cluster which is on NFS shared storage,
> there is no
> permission issues with NFS storage, all the users are able to
> write to NFS
> storage. When I fired write.df command in SparkR, I am getting
> below. Can
> some one please help me to fix this issue.
>
>
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand:
> Aborting job.
> java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
> {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
> /0/task_201609170802_0013_m_00/part-r-0-46a7f178-249
> 0-444e-9110-510978eaaecb.csv;
> isDirectory=false; length=436486316; replication=1;
> blocksize=33554432;
> modification_time=147409940; access_time=0; owner=; group=;
> permission=rw-rw-rw-; isSymlink=false}
> to
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a
> 7f178-2490-444e-9110-510978eaaecb.csv
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
> ergePaths(FileOutputCommitter.java:371)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
> ergePaths(FileOutputCommitter.java:384)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.c
> ommitJob(FileOutputCommitter.java:326)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContain
> er.commitJob(WriterContainer.scala:222)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
> sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
> pFsRelationCommand.scala:144)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
> 

Israel Spark Meetup

2016-09-20 Thread Romi Kuntsman
Hello,
Please add a link in Spark Community page (
https://spark.apache.org/community.html)
To Israel Spark Meetup (https://www.meetup.com/israel-spark-users/)
We're an active meetup group, unifying the local Spark user community, and
having regular meetups.
Thanks!
Romi K.


java.lang.ClassCastException: optional binary element (UTF8) is not a group

2016-09-20 Thread naveen.ra...@sony.com
My code works fine with JSON input format (Spark 1.6 on Amazon EMR,
emr-5.0.0). I tried the Parquet format. Works fine for English data. When I
tried the Parquet format with some Japanese language text, I am getting this
weird stack-trace:
 *Caused by: java.lang.ClassCastException: optional binary element (UTF8) is
not a group*  /at org.apache.parquet.schema.Type.asGroupType(Type.java:202) 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:131)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetListType(ParquetReadSupport.scala:207)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
 
at scala.Option.map(Option.scala:146)  at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
 
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 
at scala.collection.Iterator$class.foreach(Iterator.scala:893)  at
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)  at
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)  at
org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)  at
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)  at
org.apache.spark.sql.types.StructType.map(StructType.scala:95)  at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroup(ParquetReadSupport.scala:252)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:131)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
 
at scala.Option.map(Option.scala:146)  at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
 
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 
at scala.collection.Iterator$class.foreach(Iterator.scala:893)  at
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)  at
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)  at
org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)  at
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)  at
org.apache.spark.sql.types.StructType.map(StructType.scala:95)  at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
 
at
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168)
 
at
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192)
 
at
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
 
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
 
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
 
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
 
at

Re: Convert RDD to JSON Rdd and append more information

2016-09-20 Thread Deepak Sharma
Enrich the RDDs first with more information and then map it to some case
class , if you are using scala.
You can then use play api's
(play.api.libs.json.Writes/play.api.libs.json.Json) classes to convert the
mapped case class to json.

Thanks
Deepak

On Tue, Sep 20, 2016 at 6:42 PM, sujeet jog  wrote:

> Hi,
>
> I have a Rdd of n rows,  i want to transform this to a Json RDD, and also
> add some more information , any idea how to accomplish this .
>
>
> ex : -
>
> i have rdd with n rows with data like below ,  ,
>
>  16.9527493170273,20.1989561393151,15.7065424947394
>  17.9527493170273,21.1989561393151,15.7065424947394
>  18.9527493170273,22.1989561393151,15.7065424947394
>
>
> would like to add few rows highlited to the beginning of RDD like below,
> is there a way to
> do this and transform it to JSON,  the reason being i intend to push this
> as input  to some application via pipeRDD for some processing, and want to
> enforce a JSON structure on the input.
>
> *{*
> *TimeSeriesID : 1234*
> *NumOfInputSamples : 1008 *
> *Request Type : Fcast*
>  16.9527493170273,20.1989561393151,15.7065424947394
>  17.9527493170273,21.1989561393151,15.7065424947394
>  18.9527493170273,22.1989561393151,15.7065424947394
> }
>
>
> Thanks,
> Sujeet
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Convert RDD to JSON Rdd and append more information

2016-09-20 Thread sujeet jog
Hi,

I have a Rdd of n rows,  i want to transform this to a Json RDD, and also
add some more information , any idea how to accomplish this .


ex : -

i have rdd with n rows with data like below ,  ,

 16.9527493170273,20.1989561393151,15.7065424947394
 17.9527493170273,21.1989561393151,15.7065424947394
 18.9527493170273,22.1989561393151,15.7065424947394


would like to add few rows highlited to the beginning of RDD like below, is
there a way to
do this and transform it to JSON,  the reason being i intend to push this
as input  to some application via pipeRDD for some processing, and want to
enforce a JSON structure on the input.

*{*
*TimeSeriesID : 1234*
*NumOfInputSamples : 1008 *
*Request Type : Fcast*
 16.9527493170273,20.1989561393151,15.7065424947394
 17.9527493170273,21.1989561393151,15.7065424947394
 18.9527493170273,22.1989561393151,15.7065424947394
}


Thanks,
Sujeet


Re: SPARK-10835 in 2.0

2016-09-20 Thread Sean Owen
Ah, I think that this was supposed to be changed with SPARK-9062. Let
me see about reopening 10835 and addressing it.

On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
 wrote:
> Is this a bug?
>
> On Sep 19, 2016 10:10 PM, "janardhan shetty"  wrote:
>>
>> Hi,
>>
>> I am hitting this issue.
>> https://issues.apache.org/jira/browse/SPARK-10835.
>>
>> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
>> appreciated ?
>>
>> Note:
>> Pipeline has Ngram before word2Vec.
>>
>> Error:
>> val word2Vec = new
>> Word2Vec().setInputCol("wordsGrams").setOutputCol("features").setVectorSize(128).setMinCount(10)
>>
>> scala> word2Vec.fit(grams)
>> java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
>> must be of type ArrayType(StringType,true) but was actually
>> ArrayType(StringType,false).
>>   at scala.Predef$.require(Predef.scala:224)
>>   at
>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>>   at
>> org.apache.spark.ml.feature.Word2VecBase$class.validateAndTransformSchema(Word2Vec.scala:111)
>>   at
>> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(Word2Vec.scala:121)
>>   at
>> org.apache.spark.ml.feature.Word2Vec.transformSchema(Word2Vec.scala:187)
>>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>>
>>
>> Github code for Ngram:
>>
>>
>> override protected def validateInputType(inputType: DataType): Unit = {
>> require(inputType.sameType(ArrayType(StringType)),
>>   s"Input type must be ArrayType(StringType) but got $inputType.")
>>   }
>>
>>   override protected def outputDataType: DataType = new
>> ArrayType(StringType, false)
>> }
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Thanks Sean.
On Sep 20, 2016 7:45 AM, "Sean Owen"  wrote:

> Ah, I think that this was supposed to be changed with SPARK-9062. Let
> me see about reopening 10835 and addressing it.
>
> On Tue, Sep 20, 2016 at 3:24 PM, janardhan shetty
>  wrote:
> > Is this a bug?
> >
> > On Sep 19, 2016 10:10 PM, "janardhan shetty" 
> wrote:
> >>
> >> Hi,
> >>
> >> I am hitting this issue.
> >> https://issues.apache.org/jira/browse/SPARK-10835.
> >>
> >> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
> >> appreciated ?
> >>
> >> Note:
> >> Pipeline has Ngram before word2Vec.
> >>
> >> Error:
> >> val word2Vec = new
> >> Word2Vec().setInputCol("wordsGrams").setOutputCol("
> features").setVectorSize(128).setMinCount(10)
> >>
> >> scala> word2Vec.fit(grams)
> >> java.lang.IllegalArgumentException: requirement failed: Column
> wordsGrams
> >> must be of type ArrayType(StringType,true) but was actually
> >> ArrayType(StringType,false).
> >>   at scala.Predef$.require(Predef.scala:224)
> >>   at
> >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(
> SchemaUtils.scala:42)
> >>   at
> >> org.apache.spark.ml.feature.Word2VecBase$class.
> validateAndTransformSchema(Word2Vec.scala:111)
> >>   at
> >> org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(
> Word2Vec.scala:121)
> >>   at
> >> org.apache.spark.ml.feature.Word2Vec.transformSchema(
> Word2Vec.scala:187)
> >>   at org.apache.spark.ml.PipelineStage.transformSchema(
> Pipeline.scala:70)
> >>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
> >>
> >>
> >> Github code for Ngram:
> >>
> >>
> >> override protected def validateInputType(inputType: DataType): Unit = {
> >> require(inputType.sameType(ArrayType(StringType)),
> >>   s"Input type must be ArrayType(StringType) but got $inputType.")
> >>   }
> >>
> >>   override protected def outputDataType: DataType = new
> >> ArrayType(StringType, false)
> >> }
> >>
> >
>


Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Have you checked to see if any files already exist at /nfspartition/sankar/
banking_l1_v2.csv? If so, you will need to delete them before attempting to
save your DataFrame to that location. Alternatively, you may be able to
specify the "mode" setting of the df.write operation to "overwrite",
depending on the version of Spark you are running.

*ERROR (from log)*
16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/
0/task_201609170802_0013_m_00/.part-r-0-46a7f178-
2490-444e-9110-510978eaaecb.csv.crc]:
it still exists.
16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/
0/task_201609170802_0013_m_00/part-r-0-46a7f178-
2490-444e-9110-510978eaaecb.csv]:
it still exists.

*df.write Documentation*
http://spark.apache.org/docs/latest/api/R/write.df.html

Thanks,
Kevin

On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally <
sankar.mittapa...@creditvidya.com> wrote:

>  We have setup a spark cluster which is on NFS shared storage, there is no
> permission issues with NFS storage, all the users are able to write to NFS
> storage. When I fired write.df command in SparkR, I am getting below. Can
> some one please help me to fix this issue.
>
>
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
> {path=file:/nfspartition/sankar/banking_l1_v2.csv/_
> temporary/0/task_201609170802_0013_m_00/part-r-0-
> 46a7f178-2490-444e-9110-510978eaaecb.csv;
> isDirectory=false; length=436486316; replication=1; blocksize=33554432;
> modification_time=147409940; access_time=0; owner=; group=;
> permission=rw-rw-rw-; isSymlink=false}
> to
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-
> 0-46a7f178-2490-444e-9110-510978eaaecb.csv
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
> FileOutputCommitter.java:371)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
> FileOutputCommitter.java:384)
> at
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(
> FileOutputCommitter.java:326)
> at
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(
> WriterContainer.scala:222)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationComm
> and$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationComm
> and.scala:144)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationComm
> and$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationComm
> and$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> SQLExecution.scala:57)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationComm
> and.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult$lzycompute(commands.scala:60)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult(commands.scala:58)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(
> commands.scala:74)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:115)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:115)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(
> SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(
> QueryExecution.scala:86)
> at
> org.apache.spark.sql.execution.QueryExecution.
> toRdd(QueryExecution.scala:86)
> at
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.
> scala:487)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
> 62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(
> RBackendHandler.scala:141)
> at
> org.apache.spark.api.r.RBackendHandler.channelRead0(
> RBackendHandler.scala:86)
> at
> org.apache.spark.api.r.RBackendHandler.channelRead0(
> RBackendHandler.scala:38)
> at
> io.netty.channel.SimpleChannelInboundHandler.channelRead(
> SimpleChannelInboundHandler.java:105)
> at
> 

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Sankar Mittapally
Hey Kevin,

It is a empty directory, It is able to write part files to the directory
but while merging those part files we are getting above error.

Regards


On Tue, Sep 20, 2016 at 7:46 PM, Kevin Mellott 
wrote:

> Have you checked to see if any files already exist at
> /nfspartition/sankar/banking_l1_v2.csv? If so, you will need to delete
> them before attempting to save your DataFrame to that location.
> Alternatively, you may be able to specify the "mode" setting of the
> df.write operation to "overwrite", depending on the version of Spark you
> are running.
>
> *ERROR (from log)*
> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/
> task_201609170802_0013_m_00/.part-r-0-46a7f178-2490-
> 444e-9110-510978eaaecb.csv.crc]:
> it still exists.
> 16/09/17 08:03:28 WARN FileUtil: Failed to delete file or
> dir[/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/
> task_201609170802_0013_m_00/part-r-0-46a7f178-2490-
> 444e-9110-510978eaaecb.csv]:
> it still exists.
>
> *df.write Documentation*
> http://spark.apache.org/docs/latest/api/R/write.df.html
>
> Thanks,
> Kevin
>
> On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally  creditvidya.com> wrote:
>
>>  We have setup a spark cluster which is on NFS shared storage, there is no
>> permission issues with NFS storage, all the users are able to write to NFS
>> storage. When I fired write.df command in SparkR, I am getting below. Can
>> some one please help me to fix this issue.
>>
>>
>> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
>> java.io.IOException: Failed to rename DeprecatedRawLocalFileStatus
>> {path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary
>> /0/task_201609170802_0013_m_00/part-r-0-46a7f178-
>> 2490-444e-9110-510978eaaecb.csv;
>> isDirectory=false; length=436486316; replication=1; blocksize=33554432;
>> modification_time=147409940; access_time=0; owner=; group=;
>> permission=rw-rw-rw-; isSymlink=false}
>> to
>> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-
>> 46a7f178-2490-444e-9110-510978eaaecb.csv
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>> ergePaths(FileOutputCommitter.java:371)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.m
>> ergePaths(FileOutputCommitter.java:384)
>> at
>> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.
>> commitJob(FileOutputCommitter.java:326)
>> at
>> org.apache.spark.sql.execution.datasources.BaseWriterContain
>> er.commitJob(WriterContainer.scala:222)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoo
>> pFsRelationCommand.scala:144)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>> tionCommand.scala:115)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRela
>> tionCommand.scala:115)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>> nId(SQLExecution.scala:57)
>> at
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopF
>> sRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
>> ideEffectResult$lzycompute(commands.scala:60)
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.s
>> ideEffectResult(commands.scala:58)
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.
>> doExecute(commands.scala:74)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
>> apply(SparkPlan.scala:115)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.
>> apply(SparkPlan.scala:115)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQue
>> ry$1.apply(SparkPlan.scala:136)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>> onScope.scala:151)
>> at
>> org.apache.spark.sql.execution.SparkPlan.executeQuery(
>> SparkPlan.scala:133)
>> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>> at
>> org.apache.spark.sql.execution.QueryExecution.toRdd$
>> lzycompute(QueryExecution.scala:86)
>> at
>> org.apache.spark.sql.execution.QueryExecution.toRdd(
>> QueryExecution.scala:86)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource.write(
>> DataSource.scala:487)
>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)

Re: SPARK-10835 in 2.0

2016-09-20 Thread janardhan shetty
Is this a bug?
On Sep 19, 2016 10:10 PM, "janardhan shetty"  wrote:

> Hi,
>
> I am hitting this issue. https://issues.apache.org/jira/browse/SPARK-10835
> .
>
> Issue seems to be resolved but resurfacing in 2.0 ML. Any workaround is
> appreciated ?
>
> Note:
> Pipeline has Ngram before word2Vec.
>
> Error:
> val word2Vec = new Word2Vec().setInputCol("wordsGrams").setOutputCol("
> features").setVectorSize(128).setMinCount(10)
>
> scala> word2Vec.fit(grams)
> java.lang.IllegalArgumentException: requirement failed: Column wordsGrams
> must be of type ArrayType(StringType,true) but was actually
> ArrayType(StringType,false).
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(
> SchemaUtils.scala:42)
>   at org.apache.spark.ml.feature.Word2VecBase$class.
> validateAndTransformSchema(Word2Vec.scala:111)
>   at org.apache.spark.ml.feature.Word2Vec.validateAndTransformSchema(
> Word2Vec.scala:121)
>   at org.apache.spark.ml.feature.Word2Vec.transformSchema(
> Word2Vec.scala:187)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:170)
>
>
> Github code for Ngram:
>
>
> override protected def validateInputType(inputType: DataType): Unit = {
> require(inputType.sameType(ArrayType(StringType)),
>   s"Input type must be ArrayType(StringType) but got $inputType.")
>   }
>
>   override protected def outputDataType: DataType = new
> ArrayType(StringType, false)
> }
>
>