from:"Rishi Yadav"

Re: Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Rishi Yadav

Can you provide some more details:
1. How many partitions does RDD have
2. How big is the cluster
On Sat, Jan 14, 2017 at 3:59 PM Fei Hu  wrote:

> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
> Is there anyone who knows how to implement it or any hints for it?
>
> Thanks in advance,
> Fei
>
>

Re: Spark can't fetch application jar after adding it to HTTP server

2015-08-16 Thread Rishi Yadav

can you tell more about your environment. I understand you are running it
on a single machine but is firewall enabled?

On Sun, Aug 16, 2015 at 5:47 AM, t4ng0  wrote:

> Hi
>
> I am new to spark and trying to run standalone application using
> spark-submit. Whatever i could understood, from logs is that spark can't
> fetch the jar file after adding it to the http server. Do i need to
> configure proxy settings for spark too individually if it is a problem.
> Otherwise please help me, thanks in advance.
>
> PS: i am attaching logs here.
>
>  Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
> SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
> NativeCodeLoader: Unable to load native-hadoop library for your platform...
> using builtin-java classes where applicable 15/08/16 15:20:53 INFO
> SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
> INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
> 15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
> acls disabled; users with view permissions: Set(manvendratomar); users with
> modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
> Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
> 15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO
> Utils:
> Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
> INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO
> SparkEnv:
> Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
> Created local directory at
>
> /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
> 15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
> MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is
>
> /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
> 15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
> INFO Utils: Successfully started service 'HTTP file server' on port 63986.
> 15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on
> port
> 4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
> http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
> target/scala-2.11/spark_matrix_2.11-1.0.jar at
> http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
> 1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
> on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started
> service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
> 15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
> 15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
> 15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block
> manager
> localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
> 15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
> 15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
> maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
> stored as values in memory (estimated size 153.6 KB, free 265.3 MB)
> 15/08/16
> 15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with
> curMem=157248,
> maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
> broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
> 265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
> in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
> 15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
> partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input
> paths
> to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
> IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
> (reduce at IndexedRowMatrix.scala:65) with 1 output partitions
> (allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
> ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
> DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
> DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
> Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
> IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
> INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
> maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
> stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
> 15:20:56 I

Re: Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-16 Thread Rishi Yadav

try --jars rather than --class to submit jar.



On Fri, Aug 14, 2015 at 6:19 AM, Stephen Boesch  wrote:

> The NoClassDefFoundException differs from ClassNotFoundException : it
> indicates an error while initializing that class: but the class is found in
> the classpath. Please provide the full stack trace.
>
> 2015-08-14 4:59 GMT-07:00 stelsavva :
>
>> Hello, I am just starting out with spark streaming and Hbase/hadoop, i m
>> writing a simple app to read from kafka and store to Hbase, I am having
>> trouble submitting my job to spark.
>>
>> I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6
>>
>> I am building the project with mvn package
>>
>> and submitting the jar file with
>>
>>  ~/Desktop/spark/bin/spark-submit --class org.example.main.scalaConsumer
>> scalConsumer-0.0.1-SNAPSHOT.jar
>>
>> And then i am getting the error you see in the subject line. Is this a
>> problem with my maven dependencies? do i need to install hadoop locally?
>> And
>> if so how can i add the hadoop classpath to the spark job?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Error-Exception-in-thread-main-java-lang-NoClassDefFoundError-org-apache-hadoop-hbase-HBaseConfiguran-tp24266.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Re: Can't understand the size of raw RDD and its DataFrame

2015-08-16 Thread Rishi Yadav

Dataframes in simple terms are RDDs combined with Schema. In reality they
are much more than that and provide a very fine level of optimization,
Check out project Tungsten.

In your case it was one column as you chose. By default, it keeps same
columns as in RDD (same as field of a case class if you created RDD using
case class)


Author: Spark Cook Book <http://amzn.com/1783987065> (Packt)


On Sat, Aug 15, 2015 at 10:01 PM, Todd  wrote:

> I thought that the df only contains one column, and actually contains only
> one resulting row(select avg(age) from theTable).
> So,I would think that it would take less space,looks my understanding is
> run??
>
>
>
>
>
> At 2015-08-16 12:34:31, "Rishi Yadav"  wrote:
>
> why are you expecting footprint of dataframe to be lower when it contains
> more information ( RDD + Schema)
>
> On Sat, Aug 15, 2015 at 6:35 PM, Todd  wrote:
>
>> Hi,
>> With following code snippet, I cached the raw RDD(which is already in
>> memory, but just for illustration) and its DataFrame.
>> I thought that the df cache would take less space than the rdd
>> cache,which is wrong because from the UI that I see the rdd cache takes
>> 168B,while the df cache takes 272B.
>> What data is cached when df.cache is called and actually cache the data?
>> It looks that the df only cached the avg(age) which should be much smaller
>> in size,
>>
>> val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
>> val sc = new SparkContext(conf)
>> val sqlContext = new SQLContext(sc)
>> import sqlContext.implicits._
>> val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
>> rdd.cache
>> rdd.toDF().registerTempTable("TBL_STUDENT")
>> val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
>> df.cache()
>> df.show
>>
>>
>

Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Rishi Yadav

why are you expecting footprint of dataframe to be lower when it contains
more information ( RDD + Schema)

On Sat, Aug 15, 2015 at 6:35 PM, Todd  wrote:

> Hi,
> With following code snippet, I cached the raw RDD(which is already in
> memory, but just for illustration) and its DataFrame.
> I thought that the df cache would take less space than the rdd cache,which
> is wrong because from the UI that I see the rdd cache takes 168B,while the
> df cache takes 272B.
> What data is cached when df.cache is called and actually cache the data?
> It looks that the df only cached the avg(age) which should be much smaller
> in size,
>
> val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
> rdd.cache
> rdd.toDF().registerTempTable("TBL_STUDENT")
> val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
> df.cache()
> df.show
>
>

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Rishi Yadav

can you explain what transformation is failing. Here's a simple example.

http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/

On Thu, Jul 23, 2015 at 5:37 AM,  wrote:

>  I tried with a RDD[DenseVector] but RDDs are not transformable, so T+
> RDD[DenseVector] not >: RDD[Vector] and can’t get to use the RDD input
> method of correlation.
>
> Thanks,
> Saif
>
>

Re: No suitable driver found for jdbc:mysql://

2015-07-22 Thread Rishi Yadav

try setting --driver-class-path

On Wed, Jul 22, 2015 at 3:45 PM, roni  wrote:

> Hi All,
>  I have a cluster with spark 1.4.
> I am trying to save data to mysql but getting error
>
> Exception in thread "main" java.sql.SQLException: No suitable driver found
> for jdbc:mysql://<>.rds.amazonaws.com:3306/DAE_kmer?user=<>&password=<>
>
>
> *I looked at - https://issues.apache.org/jira/browse/SPARK-8463
>  and added the connector
> jar to the same location as on Master using copy-dir script.*
>
> *But I am still getting the same error. This sued to work with 1.3.*
>
> *This is my command  to run the program - **$SPARK_HOME/bin/spark-submit
> --jars
> /root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
> --conf
> spark.executor.extraClassPath=/root/spark/lib/mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar
> --conf spark.executor.memory=55g --driver-memory=55g
> --master=spark://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077
>   --class
> "saveBedToDB"  target/scala-2.10/adam-project_2.10-1.0.jar*
>
> *What else can I Do ?*
>
> *Thanks*
>
> *-Roni*
>

Re: How to use DataFrame with MySQL

2015-03-23 Thread Rishi Yadav

for me, it's only working if I set --driver-class-path to mysql library.

On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang  wrote:

> OK，I found what the problem is: It couldn't work with
> mysql-connector-5.0.8.
> I updated the connector version to 5.1.34 and it worked.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-DataFrame-with-MySQL-tp22178p22182.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rishi Yadav

ca you share some sample data

On Sun, Mar 15, 2015 at 8:51 PM, Rohit U  wrote:

> Hi,
>
> I am trying to run  LogisticRegressionWithSGD on RDD of LabeledPoints
> loaded using loadLibSVMFile:
>
> val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
> "s3n://logistic-regression/epsilon_normalized")
>
> val model = LogisticRegressionWithSGD.train(logistic, 100)
>
> It gives an input validation error after about 10 minutes:
>
> org.apache.spark.SparkException: Input validation failed.
> at
> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:162)
> at
> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
> at
> org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:157)
> at
> org.apache.spark.mllib.classification.LogisticRegressionWithSGD$.train(LogisticRegression.scala:192)
>
> From reading this bug report (
> https://issues.apache.org/jira/browse/SPARK-2575) since I am loading
> LibSVM format file there should be only 0/1 in the dataset and should not
> be facing the issue in the bug report. Is there something else I'm missing
> here?
>
> Thanks!
>

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Rishi Yadav

programmatically specifying Schema needs

 import org.apache.spark.sql.type._

for StructType and StructField to resolve.

On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen  wrote:

> Yes I think this was already just fixed by:
>
> https://github.com/apache/spark/pull/4977
>
> a ".toDF()" is missing
>
> On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath
>  wrote:
> > I've found people.toDF gives you a data frame (roughly equivalent to the
> > previous Row RDD),
> >
> > And you can then call registerTempTable on that DataFrame.
> >
> > So people.toDF.registerTempTable("people") should work
> >
> >
> >
> > —
> > Sent from Mailbox
> >
> >
> > On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell <
> jdavidmitch...@gmail.com>
> > wrote:
> >>
> >>
> >> I am pleased with the release of the DataFrame API.  However, I started
> >> playing with it, and neither of the two main examples in the
> documentation
> >> work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html
> >>
> >> Specfically:
> >>
> >> Inferring the Schema Using Reflection
> >> Programmatically Specifying the Schema
> >>
> >>
> >> Scala 2.11.6
> >> Spark 1.3.0 prebuilt for Hadoop 2.4 and later
> >>
> >> Inferring the Schema Using Reflection
> >> scala> people.registerTempTable("people")
> >> :31: error: value registerTempTable is not a member of
> >> org.apache.spark
> >> .rdd.RDD[Person]
> >>   people.registerTempTable("people")
> >>  ^
> >>
> >> Programmatically Specifying the Schema
> >> scala> val peopleDataFrame = sqlContext.createDataFrame(people, schema)
> >> :41: error: overloaded method value createDataFrame with
> >> alternatives:
> >>   (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass:
> >> Class[_])org.apache.spar
> >> k.sql.DataFrame 
> >>   (rdd: org.apache.spark.rdd.RDD[_],beanClass:
> >> Class[_])org.apache.spark.sql.Dat
> >> aFrame 
> >>   (rowRDD:
> >> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],columns:
> >> java.util.List[String])org.apache.spark.sql.DataFrame 
> >>   (rowRDD:
> >> org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: o
> >> rg.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
> 
> >>   (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema:
> >> org.apache
> >> .spark.sql.types.StructType)org.apache.spark.sql.DataFrame
> >>  cannot be applied to (org.apache.spark.rdd.RDD[String],
> >> org.apache.spark.sql.ty
> >> pes.StructType)
> >>val df = sqlContext.createDataFrame(people, schema)
> >>
> >> Any help would be appreciated.
> >>
> >> David
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Stepsize with Linear Regression

2015-02-10 Thread Rishi Yadav

Are there any thumbrules how to set stepsize with gradient descent. I am using 
it for Linear Regression but I am sure it applies in general to gradient 
descent.

I am at present deriving a number which fits closest to training data set 
response variable values. I am sure there is a better way to do it.

Thanks and Regards,
Rishi
@meditativesoul

Re: Define size partitions

2015-01-30 Thread Rishi Yadav

if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.

On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser  wrote:

> You can also use your InputFormat/RecordReader in Spark, e.g. using
> newAPIHadoopFile. See here:
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
> .
> -Sven
>
> On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz 
> wrote:
>
>> Hi,
>>
>> I want to process some files, there're a king of big, dozens of
>> gigabytes each one. I get them like a array of bytes and there's an
>> structure inside of them.
>>
>> I have a header which describes the structure. It could be like:
>> Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
>> This structure appears N times on the file.
>>
>> So, I could know the size of each block since it's fix. There's not
>> separator among block and block.
>>
>> If I would do this with MapReduce, I could implement a new
>> RecordReader and InputFormat  to read each block because I know the
>> size of them and I'd fix the split size in the driver. (blockX1000 for
>> example). On this way, I could know that each split for each mapper
>> has complete blocks and there isn't a piece of the last block in the
>> next split.
>>
>> Spark works with RDD and partitions, How could I resize  each
>> partition to do that?? is it possible? I guess that Spark doesn't use
>> the RecordReader and these classes for these tasks.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> http://sites.google.com/site/krasser/?utm_source=sig
>

RangePartitioner

2015-01-20 Thread Rishi Yadav

I am joining two tables as below, the program stalls at below log line and 
never proceeds.
What might be the issue and possible solution?


>>> INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79


Table 1 has  450 columns
Table2 has  100 columns


Both tables have few million rows




            val table1= myTable1.as('table1)
            val table2= myTable2.as('table2)
            val results= table1.join(table2,LeftOuter,Some("table1.Id".attr === 
"table2.id".attr ))




           println(results.count())

Thanks and Regards,
Rishi
@meditativesoul

Re: JavaRDD (Data Aggregation) based on key

2015-01-08 Thread Rishi Yadav

One approach is  to first transform this RDD into a PairRDD by taking the
field you are going to do aggregation on as key

On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh 
wrote:

> Hi,
> I have a csv file having fields as a,b,c .
> I want to do aggregation(sum,average..) based on any field(a,b or c) as per
> user input,
> using Apache Spark Java API,Please Help Urgent!
>
> Thanks in advance,
>
> Regards
> Sachin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/JavaRDD-Data-Aggregation-based-on-key-tp20828.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Profiling a spark application.

2015-01-08 Thread Rishi Yadav

as per my understanding RDDs do not get replicated, underlying Data does if
it's in HDFS.

On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek 
wrote:

> Hi,
>
> I want to find the time taken for replicating an rdd in spark cluster
> along with the computation time on the replicated rdd.
>
> Can someone please suggest some ideas?
>
> Thank you
>

Re: Problem with StreamingContext - getting SPARK-2243

2015-01-08 Thread Rishi Yadav

you can also access SparkConf using sc.getConf in Spark shell though for
StreamingContext you can directly refer sc as Akhil suggested.

On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das 
wrote:

> In the shell you could do:
>
> val ssc = StreamingContext(*sc*, Seconds(1))
>
> as *sc* is the SparkContext, which is already instantiated.
>
> Thanks
> Best Regards
>
> On Sun, Dec 28, 2014 at 6:55 AM, Thomas Frisk  wrote:
>
>> Yes you are right - thanks for that :)
>>
>> On 27 December 2014 at 23:18, Ilya Ganelin  wrote:
>>
>>> Are you trying to do this in the shell? Shell is instantiated with a
>>> spark context named sc.
>>>
>>> -Ilya Ganelin
>>>
>>> On Sat, Dec 27, 2014 at 5:24 PM, tfrisk  wrote:
>>>

 Hi,

 Doing:
val ssc = new StreamingContext(conf, Seconds(1))

 and getting:
Only one SparkContext may be running in this JVM (see SPARK-2243). To
 ignore this error, set spark.driver.allowMultipleContexts = true.

 But I dont think that I have another SparkContext running. Is there any
 way
 I can check this or force kill ?  I've tried restarting the server as
 I'm
 desperate but still I get the same issue.  I was not getting this
 earlier
 today.

 Any help much appreciated .

 Thanks,

 Thomas

 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-StreamingContext-getting-SPARK-2243-tp20869.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>
>

Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav

Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for
these 10 ids?

What if you load A and B as separate schemaRDDs and then do join. Spark
will optimize the path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin  wrote:

>  Hi, All
>
>
>
> Suppose I want to join two tables A and B as follows:
>
>
>
> Select * from A join B on A.id = B.id
>
>
>
> A is a file while B is a database which indexed by id and I wrapped it by
> Data source API.
>
> The desired join flow is:
>
> 1.   Generate A’s RDD[Row]
>
> 2.   Generate B’s RDD[Row] from A by using A’s id and B’s data source
> api to get row from the database
>
> 3.   Merge these two RDDs to the final RDD[Row]
>
>
>
> However it seems existing join strategy doesn’t support it?
>
>
>
> Any way to achieve it?
>
>
>
> Best Regards,
>
> Kevin.
>

Re: sparkContext.textFile does not honour the minPartitions argument

2015-01-01 Thread Rishi Yadav

Hi Ankit,

Optional number of partitions value is to increase number of partitions not
reduce it from default value.

On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> I am trying to read a file into a single partition but it seems like
> sparkContext.textFile ignores the passed minPartitions value. I know I can
> repartition the RDD but I was curious to know if this is expected or if
> this is a bug that needs to be further investigated?

Re: Cached RDD

2014-12-30 Thread Rishi Yadav

Without caching, each action is recomputed. So assuming rdd2 and rdd3
result in separate actions answer is yes.

On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet  wrote:

> If I have 2 RDDs which depend on the same RDD like the following:
>
> val rdd1 = ...
>
> val rdd2 = rdd1.groupBy()...
>
> val rdd3 = rdd1.groupBy()...
>
>
> If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2
> and one for rdd3)?
>

Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav

How big is your input dataset?

On Thursday, November 27, 2014, Praveen Sripati 
wrote:

> Hi,
>
> When I run the below program, I see two files in the HDFS because the
> number of partitions in 2. But, one of the file is empty. Why is it so? Is
> the work not distributed equally to all the tasks?
>
> textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
> *reduceByKey*(lambda a, b: a+b).*repartition(2)*
> .saveAsTextFile("hdfs://localhost:9000/user/praveen/output/")
>
> Thanks,
> Praveen
>


-- 
- Rishi

Re: optimize multiple filter operations

2014-11-28 Thread Rishi Yadav

you can try (scala version => you convert to python)

val set = initial.groupBy( x => if (x == something) "key1" else "key2")

This would do one pass over original data.

On Fri, Nov 28, 2014 at 8:21 AM, mrm  wrote:

> Hi,
>
> My question is:
>
> I have multiple filter operations where I split my initial rdd into two
> different groups. The two groups cover the whole initial set. In code, it's
> something like:
>
> set1 = initial.filter(lambda x: x == something)
> set2 = initial.filter(lambda x: x != something)
>
> By doing this, I am doing two passes over the data. Is there any way to
> optimise this to do it in a single pass?
>
> Note: I was trying to look in the mailing list to see if this question has
> been asked already, but could not find it.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread Rishi Yadav

We keep conf  as symbolic link so that upgrade is as simple as drop-in
replacement

On Monday, November 24, 2014, riginos  wrote:

> OK thank you very much for that!
> On 23 Nov 2014 21:49, "Denny Lee [via Apache Spark User List]" <[hidden
> email] > wrote:
>
>> It sort of depends on your environment.  If you are running on your local
>> environment, I would just download the latest Spark 1.1 binaries and you'll
>> be good to go.  If its a production environment, it sort of depends on how
>> you are setup (e.g. AWS, Cloudera, etc.)
>>
>> On Sun Nov 23 2014 at 11:27:49 AM riginos <[hidden email]
>> > wrote:
>>
>>> That was the problem ! Thank you Denny for your fast response!
>>> Another quick question:
>>> Is there any way to update spark to 1.1.0 fast?
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Spark-SQL-Programming-Guide-
>>> registerTempTable-Error-tp19591p19595.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> 
>>> For additional commands, e-mail: [hidden email]
>>> 
>>>
>>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Programming-Guide-registerTempTable-Error-tp19591p19598.html
>>  To unsubscribe from Spark SQL Programming Guide - registerTempTable
>> Error, click here.
>> NAML
>> 
>>
>
> --
> View this message in context: Re: Spark SQL Programming Guide -
> registerTempTable Error
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


-- 
- Rishi

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Rishi Yadav

how about using fluent style of Scala programming.


On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini 
wrote:

> Let's say I have to apply a complex sequence of operations to a certain
> RDD.
> In order to make code more modular/readable, I would typically have
> something like this:
>
> object myObject {
>   def main(args: Array[String]) {
> val rdd1 = function1(myRdd)
> val rdd2 = function2(rdd1)
> val rdd3 = function3(rdd2)
>   }
>
>   def function1(rdd: RDD) : RDD = { doSomething }
>   def function2(rdd: RDD) : RDD = { doSomethingElse }
>   def function3(rdd: RDD) : RDD = { doSomethingElseYet }
> }
>
> So I am explicitly declaring vals for the intermediate steps. Does this
> end up using more storage than if I just chained all of the operations and
> declared only one val instead?
> If yes, is there a better way to chain together the operations?
> Ideally I would like to do something like:
>
> val rdd = function1.function2.function3
>
> Is there a way I can write the signature of my functions to accomplish
> this? Is this also an efficiency issue or just a stylistic one?
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>

Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav

If your data is in hdfs and you are reading as textFile and each file is
less than block size, my understanding is it would always have one
partition per file.

On Thursday, November 13, 2014, Daniel Siegmann 
wrote:

> Would it make sense to read each file in as a separate RDD? This way you
> would be guaranteed the data is partitioned as you expected.
>
> Possibly you could then repartition each of those RDDs into a single
> partition and then union them. I think that would achieve what you expect.
> But it would be easy to accidentally screw this up (have some operation
> that causes a shuffle), so I think you're better off just leaving them as
> separate RDDs.
>
> On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com
> > wrote:
>
>> Hi,
>>
>> I have a set of input files for a spark program, with each file
>> corresponding to a logical data partition. What is the API/mechanism to
>> assign each input file (or a set of files) to a spark partition, when
>> initializing RDDs?
>>
>> When i create a spark RDD pointing to the directory of files, my
>> understanding is it's not guaranteed that each input file will be treated
>> as separate partition.
>>
>> My job semantics require that the data is partitioned, and i want to
>> leverage the partitioning that has already been done, rather than
>> repartitioning again in the spark job.
>>
>> I tried to lookup online but haven't found any pointers so far.
>>
>>
>> Thanks
>> pala
>>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 54 W 40th St, New York, NY 10018
> E: daniel.siegm...@velos.io
>  W: www.velos.io
>


-- 
- Rishi

Re: join 2 tables

2014-11-12 Thread Rishi Yadav

please use join syntax.

On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos <
franco.barrien...@exalitica.com> wrote:

> I have 2 tables in a hive context, and I want to select one field of each
> table where id’s of each table are equal. For example,
>
>
>
> *val tmp2=sqlContext.sql("select a.ult_fecha,b.pri_fecha from
> fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id
> =b.id ")*
>
>
>
> but i get an error:
>
>
>
>
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrien...@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>
>
>

Re: Question about textFileStream

2014-11-12 Thread Rishi Yadav

yes, can you always specify minimum number of partitions and that would
force some parallelism ( assuming you have enough cores)

On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa  wrote:

> What if the window is of 5 seconds, and the file takes longer than 5
> seconds to be completely scanned? It will still attempt to load the whole
> file?
>
> On Mon, Nov 10, 2014 at 6:24 PM, Soumitra Kumar 
> wrote:
>
>> Entire file in a window.
>>
>> On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa 
>> wrote:
>>
>>> Hi,
>>>
>>> In my application I am doing something like this "new
>>> StreamingContext(sparkConf, Seconds(10)).textFileStream("logs/")", and I
>>> get some unknown exceptions when I copy a file with about 800 MB to that
>>> folder ("logs/"). I have a single worker running with 512 MB of memory.
>>>
>>> Anyone can tell me if every 10 seconds spark reads parts of that big
>>> file, or if it attempts to read the entire file in a single window? How
>>> does it work?
>>>
>>> Thanks.
>>>
>>>
>>
>

Re: S3 table to spark sql

2014-11-11 Thread Rishi Yadav

simple

scala> val date = new
java.text.SimpleDateFormat("mmdd").parse(fechau3m)

should work. Replace "mmdd" with the format fechau3m is in.

If you want to do it at case class level:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
//HiveContext always a good idea

import sqlContext.createSchemaRDD



case class trx_u3m(id: String, local: String, fechau3m: java.util.Date,
rubro: Int, sku: String, unidades: Double, monto: Double)



val tabla = 
sc.textFile("s3n://exalitica.com/trx_u3m/trx_u3m.txt").map(_.split(",")).map(p
=> trx_u3m(p(0).trim.toString, p(1).trim.toString, new
java.text.SimpleDateFormat("mmdd").parse(p(2).trim.toString),
p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
p(6).trim.toDouble))

tabla.registerTempTable("trx_u3m")


On Tue, Nov 11, 2014 at 11:11 AM, Franco Barrientos <
franco.barrien...@exalitica.com> wrote:

> How can i create a date field in spark sql? I have a S3 table and  i load
> it into a RDD.
>
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> import sqlContext.createSchemaRDD
>
>
>
> case class trx_u3m(id: String, local: String, fechau3m: String, rubro:
> Int, sku: String, unidades: Double, monto: Double)
>
>
>
> val tabla = 
> sc.textFile("s3n://exalitica.com/trx_u3m/trx_u3m.txt").map(_.split(",")).map(p
> => trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString,
> p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
> p(6).trim.toDouble))
>
> tabla.registerTempTable("trx_u3m")
>
>
>
> Now my problema i show can i transform string variable into date variables
> (fechau3m)?
>
>
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrien...@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>
>
>

Re: Spark SQL : how to find element where a field is in a given set

2014-11-02 Thread Rishi Yadav

did you create SQLContext?

On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary  wrote:

> I have same requirement of passing list of values to in clause, when i am
> trying to do
>
> i am getting below error
>
> scala> val longList = Seq[Expression]("a", "b")
> :11: error: type mismatch;
>  found   : String("a")
>  required: org.apache.spark.sql.catalyst.expressions.Expression
>val longList = Seq[Expression]("a", "b")
>
> Thanks
>
>
> On Fri, Aug 29, 2014 at 3:52 PM, Michael Armbrust 
> wrote:
>
>> This feature was not part of that version.  It will be in 1.1.
>>
>>
>> On Fri, Aug 29, 2014 at 12:33 PM, Jaonary Rabarisoa 
>> wrote:
>>
>>>
>>> 1.0.2
>>>
>>>
>>> On Friday, August 29, 2014, Michael Armbrust 
>>> wrote:
>>>
 What version are you using?



 On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa 
 wrote:

> Still not working for me. I got a compilation error : *value in is
> not a member of Symbol.* Any ideas ?
>
>
> On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust <
> mich...@databricks.com> wrote:
>
>> To pass a list to a variadic function you can use the type ascription
>> :_*
>>
>> For example:
>>
>> val longList = Seq[Expression]("a", "b", ...)
>> table("src").where('key in (longList: _*))
>>
>> Also, note that I had to explicitly specify Expression as the type
>> parameter of Seq to ensure that the compiler converts "a" and "b" into
>> Spark SQL expressions.
>>
>>
>>
>>
>> On Thu, Aug 28, 2014 at 11:52 PM, Jaonary Rabarisoa <
>> jaon...@gmail.com> wrote:
>>
>>> ok, but what if I have a long list do I need to hard code like this
>>> every element of my list of is there a function that translate a list 
>>> into
>>> a tuple ?
>>>
>>>
>>> On Fri, Aug 29, 2014 at 3:24 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 You don't need the Seq, as in is a variadic function.

 personTable.where('name in ("foo", "bar"))



 On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa <
 jaon...@gmail.com> wrote:

> Hi all,
>
> What is the expression that I should use with spark sql DSL if I
> need to retreive
> data with a field in a given set.
> For example :
>
> I have the following schema
>
> case class Person(name: String, age: Int)
>
> And I need to do something like :
>
> personTable.where('name in Seq("foo", "bar")) ?
>
>
> Cheers.
>
>
> Jaonary
>


>>>
>>
>

>>>
>>
>>
>
>
> --
> Warm Regards
> Abhinav Chowdary
>

Re: Bug in Accumulators...

2014-10-25 Thread Rishi Yadav

works fine. Spark 1.1.0 on REPL
On Sat, Oct 25, 2014 at 1:41 PM, octavian.ganea 
wrote:

> There is for sure a bug in the Accumulators code.
>
> More specifically, the following code works well as expected:
>
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("EL LBP SPARK")
> val sc = new SparkContext(conf)
> val accum = sc.accumulator(0)
> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
> sc.stop
>   }
>
> but the following code (adding just a for loop) gives the weird error :
>   def run(args: Array[String]) {
> val conf = new SparkConf().setAppName("EL LBP SPARK")
> val sc = new SparkContext(conf)
> val accum = sc.accumulator(0)
> for (i <- 1 to 10) {
>   sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
> }
> sc.stop
>   }
>
>
> the error:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task not serializable: java.io.NotSerializableException:
> org.apache.spark.SparkContext
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
> at
>
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
> at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> Can someone confirm this bug ?
>
> Related to this:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Accumulators-Task-not-serializable-java-io-NotSerializableException-org-apache-spark-SparkContext-td17262.html
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-using-Accumulators-on-cluster-td17261.html
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Rishi Yadav

Hi Tridib,

I changed SQLContext to HiveContext and it started working. These are steps
I used.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val person = sqlContext.jsonFile("json/person.json")
person.printSchema()
person.registerTempTable("person")
val address = sqlContext.jsonFile("json/address.json")
address.printSchema()
address.registerTempTable("address")
sqlContext.cacheTable("person")
sqlContext.cacheTable("address")
val rs2 = sqlContext.sql("select p.id,p.name,a.city from person p join
address a on (p.id = a.id)").collect.foreach(println)


Rishi@InfoObjects

*Pure-play Big Data Consulting*


On Tue, Oct 21, 2014 at 5:47 AM, tridib  wrote:

> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val personPath = "/hdd/spark/person.json"
> val person = sqlContext.jsonFile(personPath)
> person.printSchema()
> person.registerTempTable("person")
> val addressPath = "/hdd/spark/address.json"
> val address = sqlContext.jsonFile(addressPath)
> address.printSchema()
> address.registerTempTable("address")
> sqlContext.cacheTable("person")
> sqlContext.cacheTable("address")
> val rs2 = sqlContext.sql("SELECT p.id, p.name, a.city FROM person p,
> address
> a where p.id = a.id limit 10").collect.foreach(println)
>
> person.json
> {"id:"1","name":"Mr. X"}
>
> address.json
> {"city:"Earth","id":"1"}
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16914.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav

Write to hdfs and then get one file locally bu using "hdfs dfs -getmerge..."

On Friday, October 17, 2014, Sean Owen  wrote:

> You can save to a local file. What are you trying and what doesn't work?
>
> You can output one file by repartitioning to 1 partition but this is
> probably not a good idea as you are bottlenecking the output and some
> upstream computation by disabling parallelism.
>
> How about just combining the files on HDFS afterwards? or just reading
> all the files instead of 1? You can hdfs dfs -cat a bunch of files at
> once.
>
> On Fri, Oct 17, 2014 at 6:46 PM, Parthus  > wrote:
> > Hi,
> >
> > I have a spark mapreduce task which requires me to write the final rdd
> to an
> > existing local file (appending to this file). I tried two ways but
> neither
> > works well:
> >
> > 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write
> to
> > local, but I never make it work. Moreover, the result is not one file
> but a
> > series of part-x files which is not what I hope to get.
> >
> > 2. collect the rdd to an array and write it to the driver node using
> Java's
> > File IO. There are also two problems: 1) my RDD is huge(1TB), which
> cannot
> > fit into the memory of one driver node. I have to split the task into
> small
> > pieces and collect them part by part and write; 2) During the writing by
> > Java IO, the Spark Mapreduce task has to wait, which is not efficient.
> >
> > Could anybody provide me an efficient way to solve this problem? I wish
> that
> > the solution could be like: appending a huge rdd to a local file without
> > pausing the MapReduce during writing?
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-tp16720.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > For additional commands, e-mail: user-h...@spark.apache.org
> 
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

-- 
- Rishi

Re: Spark Streaming Twitter Example Error

2014-08-21 Thread Rishi Yadav

please add following three libraries to your class path.

spark-streaming-twitter_2.10-1.0.0.jar
 twitter4j-core-3.0.3.jar
twitter4j-stream-3.0.3.jar


On Thu, Aug 21, 2014 at 1:09 PM, danilopds  wrote:

> Hi!
>
> I'm beginning with the development in Spark Streaming.. And I'm learning
> with the examples available in the spark directory. There are several
> applications and I want to make modifications.
>
> I can execute the TwitterPopularTags normally with command:
> ./bin/run-example TwitterPopularTags 
>
> So,
> I moved the source code to a separate folder with the structure:
> ./src/main/scala/
>
> With the files:
> -TwitterPopularTags
> -TwitterUtils
> -StreamingExamples
> -TwitterInputDStream
>
> But when I run the command:
> ./bin/spark-submit --class "TwitterPopularTags" --master local[4]
> //TwitterTest/target/scala-2.10/simple-project_2.10-1.0.jar 
>
> I receive the following error:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> twitter4j/auth/Authorization
> at TwitterUtils$.createStream(TwitterUtils.scala:42)
> at TwitterPopularTags$.main(TwitterPopularTags.scala:65)
> at TwitterPopularTags.main(TwitterPopularTags.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: twitter4j.auth.Authorization
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 10 more
>
> This is my sbt build file:
> name := "Simple Project"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.2"
>
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.0.2"
>
> libraryDependencies += "org.twitter4j" % "twitter4j-core" % "3.0.3"
>
> libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "3.0.3"
>
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>
> Can anybody help me?
> Thanks a lot!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Equally split a RDD partition into two partition at the same node

Re: Spark can't fetch application jar after adding it to HTTP server

Re: Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

Re: Re: Can't understand the size of raw RDD and its DataFrame

Re: Can't understand the size of raw RDD and its DataFrame

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

Re: No suitable driver found for jdbc:mysql://

Re: How to use DataFrame with MySQL

Re: Input validation for LogisticRegressionWithSGD

Re: Spark Release 1.3.0 DataFrame API

Stepsize with Linear Regression

Re: Define size partitions

RangePartitioner

Re: JavaRDD (Data Aggregation) based on key

Re: Profiling a spark application.

Re: Problem with StreamingContext - getting SPARK-2243

Re: Implement customized Join for SparkSQL

Re: sparkContext.textFile does not honour the minPartitions argument

Re: Cached RDD

Re: reduceByKey and empty output files

Re: optimize multiple filter operations

Re: Spark SQL Programming Guide - registerTempTable Error

Re: Declaring multiple RDDs and efficiency concerns

Re: Assigning input files to spark partitions

Re: join 2 tables

Re: Question about textFileStream

Re: S3 table to spark sql

Re: Spark SQL : how to find element where a field is in a given set

Re: Bug in Accumulators...

Re: spark sql: join sql fails after sqlCtx.cacheTable()

Re: How to write a RDD into One Local Existing File?

Re: Spark Streaming Twitter Example Error

32 matches

Site Navigation

Mail list logo

Footer information