Re: spark streaming rate limiting from kafka

2014-07-19 Thread Bill Jay
Hi Tobias,

It seems that repartition can create more executors for the stages
following data receiving. However, the number of executors is still far
less than what I require (I specify one core for each executor). Based on
the index of the executors in the stage, I find many numbers are missing in
between. For example, if I repartition(100), the index of executors may be
1, 3, 5, 10, etc. Finally, there may be 45 executors although I request 100
partitions.

Bill


On Thu, Jul 17, 2014 at 6:15 PM, Tobias Pfeiffer  wrote:

> Bill,
>
> are you saying, after repartition(400), you have 400 partitions on one
> host and the other hosts receive nothing of the data?
>
> Tobias
>
>
> On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay 
> wrote:
>
>> I also have an issue consuming from Kafka. When I consume from Kafka,
>> there are always a single executor working on this job. Even I use
>> repartition, it seems that there is still a single executor. Does anyone
>> has an idea how to add parallelism to this job?
>>
>>
>>
>> On Thu, Jul 17, 2014 at 2:06 PM, Chen Song 
>> wrote:
>>
>>> Thanks Luis and Tobias.
>>>
>>>
>>> On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer 
>>> wrote:
>>>
 Hi,

 On Wed, Jul 2, 2014 at 1:57 AM, Chen Song 
 wrote:
>
> * Is there a way to control how far Kafka Dstream can read on
> topic-partition (via offset for example). By setting this to a small
> number, it will force DStream to read less data initially.
>

 Please see the post at

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
 Kafka's auto.offset.reset parameter may be what you are looking for.

 Tobias


>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>


Re: spark1.0.1 & hadoop2.2.0 issue

2014-07-19 Thread Debasish Das
I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today...

No issues with mvn compilation but my sbt build keeps failing on the sql
module...

I just saw that my scala is at 2.11.0 (with brew update)...not sure if
that's why the sbt compilation is failing...retrying..




On Sat, Jul 19, 2014 at 6:16 PM, Hu, Leo  wrote:

>  Hi all
>
>   Have anyone encounter such problem below, and how to solve it ? any help
> would be appreciated.
>
>
>
> Caused by: java.lang.UnsatisfiedLinkError:
> org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
>
> at
> org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native
> Method)
>
> at
> org.apache.hadoop.security.JniBasedUnixGroupsMapping.(JniBasedUnixGroupsMapping.java:49)
>
> at
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.(JniBasedUnixGroupsMappingWithFallback.java:38)
>
>
>
>
>
> Best Regards
>
> Leo Hu
>


Re: Out of any idea

2014-07-19 Thread Krishna Sankar
Probably you have - if not, try a very simple app in the docker container
and make sure it works. Sometimes resource contention/allocation can get in
the way. This happened to me in the YARN container.
Also try single worker thread.
Cheers



On Sat, Jul 19, 2014 at 2:39 PM, boci  wrote:

> Hi guys!
>
> I run out of ideas... I created a spark streaming job (kafka -> spark ->
> ES).
> If I start my app local machine (inside the editor, but connect to the
> real kafka and ES) the application work correctly.
> If I start it in my docker container (same kafka and ES, local mode
> (local[4]) like inside my editor) the application connect to kafka, receive
> the message but after that nothing happened (I put config/log4j.properties
> to debug mode and I see BlockGenerator receive the data bu after that
> nothing happened with that.
> (first step I simply run a map to print the received data with log4j)
>
> I hope somebody can help... :(
>
> b0c1
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>


Re: Out of any idea

2014-07-19 Thread Tathagata Das
Could you collect debug level logs and send us. Without logs its hard to
speculate anything. :)

TD


On Sat, Jul 19, 2014 at 2:39 PM, boci  wrote:

> Hi guys!
>
> I run out of ideas... I created a spark streaming job (kafka -> spark ->
> ES).
> If I start my app local machine (inside the editor, but connect to the
> real kafka and ES) the application work correctly.
> If I start it in my docker container (same kafka and ES, local mode
> (local[4]) like inside my editor) the application connect to kafka, receive
> the message but after that nothing happened (I put config/log4j.properties
> to debug mode and I see BlockGenerator receive the data bu after that
> nothing happened with that.
> (first step I simply run a map to print the received data with log4j)
>
> I hope somebody can help... :(
>
> b0c1
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>


Re: SparkSQL operator priority

2014-07-19 Thread Christos Kozanitis
Thanks Eric. That is the case as most of my fields are optional. So it
seems that the problem comes from Parquet.


On Sat, Jul 19, 2014 at 8:27 AM, Eric Friedman 
wrote:

> Can position be null?  Looks like there may be constraints with predicate
> push down in that case. https://github.com/apache/spark/pull/511/
>
> On Jul 18, 2014, at 8:04 PM, Christos Kozanitis 
> wrote:
>
> Hello
>
> What is the order with which SparkSQL deserializes parquet fields? Is it
> possible to modify it?
>
> I am using SparkSQL to query a parquet file that consists of a lot of
> fields (around 30 or so). Let me call an example table MyTable and let's
> suppose the name of one of its fields is "position".
>
> The query that I am executing is:
> sql("select * from MyTable where position = 243189160")
>
> The query plan that I get from this query is:
> Filter (position#6L:6 = 243189160)
>  ParquetTableScan
> [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
> (ParquetRelation hdfs://
> ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), None
>
> I expect 14 entries in the output but the execution of
> .collect.foreach(println) takes forever to run on my cluster (more than an
> hour).
>
> Is it safe to assume in my example that SparkSQL deserializes all fields
> first before applying the filter? If so, can a user change this behavior?
>
> To support my assumption I replaced "*" with "position", so my new query
> is of the form sql("select position from MyTable where position =
> 243189160") and this query runs much faster on the same hardware (2-3
> minutes vs 65 min).
>
> Any ideas?
>
> thanks
> Christos
>
>


spark1.0.1 & hadoop2.2.0 issue

2014-07-19 Thread Hu, Leo
Hi all
  Have anyone encounter such problem below, and how to solve it ? any help 
would be appreciated.

Caused by: java.lang.UnsatisfiedLinkError: 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
at 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method)
at 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.(JniBasedUnixGroupsMapping.java:49)
at 
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.(JniBasedUnixGroupsMappingWithFallback.java:38)


Best Regards
Leo Hu


Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-19 Thread Yin Huai
Can you attach your code?

Thanks,

Yin


On Sat, Jul 19, 2014 at 4:10 PM, chutium  wrote:

> 160G parquet files (ca. 30 files, snappy compressed, made by cloudera
> impala)
>
> ca. 30 full table scan, took 3-5 columns out, then some normal scala
> operations like substring, groupby, filter, at the end, save as file in
> HDFS
>
> yarn-client mode, 23 core and 60G mem / node
>
> but, always failed !
>
> startup script (3 NodeManager, each an executor):
>
>
>
>
> some screenshot:
>
>
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n10254/spark1.png
> >
>
>
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n10254/spark2.png
> >
>
>
>
> i got some log like:
>
>
>
>
> same job using standalone mode (3 slaves) works...
>
> startup script (each 24 cores, 64g mem) :
>
>
>
>
> any idea?
>
> thanks a lot!
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Large Task Size?

2014-07-19 Thread Kyle Ellrott
I'm still having trouble with this one.
Watching it, I've noticed that the first time around, the task size is
large, but not terrible (199KB). It's on the second iteration of the
optimization that the task size goes crazy (120MB).

Does anybody have any ideas why this might be happening? Is there anyway
that I can view the data being encoded in the task description, so that I
might be able to get an idea why it is blowing up?

The line in question can be found at:
https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L157

>From the logs:
14/07/19 17:29:35 WARN scheduler.TaskSetManager: Stage 858 contains a task
of very large size (199 KB). The maximum recommended task size is 100 KB.
14/07/19 17:30:22 WARN scheduler.TaskSetManager: Stage 857 contains a task
of very large size (185 KB). The maximum recommended task size is 100 KB.
14/07/19 17:30:26 WARN scheduler.TaskSetManager: Stage 1029 contains a task
of very large size (185 KB). The maximum recommended task size is 100 KB.
14/07/19 17:30:57 WARN scheduler.TaskSetManager: Stage 1202 contains a task
of very large size (123414 KB). The maximum recommended task size is 100 KB.

>From the web server (connecting the stage number to the line number)
Stage Id   Description
858  sample at GroupedGradientDescent.scala:157
857  collect at GroupedGradientDescent.scala:183
1029collect at GroupedGradientDescent.scala:194
1202sample at GroupedGradientDescent.scala:157

Kyle



On Tue, Jul 15, 2014 at 2:45 PM, Kyle Ellrott  wrote:

> Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train
> multiple models at the same time. I am hoping that by multiplexing several
> models in the same RDD will be more efficient then trying to get the Spark
> scheduler to manage a few 100 tasks simultaneously.
>
> I don't think I see stochasticLossHistory being included in the closure
> (please correct me if I'm wrong). Its used once on line 183 to capture the
> loss sums (a local operation on the results of a 'collect' call), and again
> on line 198 to update weightSet, but that's after the loop completes, and
> the memory blow definitely happens before then.
>
> Kyle
>
>
>
> On Tue, Jul 15, 2014 at 12:00 PM, Aaron Davidson 
> wrote:
>
>> Ah, I didn't realize this was non-MLLib code. Do you mean to be sending 
>> stochasticLossHistory
>> in the closure as well?
>>
>>
>> On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott 
>> wrote:
>>
>>> It uses the standard SquaredL2Updater, and I also tried to broadcast it
>>> as well.
>>>
>>> The input is a RDD created by taking the union of several inputs, that
>>> have all been run against MLUtils.kFold to produce even more RDDs. If I run
>>> with 10 different inputs, each with 10 kFolds. I'm pretty certain that all
>>> of the input RDDs have clean closures. But I'm curious, is there a high
>>> overhead for running union? Could that create larger task sizes?
>>>
>>> Kyle
>>>
>>>
>>>
>>> On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson 
>>> wrote:
>>>
 I also did a quick glance through the code and couldn't find anything
 worrying that should be included in the task closures. The only possibly
 unsanitary part is the Updater you pass in -- what is your Updater and is
 it possible it's dragging in a significant amount of extra state?


 On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott 
 wrote:

> I'm working of a patch to MLLib that allows for multiplexing several
> different model optimization using the same RDD ( SPARK-2372:
> https://issues.apache.org/jira/browse/SPARK-2372 )
>
> In testing larger datasets, I've started to see some memory errors (
> java.lang.OutOfMemoryError and "exceeds max allowed: spark.akka.frameSize"
> errors ).
> My main clue is that Spark will start logging warning on smaller
> systems like:
>
> 14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a
> task of very large size (10119 KB). The maximum recommended task size is
> 100 KB.
>
> Looking up start '2862' in the case leads to a 'sample at
> GroupedGradientDescent.scala:156' call. That code can be seen at
>
> https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156
>
> I've looked over the code, I'm broadcasting the larger variables, and
> between the sampler and the combineByKey, I wouldn't think there much data
> being moved over the network, much less a 10MB chunk.
>
> Any ideas of what this might be a symptom of?
>
> Kyle
>
>

>>>
>>
>


Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-19 Thread chutium
160G parquet files (ca. 30 files, snappy compressed, made by cloudera impala)

ca. 30 full table scan, took 3-5 columns out, then some normal scala
operations like substring, groupby, filter, at the end, save as file in HDFS

yarn-client mode, 23 core and 60G mem / node

but, always failed !

startup script (3 NodeManager, each an executor):




some screenshot:


 


 



i got some log like:




same job using standalone mode (3 slaves) works...

startup script (each 24 cores, 64g mem) :




any idea?

thanks a lot!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SQL-on-160-G-parquet-file-snappy-compressed-made-by-cloudera-impala-23-core-and-60G-mem-d-tp10254.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Debugging spark

2014-07-19 Thread Ruchir Jha
I am a newbie and am looking for pointers to start debugging my spark app and 
did not find a straightforward tutorial.  Any help is appreciated?

Sent from my iPhone

Out of any idea

2014-07-19 Thread boci
Hi guys!

I run out of ideas... I created a spark streaming job (kafka -> spark ->
ES).
If I start my app local machine (inside the editor, but connect to the real
kafka and ES) the application work correctly.
If I start it in my docker container (same kafka and ES, local mode
(local[4]) like inside my editor) the application connect to kafka, receive
the message but after that nothing happened (I put config/log4j.properties
to debug mode and I see BlockGenerator receive the data bu after that
nothing happened with that.
(first step I simply run a map to print the received data with log4j)

I hope somebody can help... :(

b0c1
--
Skype: boci13, Hangout: boci.b...@gmail.com


java.net.ConnectException: Connection timed out

2014-07-19 Thread Soren Macbeth
Hello,

I get a lot of these exceptions on my mesos cluster when running spark jobs:

14/07/19 16:29:43 WARN spark.network.SendingConnection: Error finishing
connection to prd-atl-mesos-slave-010/10.88.160.200:37586
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/07/19 16:29:43 INFO spark.network.ConnectionManager: Handling connection
error on connection to ConnectionManagerId(prd-atl-mesos-slave-010,37586)
14/07/19 16:29:43 INFO spark.network.ConnectionManager: Removing
SendingConnection to ConnectionManagerId(prd-atl-mesos-slave-010,37586)
14/07/19 16:29:43 INFO spark.network.ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@4b0472b4
14/07/19 16:29:43 INFO spark.network.ConnectionManager: Notifying
org.apache.spark.network.ConnectionManager$MessageStatus@1106ade6
14/07/19 16:29:43 ERROR
spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId(prd-atl-mesos-slave-010,37586)
14/07/19 16:29:43 ERROR
spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get
block(s) from ConnectionManagerId(prd-atl-mesos-slave-010,37586)
14/07/19 16:29:43 WARN spark.network.SendingConnection: Error finishing
connection to prd-atl-mesos-slave-004/10.88.160.156:35446
java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)
at
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/07/19 16:29:43 INFO spark.network.ConnectionManager: Handling connection
error on connection to ConnectionManagerId(prd-atl-mesos-slave-004,35446)
14/07/19 16:29:43 INFO spark.network.ConnectionManager: Removing
SendingConnection to ConnectionManagerId(prd-atl-mesos-slave-004,35446)

I've tried bumping up the spark.akka.timeout, but it doesn't seem to have
much of an effect.

Has anyone else seen these? Is there a spark configuration option that I
should tune? Or perhaps some JVM properties that I should be setting on my
executors?

TIA


Re: Java null pointer exception while saving hadoop file

2014-07-19 Thread durga
Thanks for the reply.

I am trying to save a huge file in my case it is 60GB. I think l.toSeq is
going to collect all the data into the driver , where I don't have that much
space . Is there any possibility using something like multipleoutput format
class etc for  a large file.

Thanks,
Durga.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Java-null-pointer-exception-while-saving-hadoop-file-tp10220p10249.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-19 Thread rindra
Hi,

I am working with a small dataset about 13Mbyte on the spark-shell. After
doing a
groupBy on the RDD, I wanted to cache RDD in memory but I keep getting
these warnings:

scala> rdd.cache()
res28: rdd.type = MappedRDD[63] at repartition at :28


scala> rdd.count()
14/07/19 12:45:18 WARN BlockManager: Block rdd_63_82 could not be dropped
from memory as it does not exist
14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_82 failed
14/07/19 12:45:18 WARN BlockManager: Block rdd_63_40 could not be dropped
from memory as it does not exist
14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_40 failed
res29: Long = 5

It seems that I could not cache the data in memory even though my local
machine has
16Gb RAM and the data is only 13MB with 100 partitions size.

How to prevent this caching issue from happening? Thanks.

Rindra



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Server IPC version 7 cannot communicate with client version 4 with Spark Streaming 1.0.0 in Java and CH4 quickstart in local mode

2014-07-19 Thread Juan Rodríguez Hortalá
Hi Sean,

I was launching the Spark Streaming program from Eclipse, but now I'm
running it with the spark-submit script from the Spark distribution for
CDH4 at http://spark.apache.org/downloads.html, and it works just fine.

Thanks a lot for your help,

Greetings,

Juan



2014-07-16 12:58 GMT+02:00 Sean Owen :

> "Server IPC version 7 cannot communicate with client version 4" means
> your client is Hadoop 1.x and your cluster is Hadoop 2.x. The default
> Spark distribution is built for Hadoop 1.x. You would have to make
> your own build (or, use the artifacts distributed for CDH4.6 maybe?
> they are certainly built vs Hadoop 2)
>
> On Wed, Jul 16, 2014 at 10:32 AM, Juan Rodríguez Hortalá
>  wrote:
> > Hi,
> >
> > I'm running a Java program using Spark Streaming 1.0.0 on Cloudera 4.4.0
> > quickstart virtual machine, with hadoop-client 2.0.0-mr1-cdh4.4.0, which
> is
> > the one corresponding to my Hadoop distribution, and that works with
> other
> > mapreduce programs, and with the maven property
> > 2.0.0-mr1-cdh4.4.0 configured according
> to
> >
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html.
> > When I set
> >
> >
> jssc.checkpoint("hdfs://localhost:8020/user/cloudera/bicing/streaming_checkpoints");
> >
> >
> > I get a "Server IPC version 7 cannot communicate with client version 4"
> > running the program in local mode using "local[4]" as master. I have seen
> > this problem before in other forums like
> >
> http://qnalist.com/questions/4957822/hdfs-server-client-ipc-version-mismatch-while-trying-to-access-hdfs-files-using-spark-0-9-1
> > or http://comments.gmane.org/gmane.comp.lang.scala.spark.user/106 but
> the
> > solution is basically setting the property I have already set. I have
> tried
> > also with 2.0.0-cdh4.4.0 and
> > 2.0 with no luck.
> >
> > Could someone help me with this?
> >
> > Thanks a lot in advance
> >
> > Greetings,
> >
> > Juan
>


Real-time segmentation with SPARK

2014-07-19 Thread Mahesh Govind
HI  Experts,

Could you please help me in getting some insights about doing realtime 
segmentation  ( Segmentation on demand )
Using spark .

My use case is like this .
1) I am running a campaign
2) Customers are subscribing for the campaign
3) Campaign is for 2-3 hours 
4) Estimated target customers  ( ~10 million )
5) After an hour I need to know details of segments of customers participated 
in the campaign

Any one has done similar stuff . Could you please give me some pointers .

Regards
Mahesh


 


Re: registerAsTable can't be compiled

2014-07-19 Thread Michael Armbrust
Can you provide the code?  Is Record a case class? and is it defined as a
top level object?  Also have you done "import sqlContext._"?


On Sat, Jul 19, 2014 at 3:39 AM, junius  wrote:

> Hello,
> I write code to practice Spark Sql based on latest Spark version.
> But I get compilation error as following, seems the implicit conversion
> from RDD to SchemaRDD doesn't
> work. If anybody can help me to fix it. Thanks a lot.
>
> value registerAsTable is not a member of
> org.apache.spark.rdd.RDD[org.apache.spark.examples.mySparkExamples.Record]
>
> Junius Zhou
> b.r
>


Re: Java null pointer exception while saving hadoop file

2014-07-19 Thread Madhura
Hi,
You can try setting the heap space memory to a higher value.
Are you using an Ubuntu machine?
In bashrc set the following option.
export _JAVA_OPTIONS=-Xmx2g
This should set your heap size to a higher value.
Regards,
Madhura



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Java-null-pointer-exception-while-saving-hadoop-file-tp10220p10244.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Need help with coalesce

2014-07-19 Thread Madhura
Hi,

I have a file called out with random numbers where each number in on one
line in the file. I am loading the complete file into a RDD and I want to
create partitions with the help of coalesce function.
This is my code snippet.
import scala.math.Ordered
import org.apache.spark.rdd.CoalescedRDD
import org.apache.spark.api.java.JavaPairRDD
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.function.Function
import org.apache.spark.api.java.function.PairFunction

val dRDD = sc.textFile("hdfs://master:54310/out",10)

  val keyval=dRDD.coalesce(100,true).mapPartitionsWithIndex{(ind,iter)
=> iter.map(x => process(ind,x.trim().split(' ').map(_.toDouble),q,m,r))}

However, I am getting this error. I tried looking at various other links but
I always got this error.
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.rdd.RDD.coalesce(IZ)Lorg/apache/spark/rdd/RDD;
at SimpleApp$.main(SimpleApp.scala:432)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

A follow up, is there any way I can access the elements in a RDD with the
help of an index and create partitions? Like for example I have a RDD with
values like 1,2,3,...,100. I would like to create partitions which look like
this:
part1: 1,2,3..,10
part2: 8,9,10,...,20
part3: 18,19,20,...,30 and so on...

Thanks and regards,
Madhura




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-with-coalesce-tp10243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SparkSQL operator priority

2014-07-19 Thread Eric Friedman
Can position be null?  Looks like there may be constraints with predicate push 
down in that case. https://github.com/apache/spark/pull/511/

> On Jul 18, 2014, at 8:04 PM, Christos Kozanitis  
> wrote:
> 
> Hello
> 
> What is the order with which SparkSQL deserializes parquet fields? Is it 
> possible to modify it?
> 
> I am using SparkSQL to query a parquet file that consists of a lot of fields 
> (around 30 or so). Let me call an example table MyTable and let's suppose the 
> name of one of its fields is "position".
> 
> The query that I am executing is: 
> sql("select * from MyTable where position = 243189160")
> 
> The query plan that I get from this query is:
> Filter (position#6L:6 = 243189160)
>  ParquetTableScan 
> [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
>  (ParquetRelation 
> hdfs://ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), 
> None
> 
> I expect 14 entries in the output but the execution of 
> .collect.foreach(println) takes forever to run on my cluster (more than an 
> hour). 
> 
> Is it safe to assume in my example that SparkSQL deserializes all fields 
> first before applying the filter? If so, can a user change this behavior?
> 
> To support my assumption I replaced "*" with "position", so my new query is 
> of the form sql("select position from MyTable where position = 243189160") 
> and this query runs much faster on the same hardware (2-3 minutes vs 65 min).
> 
> Any ideas?
> 
> thanks
> Christos


Re: Hive From Spark

2014-07-19 Thread Silvio Fiorito
Please ensure your hive-site.xml is pointing to a HiveServer2 endpoint vs 
HiveServer1

From: JiajiaJing
Sent: ?Thursday?, ?July? ?17?, ?2014 ?8?:?48? ?PM
To: u...@spark.incubator.apache.org

Hello Spark Users,

I am new to Spark SQL and now trying to first get the HiveFromSpark example
working.
However, I got the following error when running HiveFromSpark.scala program.
May I get some help on this please?

ERROR MESSAGE:

org.apache.thrift.TApplicationException: Invalid method name: 'get_table'
 at
org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
 at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
 at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922)
 at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 at $Proxy9.getTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:905)
 at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:8999)
 at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8313)
 at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:186)
 at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:160)
 at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:250)
 at
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:247)
 at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:85)
 at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:90)
 at HiveFromSpark$.main(HiveFromSpark.scala:38)
 at HiveFromSpark.main(HiveFromSpark.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



Thank you very much!

JJing



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Uber jar with SBT

2014-07-19 Thread boci
Hi!

I using java7, I found the problem. I not run start and await termination
on streaming context, now it's work BUT
spark-submit never return (it's run in the foreground and receive the kafka
streams)... what I miss?
(I want to send the job to standalone cluster worker process)

b0c1

--
Skype: boci13, Hangout: boci.b...@gmail.com


On Sat, Jul 19, 2014 at 3:32 PM, Sean Owen  wrote:

> Are you building / running with Java 6? I imagine your .jar files has
> more than 65536 files, and Java 6 has various issues with jars this
> large. If possible, use Java 7 everywhere.
>
> https://issues.apache.org/jira/browse/SPARK-1520
>
> On Sat, Jul 19, 2014 at 2:30 PM, boci  wrote:
> > Hi Guys,
> >
> > I try to create spark uber jar with sbt but I have a lot of problem... I
> > want to use the following:
> > - Spark streaming
> > - Kafka
> > - Elsaticsearch
> > - HBase
> >
> > the current jar size is cca 60M and it's not working.
> > - When I deploy with spark-submit: It's running and exit without any
> error
> > - When I try to start with local[*]  mode, it's say:
> >  Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/spark/Logging
> > => but I start with java -cp /.../spark-assembly-1.0.1-hadoop2.2.0.jar
> -jar
> > my.jar
> >
> > Any idea how can solve this? (which lib required to set provided wich
> > required for run... later I want to run this jar in yarn cluster)
> >
> > b0c1
> >
> --
> > Skype: boci13, Hangout: boci.b...@gmail.com
>


Re: Uber jar with SBT

2014-07-19 Thread Sean Owen
Are you building / running with Java 6? I imagine your .jar files has
more than 65536 files, and Java 6 has various issues with jars this
large. If possible, use Java 7 everywhere.

https://issues.apache.org/jira/browse/SPARK-1520

On Sat, Jul 19, 2014 at 2:30 PM, boci  wrote:
> Hi Guys,
>
> I try to create spark uber jar with sbt but I have a lot of problem... I
> want to use the following:
> - Spark streaming
> - Kafka
> - Elsaticsearch
> - HBase
>
> the current jar size is cca 60M and it's not working.
> - When I deploy with spark-submit: It's running and exit without any error
> - When I try to start with local[*]  mode, it's say:
>  Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/Logging
> => but I start with java -cp /.../spark-assembly-1.0.1-hadoop2.2.0.jar -jar
> my.jar
>
> Any idea how can solve this? (which lib required to set provided wich
> required for run... later I want to run this jar in yarn cluster)
>
> b0c1
> --
> Skype: boci13, Hangout: boci.b...@gmail.com


Uber jar with SBT

2014-07-19 Thread boci
Hi Guys,

I try to create spark uber jar with sbt but I have a lot of problem... I
want to use the following:
- Spark streaming
- Kafka
- Elsaticsearch
- HBase

the current jar size is cca 60M and it's not working.
- When I deploy with spark-submit: It's running and exit without any error
- When I try to start with local[*]  mode, it's say:
 Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/Logging
=> but I start with java -cp /.../spark-assembly-1.0.1-hadoop2.2.0.jar -jar
my.jar

Any idea how can solve this? (which lib required to set provided wich
required for run... later I want to run this jar in yarn cluster)

b0c1
--
Skype: boci13, Hangout: boci.b...@gmail.com


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
Thanks for the gist.  I'm just now learning about Avro.  I think when you use
a DataFileWriter you are writing to an Avro Container (which is different
than an Avro Sequence File).  I have a system where data was written to an
HDFS Sequence File using  AvroSequenceFile.Writer (which is a wrapper around
sequence file).  

I'll put together an example of the problem so others can better understand
what I'm talking about.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Nick Pentreath
I got this working locally a little while ago when playing around with
AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211

But not sure about AvroSequenceFile. Any chance you have an example
datafile or records?



On Sat, Jul 19, 2014 at 11:00 AM, Sparky  wrote:

> To be more specific, I'm working with a system that stores data in
> org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is
> "A wrapper around a Hadoop SequenceFile that also supports reading and
> writing Avro data."
>
> It seems that Spark does not support this out of the box.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
To be more specific, I'm working with a system that stores data in
org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is 
"A wrapper around a Hadoop SequenceFile that also supports reading and
writing Avro data."

It seems that Spark does not support this out of the box.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
I see Spark is using AvroRecordReaderBase, which is used to grab Avro
Container Files, which is different from Sequence Files.  If anyone is using
Avro Sequence Files with success and has an example, please let me know.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


registerAsTable can't be compiled

2014-07-19 Thread junius
Hello,
I write code to practice Spark Sql based on latest Spark version.
But I get compilation error as following, seems the implicit conversion
from RDD to SchemaRDD doesn't
work. If anybody can help me to fix it. Thanks a lot.

value registerAsTable is not a member of
org.apache.spark.rdd.RDD[org.apache.spark.examples.mySparkExamples.Record]

Junius Zhou
b.r


Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-19 Thread lihu
Hi,
Everyone.  I have a piece of following code. When I run it,
it occurred the error just like below, it seem that the SparkContext is not
serializable, but i do not try to use the SparkContext except the broadcast.
[In fact, this code is in the MLLib, I just try to broadcast the
 centerArrays ]

it can success in the redeceBykey operation, but failed at the collect
operation, this confused me.


INFO DAGScheduler: Failed to run collect at KMeans.scala:235
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task not
serializable: java.io.NotSerializableException:
org.apache.spark.SparkContext
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: org.apache.spark.SparkContext
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)




private def initKMeansParallel(data: RDD[Array[Double]]):
Array[ClusterCenters] = {

@transient val sc = data.sparkContext   // I try to add
the transient
annotation here, but it doesn't work

// Initialize each run's center to a random point
val seed = new XORShiftRandom().nextInt()
val sample = data.takeSample(true, runs, seed).toSeq
val centers = Array.tabulate(runs)(r => ArrayBuffer(sample(r)))

// On each step, sample 2 * k points on average for each run with
probability proportional
// to their squared distance from that run's current centers
for (step <- 0 until initializationSteps) {
  val centerArrays = sc.broadcast(centers.map(_.toArray))
  val sumCosts = data.flatMap { point =>
for (r <- 0 until runs) yield (r,
KMeans.pointCost(centerArrays.value(r), point))
  }.reduceByKey(_ + _).collectAsMap()
//can pass at this point
  val chosen = data.mapPartitionsWithIndex { (index, points) =>
val rand = new XORShiftRandom(seed ^ (step << 16) ^ index)
for {
  p <- points
  r <- 0 until runs
  if rand.nextDouble() < KMeans.pointCost(centerArrays.value(r), p)
* 2 * k / sumCosts(r)
} yield (r, p)
  }.collect()
// failed at this
point.
  for ((r, p) <- chosen) {
centers(r) += p
  }
}