Re: "Got wrong record after seeking to offset" issue

2018-01-17 Thread Justin Miller
By compacted do you mean compression? If so then we did recently turn on
lz4 compression. If there’s another meaning if there’s a command I can run
to check compaction I’m happy to give that a shot too.

I’ll try consuming from the failed offset if/when the problem manifests
itself again.

Thanks!
Justin

On Wednesday, January 17, 2018, Cody Koeninger  wrote:

> That means the consumer on the executor tried to seek to the specified
> offset, but the message that was returned did not have a matching
> offset.  If the executor can't get the messages the driver told it to
> get, something's generally wrong.
>
> What happens when you try to consume the particular failing offset
> from another  (e.g. commandline) consumer?
>
> Is the topic in question compacted?
>
>
>
> On Tue, Jan 16, 2018 at 11:10 PM, Justin Miller
>  wrote:
> > Greetings all,
> >
> > I’ve recently started hitting on the following error in Spark Streaming
> in Kafka. Adjusting maxRetries and spark.streaming.kafka.consumer.poll.ms
> even to five minutes doesn’t seem to be helping. The problem only
> manifested in the last few days, restarting with a new consumer group seems
> to remedy the issue for a few hours (< retention, which is 12 hours).
> >
> > Error:
> > Caused by: java.lang.AssertionError: assertion failed: Got wrong record
> for spark-executor-  76 even after seeking
> to offset 1759148155
> > at scala.Predef$.assert(Predef.scala:170)
> > at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.
> get(CachedKafkaConsumer.scala:85)
> > at org.apache.spark.streaming.kafka010.KafkaRDD$
> KafkaRDDIterator.next(KafkaRDD.scala:223)
> > at org.apache.spark.streaming.kafka010.KafkaRDD$
> KafkaRDDIterator.next(KafkaRDD.scala:189)
> > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> >
> > I guess my questions are, why is that assertion a job killer vs a
> warning and is there anything I can tweak settings wise that may keep it at
> bay.
> >
> > I wouldn’t be surprised if this issue were exacerbated by the volume we
> do on Kafka topics (~150k/sec on the persister that’s crashing).
> >
> > Thank you!
> > Justin
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>


good materiala to learn apache spark

2018-01-17 Thread Manuel Sopena Ballesteros
Dear Spark community,

I would like to learn more about apache spark. I have a Horton works HDP 
platform and have ran a few spark jobs in a cluster but now I need to know more 
in depth how spark works.

My main interest is sys admin and operational point of Spark and it's ecosystem.

Is there any material?

Thank you very much

Manuel Sopena Ballesteros | Big data Engineer
Garvan Institute of Medical Research
The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: 
manuel...@garvan.org.au

NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.


StreamingLogisticRegressionWithSGD : Multiclass Classification : Options

2018-01-17 Thread Sundeep Kumar Mehta
Hi,

I was looking for Logistic Regression with Multi Class classifier on
Streaming data do we have any alternative options or library/github prj.

As StreamingLogisticRegressionWithSGD only supports binary classification

Regards
Sundeep


Spark Stream is corrupted

2018-01-17 Thread KhajaAsmath Mohammed
Hi,

I have created a streaming object from checkpoint but it always through up
error as stream corrupted when I restart spark streaming job. any solution
for this?

private def createStreamingContext(
sparkCheckpointDir: String, sparkSession: SparkSession,
batchDuration: Int, config: com.typesafe.config.Config) = {
val topics = config.getString(Constants.Properties.KafkaTopics)
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" ->
config.getString(Constants.Properties.KafkaBrokerList))
val ssc = new StreamingContext(sparkSession.sparkContext,
Seconds(batchDuration))
val messages = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val datapointDStream =
messages.map(_._2).map(TransformDatapoint.parseDataPointText)
lazy val sqlCont = sparkSession.sqlContext

hiveDBInstance = config.getString("hiveDBInstance")

TransformDatapoint.readDstreamData(sparkSession, sqlCont,
datapointDStream, runMode, includeIndex, indexNum, datapointTmpTableName,
fencedDPTmpTableName, fencedVINDPTmpTableName, hiveDBInstance)

ssc.checkpoint(sparkCheckpointDir)
ssc
  }



// calling streming context method

 val streamingContext =
StreamingContext.getOrCreate(config.getString(Constants.Properties.CheckPointDir),
() =>
createStreamingContext(config.getString(Constants.Properties.CheckPointDir),
sparkSession, config.getInt(Constants.Properties.BatchInterval), config))

*ERROR:*
org.apache.spark.SparkException: Failed to read checkpoint from directory
hdfs://prodnameservice1/user/yyy1k78/KafkaCheckPointNTDSC

java.io.IOException: Stream is corrupted


Thanks,
Asmath


Re: "Got wrong record after seeking to offset" issue

2018-01-17 Thread Cody Koeninger
That means the consumer on the executor tried to seek to the specified
offset, but the message that was returned did not have a matching
offset.  If the executor can't get the messages the driver told it to
get, something's generally wrong.

What happens when you try to consume the particular failing offset
from another  (e.g. commandline) consumer?

Is the topic in question compacted?



On Tue, Jan 16, 2018 at 11:10 PM, Justin Miller
 wrote:
> Greetings all,
>
> I’ve recently started hitting on the following error in Spark Streaming in 
> Kafka. Adjusting maxRetries and spark.streaming.kafka.consumer.poll.ms even 
> to five minutes doesn’t seem to be helping. The problem only manifested in 
> the last few days, restarting with a new consumer group seems to remedy the 
> issue for a few hours (< retention, which is 12 hours).
>
> Error:
> Caused by: java.lang.AssertionError: assertion failed: Got wrong record for 
> spark-executor-  76 even after seeking to 
> offset 1759148155
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:85)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>
> I guess my questions are, why is that assertion a job killer vs a warning and 
> is there anything I can tweak settings wise that may keep it at bay.
>
> I wouldn’t be surprised if this issue were exacerbated by the volume we do on 
> Kafka topics (~150k/sec on the persister that’s crashing).
>
> Thank you!
> Justin
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: update LD_LIBRARY_PATH when running apache job in a YARN cluster

2018-01-17 Thread Keith Chapman
Hi Manuel,

You could use the following to add a path to the library search path,
--conf spark.driver.extraLibraryPath=PathToLibFolder
--conf spark.executor.extraLibraryPath=PathToLibFolder

Thanks,
Keith.

Regards,
Keith.

http://keith-chapman.com

On Wed, Jan 17, 2018 at 5:39 PM, Manuel Sopena Ballesteros <
manuel...@garvan.org.au> wrote:

> Dear Spark community,
>
>
>
> I have a spark running in a yarn cluster and I am getting some error when
> trying to run my python application.
>
>
>
> /home/mansop/virtenv/bin/python2.7: error while loading shared libraries:
> libpython2.7.so.1.0: cannot open shared object file: No such file or
> directory
>
>
>
> Is there a way to specify the LD_LIBRARY_PATH in the spark-submit command
> or in the config file?
>
>
>
>
>
> *Manuel Sopena Ballesteros *| Big data Engineer
> *Garvan Institute of Medical Research *
> The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
> 
> *T:* + 61 (0)2 9355 5760 <+61%202%209355%205760> | *F:* +61 (0)2 9295 8507
> <+61%202%209295%208507> | *E:* manuel...@garvan.org.au
>
>
> NOTICE
> Please consider the environment before printing this email. This message
> and any attachments are intended for the addressee named and may contain
> legally privileged/confidential/copyright information. If you are not the
> intended recipient, you should not read, use, disclose, copy or distribute
> this communication. If you have received this message in error please
> notify us at once by return email and then delete both messages. We accept
> no liability for the distribution of viruses or similar in electronic
> communications. This notice should not be removed.
>


update LD_LIBRARY_PATH when running apache job in a YARN cluster

2018-01-17 Thread Manuel Sopena Ballesteros
Dear Spark community,

I have a spark running in a yarn cluster and I am getting some error when 
trying to run my python application.

/home/mansop/virtenv/bin/python2.7: error while loading shared libraries: 
libpython2.7.so.1.0: cannot open shared object file: No such file or directory

Is there a way to specify the LD_LIBRARY_PATH in the spark-submit command or in 
the config file?


Manuel Sopena Ballesteros | Big data Engineer
Garvan Institute of Medical Research
The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: 
manuel...@garvan.org.au

NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.


Re: Testing Spark-Cassandra

2018-01-17 Thread Guillermo Ortiz
Thanks, I'll check it ;)

2018-01-17 17:19 GMT+01:00 Alonso Isidoro Roman :

> Yes, you can use docker to build your own cassandra ring. Depending your
> SO, instructions may change, so, please, follow this
>  link
> to install it, and then follow this
>  project, but you
> will have to adapt the necessary libraries to use spark 2.0.x version.
>
> Good luck, i would like to see any blog post using this combination.
>
>
>
> 2018-01-17 16:48 GMT+01:00 Guillermo Ortiz :
>
>> Hello,
>>
>> I'm using spark 2.0 and Cassandra. Is there any util to make unit test
>> easily or which one would be the best way to do it? library? Cassandra with
>> docker?
>>
>
>
>
> --
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>


Re: Testing Spark-Cassandra

2018-01-17 Thread Alonso Isidoro Roman
Yes, you can use docker to build your own cassandra ring. Depending your
SO, instructions may change, so, please, follow this
 link to
install it, and then follow this
 project, but you
will have to adapt the necessary libraries to use spark 2.0.x version.

Good luck, i would like to see any blog post using this combination.



2018-01-17 16:48 GMT+01:00 Guillermo Ortiz :

> Hello,
>
> I'm using spark 2.0 and Cassandra. Is there any util to make unit test
> easily or which one would be the best way to do it? library? Cassandra with
> docker?
>



-- 
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman



Testing Spark-Cassandra

2018-01-17 Thread Guillermo Ortiz
Hello,

I'm using spark 2.0 and Cassandra. Is there any util to make unit test
easily or which one would be the best way to do it? library? Cassandra with
docker?