Re: unable to stream kafka messages

2017-08-24 Thread cbowden
The exception is telling you precisely what is wrong. The kafka source has a
schema of (topic, partition, offset, key, value, timestamp, timestampType).
Nothing about those columns makes sense as a tweet. You need to inform spark
how to get from bytes to tweet, it doesn't know how you serialized the
messages into kafka.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/unable-to-stream-kafka-messages-tp28537p29107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-24 Thread cbowden
The default spark.sql.streaming.commitProtocolClass is
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala
which may or may not be the best suited for all needs.

Code deploys could be improved by ensuring you shutdown gracefully, eg.
invoke StreamingQuery#stop.
https://issues.apache.org/jira/browse/SPARK-21029 is probably of interest.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tp29043p29106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Structured Streaming: multiple sinks

2017-08-24 Thread cbowden
1. would it not be more natural to write processed to kafka and sink
processed from kafka to s3?
2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
2b. getBatch is the time Source#getBatch took as measured by
StreamExecution.
3. triggerExecution is effectively end-to-end processing time for the
micro-batch, note all other durations sum closely to triggerExecution, there
is a little slippage based on book-keeping activities in StreamExecution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp29056p29105.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs

2017-08-24 Thread cbowden
You can leverage dynamic resource allocation with structured streaming.
Certainly there's an argument trivial jobs won't benefit. Certainly there's
an argument important jobs should have fixed resources for stable end to end
latency.

Few scenarios come to mind with benefits:
- I want my application to automatically leverage more resources if my
environment changes, eg. kafka topic partitions were increased at runtime
- I am not building a toy application and my driver is managing many
streaming queries with fair scheduling enabled where not every streaming
query has strict latency requirements
- My source's underlying rdd representing the dataframe provided by getbatch
is volatile, eg. #partitions batch to batch






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Structured-Streaming-Understanding-dynamic-allocation-in-streaming-jobs-tp29091p29104.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark] Can Apache Spark be used with time series processing?

2017-08-24 Thread Alexandr Porunov
Hello,

I am new in Apache Spark. I need to process different time series data
(numeric values which depend on time) and react on next actions:
1. Data is changing up or down too fast.
2. Data is changing constantly up or down too long.

For example, if the data have changed 30% up or down in the last five
minutes (or less), then I need to send a special event.
If the data have changed 50% up or down in two hours (or less), then I need
to send a special event.

Frequency of data changing is about 1000-3000 per second. And I need to
react as soon as possible.

Does Apache Spark fit well for this scenario or I need to search for
another solution?
Sorry for stupid question, but I am a total newbie.

Regards


SWOT Analysis on Apache Spark

2017-08-24 Thread Irfan Kabli
Hi All,

I am not sure if the users list is the right list for this query, but I am
hoping if this is the wrong forum someone would point me to the right forum.

I work for a company which uses proprietary analytical ecosystem. I am
evangelising open-source and have been requested by management to conduct a
fair SWOT analysis of Spark as an analytical engine. Has anybody attempted
a similar exercise before?

Basically i am looking at a SWOT analysis of Spark components e.g.

Spark-Core
Spark-MLLib
Spark-Streaming
Spark-GraphX

Looking forward to hearing from you.

Regards,
Irfan Kabli


Re: PySpark, Structured Streaming and Kafka

2017-08-24 Thread Brian Wylie
Resolved :)

Hi just a loopback on this (thanks for everyone's help).

In jupyter notebook the following command works and properly loads in the
Kafka jar files.

# Spin up a local Spark Session
spark = SparkSession.builder.appName('my_awesome')\
.config('spark.jars.packages',
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0')\
.getOrCreate()


I wanted to post this in case folks find this email thread with same
question.
Note: If you've started a previous session make sure it's properly
stopped/killed before playing with config options.

Cheers and thanks again.
-Brian

On Wed, Aug 23, 2017 at 4:51 PM, Shixiong(Ryan) Zhu  wrote:

> You can use `bin/pyspark --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0`
> to start "pyspark". If you want to use "spark-submit", you also need to
> provide your Python file.
>
> On Wed, Aug 23, 2017 at 1:41 PM, Brian Wylie 
> wrote:
>
>> Hi All,
>>
>> I'm trying the new hotness of using Kafka and Structured Streaming.
>>
>> Resources that I've looked at
>> - https://spark.apache.org/docs/latest/streaming-programming-guide.html
>> - https://databricks.com/blog/2016/07/28/structured-streamin
>> g-in-apache-spark.html
>> - https://spark.apache.org/docs/latest/streaming-custom-receivers.html
>> - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/
>> Structured%20Streaming%20using%20Python%20DataFrames%20API.html
>>
>> My setup is a bit weird (yes.. yes.. I know...)
>> - Eventually I'll just use a DataBricks cluster and life will be bliss :)
>> - But for now I want to test/try stuff out on my little Mac Laptop
>>
>> The newest version of PySpark will install a local Spark server with a
>> simple:
>> $ pip install pyspark
>>
>> This is very nice. I've put together a little notebook using that kewl
>> feature:
>> - https://github.com/Kitware/BroThon/blob/master/notebooks/B
>> ro_to_Spark_Cheesy.ipynb
>>
>> So the next step is the setup/use a Kafka message queue and that went
>> well/works fine.
>>
>> $ kafka-console-consumer --bootstrap-server localhost:9092 --topic dns
>>
>> *I get messages spitting out*
>>
>> {"ts":1503513688.232274,"uid":"CdA64S2Z6Xh555","id.orig_h":"192.168.1.7","id.orig_p":58528,"id.resp_h":"192.168.1.1","id.resp_p":53,"proto":"udp","trans_id":43933,"rtt":0.02226,"query":"brian.wylie.is.awesome.tk","qclass":1,"qclass_name":"C_INTERNET","qtype":1,"qtype_name":"A","rcode":0,"rcode_name":"NOERROR","AA":false,"TC":false,"RD":true,"RA":true,"Z":0,"answers":["17.188.137.55","17.188.142.54","17.188.138.55","17.188.141.184","17.188.129.50","17.188.128.178","17.188.129.178","17.188.141.56"],"TTLs":[25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0],"rejected":false}
>>
>>
>> Okay, finally getting to my question:
>> - Local spark server (good)
>> - Local kafka server and messages getting produced (good)
>> - Trying to this line of PySpark code (not good)
>>
>> # Setup connection to Kafka Stream dns_events = 
>> spark.readStream.format('kafka')\
>>   .option('kafka.bootstrap.servers', 'localhost:9092')\
>>   .option('subscribe', 'dns')\
>>   .option('startingOffsets', 'latest')\
>>   .load()
>>
>>
>> fails with:
>> java.lang.ClassNotFoundException: Failed to find data source: kafka.
>> Please find packages at http://spark.apache.org/third-party-projects.html
>>
>> I've looked that the URL listed... and poking around I can see that maybe
>> I need the kafka jar file as part of my local server.
>>
>> I lamely tried this:
>> $ spark-submit --packages org.apache.spark:spark-sql-kaf
>> ka-0-10_2.11:2.2.0
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Missing
>> application resource. at org.apache.spark.launcher.Comm
>> andBuilderUtils.checkArgument(CommandBuilderUtils.java:241) at
>> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSpa
>> rkSubmitArgs(SparkSubmitCommandBuilder.java:160) at
>> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSpa
>> rkSubmitCommand(SparkSubmitCommandBuilder.java:274) at
>> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCom
>> mand(SparkSubmitCommandBuilder.java:151) at
>> org.apache.spark.launcher.Main.main(Main.java:86)
>>
>>
>> Anyway, all my code/versions/etc are in this notebook:
>> - https://github.com/Kitware/BroThon/blob/master/notebooks/Bro
>> _to_Spark.ipynb
>>
>> I'd be tremendously appreciative of some super nice, smart person if they
>> could point me in the right direction :)
>>
>> -Brian Wylie
>>
>
>


Restarting the SparkContext in pyspark

2017-08-24 Thread Alexander Czech
I'm running a Jupyter-Spark setup and I want to benchmark my cluster with
different input parameters. To make sure the enivorment stays the same I'm
trying to reset(restart) the SparkContext, here is some code:














*temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')i
= 0 while i < max_i:i += 1if
os.path.exists(temp_result_parquet):
shutil.rmtree(temp_result_parquet) # I know I could simply overwrite the
parquetMy_DF = do_something(i)
My_DF.write.parquet(temp_result_parquet)sc.stop()
time.sleep(10)sc = pyspark.SparkContext(master='spark://ip:here',
appName='PySparkShell')*

when I do this in the first iteration it runs fine but in the second I get
the following error:







*Py4JJavaError: An error occurred while calling o1876.parquet.:
org.apache.spark.SparkException: Job aborted.[...]Caused by:
java.lang.IllegalStateException: SparkContext has been shutdownat
org.apache.spark.SparkContext.runJob(SparkContext.scala:2014)at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)*
I tried running the code without the SparkContext restart but this results
in memory issues. So to wipe the slate clean before every iteration I'm
trying this. With the weird result that parquet "thinks" SparkContext is
down.


NoSuchMethodError CatalogTable.copy

2017-08-24 Thread Lionel Luffy
Hi, any one knows how to fix below error?

java.lang.NoSuchMethodError:
org.apache.spark.sql.catalyst.catalog.CatalogTable.copy(Lorg/apache/spark/sql/catalyst/TableIdentifier;Lorg/apache/spark/sql/catalyst/catalog/CatalogTableType;Lorg/apache/spark/sql/catalyst/catalog/CatalogStorageFormat;Lorg/apache/spark/sql/types/StructType;Lscala/Option;Lscala/collection/Seq;Lscala/Option;Ljava/lang/String;JJLscala/collection/immutable/Map;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/collection/Seq;Z)Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;

it occurred when execute below code...

catalogTable.copy(storage = newStorage)

the catalyst jar is spark-catalyst_2.11-2.1.0.cloudera1.jar

CatalogTable is a case class:

case class CatalogTable(

identifier: TableIdentifier,

tableType: CatalogTableType,

storage: CatalogStorageFormat,

schema: StructType,

provider: Option[String] = None,

partitionColumnNames: Seq[String] = Seq.empty,

bucketSpec: Option[BucketSpec] = None,

owner: String = "",

createTime: Long = System.currentTimeMillis,

lastAccessTime: Long = -1,

properties: Map[String, String] = Map.empty,

stats: Option[Statistics] = None,

viewOriginalText: Option[String] = None,

viewText: Option[String] = None,

comment: Option[String] = None,

unsupportedFeatures: Seq[String] = Seq.empty,

tracksPartitionsInCatalog: Boolean = false,

schemaPreservesCase: Boolean = true)


RE: Joining 2 dataframes, getting result as nested list/structure in dataframe

2017-08-24 Thread JG Perrin
Thanks Michael – this is a great article… very helpful

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, August 23, 2017 4:33 PM
To: JG Perrin 
Cc: user@spark.apache.org
Subject: Re: Joining 2 dataframes, getting result as nested list/structure in 
dataframe

You can create a nested struct that contains multiple columns using struct().

Here's a pretty complete guide on working with nested data: 
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

On Wed, Aug 23, 2017 at 2:30 PM, JG Perrin 
> wrote:
Hi folks,

I am trying to join 2 dataframes, but I would like to have the result as a list 
of rows of the right dataframe (dDf in the example) in a column of the left 
dataframe (cDf in the example). I made it work with one column, but having 
issues adding more columns/creating a row(?).
Seq joinColumns = new Set2<>("c1", "c2").toSeq();
Dataset allDf = cDf.join(dDf, joinColumns, "inner");
allDf.printSchema();
allDf.show();

Dataset aggDf = allDf.groupBy(cDf.col("c1"), cDf.col("c2"))
.agg(collect_list(col("c50")));
aggDf.show();

Output:
++---+---+
|c1  |c2 |collect_list(c50)  |
++---+---+
|3744|1160242| [6, 5, 4, 3, 2, 1]|
|3739|1150097|[1]|
|3780|1159902|[5, 4, 3, 2, 1]|
| 132|1200743|   [4, 3, 2, 1]|
|3778|1183204|[1]|
|3766|1132709|[1]|
|3835|1146169|[1]|
++---+---+

Thanks,

jg



This electronic transmission and any documents accompanying this electronic 
transmission contain confidential information belonging to the sender. This 
information may contain confidential health information that is legally 
privileged. The information is intended only for the use of the individual or 
entity named above. The authorized recipient of this transmission is prohibited 
from disclosing this information to any other party unless required to do so by 
law or regulation and is required to delete or destroy the information after 
its stated need has been fulfilled. If you are not the intended recipient, you 
are hereby notified that any disclosure, copying, distribution or the taking of 
any action in reliance on or regarding the contents of this electronically 
transmitted information is strictly prohibited. If you have received this 
E-mail in error, please notify the sender and delete this message immediately.



SparkStreaming connection exception

2017-08-24 Thread Likith_Kailas
I have written a unit test which uses multithreading to start and stop
Sparkstreamingjob and kafkaproducer. All the dependencies have been declared
in maven pom.xml file.

When i run the test, once the all the kafka messages are read and the
threads are stopped i continue to get the below exception

 2017-08-19 17:08:16,783 INFO  [Executor task launch worker-0-
 SendThread(127.0.0.1:64040)] zookeeper.ClientCnxn 
(ClientCnxn.java:logStartConnect(1032)) - Opening socket connection to 
server 127.0.0.1/127.0.0.1:64040. Will not attempt to authenticate using 
SASL (unknown error)
2017-08-19 17:08:17,786 WARN  [Executor task launch worker-0-
SendThread(127.0.0.1:64040)] zookeeper.ClientCnxn
(ClientCnxn.java:run(1162)) - Session 0x15dfb08227f0001 for server null,
unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

The code is as below :

@Test
public void someKafkaTest() {

try {


//Thread controlling the Spark streaming
Thread sparkStreamerThread = new Thread(
new SparkStreamingJSonJob(new String[] { zookeeperConnect,
"my-consumer-group", "test", "1" }),
"spark-streaming");
sparkStreamerThread.start();


Thread.sleep(1);

//Thread to start the producer
Thread producerThread = new Thread(new KafkaJSonProducer(),
"producer");
producerThread.start();

//current kafkaTest thread to sleep for 1 second
Thread.sleep(6);

producerThread.stop();

int sparkAccVal = SparkStreamingJSonJob.getAccumulator().intValue();
System.out.println("Spark Throughput value : " + sparkAccVal/60);

while (sparkStreamerThread.isAlive())
sparkStreamerThread.stop();
   } catch (Exception e) {
e.printStackTrace();
Assert.fail();
}

}

I feel that the streamingjob continues to run even after stopping the
thread. Please help me on this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkStreaming-connection-exception-tp29103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: ORC Transaction Table - Spark

2017-08-24 Thread Aviral Agarwal
Are there any plans to include it in the future releases of Spark ?

Regards,
Aviral Agarwal

On Thu, Aug 24, 2017 at 3:11 PM, Akhil Das  wrote:

> How are you reading the data? Its clearly saying 
> *java.lang.NumberFormatException:
> For input string: "0645253_0001" *
>
> On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal 
> wrote:
>
>> Hi,
>>
>> I am trying to read hive orc transaction table through Spark but I am
>> getting the following error
>>
>> Caused by: java.lang.RuntimeException: serious problem
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:1021)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(Or
>> cInputFormat.java:1048)
>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>> .
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.NumberFormatException: For input string: "0645253_0001"
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
>> tsInfo(OrcInputFormat.java:998)
>> ... 118 more
>>
>> Any help would be appreciated.
>>
>> Thanks and Regards,
>> Aviral Agarwal
>>
>>
>
>
> --
> Cheers!
>
>


Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang
Hi Sea,

Could you let us know which ML algorithm you use? What's the number
instances and dimension of your dataset?
AFAIK, Spark MLlib can train model with several millions of feature if you
configure it correctly.

Thanks
Yanbo

On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet  wrote:

> SGD is supported. I see I assumed you were using Scala. Looks like you can
> do streaming regression, not sure of pyspark API though:
>
> https://spark.apache.org/docs/latest/mllib-linear-methods.
> html#streaming-linear-regression
>
> On 23 August 2017 at 18:22, Sea aj  wrote:
>
>> Thanks for the reply.
>>
>> As far as I understood mini batch is not yet supported in ML libarary. As
>> for MLLib minibatch, I could not find any pyspark api.
>>
>>
>>
>>  Sent with Mailtrack
>> 
>>
>> On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet  wrote:
>>
>>> It depends on what model you would like to train but models requiring
>>> optimisation could use SGD with mini batches. See:
>>> https://spark.apache.org/docs/latest/mllib-optimization.html
>>> #stochastic-gradient-descent-sgd
>>>
>>> On 23 August 2017 at 14:27, Sea aj  wrote:
>>>
 Hi,

 I am trying to feed a huge dataframe to a ml algorithm in Spark but it
 crashes due to the shortage of memory.

 Is there a way to train the model on a subset of the data in multiple
 steps?

 Thanks



  Sent with Mailtrack
 

>>>
>>>
>>>
>>> --
>>>
>>> Mehmet Süzen, MSc, PhD
>>> 
>>>
>>> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission,
>>> and any documents, files or previous e-mail messages attached to it, may
>>> contain confidential information that is legally privileged. If you are not
>>> the intended recipient or a person responsible for delivering it to the
>>> intended recipient, you are hereby notified that any disclosure, copying,
>>> distribution or use of any of the information contained in or attached to
>>> this transmission is STRICTLY PROHIBITED within the applicable law. If you
>>> have received this transmission in error, please: (1) immediately notify me
>>> by reply e-mail to su...@acm.org,  and (2) destroy the original
>>> transmission and its attachments without reading or saving in any manner. |
>>>
>>
>>
>
>
> --
>
> Mehmet Süzen, MSc, PhD
> 
>
> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
> any documents, files or previous e-mail messages attached to it, may
> contain confidential information that is legally privileged. If you are not
> the intended recipient or a person responsible for delivering it to the
> intended recipient, you are hereby notified that any disclosure, copying,
> distribution or use of any of the information contained in or attached to
> this transmission is STRICTLY PROHIBITED within the applicable law. If you
> have received this transmission in error, please: (1) immediately notify me
> by reply e-mail to su...@acm.org,  and (2) destroy the original
> transmission and its attachments without reading or saving in any manner. |
>


Re: How is data desensitization (example: select bank_no from users)?

2017-08-24 Thread Akhil Das
Usually analysts will not have access to data stored in the PCI Zone, you
could write the data out to a table for the analysts by masking the
sensitive information.

Eg:


> val mask_udf = udf((info: String) => info.patch(0, "*" * 12, 7))
> val df = sc.parallelize(Seq(("user1", "400-000-444"))).toDF("user", 
> "sensitive_info")
> df.show

+-+--+
| user|sensitive_info|
+-+--+
|user1|   400-000-444|
+-+--+

> df.withColumn("sensitive_info", mask_udf($"sensitive_info")).show

+-++
| user|  sensitive_info|
+-++
|user1|-444|
+-++


On Sat, Aug 19, 2017 at 10:42 PM, 李斌松  wrote:

> For example, the user's bank card number cannot be viewed by an analyst
> and replaced by an asterisk. How do you do that in spark?
>



-- 
Cheers!


Re: UI for spark machine learning.

2017-08-24 Thread Akhil Das
How many iterations are you doing on the data? Like Jörn said, you don't
necessarily need a billion samples for linear regression.

On Tue, Aug 22, 2017 at 6:28 PM, Sea aj  wrote:

> Jorn,
>
> My question is not about the model type but instead, the spark capability
> on reusing any already trained ml model in training a new model.
>
>
>
>
> On Tue, Aug 22, 2017 at 1:13 PM, Jörn Franke  wrote:
>
>> Is it really required to have one billion samples for just linear
>> regression? Probably your model would do equally well with much less
>> samples. Have you checked bias and variance if you use much less random
>> samples?
>>
>> On 22. Aug 2017, at 12:58, Sea aj  wrote:
>>
>> I have a large dataframe of 1 billion rows of type LabeledPoint. I tried
>> to train a linear regression model on the df but it failed due to lack of
>> memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of
>> CPU.
>>
>> I decided to split my data into multiple chunks and train the model in
>> multiple phases but I learned the linear regression model in ml library
>> does not have "setinitialmodel" function to be able to pass the trained
>> model from one chunk to the rest of chunks. In another word, each time I
>> call the fit function over a chunk of my data, it overwrites the previous
>> mode.
>>
>> So far the only solution I found is using Spark Streaming to be able to
>> split the data to multiple dfs and then train over each individually to
>> overcome memory issue.
>>
>> Do you know if there's any other solution?
>>
>>
>>
>>
>> On Mon, Jul 10, 2017 at 7:57 AM, Jayant Shekhar 
>> wrote:
>>
>>> Hello Mahesh,
>>>
>>> We have built one. You can download from here :
>>> https://www.sparkflows.io/download
>>>
>>> Feel free to ping me for any questions, etc.
>>>
>>> Best Regards,
>>> Jayant
>>>
>>>
>>> On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker <
>>> mahesh_sawai...@persistent.com> wrote:
>>>
 Hi,


 1) Is anyone aware of any workbench kind of tool to run ML jobs in
 spark. Specifically is the tool  could be something like a Web application
 that is configured to connect to a spark cluster.


 User is able to select input training sets probably from hdfs , train
 and then run predictions, without having to write any Scala code.


 2) If there is not tool, is there value in having such tool, what could
 be the challenges.


 Thanks,

 Mahesh


 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which
 is the property of Persistent Systems Ltd. It is intended only for the use
 of the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.

>>>
>>>
>>
>


-- 
Cheers!


Re: ORC Transaction Table - Spark

2017-08-24 Thread Akhil Das
How are you reading the data? Its clearly saying
*java.lang.NumberFormatException:
For input string: "0645253_0001" *

On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal 
wrote:

> Hi,
>
> I am trying to read hive orc transaction table through Spark but I am
> getting the following error
>
> Caused by: java.lang.RuntimeException: serious problem
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
> tsInfo(OrcInputFormat.java:1021)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(Or
> cInputFormat.java:1048)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> .
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.NumberFormatException: For input string: "0645253_0001"
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSpli
> tsInfo(OrcInputFormat.java:998)
> ... 118 more
>
> Any help would be appreciated.
>
> Thanks and Regards,
> Aviral Agarwal
>
>


-- 
Cheers!


Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-08-24 Thread Akhil Das
Have you tried setting spark.executor.instances=0 to a positive non-zero
value? Also, since its a streaming application set executor cores > 1.

On Wed, Aug 23, 2017 at 3:38 AM, Karthik Palaniappan  wrote:

> I ran the HdfsWordCount example using this command:
>
> spark-submit run-example \
>   --conf spark.streaming.dynamicAllocation.enabled=true \
>   --conf spark.executor.instances=0 \
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.master=yarn \
>   --conf spark.submit.deployMode=client \
>   org.apache.spark.examples.streaming.HdfsWordCount /foo
>
> I tried it on both Spark 2.1.1 (through HDP 2.6) and Spark 2.2.0 (through
> Google Dataproc 1.2), and I get the same message repeatedly that Spark
> cannot allocate any executors.
>
> 17/08/22 19:34:57 INFO org.spark_project.jetty.util.log: Logging
> initialized @1694ms
> 17/08/22 19:34:57 INFO org.spark_project.jetty.server.Server:
> jetty-9.3.z-SNAPSHOT
> 17/08/22 19:34:57 INFO org.spark_project.jetty.server.Server: Started
> @1756ms
> 17/08/22 19:34:57 INFO org.spark_project.jetty.server.AbstractConnector:
> Started ServerConnector@578782d6{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
> 17/08/22 19:34:58 INFO 
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase:
> GHFS version: 1.6.1-hadoop2
> 17/08/22 19:34:58 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting
> to ResourceManager at hadoop-m/10.240.1.92:8032
> 17/08/22 19:35:00 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl:
> Submitted application application_1503036971561_0022
> 17/08/22 19:35:04 WARN org.apache.spark.streaming.StreamingContext:
> Dynamic Allocation is enabled for this application. Enabling Dynamic
> allocation for Spark Streaming applications can cause data loss if Write
> Ahead Log is not enabled for non-replayable sources like Flume. See the
> programming guide for details on how to enable the Write Ahead Log.
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler:
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler:
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler:
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient resources
>
> I confirmed that the YARN cluster has enough memory for dozens of
> executors, and verified that the application allocates executors when using
> Core's spark.dynamicAllocation.enabled=true, and leaving spark.streaming.
> dynamicAllocation.enabled=false.
>
> Is streaming dynamic allocation actually supported? Sean Owen suggested it
> might have been experimental: https://issues.apache.org/jira/browse/SPARK-
> 21792.
>



-- 
Cheers!